feat: Quasi-mature OpenTelemetry integration#223
Conversation
|
@Helveg I tried to get opentelemetry logs from the bsb reconstruction on HPC with the following command: |
|
Can you try some smaller debug examples first?
If none of them work, I think I have a CINECA login, or can use yours but I keep forgetting how hahaha :D I can try to debug the code there then. |
|
Ok I have a small test. The following command produces a jsonlines` file on my local PC but not on login node of cineca: # bla.yaml does not exist
opentelemetry-instrument --traces_exporter jsonlines bsb compile bla.yaml -v4 Should we look into python libs differences? |
|
Ok it seems I was lacking some opentelemetry libraries despite the code running without throwing any errors. opentelemetry-api 1.40.0
opentelemetry-distro 0.61b0
opentelemetry-exporter-otlp 1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc 1.40.0
opentelemetry-exporter-otlp-proto-http 1.40.0
opentelemetry-instrumentation 0.61b0
opentelemetry-proto 1.40.0
opentelemetry-sdk 1.40.0
opentelemetry-semantic-conventions 0.61b0
`` |
|
let's just say all of them. it's hard to tell because the otel package ecosystem for python specifically is a MESS.it's multiple monorepos on github,that all contain names pace packages for multiple names paces 🥲 it's really hard to tell what comes from what package. but I'll fix that as part of the move to a |
…bsb packages lint and format pass
Saving trace.get_tracer_provider() returns the proxy provider when no real provider is set. Restoring that into _TRACER_PROVIDER on exit makes ProxyTracerProvider.get_tracer recurse into itself. Snapshot the raw global instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers tracer registry idempotence, the explicit-version path that skips importlib.metadata lookup, and basic span context-manager behaviour. Keeps the package's CI test target from being a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xture - Skip the rank/bcast dance entirely when MPI size == 1; in serial mode it is observably equivalent to a plain start_as_current_span. - When broadcasting, treat an invalid local SpanContext as a hard error. Rank 0 still bcasts None first so non-root ranks unblock and raise the same misconfiguration error instead of deadlocking. - Drop the dead set_span_in_context call (its return value was discarded). - OTelFixture: ignore blank trailing lines so the file-exporter readback works on Windows, where text-mode write doubles \r\n. - Make tests use OTelFixture and add opentelemetry-sdk to the test deps so bsb-otel's tests can exercise the broadcast path under a real provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous hard-error stance broke OTel's "API works without SDK" contract. Now rank 0 broadcasts None when its local tracer produced an invalid context, and non-root ranks treat that signal as "no parent to share" and fall through to their own local start_as_current_span (also a no-op when there's no provider). No NonRecordingSpan(invalid_ctx) ever gets attached, so line 65's is_valid check stays honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the OTel "API works without SDK" contract for BsbTracer: with no provider configured, trace() yields a non-recording span, and under MPI the broadcast path stays deadlock-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The broadcast-on-no-parent design assumed every rank entered each trace() in lockstep. Rank-divergent callers (e.g. bsb-hdf5's per-rank file ops, serialized via MPILock rather than collectives) deadlock under that contract. Introduce an MPI-communicator contextvar that BsbTracer broadcasts on: - use_communicator(comm) overrides the contextual broadcast comm. - local_tracing() is the COMM_SELF shorthand; spans in the block stay per-rank (size==1 in the contextual comm → no broadcast). A pre-existing broadcast parent set up above the block is still inherited, so cross-rank correlation is preserved through that parent. - Without a configured SDK provider, skip the broadcast branch entirely — there's nothing meaningful to share, and forcing a collective on every trace() call would lock up rank-divergent code. This restores OTel's "API works without SDK" contract. mpi.rank/mpi.size span attributes still report the global communicator's rank/size; only bsb-otel's internal bcast follows the contextual comm. Includes an env-gated (BSB_OTEL_TRACE=1) _btrace diagnostic helper used to track down this very class of deadlock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd ) uv warned: Failed to parse environment file at position 97 "/.../site-packages:". uv s --env-file does not perform shell-style variable substitution, so stays literal AND makes the whole line fail to parse, leaving PYTHONPATH unset in the child. Drop the suffix completely so the file is plain PYTHONPATH=/nest/path with no substitution. Trade-off: bsb-nest test processes will not see the workflow s OTel sitecustomize entry, which is fine -- bsb-nest is not where the otel-jsonlines diagnostics live.
…ort nest uv 0.11 --env-file does not override variables already set in the parent shell, so the outer PYTHONPATH (which we set to expose the OTel sitecustomize) wins over nest_vars.sh PYTHONPATH and the child python loses access to the nest module. Just put NEST python lib in the outer PYTHONPATH alongside sitecustomize: harmless for other test venvs (they have no reason to import nest) and required for bsb-nest.
The wrap closes its span cleanly even when the test fails because unittest catches the exception and records it on the TestResult. Intercept addError/addFailure on the result hand-off to set ERROR status + record_exception on the active span. With the BatchSpanProcessor exporting periodically, post-mortem we can now identify which tests failed in the jsonlines artifact without relying on truncated buffered stdout.
test_mpi_broadcast_parent_chain assumes _tracer.trace(broadcast_root) runs with no active parent so BsbTracer takes the bcast branch and the broadcast root has the same span_id on every rank. With wrap_tests_with_traces every test inherits the test wrapper span, which turns broadcast_root into a regular per-rank child. Attach a fresh OTel context with INVALID_SPAN around the inner fixture so the bcast branch fires.
…ting Previous version chained addError/addFailure wrappers across tests: each test pulls the current method as _orig and re-wraps, so by the Nth test the call goes through N nested wrappers (one per prior test), each touching an already-ended span. Restore the originals in a finally block so each tests wrap is independent. Pre-existing test_label parallel flake was actually surfaced by this accumulation - reverting to per-test wrap should let it pass again.
…log failures outer_span on non-root ranks is a NonRecordingSpan, so set_status/ record_exception on it is silent. Emit a short-lived child span through the underlying SDK tracer instead so the failure is visible on every rank in the jsonlines artifact.
…ommand Same pattern as test_mpi_broadcast_parent_chain: handle_command internally opens a cli span via BsbTracer.trace, expects no parent so bcast branch fires (rank 0 recording, non-root NonRecording). wrap_tests_with_traces turns it into a regular per-rank child instead, so non-root ranks also record. Attach an INVALID_SPAN context around handle_command to restore the no-parent assumption.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #223 +/- ##
==========================================
+ Coverage 81.06% 82.90% +1.83%
==========================================
Files 180 156 -24
Lines 17815 15385 -2430
Branches 2140 1794 -346
==========================================
- Hits 14442 12755 -1687
+ Misses 2821 2202 -619
+ Partials 552 428 -124 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Print len(ps), load_ids, get_unique_labels, mask shape, and label mask right before and after the all-False label call so the next CI flake gives concrete data on what state the failing rank had.
…bel diagnostic Python 3.12 build hung 9+ minutes at shutdown after all 477 heartbeats fired; orphan mpiexec/coverage at cancel. BatchSpanProcessor export daemon thread interacting with mpi4py atexit MPI_Finalize is the suspected cause. SimpleSpanProcessor has no daemon thread - each span is exported synchronously at end_of_span, so shutdown has nothing to wait for. Revert test_label inline diagnostic prints since the failure pattern moved off test_label (it was a pre-existing flake that surfaced briefly when the deadlock fix unblocked bsb-hdf5 PARALLEL).
…arallel tests bsb_test engines test_label: ps.label() and get_unique_labels() use mpilock to serialise writes/reads but do not barrier across ranks at op end. A fast rank can race ahead into its second ps.label() call before a slower rank reaches its get_unique_labels() read, so the slower rank observes labels that should not exist yet. Barrier between the calls. bsb-hdf5 test_mr TestHandcrafted: tearDownClass deletes test files on rank 0 only. Without a leading barrier, rank 0 can finish all its tests and start deleting while another rank is still in its last test method, producing FileNotFoundError on the in-flight test. Barrier at the start of tearDownClass aligns ranks before deletion.
… shell Adds opentelemetry_distro + opentelemetry_configurator entry points named bsb_jsonlines. The distro pins jsonlines as the traces exporter and selects our companion configurator, which builds a TracerProvider backed by SimpleSpanProcessor (avoiding the BatchSpanProcessor / mpi4py atexit deadlock seen on Python 3.12). CI Run unittests step drops the sitecustomize heredoc, outer PYTHONPATH, NEST_LIB injection, and BSB_OTEL_JSONLINES env var. It now opts into our distro the canonical OTel way: wrap nx with opentelemetry-instrument and set OTEL_PYTHON_DISTRO=bsb_jsonlines. __init__ stays a docstring DMZ, no implicit auto-init on bsb_otel import. Local equivalent for any developer: OTEL_PYTHON_DISTRO=bsb_jsonlines OTEL_EXPORTER_JSONLINES_PATH=./traces.jsonlines uv run --project packages/bsb-core opentelemetry-instrument ./nx run-many -t test
…ecord Two cheap guards added to the auto-instrumented method wrapper: - If opentelemetry has no TracerProvider configured, bypass the trace context manager entirely and call the original method. - Even when a provider IS set, defer the heavy json.dumps(self.__tree__()) attribute until after the span exists and gate it on span.is_recording(). Non-recording spans (no SDK on this rank, NonRecordingSpan, local_tracing()) skip the serialization.
OTEL_PYTHON_CONFIGURATOR is not re-exported as a constant from opentelemetry.environment_variables in the pinned opentelemetry-api version, so the distro module failed to import under opentelemetry-instrument auto_instrumentation _load_distro and broke every Python subprocess (bsb-nest test loader was the visible symptom).
uv --env-file (dotenvy) lets the parent PYTHONPATH win, and opentelemetry-instrument injects its own PYTHONPATH before exec, so nest_vars.sh could never override it. Export NEST's site-packages into the outer PYTHONPATH in the workflow so it survives both layers. Drop the now-redundant sed patch from install-nest. Also correct the misleading "older version" comment in the bsb_jsonlines distro — OTEL_PYTHON_CONFIGURATOR isn't a constant in any opentelemetry-api version, including main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t path The early-return when _TRACER_PROVIDER is None already short-circuits the entire wrapper, so an additional is_recording() check inside the span only adds branching without saving any work in the no-SDK case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the _btrace/_shtrace stderr-print helpers gated on $BSB_OTEL_TRACE — they were investigation scaffolding for the parallel-deadlock work and the OTel spans themselves now carry the same information. Removes the helpers and every call site in BsbTracer.trace and _SpannedHandle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the specific BatchSpanProcessor reference — the heartbeat span is exported by whichever processor is configured (Simple or Batch), and the testing wiring no longer picks a single one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l meta Engine.__init__ broadcasts rank-0's root to all ranks, so an fs store is implicitly shared across MPI ranks. FileStore.store() was two raw file writes back-to-back: opening the meta path in "w" mode truncates it before json.dump runs, so a concurrent reader iterating os.listdir(files/) could read the empty meta and JSONDecodeError. Route both writes through a tmp+os.replace helper and write meta first, so the meta file is fully on disk before the content file appears in listdir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous atomic-write change staged the tmp file in the same dir as the final path. For the content file that meant a tmpXXXXX entry appeared briefly in files/, which os.listdir(files/) would return — and _path_to_id then crashes trying to base64-decode tmpXXXXX. Stage from the engine root instead; same filesystem so os.replace stays atomic, and files/ only ever contains valid id entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clarify the two rules the atomic-write fix relies on and put the discovery contract next to `all()` so a future reader iterating `file_meta/` is forced to think about the write order before changing it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the flip-the-order escape hatch from the comment; just say what the rule is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
drodarie
left a comment
There was a problem hiding this comment.
Maybe consider improving the code coverage of the unittests for bsb-otel.
The last report from codecov seems to indicate that a lot of the new code is not tested.
| BSB_QUIET: true | ||
| run: | | ||
| cd packages/bsb-otel | ||
| uv run python -m unittest -v \ |
There was a problem hiding this comment.
why not using coverage so that these tests contribute to the overall package coverage?
Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>
|
Could you please also add some explanation in the bsb docs regarding jsonlines logs and how to load them afterwards too? |
Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>
Features
bsb-otelpackage: MPI-aware tracer, jsonlines + OTLP exporters, unittest wrapper that emits a span per test@config.nodemethods and CLI command handlers viabsb.profilingbsb_jsonlinesOTel distro/configurator entry point; per-rank trace artifacts uploaded on every buildhpc-cinecaClaude Code skill capturing the SLURM/MPI workflowFlakes / races fixed
JSONDecodeError: Expecting value: line 1 column 1 (char 0)in fs FileStore — atomic tmp+rename, meta lands before contenttest_cache_survivalparallel deadlock — BsbTracer's collective bcast no longer fires from asymmetric contexts (@timeoutdaemon threads and pool worker dispatchers had lost their parent span)SimpleSpanProcessorMPI_Initunderopentelemetry-instrument— bsb-otel restructured as an entry-point DMZ so the two-phase startup doesn't dragbsb.servicesin pre-execopentelemetry-instrument+uv --env-file— NEST site-packages exported in the workflowtest_labelrank ordering, hdf5tearDownClasscleanup) — MPI barriers at op boundaries📚 Documentation preview 📚: https://bsb-hdf5--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb-core--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://nrn-patch--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb-json--223.org.readthedocs.build/en/223/
📚 Documentation preview 📚: https://bsb-test--223.org.readthedocs.build/en/223/