feat: Quasi-mature OpenTelemetry integration by Helveg · Pull Request #223 · dbbs-lab/bsb

Helveg · 2026-02-25T10:38:09Z

Features

bsb-otel package: MPI-aware tracer, jsonlines + OTLP exporters, unittest wrapper that emits a span per test
Auto-instrument @config.node methods and CLI command handlers via bsb.profiling
CI registers a bsb_jsonlines OTel distro/configurator entry point; per-rank trace artifacts uploaded on every build
Unittest FAIL/ERROR outcomes recorded on the test span so failures surface on every rank
New hpc-cineca Claude Code skill capturing the SLURM/MPI workflow

Flakes / races fixed

JSONDecodeError: Expecting value: line 1 column 1 (char 0) in fs FileStore — atomic tmp+rename, meta lands before content
test_cache_survival parallel deadlock — BsbTracer's collective bcast no longer fires from asymmetric contexts (@timeout daemon threads and pool worker dispatchers had lost their parent span)
Python 3.12 + mpi4py atexit shutdown hang on OTel export — default test processor switched to SimpleSpanProcessor
SLURM PMI2 error 14 / double MPI_Init under opentelemetry-instrument — bsb-otel restructured as an entry-point DMZ so the two-phase startup doesn't drag bsb.services in pre-exec
bsb-nest PYTHONPATH lost under opentelemetry-instrument + uv --env-file — NEST site-packages exported in the workflow
Misc parallel races (test_label rank ordering, hdf5 tearDownClass cleanup) — MPI barriers at op boundaries

📚 Documentation preview 📚: https://bsb-hdf5--223.org.readthedocs.build/en/223/

📚 Documentation preview 📚: https://bsb-core--223.org.readthedocs.build/en/223/

📚 Documentation preview 📚: https://nrn-patch--223.org.readthedocs.build/en/223/

📚 Documentation preview 📚: https://bsb-json--223.org.readthedocs.build/en/223/

📚 Documentation preview 📚: https://bsb-test--223.org.readthedocs.build/en/223/

drodarie · 2026-03-13T09:18:35Z

@Helveg I tried to get opentelemetry logs from the bsb reconstruction on HPC with the following command:
mpirun -n 48 opentelemetry-instrument --traces_exporter jsonlines bsb compile divergent_declive_column.yaml -v4 --clear
Unfortunately, the batch ran to the end but I did not get any jsonlines files and there is no feedback in the logs explaining why...

Helveg · 2026-03-13T09:37:12Z

Can you try some smaller debug examples first?

Simple single core bsb compile on the login node with an empty skeleton config
Multi core skeleton compile
...

If none of them work, I think I have a CINECA login, or can use yours but I keep forgetting how hahaha :D I can try to debug the code there then.

drodarie · 2026-03-13T10:03:31Z

Ok I have a small test. The following command produces a jsonlines` file on my local PC but not on login node of cineca:

# bla.yaml does not exist
opentelemetry-instrument --traces_exporter jsonlines bsb compile bla.yaml -v4

Should we look into python libs differences?

drodarie · 2026-03-13T10:12:22Z

Ok it seems I was lacking some opentelemetry libraries despite the code running without throwing any errors.
Now it produced files on CINECA, I will try again.
@Helveg Could you confirm which libraries are necessary?

opentelemetry-api                        1.40.0
opentelemetry-distro                     0.61b0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-instrumentation            0.61b0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
``

Helveg · 2026-03-13T10:24:47Z

let's just say all of them. it's hard to tell because the otel package ecosystem for python specifically is a MESS.it's multiple monorepos on github,that all contain names pace packages for multiple names paces 🥲 it's really hard to tell what comes from what package. but I'll fix that as part of the move to a bsb-otel package!

…bsb packages lint and format pass

Saving trace.get_tracer_provider() returns the proxy provider when no real provider is set. Restoring that into _TRACER_PROVIDER on exit makes ProxyTracerProvider.get_tracer recurse into itself. Snapshot the raw global instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers tracer registry idempotence, the explicit-version path that skips importlib.metadata lookup, and basic span context-manager behaviour. Keeps the package's CI test target from being a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…xture - Skip the rank/bcast dance entirely when MPI size == 1; in serial mode it is observably equivalent to a plain start_as_current_span. - When broadcasting, treat an invalid local SpanContext as a hard error. Rank 0 still bcasts None first so non-root ranks unblock and raise the same misconfiguration error instead of deadlocking. - Drop the dead set_span_in_context call (its return value was discarded). - OTelFixture: ignore blank trailing lines so the file-exporter readback works on Windows, where text-mode write doubles \r\n. - Make tests use OTelFixture and add opentelemetry-sdk to the test deps so bsb-otel's tests can exercise the broadcast path under a real provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous hard-error stance broke OTel's "API works without SDK" contract. Now rank 0 broadcasts None when its local tracer produced an invalid context, and non-root ranks treat that signal as "no parent to share" and fall through to their own local start_as_current_span (also a no-op when there's no provider). No NonRecordingSpan(invalid_ctx) ever gets attached, so line 65's is_valid check stays honest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Locks in the OTel "API works without SDK" contract for BsbTracer: with no provider configured, trace() yields a non-recording span, and under MPI the broadcast path stays deadlock-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The broadcast-on-no-parent design assumed every rank entered each trace() in lockstep. Rank-divergent callers (e.g. bsb-hdf5's per-rank file ops, serialized via MPILock rather than collectives) deadlock under that contract. Introduce an MPI-communicator contextvar that BsbTracer broadcasts on: - use_communicator(comm) overrides the contextual broadcast comm. - local_tracing() is the COMM_SELF shorthand; spans in the block stay per-rank (size==1 in the contextual comm → no broadcast). A pre-existing broadcast parent set up above the block is still inherited, so cross-rank correlation is preserved through that parent. - Without a configured SDK provider, skip the broadcast branch entirely — there's nothing meaningful to share, and forcing a collective on every trace() call would lock up rank-divergent code. This restores OTel's "API works without SDK" contract. mpi.rank/mpi.size span attributes still report the global communicator's rank/size; only bsb-otel's internal bcast follows the contextual comm. Includes an env-gated (BSB_OTEL_TRACE=1) _btrace diagnostic helper used to track down this very class of deadlock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nd ) uv warned: Failed to parse environment file at position 97 "/.../site-packages:". uv s --env-file does not perform shell-style variable substitution, so stays literal AND makes the whole line fail to parse, leaving PYTHONPATH unset in the child. Drop the suffix completely so the file is plain PYTHONPATH=/nest/path with no substitution. Trade-off: bsb-nest test processes will not see the workflow s OTel sitecustomize entry, which is fine -- bsb-nest is not where the otel-jsonlines diagnostics live.

…ort nest uv 0.11 --env-file does not override variables already set in the parent shell, so the outer PYTHONPATH (which we set to expose the OTel sitecustomize) wins over nest_vars.sh PYTHONPATH and the child python loses access to the nest module. Just put NEST python lib in the outer PYTHONPATH alongside sitecustomize: harmless for other test venvs (they have no reason to import nest) and required for bsb-nest.

The wrap closes its span cleanly even when the test fails because unittest catches the exception and records it on the TestResult. Intercept addError/addFailure on the result hand-off to set ERROR status + record_exception on the active span. With the BatchSpanProcessor exporting periodically, post-mortem we can now identify which tests failed in the jsonlines artifact without relying on truncated buffered stdout.

test_mpi_broadcast_parent_chain assumes _tracer.trace(broadcast_root) runs with no active parent so BsbTracer takes the bcast branch and the broadcast root has the same span_id on every rank. With wrap_tests_with_traces every test inherits the test wrapper span, which turns broadcast_root into a regular per-rank child. Attach a fresh OTel context with INVALID_SPAN around the inner fixture so the bcast branch fires.

…ting Previous version chained addError/addFailure wrappers across tests: each test pulls the current method as _orig and re-wraps, so by the Nth test the call goes through N nested wrappers (one per prior test), each touching an already-ended span. Restore the originals in a finally block so each tests wrap is independent. Pre-existing test_label parallel flake was actually surfaced by this accumulation - reverting to per-test wrap should let it pass again.

…log failures outer_span on non-root ranks is a NonRecordingSpan, so set_status/ record_exception on it is silent. Emit a short-lived child span through the underlying SDK tracer instead so the failure is visible on every rank in the jsonlines artifact.

…ommand Same pattern as test_mpi_broadcast_parent_chain: handle_command internally opens a cli span via BsbTracer.trace, expects no parent so bcast branch fires (rank 0 recording, non-root NonRecording). wrap_tests_with_traces turns it into a regular per-rank child instead, so non-root ranks also record. Attach an INVALID_SPAN context around handle_command to restore the no-parent assumption.

codecov · 2026-05-16T15:35:14Z

Codecov Report

❌ Patch coverage is 50.42254% with 176 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.90%. Comparing base (1f47773) to head (ba6b81e).

Files with missing lines	Patch %	Lines
packages/bsb-otel/bsb_otel/testing.py	36.09%	82 Missing and 3 partials ⚠️
packages/bsb-otel/bsb_otel/exporters.py	0.00%	29 Missing ⚠️
packages/bsb-otel/bsb_otel/tracer.py	71.60%	19 Missing and 4 partials ⚠️
packages/bsb-otel/bsb_otel/_distro.py	0.00%	19 Missing ⚠️
packages/bsb-hdf5/bsb_hdf5/resource.py	48.00%	13 Missing ⚠️
packages/bsb-core/bsb/storage/fs/file_store.py	75.00%	4 Missing ⚠️
packages/bsb-core/bsb/profiling.py	77.77%	1 Missing and 1 partial ⚠️
packages/bsb-otel/bsb_otel/_otel_env.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #223      +/-   ##
==========================================
+ Coverage   81.06%   82.90%   +1.83%     
==========================================
  Files         180      156      -24     
  Lines       17815    15385    -2430     
  Branches     2140     1794     -346     
==========================================
- Hits        14442    12755    -1687     
+ Misses       2821     2202     -619     
+ Partials      552      428     -124

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Print len(ps), load_ids, get_unique_labels, mask shape, and label mask right before and after the all-False label call so the next CI flake gives concrete data on what state the failing rank had.

…bel diagnostic Python 3.12 build hung 9+ minutes at shutdown after all 477 heartbeats fired; orphan mpiexec/coverage at cancel. BatchSpanProcessor export daemon thread interacting with mpi4py atexit MPI_Finalize is the suspected cause. SimpleSpanProcessor has no daemon thread - each span is exported synchronously at end_of_span, so shutdown has nothing to wait for. Revert test_label inline diagnostic prints since the failure pattern moved off test_label (it was a pre-existing flake that surfaced briefly when the deadlock fix unblocked bsb-hdf5 PARALLEL).

…arallel tests bsb_test engines test_label: ps.label() and get_unique_labels() use mpilock to serialise writes/reads but do not barrier across ranks at op end. A fast rank can race ahead into its second ps.label() call before a slower rank reaches its get_unique_labels() read, so the slower rank observes labels that should not exist yet. Barrier between the calls. bsb-hdf5 test_mr TestHandcrafted: tearDownClass deletes test files on rank 0 only. Without a leading barrier, rank 0 can finish all its tests and start deleting while another rank is still in its last test method, producing FileNotFoundError on the in-flight test. Barrier at the start of tearDownClass aligns ranks before deletion.

… shell Adds opentelemetry_distro + opentelemetry_configurator entry points named bsb_jsonlines. The distro pins jsonlines as the traces exporter and selects our companion configurator, which builds a TracerProvider backed by SimpleSpanProcessor (avoiding the BatchSpanProcessor / mpi4py atexit deadlock seen on Python 3.12). CI Run unittests step drops the sitecustomize heredoc, outer PYTHONPATH, NEST_LIB injection, and BSB_OTEL_JSONLINES env var. It now opts into our distro the canonical OTel way: wrap nx with opentelemetry-instrument and set OTEL_PYTHON_DISTRO=bsb_jsonlines. __init__ stays a docstring DMZ, no implicit auto-init on bsb_otel import. Local equivalent for any developer: OTEL_PYTHON_DISTRO=bsb_jsonlines OTEL_EXPORTER_JSONLINES_PATH=./traces.jsonlines uv run --project packages/bsb-core opentelemetry-instrument ./nx run-many -t test

…ecord Two cheap guards added to the auto-instrumented method wrapper: - If opentelemetry has no TracerProvider configured, bypass the trace context manager entirely and call the original method. - Even when a provider IS set, defer the heavy json.dumps(self.__tree__()) attribute until after the span exists and gate it on span.is_recording(). Non-recording spans (no SDK on this rank, NonRecordingSpan, local_tracing()) skip the serialization.

OTEL_PYTHON_CONFIGURATOR is not re-exported as a constant from opentelemetry.environment_variables in the pinned opentelemetry-api version, so the distro module failed to import under opentelemetry-instrument auto_instrumentation _load_distro and broke every Python subprocess (bsb-nest test loader was the visible symptom).

uv --env-file (dotenvy) lets the parent PYTHONPATH win, and opentelemetry-instrument injects its own PYTHONPATH before exec, so nest_vars.sh could never override it. Export NEST's site-packages into the outer PYTHONPATH in the workflow so it survives both layers. Drop the now-redundant sed patch from install-nest. Also correct the misleading "older version" comment in the bsb_jsonlines distro — OTEL_PYTHON_CONFIGURATOR isn't a constant in any opentelemetry-api version, including main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t path The early-return when _TRACER_PROVIDER is None already short-circuits the entire wrapper, so an additional is_recording() check inside the span only adds branching without saving any work in the no-SDK case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the _btrace/_shtrace stderr-print helpers gated on $BSB_OTEL_TRACE — they were investigation scaffolding for the parallel-deadlock work and the OTel spans themselves now carry the same information. Removes the helpers and every call site in BsbTracer.trace and _SpannedHandle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the specific BatchSpanProcessor reference — the heartbeat span is exported by whichever processor is configured (Simple or Batch), and the testing wiring no longer picks a single one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l meta Engine.__init__ broadcasts rank-0's root to all ranks, so an fs store is implicitly shared across MPI ranks. FileStore.store() was two raw file writes back-to-back: opening the meta path in "w" mode truncates it before json.dump runs, so a concurrent reader iterating os.listdir(files/) could read the empty meta and JSONDecodeError. Route both writes through a tmp+os.replace helper and write meta first, so the meta file is fully on disk before the content file appears in listdir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous atomic-write change staged the tmp file in the same dir as the final path. For the content file that meant a tmpXXXXX entry appeared briefly in files/, which os.listdir(files/) would return — and _path_to_id then crashes trying to base64-decode tmpXXXXX. Stage from the engine root instead; same filesystem so os.replace stays atomic, and files/ only ever contains valid id entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Clarify the two rules the atomic-write fix relies on and put the discovery contract next to `all()` so a future reader iterating `file_meta/` is forced to think about the write order before changing it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the flip-the-order escape hatch from the comment; just say what the rule is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drodarie

Maybe consider improving the code coverage of the unittests for bsb-otel.
The last report from codecov seems to indicate that a lot of the new code is not tested.

drodarie · 2026-05-20T12:34:23Z

+          BSB_QUIET: true
+        run: |
+          cd packages/bsb-otel
+          uv run python -m unittest -v \


why not using coverage so that these tests contribute to the overall package coverage?

Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>

drodarie · 2026-05-25T10:00:34Z

Could you please also add some explanation in the bsb docs regarding jsonlines logs and how to load them afterwards too?

Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>

Robin De Schepper added 9 commits November 26, 2025 02:07

fixed incorrect all attribute

ee0e643

added jsonlines file exporter

8e2b519

added telemetry test case with fixture to check if spans are recorded

91847c7

wip to start make sure that the unittests setup OTLP export

b3cff25

chore: placed all at end of file

a2c279a

feat(test): auto-wrapped test cases in otel trace

6d6d3b1

Merge branch 'main' into feature/test-telemetry

48f9908

added otel instrumentation entry points integration

a114d59

support adding random prefix to path for sep files per rank under mpi

64c227e

Helveg marked this pull request as draft February 25, 2026 10:55

Robin De Schepper and others added 15 commits April 14, 2026 22:58

created bsb otel

ff59327

instrumented bsb hdf5

5f3ebdf

fixed stray occurences of placement.particle

2759496

moved more otel code to bsb-otel

860803b

propagate context during pool.schedule

87c33bb

finished otel with BsbTracer, hdf5 and mpilok instrumentation

1a51355

pass pattern along to custom test discovery

e1c59bb

chore: Merge branch 'origin/main' into feature/test-telemetry

f2da033

fix: Make bsb-otel configuration files uniform with respect to other …

c69343a

…bsb packages lint and format pass

Robin De Schepper added 8 commits May 16, 2026 16:02

chore: trigger CI rerun to verify bsb-hdf5 test flakiness

a307ba4

Robin De Schepper and others added 14 commits May 16, 2026 17:46

investigate(test): dump state in test_label parallel flake

7bdc1b1

Print len(ps), load_ids, get_unique_labels, mask shape, and label mask right before and after the all-False label call so the next CI flake gives concrete data on what state the failing rank had.

docs(storage/fs): tighten the discovery-source-of-truth note

f7019c5

Drop the flip-the-order escape hatch from the comment; just say what the rule is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Helveg marked this pull request as ready for review May 16, 2026 19:33

Helveg requested a review from drodarie May 16, 2026 19:33

drodarie requested changes May 20, 2026

View reviewed changes

Apply suggestions from code review

9e6a567

Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>

drodarie reviewed May 23, 2026

View reviewed changes

Comment thread devtools/editable-install.txt Outdated

Update devtools/editable-install.txt

ba6b81e

Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>

Conversation

Helveg commented Feb 25, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drodarie commented Mar 13, 2026

Uh oh!

Helveg commented Mar 13, 2026

Uh oh!

drodarie commented Mar 13, 2026

Uh oh!

drodarie commented Mar 13, 2026

Uh oh!

Helveg commented Mar 13, 2026

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

drodarie left a comment

Choose a reason for hiding this comment

Uh oh!

drodarie May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drodarie commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Helveg commented Feb 25, 2026 •

edited by github-actions Bot

Loading

codecov Bot commented May 16, 2026 •

edited

Loading