Skip to content

feat: Quasi-mature OpenTelemetry integration#223

Open
Helveg wants to merge 84 commits into
mainfrom
feature/test-telemetry
Open

feat: Quasi-mature OpenTelemetry integration#223
Helveg wants to merge 84 commits into
mainfrom
feature/test-telemetry

Conversation

@Helveg
Copy link
Copy Markdown
Contributor

@Helveg Helveg commented Feb 25, 2026

Features

  • bsb-otel package: MPI-aware tracer, jsonlines + OTLP exporters, unittest wrapper that emits a span per test
  • Auto-instrument @config.node methods and CLI command handlers via bsb.profiling
  • CI registers a bsb_jsonlines OTel distro/configurator entry point; per-rank trace artifacts uploaded on every build
  • Unittest FAIL/ERROR outcomes recorded on the test span so failures surface on every rank
  • New hpc-cineca Claude Code skill capturing the SLURM/MPI workflow

Flakes / races fixed

  • JSONDecodeError: Expecting value: line 1 column 1 (char 0) in fs FileStore — atomic tmp+rename, meta lands before content
  • test_cache_survival parallel deadlock — BsbTracer's collective bcast no longer fires from asymmetric contexts (@timeout daemon threads and pool worker dispatchers had lost their parent span)
  • Python 3.12 + mpi4py atexit shutdown hang on OTel export — default test processor switched to SimpleSpanProcessor
  • SLURM PMI2 error 14 / double MPI_Init under opentelemetry-instrument — bsb-otel restructured as an entry-point DMZ so the two-phase startup doesn't drag bsb.services in pre-exec
  • bsb-nest PYTHONPATH lost under opentelemetry-instrument + uv --env-file — NEST site-packages exported in the workflow
  • Misc parallel races (test_label rank ordering, hdf5 tearDownClass cleanup) — MPI barriers at op boundaries

📚 Documentation preview 📚: https://bsb-hdf5--223.org.readthedocs.build/en/223/


📚 Documentation preview 📚: https://bsb-core--223.org.readthedocs.build/en/223/


📚 Documentation preview 📚: https://nrn-patch--223.org.readthedocs.build/en/223/


📚 Documentation preview 📚: https://bsb-json--223.org.readthedocs.build/en/223/


📚 Documentation preview 📚: https://bsb-test--223.org.readthedocs.build/en/223/

@Helveg Helveg marked this pull request as draft February 25, 2026 10:55
@drodarie
Copy link
Copy Markdown
Contributor

@Helveg I tried to get opentelemetry logs from the bsb reconstruction on HPC with the following command:
mpirun -n 48 opentelemetry-instrument --traces_exporter jsonlines bsb compile divergent_declive_column.yaml -v4 --clear
Unfortunately, the batch ran to the end but I did not get any jsonlines files and there is no feedback in the logs explaining why...

@Helveg
Copy link
Copy Markdown
Contributor Author

Helveg commented Mar 13, 2026

Can you try some smaller debug examples first?

  • Simple single core bsb compile on the login node with an empty skeleton config
  • Multi core skeleton compile
  • ...

If none of them work, I think I have a CINECA login, or can use yours but I keep forgetting how hahaha :D I can try to debug the code there then.

@drodarie
Copy link
Copy Markdown
Contributor

Ok I have a small test. The following command produces a jsonlines` file on my local PC but not on login node of cineca:

# bla.yaml does not exist
opentelemetry-instrument --traces_exporter jsonlines bsb compile bla.yaml -v4  

Should we look into python libs differences?

@drodarie
Copy link
Copy Markdown
Contributor

Ok it seems I was lacking some opentelemetry libraries despite the code running without throwing any errors.
Now it produced files on CINECA, I will try again.
@Helveg Could you confirm which libraries are necessary?

opentelemetry-api                        1.40.0
opentelemetry-distro                     0.61b0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-instrumentation            0.61b0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
``

@Helveg
Copy link
Copy Markdown
Contributor Author

Helveg commented Mar 13, 2026

let's just say all of them. it's hard to tell because the otel package ecosystem for python specifically is a MESS.it's multiple monorepos on github,that all contain names pace packages for multiple names paces 🥲 it's really hard to tell what comes from what package. but I'll fix that as part of the move to a bsb-otel package!

Robin De Schepper and others added 15 commits April 14, 2026 22:58
Saving trace.get_tracer_provider() returns the proxy provider when no real
provider is set. Restoring that into _TRACER_PROVIDER on exit makes
ProxyTracerProvider.get_tracer recurse into itself. Snapshot the raw global
instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers tracer registry idempotence, the explicit-version path that skips
importlib.metadata lookup, and basic span context-manager behaviour. Keeps
the package's CI test target from being a no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xture

- Skip the rank/bcast dance entirely when MPI size == 1; in serial mode it
  is observably equivalent to a plain start_as_current_span.
- When broadcasting, treat an invalid local SpanContext as a hard error.
  Rank 0 still bcasts None first so non-root ranks unblock and raise the
  same misconfiguration error instead of deadlocking.
- Drop the dead set_span_in_context call (its return value was discarded).
- OTelFixture: ignore blank trailing lines so the file-exporter readback
  works on Windows, where text-mode write doubles \r\n.
- Make tests use OTelFixture and add opentelemetry-sdk to the test deps so
  bsb-otel's tests can exercise the broadcast path under a real provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous hard-error stance broke OTel's "API works without SDK"
contract. Now rank 0 broadcasts None when its local tracer produced an
invalid context, and non-root ranks treat that signal as "no parent to
share" and fall through to their own local start_as_current_span (also a
no-op when there's no provider). No NonRecordingSpan(invalid_ctx) ever
gets attached, so line 65's is_valid check stays honest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the OTel "API works without SDK" contract for BsbTracer: with no
provider configured, trace() yields a non-recording span, and under MPI
the broadcast path stays deadlock-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The broadcast-on-no-parent design assumed every rank entered each
trace() in lockstep. Rank-divergent callers (e.g. bsb-hdf5's per-rank
file ops, serialized via MPILock rather than collectives) deadlock under
that contract.

Introduce an MPI-communicator contextvar that BsbTracer broadcasts on:

- use_communicator(comm) overrides the contextual broadcast comm.
- local_tracing() is the COMM_SELF shorthand; spans in the block stay
  per-rank (size==1 in the contextual comm → no broadcast). A
  pre-existing broadcast parent set up above the block is still
  inherited, so cross-rank correlation is preserved through that parent.
- Without a configured SDK provider, skip the broadcast branch entirely
  — there's nothing meaningful to share, and forcing a collective on
  every trace() call would lock up rank-divergent code. This restores
  OTel's "API works without SDK" contract.

mpi.rank/mpi.size span attributes still report the global communicator's
rank/size; only bsb-otel's internal bcast follows the contextual comm.

Includes an env-gated (BSB_OTEL_TRACE=1) _btrace diagnostic helper used
to track down this very class of deadlock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Robin De Schepper added 8 commits May 16, 2026 16:02
…nd )

uv warned: Failed to parse environment file at position 97
"/.../site-packages:".  uv s --env-file does not perform
shell-style variable substitution, so  stays literal AND
makes the whole line fail to parse, leaving PYTHONPATH unset in the
child.  Drop the suffix completely so the file is plain
PYTHONPATH=/nest/path with no substitution.  Trade-off: bsb-nest test
processes will not see the workflow s OTel sitecustomize entry, which
is fine -- bsb-nest is not where the otel-jsonlines diagnostics live.
…ort nest

uv 0.11 --env-file does not override variables already set in the
parent shell, so the outer PYTHONPATH (which we set to expose the OTel
sitecustomize) wins over nest_vars.sh PYTHONPATH and the child python
loses access to the nest module.  Just put NEST python lib in the outer
PYTHONPATH alongside sitecustomize: harmless for other test venvs (they
have no reason to import nest) and required for bsb-nest.
The wrap closes its span cleanly even when the test fails because
unittest catches the exception and records it on the TestResult.
Intercept addError/addFailure on the result hand-off to set ERROR
status + record_exception on the active span.  With the BatchSpanProcessor
exporting periodically, post-mortem we can now identify which tests
failed in the jsonlines artifact without relying on truncated buffered
stdout.
test_mpi_broadcast_parent_chain assumes _tracer.trace(broadcast_root)
runs with no active parent so BsbTracer takes the bcast branch and the
broadcast root has the same span_id on every rank.  With
wrap_tests_with_traces every test inherits the test wrapper span, which
turns broadcast_root into a regular per-rank child.  Attach a fresh OTel
context with INVALID_SPAN around the inner fixture so the bcast branch
fires.
…ting

Previous version chained addError/addFailure wrappers across tests: each
test pulls the current method as _orig and re-wraps, so by the Nth test
the call goes through N nested wrappers (one per prior test), each
touching an already-ended span.  Restore the originals in a finally
block so each tests wrap is independent.  Pre-existing test_label
parallel flake was actually surfaced by this accumulation - reverting
to per-test wrap should let it pass again.
…log failures

outer_span on non-root ranks is a NonRecordingSpan, so set_status/
record_exception on it is silent.  Emit a short-lived child span
through the underlying SDK tracer instead so the failure is visible
on every rank in the jsonlines artifact.
…ommand

Same pattern as test_mpi_broadcast_parent_chain: handle_command internally
opens a cli span via BsbTracer.trace, expects no parent so bcast branch
fires (rank 0 recording, non-root NonRecording).  wrap_tests_with_traces
turns it into a regular per-rank child instead, so non-root ranks also
record.  Attach an INVALID_SPAN context around handle_command to restore
the no-parent assumption.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 50.42254% with 176 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.90%. Comparing base (1f47773) to head (ba6b81e).

Files with missing lines Patch % Lines
packages/bsb-otel/bsb_otel/testing.py 36.09% 82 Missing and 3 partials ⚠️
packages/bsb-otel/bsb_otel/exporters.py 0.00% 29 Missing ⚠️
packages/bsb-otel/bsb_otel/tracer.py 71.60% 19 Missing and 4 partials ⚠️
packages/bsb-otel/bsb_otel/_distro.py 0.00% 19 Missing ⚠️
packages/bsb-hdf5/bsb_hdf5/resource.py 48.00% 13 Missing ⚠️
packages/bsb-core/bsb/storage/fs/file_store.py 75.00% 4 Missing ⚠️
packages/bsb-core/bsb/profiling.py 77.77% 1 Missing and 1 partial ⚠️
packages/bsb-otel/bsb_otel/_otel_env.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #223      +/-   ##
==========================================
+ Coverage   81.06%   82.90%   +1.83%     
==========================================
  Files         180      156      -24     
  Lines       17815    15385    -2430     
  Branches     2140     1794     -346     
==========================================
- Hits        14442    12755    -1687     
+ Misses       2821     2202     -619     
+ Partials      552      428     -124     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Robin De Schepper and others added 14 commits May 16, 2026 17:46
Print len(ps), load_ids, get_unique_labels, mask shape, and label mask
right before and after the all-False label call so the next CI flake
gives concrete data on what state the failing rank had.
…bel diagnostic

Python 3.12 build hung 9+ minutes at shutdown after all 477 heartbeats
fired; orphan mpiexec/coverage at cancel.  BatchSpanProcessor export
daemon thread interacting with mpi4py atexit MPI_Finalize is the
suspected cause.  SimpleSpanProcessor has no daemon thread - each span
is exported synchronously at end_of_span, so shutdown has nothing to
wait for.

Revert test_label inline diagnostic prints since the failure pattern
moved off test_label (it was a pre-existing flake that surfaced briefly
when the deadlock fix unblocked bsb-hdf5 PARALLEL).
…arallel tests

bsb_test engines test_label: ps.label() and get_unique_labels() use
mpilock to serialise writes/reads but do not barrier across ranks at op
end.  A fast rank can race ahead into its second ps.label() call before
a slower rank reaches its get_unique_labels() read, so the slower rank
observes labels that should not exist yet.  Barrier between the calls.

bsb-hdf5 test_mr TestHandcrafted: tearDownClass deletes test files on
rank 0 only.  Without a leading barrier, rank 0 can finish all its
tests and start deleting while another rank is still in its last test
method, producing FileNotFoundError on the in-flight test.  Barrier at
the start of tearDownClass aligns ranks before deletion.
… shell

Adds opentelemetry_distro + opentelemetry_configurator entry points
named bsb_jsonlines. The distro pins jsonlines as the traces exporter
and selects our companion configurator, which builds a TracerProvider
backed by SimpleSpanProcessor (avoiding the BatchSpanProcessor /
mpi4py atexit deadlock seen on Python 3.12).

CI Run unittests step drops the sitecustomize heredoc, outer
PYTHONPATH, NEST_LIB injection, and BSB_OTEL_JSONLINES env var. It now
opts into our distro the canonical OTel way: wrap nx with
opentelemetry-instrument and set OTEL_PYTHON_DISTRO=bsb_jsonlines.
__init__ stays a docstring DMZ, no implicit auto-init on bsb_otel
import.

Local equivalent for any developer:
  OTEL_PYTHON_DISTRO=bsb_jsonlines OTEL_EXPORTER_JSONLINES_PATH=./traces.jsonlines     uv run --project packages/bsb-core opentelemetry-instrument ./nx run-many -t test
…ecord

Two cheap guards added to the auto-instrumented method wrapper:
- If opentelemetry has no TracerProvider configured, bypass the trace
  context manager entirely and call the original method.
- Even when a provider IS set, defer the heavy
  json.dumps(self.__tree__()) attribute until after the span exists and
  gate it on span.is_recording().  Non-recording spans (no SDK on this
  rank, NonRecordingSpan, local_tracing()) skip the serialization.
OTEL_PYTHON_CONFIGURATOR is not re-exported as a constant from
opentelemetry.environment_variables in the pinned opentelemetry-api
version, so the distro module failed to import under
opentelemetry-instrument auto_instrumentation _load_distro and broke
every Python subprocess (bsb-nest test loader was the visible
symptom).
uv --env-file (dotenvy) lets the parent PYTHONPATH win, and
opentelemetry-instrument injects its own PYTHONPATH before exec, so
nest_vars.sh could never override it. Export NEST's site-packages
into the outer PYTHONPATH in the workflow so it survives both
layers. Drop the now-redundant sed patch from install-nest. Also
correct the misleading "older version" comment in the bsb_jsonlines
distro — OTEL_PYTHON_CONFIGURATOR isn't a constant in any
opentelemetry-api version, including main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t path

The early-return when _TRACER_PROVIDER is None already short-circuits
the entire wrapper, so an additional is_recording() check inside the
span only adds branching without saving any work in the no-SDK case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the _btrace/_shtrace stderr-print helpers gated on
$BSB_OTEL_TRACE — they were investigation scaffolding for the
parallel-deadlock work and the OTel spans themselves now carry the
same information. Removes the helpers and every call site in
BsbTracer.trace and _SpannedHandle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the specific BatchSpanProcessor reference — the heartbeat span
is exported by whichever processor is configured (Simple or Batch),
and the testing wiring no longer picks a single one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l meta

Engine.__init__ broadcasts rank-0's root to all ranks, so an fs store
is implicitly shared across MPI ranks. FileStore.store() was two raw
file writes back-to-back: opening the meta path in "w" mode truncates
it before json.dump runs, so a concurrent reader iterating
os.listdir(files/) could read the empty meta and JSONDecodeError.

Route both writes through a tmp+os.replace helper and write meta
first, so the meta file is fully on disk before the content file
appears in listdir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous atomic-write change staged the tmp file in the same dir
as the final path. For the content file that meant a tmpXXXXX entry
appeared briefly in files/, which os.listdir(files/) would return —
and _path_to_id then crashes trying to base64-decode tmpXXXXX. Stage
from the engine root instead; same filesystem so os.replace stays
atomic, and files/ only ever contains valid id entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clarify the two rules the atomic-write fix relies on and put the
discovery contract next to `all()` so a future reader iterating
`file_meta/` is forced to think about the write order before changing
it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the flip-the-order escape hatch from the comment; just say what
the rule is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Helveg Helveg marked this pull request as ready for review May 16, 2026 19:33
@Helveg Helveg requested a review from drodarie May 16, 2026 19:33
Copy link
Copy Markdown
Contributor

@drodarie drodarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider improving the code coverage of the unittests for bsb-otel.
The last report from codecov seems to indicate that a lot of the new code is not tested.

BSB_QUIET: true
run: |
cd packages/bsb-otel
uv run python -m unittest -v \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using coverage so that these tests contribute to the overall package coverage?

Comment thread .github/workflows/build.yml
Comment thread packages/bsb-arbor/pyproject.toml
Comment thread packages/bsb-yaml/pyproject.toml
Comment thread packages/bsb-test/pyproject.toml
Comment thread packages/bsb-neuron/pyproject.toml
Comment thread packages/bsb-nest/pyproject.toml
Comment thread packages/bsb-json/pyproject.toml
Comment thread packages/bsb-hdf5/pyproject.toml
Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>
Comment thread devtools/editable-install.txt Outdated
@drodarie
Copy link
Copy Markdown
Contributor

Could you please also add some explanation in the bsb docs regarding jsonlines logs and how to load them afterwards too?

Co-authored-by: Dimitri RODARIE <d.rodarie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants