Skip to content

[codex] fix orchestrator routed experts memory retention#2623

Draft
samsja wants to merge 1 commit into
mainfrom
fix/orchestrator-routed-experts-memory
Draft

[codex] fix orchestrator routed experts memory retention#2623
samsja wants to merge 1 commit into
mainfrom
fix/orchestrator-routed-experts-memory

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 25, 2026

Summary

Fixes bounded routed-experts memory retention in the orchestrator step loop.

The routed replay path copies each rollout's tokens["routed_experts"] payload into TrainingSample.routed_experts, but the original rollout sidecars stayed attached to train_rollouts until end-of-step cleanup. The results list from interleave_rollout(...) also kept packed samples alive after the orchestrator had already extracted train_examples.

This change:

  • clears routed-expert sidecars from rollout trajectories after conversion into training samples
  • deletes the intermediate results list once samples have been extracted
  • includes filter_df and timing_df in the explicit per-step cleanup before malloc_trim(0)

This reduces per-step RSS retention and peak memory in router replay runs. It does not claim to fix every possible monotonic production leak; ZMQ backpressure and monitor futures remain separate things to inspect if RSS still slopes upward.

Validation

  • uv run pytest tests/unit/orchestrator/test_trajectories.py tests/unit/orchestrator/test_batch.py -q
  • uv run ruff check src/prime_rl/orchestrator/orchestrator.py src/prime_rl/orchestrator/trajectories.py
  • uv run ruff format --check src/prime_rl/orchestrator/orchestrator.py src/prime_rl/orchestrator/trajectories.py
  • synthetic RSS probe comparing pre-patch-like retention vs patched cleanup:
    • pre-patch-like retained about +53.3 MB after cleanup for the probe payload
    • patched cleanup returned to baseline (+0.0 MB mean over baseline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant