[Evo2] Switch from static to faster/more modern dynamic inference engine#1597
[Evo2] Switch from static to faster/more modern dynamic inference engine#1597jstjohn wants to merge 10 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR migrates Evo2's inference from Megatron Core's static inference engine to a native dynamic-inference path. Hyena recurrent state (FIR/IIR filters) is packed into MCore's Mamba slots with paged-KV attention, per-layer CUDA graphs are supported, and generation uses an explicit sampler with temperature/top-k/top-p. CLI arguments and tests are updated accordingly. ChangesEvo2 Dynamic Inference Engine
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (1)
1236-1238: 💤 Low valueConsider threading seed through components instead of tokenizer attribute.
Setting
_evo2_seedon the tokenizer object is unconventional. A cleaner approach would be to add the seed toEvo2NativeDynamicComponentsor pass it as a parameter togenerate(). However, this works correctly and the attribute name clearly indicates it's Evo2-specific.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py` around lines 1236 - 1238, The code currently mutates components.tokenizer._evo2_seed to thread the resolved seed (random_seed) into the native sampler RNG; instead, add a dedicated field on the components container (e.g., Evo2NativeDynamicComponents.seed or evo2_seed) or extend the generate() signature to accept a seed parameter and pass random_seed through that API, then update usages in _generate_native_dynamic to read from the new components field or the generate() parameter instead of tokenizer._evo2_seed; keep the attribute name evo2_seed to preserve clarity and remove the tokenizer mutation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py`:
- Around line 1236-1238: The code currently mutates
components.tokenizer._evo2_seed to thread the resolved seed (random_seed) into
the native sampler RNG; instead, add a dedicated field on the components
container (e.g., Evo2NativeDynamicComponents.seed or evo2_seed) or extend the
generate() signature to accept a seed parameter and pass random_seed through
that API, then update usages in _generate_native_dynamic to read from the new
components field or the generate() parameter instead of tokenizer._evo2_seed;
keep the attribute name evo2_seed to preserve clarity and remove the tokenizer
mutation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 72911025-adde-46a4-bb89-73f7badfbc0f
📒 Files selected for processing (11)
bionemo-recipes/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynbbionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_layer.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.pybionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.pybionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.pybionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
💤 Files with no reviewable changes (1)
- bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.py
… evo2 Signed-off-by: John St John <jstjohn@nvidia.com>
dd5fa3e to
b0cfaed
Compare
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
|
/ok to test ac4d138 |
…to jstjohn/evo2_dynamic_inference_engine
Signed-off-by: John St. John <jstjohn@nvidia.com>
farhadrgh
left a comment
There was a problem hiding this comment.
Benchmark numbers don't disclose the kernel choice. So I am assuming these are without the subq-ops? Also test don't specifically exercises CUDA graph + subq-ops prefill together.
I suggest modifying the the existing test_subquadratic_ops_matches_baseline in test_infer.py with a 2×2 parametrization with the CUDA-graph toggle:
@pytest.mark.parametrize("cuda_graph_impl", ["none", "local"])
@pytest.mark.parametrize("use_subquadratic_ops", [False, True])
def test_subquadratic_ops_with_cuda_graph_matches_baseline(
mbridge_checkpoint_path, tmp_path, use_subquadratic_ops, cuda_graph_impl
):…fp8 inference Chunked prefill (Hyena): - step_fir/step_iir accept an L-token block and thread the FIR ring / real-pole IIR modal state in one vectorized pass (equivalent to looping the single-token step); L==1 keeps the bit-identical, CUDA-graphed decode path. - Fix step_iir block dtype crash: cast residues/D to fp32 (the recurrence runs in fp32 to match the persistent iir_state buffer; einsum needs matching dtypes). FP8 inference (was always bf16 at generation; the "fp8" tests only converted in fp8): - setup_inference_engine now runs inference at the chosen precision and, for full fp8 (fp8 on all TE linears), calls mcore prepare_model_for_fp8_inference so each linear pads the token dim to the fp8 alignment -> single-token decode no longer fails assert_dim_for_fp8_exec. No-op for bf16 and vortex-style fp8. Tests: - test_batch_generate_mbridge exercises bf16 / vortex fp8 / full fp8 at inference with the second-half accuracy readout (full fp8 gets its own golden values). - Add full-fp8 with/without chunked-prefill test; add a bf16 IIR block unit test (catches the dtype mismatch on CPU) plus FIR/IIR block==per-token-loop unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: John St. John <jstjohn@nvidia.com>
142aae7 to
884b2f4
Compare
Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Description
Benchmarked with a 1024-token prompt and 1024 requested generated tokens. All runs verified the output JSONL reported 1024 prompt tokens and 1024 completion tokens. Comparison performed on 2xA6000 GPUs at bf16 precision.
origin/mainstatic engineUsage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
Improvements
Documentation