Skip to content

[Evo2] Switch from static to faster/more modern dynamic inference engine#1597

Open
jstjohn wants to merge 10 commits into
mainfrom
jstjohn/evo2_dynamic_inference_engine
Open

[Evo2] Switch from static to faster/more modern dynamic inference engine#1597
jstjohn wants to merge 10 commits into
mainfrom
jstjohn/evo2_dynamic_inference_engine

Conversation

@jstjohn
Copy link
Copy Markdown
Collaborator

@jstjohn jstjohn commented Jun 3, 2026

Description

  • New dynamic inference engine in evo2 with cudagraph support.

Benchmarked with a 1024-token prompt and 1024 requested generated tokens. All runs verified the output JSONL reported 1024 prompt tokens and 1024 completion tokens. Comparison performed on 2xA6000 GPUs at bf16 precision.

Model Parallelism Prompt / Generation origin/main static engine Dynamic engine Speedup Tokens verified
Evo2 1B TP=1 1024 / 1024 38.7 tok/s, 26.44s 129.6 tok/s, 7.90s 3.35x 1024 prompt + 1024 completion
Evo2 7B TP=2 1024 / 1024 28.4 tok/s, 36.07s 62.2 tok/s, 16.47s 2.19x 1024 prompt + 1024 completion

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Native dynamic inference engine for improved performance and memory efficiency
    • Chunked prefill and dynamic batching support for flexible inference control
  • Improvements

    • Optimized inference state management through in-place operations
    • CUDA graph acceleration for faster decode-time inference
  • Documentation

    • Updated fine-tuning tutorial with corrected CLI arguments for prediction commands

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4fdae5d3-a116-4240-a2a5-572159d7deeb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR migrates Evo2's inference from Megatron Core's static inference engine to a native dynamic-inference path. Hyena recurrent state (FIR/IIR filters) is packed into MCore's Mamba slots with paged-KV attention, per-layer CUDA graphs are supported, and generation uses an explicit sampler with temperature/top-k/top-p. CLI arguments and tests are updated accordingly.

Changes

Evo2 Dynamic Inference Engine

Layer / File(s) Summary
Hyena Recurrent State Shapes & Layer Type Mapping
src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py, src/bionemo/evo2/models/megatron/hyena/hyena_block.py
HyenaMixerStateShapes dataclass defines conv/FIR and operator-specific SSM state layouts. HyenaStack maps Hyena layers to MAMBA symbols and aggregates per-request state shapes across all Hyena layers on the rank, with uniform conv shape and padded SSM shape.
In-Place Recurrent State Buffer Updates
src/bionemo/evo2/models/megatron/hyena/engine.py, src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
FIR and IIR states are now persistent fp32 buffers with in-place operations: ring-buffer shifts via copy_(torch.roll(...)) and recurrence via mul_/add_. Cached filter state tensors are cast to fp32 before storage in inference context.
Dynamic Context State Binding & Packing Utilities
src/bionemo/evo2/models/evo2_provider.py
_PackedHyenaSlotStateDict adapters remap Hyena filter writes into live dynamic context Mamba state sub-slices. New helpers build MambaInferenceStateConfig, compute paged-KV buffer_size_gb mirroring mcore's hybrid arithmetic, and bind Hyena layer state views to dynamic context by installing per-bucket filter dicts and registering conv/SSM sub-slices.
Hyena Layer CUDA Graph Integration
src/bionemo/evo2/models/megatron/hyena/hyena_layer.py
HyenaLayer now extends GraphableMegatronModule and adds create_mcore_cudagraph_manager() and _should_call_local_cudagraph() for per-layer local CUDA graph capture during inference decode when configured.
Native Dynamic Inference Setup & Engine Configuration
src/bionemo/evo2/run/infer.py
Replaces static engine/wrapper/controller with native dynamic setup: new Evo2InferenceComponents and Evo2NativeDynamicComponents containers, tokenizer adaptation for generation, CUDA graph configuration and RNG seeding, model provider constraints (flash_decode off, sequence parallelism off), and wiring onto Evo2-specific dynamic context with Hyena state packing.
Native Dynamic Generation Engine & Sampling
src/bionemo/evo2/run/infer.py
_generate_native_dynamic() implements per-prompt dynamic inference with optional chunked prefill, dynamic context sizing, and request lifecycle (add_request→bind_views→initialize_state→forward/sample→update). _sample_from_logits() provides self-contained greedy/top-k/top-p/temperature sampling with optional per-token logprob collection.
CLI Parameters & Public Generation API
src/bionemo/evo2/run/infer.py
generate() and infer() signatures extended with enable_chunked_prefill, inference_dynamic_batching_max_tokens, and inference_dynamic_batching_block_size parameters. Prompt-segmentation-threshold removed. Seed threaded into sampler RNG via tokenizer attribute.
Test Refactoring & New Native Dynamic Edge Cases
tests/bionemo/evo2/run/test_infer.py
run_infer_subprocess() extended to support max-seq-length and return-log-probs controls and now returns full JSONL record. HyenaInferenceContext tests refactored to standalone functions. Comprehensive new suite of native dynamic edge-case tests covering full-prefill, chunked-prefill, single-token decode, FIR ring handling, longer generation, determinism, prompt sensitivity, and TP=2 CUDA graph execution.
Documentation & Example Updates
src/bionemo/evo2/run/infer.py, src/bionemo/evo2/run/infer_example_simple.py, tests/bionemo/evo2/run/test_infer.py, tests/bionemo/evo2/test_evo2.py, examples/fine-tuning-tutorial.ipynb
Module and function docstrings updated to emphasize native MCore dynamic-inference path. Test file comments clarify generation routes through public Evo2 endpoint. Fine-tuning tutorial notebook switches CLI flag from --input-fasta to --fasta.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA-BioNeMo/bionemo-framework#1419: Modifies bionemo/evo2/run/infer.py's inference setup and generation workflow; this PR switches to native dynamic-inference request lifecycle while the other PR involves static/legacy inference path changes.

Suggested labels

enhancement

Suggested reviewers

  • pstjohn
  • jwilber
  • trvachov
  • dorotat-nv
  • cspades

Poem

🐰 From static chains, a rabbit hops free,
Dynamic contexts now bind Hyena with glee!
Paged-KV buffers and CUDA graphs dance,
Native sampling takes inference's stance.
The state shapes align, the rings roll in place—
MCore's dynamic path wins the race! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive PR description includes performance benchmarking data and change type but lacks detailed usage example and incomplete pre-submit checklist. Add a concrete usage example code snippet to replace the 'TODO' placeholder and document how users interact with the new dynamic inference engine.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: switching from static to dynamic inference engine with performance improvements.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jstjohn/evo2_dynamic_inference_engine

Comment @coderabbitai help to get the list of available commands and usage tips.

@jstjohn jstjohn added the ciflow:notebooks Run Jupyter notebooks execution tests for docs and bionemo2 label Jun 3, 2026
@jstjohn
Copy link
Copy Markdown
Collaborator Author

jstjohn commented Jun 3, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (1)

1236-1238: 💤 Low value

Consider threading seed through components instead of tokenizer attribute.

Setting _evo2_seed on the tokenizer object is unconventional. A cleaner approach would be to add the seed to Evo2NativeDynamicComponents or pass it as a parameter to generate(). However, this works correctly and the attribute name clearly indicates it's Evo2-specific.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py` around
lines 1236 - 1238, The code currently mutates components.tokenizer._evo2_seed to
thread the resolved seed (random_seed) into the native sampler RNG; instead, add
a dedicated field on the components container (e.g.,
Evo2NativeDynamicComponents.seed or evo2_seed) or extend the generate()
signature to accept a seed parameter and pass random_seed through that API, then
update usages in _generate_native_dynamic to read from the new components field
or the generate() parameter instead of tokenizer._evo2_seed; keep the attribute
name evo2_seed to preserve clarity and remove the tokenizer mutation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py`:
- Around line 1236-1238: The code currently mutates
components.tokenizer._evo2_seed to thread the resolved seed (random_seed) into
the native sampler RNG; instead, add a dedicated field on the components
container (e.g., Evo2NativeDynamicComponents.seed or evo2_seed) or extend the
generate() signature to accept a seed parameter and pass random_seed through
that API, then update usages in _generate_native_dynamic to read from the new
components field or the generate() parameter instead of tokenizer._evo2_seed;
keep the attribute name evo2_seed to preserve clarity and remove the tokenizer
mutation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 72911025-adde-46a4-bb89-73f7badfbc0f

📥 Commits

Reviewing files that changed from the base of the PR and between aa2692e and dd5fa3e.

📒 Files selected for processing (11)
  • bionemo-recipes/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_layer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
💤 Files with no reviewable changes (1)
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.py

… evo2

Signed-off-by: John St John <jstjohn@nvidia.com>
@jstjohn jstjohn force-pushed the jstjohn/evo2_dynamic_inference_engine branch from dd5fa3e to b0cfaed Compare June 3, 2026 01:48
jstjohn added 3 commits June 2, 2026 19:50
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
@jstjohn
Copy link
Copy Markdown
Collaborator Author

jstjohn commented Jun 3, 2026

/ok to test ac4d138

jstjohn added 2 commits June 3, 2026 15:51
Signed-off-by: John St. John <jstjohn@nvidia.com>
Copy link
Copy Markdown
Collaborator

@farhadrgh farhadrgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark numbers don't disclose the kernel choice. So I am assuming these are without the subq-ops? Also test don't specifically exercises CUDA graph + subq-ops prefill together.

I suggest modifying the the existing test_subquadratic_ops_matches_baseline in test_infer.py with a 2×2 parametrization with the CUDA-graph toggle:

@pytest.mark.parametrize("cuda_graph_impl", ["none", "local"])
@pytest.mark.parametrize("use_subquadratic_ops", [False, True])
def test_subquadratic_ops_with_cuda_graph_matches_baseline(
    mbridge_checkpoint_path, tmp_path, use_subquadratic_ops, cuda_graph_impl
):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moradza do we need a .contiguous() here?

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…fp8 inference

Chunked prefill (Hyena):
- step_fir/step_iir accept an L-token block and thread the FIR ring / real-pole
  IIR modal state in one vectorized pass (equivalent to looping the single-token
  step); L==1 keeps the bit-identical, CUDA-graphed decode path.
- Fix step_iir block dtype crash: cast residues/D to fp32 (the recurrence runs in
  fp32 to match the persistent iir_state buffer; einsum needs matching dtypes).

FP8 inference (was always bf16 at generation; the "fp8" tests only converted in fp8):
- setup_inference_engine now runs inference at the chosen precision and, for full
  fp8 (fp8 on all TE linears), calls mcore prepare_model_for_fp8_inference so each
  linear pads the token dim to the fp8 alignment -> single-token decode no longer
  fails assert_dim_for_fp8_exec. No-op for bf16 and vortex-style fp8.

Tests:
- test_batch_generate_mbridge exercises bf16 / vortex fp8 / full fp8 at inference
  with the second-half accuracy readout (full fp8 gets its own golden values).
- Add full-fp8 with/without chunked-prefill test; add a bf16 IIR block unit test
  (catches the dtype mismatch on CPU) plus FIR/IIR block==per-token-loop unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: John St. John <jstjohn@nvidia.com>
@jstjohn jstjohn force-pushed the jstjohn/evo2_dynamic_inference_engine branch from 142aae7 to 884b2f4 Compare June 4, 2026 18:05
jstjohn added 3 commits June 4, 2026 13:06
Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:notebooks Run Jupyter notebooks execution tests for docs and bionemo2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants