[Evo2] Switch from static to faster/more modern dynamic inference engine by jstjohn · Pull Request #1597 · NVIDIA-BioNeMo/bionemo-framework

jstjohn · 2026-06-03T01:24:20Z

Description

New dynamic inference engine in evo2 with cudagraph support.

Benchmarked with a 1024-token prompt and 1024 requested generated tokens. All runs verified the output JSONL reported 1024 prompt tokens and 1024 completion tokens. Comparison performed on 2xA6000 GPUs at bf16 precision.

Model	Parallelism	Prompt / Generation	`origin/main` static engine	Dynamic engine	Speedup	Tokens verified
Evo2 1B	TP=1	1024 / 1024	38.7 tok/s, 26.44s	129.6 tok/s, 7.90s	3.35x	1024 prompt + 1024 completion
Evo2 7B	TP=2	1024 / 1024	28.4 tok/s, 36.07s	62.2 tok/s, 16.47s	2.19x	1024 prompt + 1024 completion

Usage

TODO: Add code snippet

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Native dynamic inference engine for improved performance and memory efficiency
- Chunked prefill and dynamic batching support for flexible inference control
Improvements
- Optimized inference state management through in-place operations
- CUDA graph acceleration for faster decode-time inference
Documentation
- Updated fine-tuning tutorial with corrected CLI arguments for prediction commands

coderabbitai · 2026-06-03T01:24:27Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4fdae5d3-a116-4240-a2a5-572159d7deeb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR migrates Evo2's inference from Megatron Core's static inference engine to a native dynamic-inference path. Hyena recurrent state (FIR/IIR filters) is packed into MCore's Mamba slots with paged-KV attention, per-layer CUDA graphs are supported, and generation uses an explicit sampler with temperature/top-k/top-p. CLI arguments and tests are updated accordingly.

Changes

Evo2 Dynamic Inference Engine

Layer / File(s)	Summary
Hyena Recurrent State Shapes & Layer Type Mapping `src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py`, `src/bionemo/evo2/models/megatron/hyena/hyena_block.py`	`HyenaMixerStateShapes` dataclass defines conv/FIR and operator-specific SSM state layouts. `HyenaStack` maps Hyena layers to `MAMBA` symbols and aggregates per-request state shapes across all Hyena layers on the rank, with uniform conv shape and padded SSM shape.
In-Place Recurrent State Buffer Updates `src/bionemo/evo2/models/megatron/hyena/engine.py`, `src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py`	FIR and IIR states are now persistent fp32 buffers with in-place operations: ring-buffer shifts via `copy_(torch.roll(...))` and recurrence via `mul_`/`add_`. Cached filter state tensors are cast to fp32 before storage in inference context.
Dynamic Context State Binding & Packing Utilities `src/bionemo/evo2/models/evo2_provider.py`	`_PackedHyenaSlotStateDict` adapters remap Hyena filter writes into live dynamic context Mamba state sub-slices. New helpers build `MambaInferenceStateConfig`, compute paged-KV `buffer_size_gb` mirroring mcore's hybrid arithmetic, and bind Hyena layer state views to dynamic context by installing per-bucket filter dicts and registering conv/SSM sub-slices.
Hyena Layer CUDA Graph Integration `src/bionemo/evo2/models/megatron/hyena/hyena_layer.py`	`HyenaLayer` now extends `GraphableMegatronModule` and adds `create_mcore_cudagraph_manager()` and `_should_call_local_cudagraph()` for per-layer local CUDA graph capture during inference decode when configured.
Native Dynamic Inference Setup & Engine Configuration `src/bionemo/evo2/run/infer.py`	Replaces static engine/wrapper/controller with native dynamic setup: new `Evo2InferenceComponents` and `Evo2NativeDynamicComponents` containers, tokenizer adaptation for generation, CUDA graph configuration and RNG seeding, model provider constraints (flash_decode off, sequence parallelism off), and wiring onto Evo2-specific dynamic context with Hyena state packing.
Native Dynamic Generation Engine & Sampling `src/bionemo/evo2/run/infer.py`	`_generate_native_dynamic()` implements per-prompt dynamic inference with optional chunked prefill, dynamic context sizing, and request lifecycle (add_request→bind_views→initialize_state→forward/sample→update). `_sample_from_logits()` provides self-contained greedy/top-k/top-p/temperature sampling with optional per-token logprob collection.
CLI Parameters & Public Generation API `src/bionemo/evo2/run/infer.py`	`generate()` and `infer()` signatures extended with `enable_chunked_prefill`, `inference_dynamic_batching_max_tokens`, and `inference_dynamic_batching_block_size` parameters. Prompt-segmentation-threshold removed. Seed threaded into sampler RNG via tokenizer attribute.
Test Refactoring & New Native Dynamic Edge Cases `tests/bionemo/evo2/run/test_infer.py`	`run_infer_subprocess()` extended to support max-seq-length and return-log-probs controls and now returns full JSONL record. HyenaInferenceContext tests refactored to standalone functions. Comprehensive new suite of native dynamic edge-case tests covering full-prefill, chunked-prefill, single-token decode, FIR ring handling, longer generation, determinism, prompt sensitivity, and TP=2 CUDA graph execution.
Documentation & Example Updates `src/bionemo/evo2/run/infer.py`, `src/bionemo/evo2/run/infer_example_simple.py`, `tests/bionemo/evo2/run/test_infer.py`, `tests/bionemo/evo2/test_evo2.py`, `examples/fine-tuning-tutorial.ipynb`	Module and function docstrings updated to emphasize native MCore dynamic-inference path. Test file comments clarify generation routes through public Evo2 endpoint. Fine-tuning tutorial notebook switches CLI flag from `--input-fasta` to `--fasta`.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA-BioNeMo/bionemo-framework#1419: Modifies bionemo/evo2/run/infer.py's inference setup and generation workflow; this PR switches to native dynamic-inference request lifecycle while the other PR involves static/legacy inference path changes.

Suggested labels

enhancement

Suggested reviewers

pstjohn
jwilber
trvachov
dorotat-nv
cspades

Poem

🐰 From static chains, a rabbit hops free,
Dynamic contexts now bind Hyena with glee!
Paged-KV buffers and CUDA graphs dance,
Native sampling takes inference's stance.
The state shapes align, the rings roll in place—
MCore's dynamic path wins the race! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	PR description includes performance benchmarking data and change type but lacks detailed usage example and incomplete pre-submit checklist.	Add a concrete usage example code snippet to replace the 'TODO' placeholder and document how users interact with the new dynamic inference engine.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main change: switching from static to dynamic inference engine with performance improvements.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jstjohn/evo2_dynamic_inference_engine

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jstjohn · 2026-06-03T01:28:14Z

@coderabbitai review

coderabbitai · 2026-06-03T01:28:19Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (1)
1236-1238: 💤 Low value

Consider threading seed through components instead of tokenizer attribute.

Setting _evo2_seed on the tokenizer object is unconventional. A cleaner approach would be to add the seed to Evo2NativeDynamicComponents or pass it as a parameter to generate(). However, this works correctly and the attribute name clearly indicates it's Evo2-specific.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py` around
lines 1236 - 1238, The code currently mutates components.tokenizer._evo2_seed to
thread the resolved seed (random_seed) into the native sampler RNG; instead, add
a dedicated field on the components container (e.g.,
Evo2NativeDynamicComponents.seed or evo2_seed) or extend the generate()
signature to accept a seed parameter and pass random_seed through that API, then
update usages in _generate_native_dynamic to read from the new components field
or the generate() parameter instead of tokenizer._evo2_seed; keep the attribute
name evo2_seed to preserve clarity and remove the tokenizer mutation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py`:
- Around line 1236-1238: The code currently mutates
components.tokenizer._evo2_seed to thread the resolved seed (random_seed) into
the native sampler RNG; instead, add a dedicated field on the components
container (e.g., Evo2NativeDynamicComponents.seed or evo2_seed) or extend the
generate() signature to accept a seed parameter and pass random_seed through
that API, then update usages in _generate_native_dynamic to read from the new
components field or the generate() parameter instead of tokenizer._evo2_seed;
keep the attribute name evo2_seed to preserve clarity and remove the tokenizer
mutation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 72911025-adde-46a4-bb89-73f7badfbc0f

📥 Commits

Reviewing files that changed from the base of the PR and between aa2692e and dd5fa3e.

📒 Files selected for processing (11)

bionemo-recipes/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_layer.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.py
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py

💤 Files with no reviewable changes (1)

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/text_generation_controller.py

… evo2 Signed-off-by: John St John <jstjohn@nvidia.com>

Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn · 2026-06-03T15:19:59Z

/ok to test ac4d138

…to jstjohn/evo2_dynamic_inference_engine

Signed-off-by: John St. John <jstjohn@nvidia.com>

farhadrgh

Benchmark numbers don't disclose the kernel choice. So I am assuming these are without the subq-ops? Also test don't specifically exercises CUDA graph + subq-ops prefill together.

I suggest modifying the the existing test_subquadratic_ops_matches_baseline in test_infer.py with a 2×2 parametrization with the CUDA-graph toggle:

@pytest.mark.parametrize("cuda_graph_impl", ["none", "local"])
@pytest.mark.parametrize("use_subquadratic_ops", [False, True])
def test_subquadratic_ops_with_cuda_graph_matches_baseline(
    mbridge_checkpoint_path, tmp_path, use_subquadratic_ops, cuda_graph_impl
):

farhadrgh · 2026-06-04T15:33:50Z

@moradza do we need a .contiguous() here?

copy-pr-bot · 2026-06-04T17:35:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…fp8 inference Chunked prefill (Hyena): - step_fir/step_iir accept an L-token block and thread the FIR ring / real-pole IIR modal state in one vectorized pass (equivalent to looping the single-token step); L==1 keeps the bit-identical, CUDA-graphed decode path. - Fix step_iir block dtype crash: cast residues/D to fp32 (the recurrence runs in fp32 to match the persistent iir_state buffer; einsum needs matching dtypes). FP8 inference (was always bf16 at generation; the "fp8" tests only converted in fp8): - setup_inference_engine now runs inference at the chosen precision and, for full fp8 (fp8 on all TE linears), calls mcore prepare_model_for_fp8_inference so each linear pads the token dim to the fp8 alignment -> single-token decode no longer fails assert_dim_for_fp8_exec. No-op for bf16 and vortex-style fp8. Tests: - test_batch_generate_mbridge exercises bf16 / vortex fp8 / full fp8 at inference with the second-half accuracy readout (full fp8 gets its own golden values). - Add full-fp8 with/without chunked-prefill test; add a bf16 IIR block unit test (catches the dtype mismatch on CPU) plus FIR/IIR block==per-token-loop unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: John St. John <jstjohn@nvidia.com>

Signed-off-by: John St. John <jstjohn@nvidia.com>

Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn requested review from jwilber, pstjohn and trvachov as code owners June 3, 2026 01:24

jstjohn added the ciflow:notebooks Run Jupyter notebooks execution tests for docs and bionemo2 label Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Switch from static to faster/more modern dynamic inference engine for…

b0cfaed

… evo2 Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn force-pushed the jstjohn/evo2_dynamic_inference_engine branch from dd5fa3e to b0cfaed Compare June 3, 2026 01:48

jstjohn added 3 commits June 2, 2026 19:50

Address feedback around RNG setting

7bce12c

Signed-off-by: John St John <jstjohn@nvidia.com>

Handle mcore dynamic inference context padding in the hyena mixer

bd53980

Signed-off-by: John St John <jstjohn@nvidia.com>

Add docstrings to tests

ac4d138

Signed-off-by: John St John <jstjohn@nvidia.com>

jstjohn added 2 commits June 3, 2026 15:51

Merge branch 'main' of github.com:NVIDIA-BioNeMo/bionemo-framework in…

a486814

…to jstjohn/evo2_dynamic_inference_engine

Address CI failures

dab433e

Signed-off-by: John St. John <jstjohn@nvidia.com>

farhadrgh reviewed Jun 4, 2026

View reviewed changes

jstjohn force-pushed the jstjohn/evo2_dynamic_inference_engine branch from 142aae7 to 884b2f4 Compare June 4, 2026 18:05

jstjohn added 3 commits June 4, 2026 13:06

PR feedback

9b5992f

Signed-off-by: John St. John <jstjohn@nvidia.com>

add help message to subq safety about system driver issues

7823b23

Signed-off-by: John St John <jstjohn@nvidia.com>

No subq + cudagraph

c271c65

Signed-off-by: John St John <jstjohn@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evo2] Switch from static to faster/more modern dynamic inference engine#1597

[Evo2] Switch from static to faster/more modern dynamic inference engine#1597
jstjohn wants to merge 10 commits into
mainfrom
jstjohn/evo2_dynamic_inference_engine

jstjohn commented Jun 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

jstjohn commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

jstjohn commented Jun 3, 2026

Uh oh!

farhadrgh left a comment

Uh oh!

farhadrgh Jun 4, 2026

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jstjohn commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

jstjohn commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jstjohn commented Jun 3, 2026

Uh oh!

farhadrgh left a comment

Choose a reason for hiding this comment

Uh oh!

farhadrgh Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jstjohn commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading