Skip to content

fix(geneformer): lazy-import nemo.lightning to avoid nvidia-resiliency-ext assertion#1605

Closed
svc-bionemo wants to merge 1 commit into
NVIDIA-BioNeMo:mainfrom
svc-bionemo:svc-bionemo/fix-nightly-20260605-93a8dc5e
Closed

fix(geneformer): lazy-import nemo.lightning to avoid nvidia-resiliency-ext assertion#1605
svc-bionemo wants to merge 1 commit into
NVIDIA-BioNeMo:mainfrom
svc-bionemo:svc-bionemo/fix-nightly-20260605-93a8dc5e

Conversation

@svc-bionemo

Copy link
Copy Markdown
Collaborator

Problem

The nightly CI for unit-tests (models/geneformer) is failing because geneformer/convert.py imports nemo.lightning at module level, which triggers the full megatron-core import chain. This hits an AssertionError: Minimum required nvidia-resiliency-ext package version is 0.6.0 in megatron/core/dist_checkpointing/strategies/nvrx.py.

4 tests fail at import time:

  • test_geneformer_checkpoint_loss[Geneformer-V1-10M]
  • test_geneformer_checkpoint_weight_compatibility[Geneformer-V1-10M]
  • test_te_bert_layer_and_hf_bert_layer_similar_output_values_random_inputs
  • test_geneformer_model_loss_validity

Fix

Make nemo.lightning a lazy import in convert.py:

  1. Removed module-level from nemo.lightning import io
  2. Kept the tensor math functions (_pack_qkv_weight_impl, etc.) at module level — they have no nemo dependency
  3. Added _make_transforms() that imports io locally and creates the StateTransform objects on demand
  4. convert_geneformer_hf_to_te and convert_geneformer_te_to_hf now import io locally
  5. Preserved backward-compatible module-level exports (_unpack_qkv_weight, _unpack_qkv_bias) for tests that import them directly

Root Cause

The CI container image has nvidia-resiliency-ext below version 0.6.0, which megatron-core now requires. This is an environment issue, but making the import lazy is the correct defensive fix regardless — convert.py should not force the entire megatron-core stack to load just to define tensor manipulation utilities.

Failed CI Run

https://github.com/NVIDIA-BioNeMo/bionemo-framework/actions/runs/27009502134

Signed-off-by: svc-bionemo 267129667+svc-bionemo@users.noreply.github.com

…y-ext assertion

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f3f7cc28-f01f-4175-b528-b6b5be5e995c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-bionemo svc-bionemo closed this Jun 5, 2026
@svc-bionemo

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #1598, which resolves the same issue.

Why this approach failed: Lazy-importing nemo.lightning inside convert_geneformer_hf_to_te() / convert_geneformer_te_to_hf() correctly defers the import from module load time, but the tests still call these functions — so the import chain (nemo.lightningmegatron.corenvidia-resiliency-ext version assertion) still fires at test time. The underlying problem is that nvidia-resiliency-ext==0.5.0 is in the CI base image and megatron-core==0.17.1 asserts >=0.6.0.

Fix in #1598: Uses a .ci_build.sh script to install nvidia-resiliency-ext>=0.6.0 with --no-deps, avoiding the transitive protobuf conflict while satisfying the megatron-core runtime assertion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant