fix(geneformer): lazy-import nemo.lightning to avoid nvidia-resiliency-ext assertion#1605
Conversation
…y-ext assertion Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Closing in favor of #1598, which resolves the same issue. Why this approach failed: Lazy-importing Fix in #1598: Uses a |
Problem
The nightly CI for
unit-tests (models/geneformer)is failing becausegeneformer/convert.pyimportsnemo.lightningat module level, which triggers the full megatron-core import chain. This hits anAssertionError: Minimum required nvidia-resiliency-ext package version is 0.6.0inmegatron/core/dist_checkpointing/strategies/nvrx.py.4 tests fail at import time:
test_geneformer_checkpoint_loss[Geneformer-V1-10M]test_geneformer_checkpoint_weight_compatibility[Geneformer-V1-10M]test_te_bert_layer_and_hf_bert_layer_similar_output_values_random_inputstest_geneformer_model_loss_validityFix
Make
nemo.lightninga lazy import inconvert.py:from nemo.lightning import io_pack_qkv_weight_impl, etc.) at module level — they have no nemo dependency_make_transforms()that importsiolocally and creates theStateTransformobjects on demandconvert_geneformer_hf_to_teandconvert_geneformer_te_to_hfnow importiolocally_unpack_qkv_weight,_unpack_qkv_bias) for tests that import them directlyRoot Cause
The CI container image has
nvidia-resiliency-extbelow version 0.6.0, which megatron-core now requires. This is an environment issue, but making the import lazy is the correct defensive fix regardless —convert.pyshould not force the entire megatron-core stack to load just to define tensor manipulation utilities.Failed CI Run
https://github.com/NVIDIA-BioNeMo/bionemo-framework/actions/runs/27009502134
Signed-off-by: svc-bionemo 267129667+svc-bionemo@users.noreply.github.com