The NeMo Framework is NVIDIAβs GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish.
The Eval library ("NeMo Eval") is a comprehensive evaluation module within the NeMo Framework for LLMs. It offers streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses.
- Multi-Backend Deployment: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend
- Comprehensive Evaluation: Includes state-of-the-art evaluation harnesses for academic benchmarks, reasoning benchmarks, code generation, and safety testing
- Adapter System: Features a flexible architecture with chained interceptors for customizable request and response processing
- Production-Ready: Supports high-performance inference with CUDA graphs and flash decoding
- Multi-GPU and Multi-Node Support: Enables distributed inference across multiple GPUs and compute nodes
- OpenAI-Compatible API: Provides RESTful endpoints aligned with OpenAI API specifications
- Python 3.10 or higher
- CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)
- NeMo Framework container (recommended)
For quick exploration of NeMo Eval, we recommend installing our pip package:
pip install nemo-eval
For optimal performance and user experience, use the latest version of the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:
docker run --rm -it -w /workdir -v $(pwd):/workdir \
--entrypoint bash \
--gpus all \
nvcr.io/nvidia/nemo:${TAG}
To install NeMo Eval with uv, please refer to our Contribution guide.
from nemo_eval.api import deploy
# Deploy a NeMo checkpoint
deploy(
nemo_checkpoint="/path/to/your/checkpoint",
serving_backend="pytriton", # or "ray"
server_port=8080,
num_gpus=1,
max_input_len=4096,
max_batch_size=8
)
from nvidia_eval_commons.core.evaluate import evaluate
from nvidia_eval_commons.api.api_dataclasses import ApiEndpoint, EvaluationConfig, EvaluationTarget
# Configure evaluation
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
model_id="megatron_model"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="results")
# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
print(results)
Checkpoint Type | Inference Backend | Deployment Server | Evaluation Harnesses Supported |
---|---|---|---|
NeMo FW checkpoint via Megatron Core backend | Megatron Core in-framework inference engine | PyTriton (single and multi node model parallelism), Ray (single node model parallelism with multi instance evals) | lm-evaluation-harness, simple-evals, BigCode, BFCL, safety-harness, garak |
- PyTriton Backend: Provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. Supports model parallelism across single-node and multi-node configurations. Note: Multi-instance evaluation is not supported.
- Ray Backend: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon.
-
NVIDIA Eval Factory: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory, bundled in the NeMo Framework container. The
lm-evaluation-harness
is pre-installed by default, and additional tools listed in the support matrix can be added as needed. For more information, see the documentation. -
Adapter System: Flexible request/response processing pipeline with Interceptors that provide modular processing:
- Available Interceptors: Modular components for request/response processing
- SystemMessageInterceptor: Customize system prompts
- RequestLoggingInterceptor: Log incoming requests
- ResponseLoggingInterceptor: Log outgoing responses
- ResponseReasoningInterceptor: Process reasoning outputs
- EndpointInterceptor: Route requests to the actual model
- Available Interceptors: Modular components for request/response processing
from nemo_eval.api import deploy
# Deploy model
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="pytriton",
server_port=8080,
num_gpus=1,
max_input_len=8192,
max_batch_size=4
)
from nvidia_eval_commons.core.evaluate import evaluate
from nvidia_eval_commons.api.api_dataclasses import ApiEndpoint, ConfigParams, EvaluationConfig, EvaluationTarget
# Configure Endpoint
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
model_id="megatron_model"
)
# Evaluation target configuration
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure EvaluationConfig with type, number of samples to evaluate on, etc.
config = EvaluationConfig(type="gsm8k",
output_dir="results",
params=ConfigParams(
limit_samples=10
))
# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
The example below demonstrates how to configure an Adapter to provide a custom system prompt. Requests and responses are processed through interceptors, which are automatically selected based on the parameters defined in AdapterConfig
.
from nemo_eval.utils.api import AdapterConfig
# Configure adapter for reasoning
adapter_config = AdapterConfig(
api_url="http://0.0.0.0:8080/v1/completions/",
use_reasoning=True,
end_reasoning_token="</think>",
custom_system_prompt="You are a helpful assistant that thinks step by step.",
max_logged_requests=5,
max_logged_responses=5
)
# Run evaluation with adapter
results = evaluate(
target_cfg=target,
eval_cfg=config,
adapter_cfg=adapter_config
)
# Deploy with tensor parallelism or pipeline parallelism
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="pytriton",
num_gpus=4,
tensor_parallelism_size=4,
pipeline_parallelism_size=1,
max_input_len=8192,
max_batch_size=8
)
# Deploy using Ray Serve
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="ray",
num_gpus=2,
num_replicas=2,
num_cpus_per_replica=8,
server_port=8080,
include_dashboard=True,
cuda_visible_devices="0,1"
)
Eval/
βββ src/nemo_eval/ # Main package
β βββ api.py # Main API functions
β βββ package_info.py # Package metadata
β βββ adapters/ # Adapter system
β β βββ server.py # Adapter server
β β βββ utils.py # Adapter utilities
β β βββ interceptors/ # Request/response interceptors
β βββ utils/ # Utility modules
β βββ api.py # API configuration classes
β βββ base.py # Base utilities
β βββ ray_deploy.py # Ray deployment utilities
βββ tests/ # Test suite
β βββ unit_tests/ # Unit tests
β βββ functional_tests/ # Functional tests
βββ tutorials/ # Tutorial notebooks
βββ scripts/ # Reference nemo-run scripts
βββ docs/ # Documentation
βββ docker/ # Docker configuration
βββ external/ # External dependencies
We welcome contributions! Please see our Contributing Guide for details on development setup, testing, and code style guidelines
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: NeMo Documentation
- NeMo Export Deploy - Model export and deployment
Note: This project is actively maintained by NVIDIA. For the latest updates and features, please check our releases page.