diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md new file mode 100644 index 000000000..957fbfc54 --- /dev/null +++ b/.claude/skills/evaluation/SKILL.md @@ -0,0 +1,307 @@ +--- +name: evaluation +description: Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment). +license: Apache-2.0 +# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073) +# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant +# Modifications: renamed to evaluation, added workspace management (Step 0), +# auto-detect ModelOpt quantization format, quantization-aware benchmark defaults. +--- + +## NeMo Evaluator Launcher Assistant + +You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below. + +### Workspace (multi-user / Slack bot) + +If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications. + +### Workflow + +```text +Config Generation Progress: +- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set) +- [ ] Step 1: Check if nel is installed and if user has existing config +- [ ] Step 2: Build the base config file +- [ ] Step 3: Configure model path and parameters +- [ ] Step 4: Fill in remaining missing values +- [ ] Step 5: Confirm tasks (iterative) +- [ ] Step 6: Advanced - Multi-node (Data Parallel) +- [ ] Step 7: Advanced - Interceptors +- [ ] Step 8: Run the evaluation +``` + +**Step 1: Check prerequisites** + +Test that `nel` is installed with `nel --version`. If not, instruct the user to `pip install nemo-evaluator-launcher`. + +If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running. + +**Step 2: Build the base config file** + +Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion: + +1. Execution: + +- Local +- SLURM + +2. Deployment: + +- None (External) +- vLLM +- SGLang +- NIM +- TRT-LLM + +3. Auto-export: + +- None (auto-export disabled) +- MLflow +- wandb + +4. Model type + +- Base +- Chat +- Reasoning + +5. Benchmarks: + Allow for multiple choices in this question. +1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...) +2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench) +3. Math & Reasoning (like AIME, GPQA, MATH-500, ...) +4. Safety & Security (like Garak and Safety Harness) +5. Multilingual (like MMATH, Global MMLU, MMLU-Prox) + +DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config. + +> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead. + +When you have all the answers, run the script to build the base config: + +```bash +nel skills build-config --execution --deployment --model_type --benchmarks [--export ] [--output ] +``` + +Where `--output` depends on what the user provides: + +- Omit: Uses current directory with auto-generated filename +- Directory: Writes to that directory with auto-generated filename +- File path (*.yaml): Writes to that specific file + +It never overwrites existing files. + +**Step 3: Configure model path and parameters** + +Ask for model path. Determine type: + +- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: ` and `deployment.hf_model_handle: null` +- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: ` and `deployment.checkpoint_path: null` + +**Auto-detect ModelOpt quantization format** (checkpoint paths only): + +Check for `hf_quant_config.json` in the checkpoint directory: + +```bash +cat /hf_quant_config.json 2>/dev/null +``` + +If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`: + +| `quant_algo` | Flag to add | +|-------------|-------------| +| `FP8` | `--quantization modelopt` | +| `W4A8_AWQ` | `--quantization modelopt` | +| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` | +| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure | + +If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed. + +> **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below. + +**Quantization-aware benchmark defaults:** + +When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include. + +Read `references/model-card-research.md` for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm. + +**Step 4: Fill in remaining missing values** + +- Find all remaining `???` missing values in the config. +- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text. +- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled). + +**Step 5: Confirm tasks (iterative)** + +Show tasks in the current config. Loop until the user confirms the task list is final: + +1. Tell the user: "Run `nel ls tasks` to see all available tasks". +2. Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides. + To add per-task `nemo_evaluator_config` as specified by the user, e.g.: + + ```yaml + tasks: + - name: + nemo_evaluator_config: + config: + params: + temperature: + max_new_tokens: + ... + ``` + +3. Apply changes. +4. Show updated list and ask: "Is the task list final, or do you want to make more changes?" + +**Known Issues** + +- NeMo-Skills workaround (self-deployment only): If using `nemo_skills.*` tasks with self-deployment (vLLM/SGLang/NIM), add at top level: + + ```yaml + target: + api_endpoint: + api_key_name: DUMMY_API_KEY + ``` + + For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8. + +**Step 6: Advanced - Multi-node** + +If the user needs multi-node evaluation (model >120B, or more throughput), read `references/multi-node.md` for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined). + +**Step 7: Advanced - Interceptors** + +- Tell the user they should see: . +- DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information. +- If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the `--overrides` syntax but put the values in the YAML config under `evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config` (NOT under `target.api_endpoint.adapter_config`) instead of using CLI overrides. + By defining `interceptors` list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the `CLI Configuration` section after the `--overrides` keyword to configure interceptors in the YAML config. + +**Documentation Errata** + +- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`). + +**Step 8: Run the evaluation** + +Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run. + +**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands. + +```bash +# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands): +export NEMO_EVALUATOR_TRUST_PRE_CMD=1 + +# If using nemo_skills.* tasks with self-deployment: +export DUMMY_API_KEY=dummy +``` + +1. **Dry-run** (validates config without running): + + ```bash + nel run --config --dry-run + ``` + +2. **Test with limited samples** (quick validation run): + + ```bash + nel run --config -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 + ``` + +3. **Re-run a single task** (useful for debugging or re-testing after config changes): + + ```bash + nel run --config -t + ``` + + Combine with `-o` for limited samples: `nel run --config -t -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10` + +4. **Full evaluation** (production run): + + ```bash + nel run --config + ``` + +After the dry-run, check the output from `nel` for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation. + +**Monitoring Progress** + +After job submission, you can monitor progress using: + +1. **Check job status:** + + ```bash + nel status + nel info + ``` + +2. **Stream logs** (Local execution only): + + ```bash + nel logs + ``` + + Note: `nel logs` is not supported for SLURM execution. + +3. **Inspect logs via SSH** (SLURM workaround): + + When `nel logs` is unavailable (SLURM), use SSH to inspect logs directly: + + First, get log locations: + + ```bash + nel info --logs + ``` + + Then, use SSH to view logs: + + **Check server deployment logs:** + + ```bash + ssh @ "tail -100 --logs`>/server--*.log" + ``` + + Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl). + + **Check evaluation client logs:** + + ```bash + ssh @ "tail -100 --logs`>/client-.log" + ``` + + Shows evaluation progress, task execution, and results. + + **Check SLURM scheduler logs:** + + ```bash + ssh @ "tail -100 --logs`>/slurm-.log" + ``` + + Shows job scheduling, health checks, and overall execution flow. + + **Search for errors:** + + ```bash + ssh @ "grep -i 'error\|warning\|failed' --logs`>/*.log" + ``` + +--- + +Direct users with issues to: + +- **GitHub Issues:** +- **GitHub Discussions:** + +Now, copy this checklist and track your progress: + +```text +Config Generation Progress: +- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set) +- [ ] Step 1: Check if nel is installed and if user has existing config +- [ ] Step 2: Build the base config file +- [ ] Step 3: Configure model path and parameters +- [ ] Step 4: Fill in remaining missing values +- [ ] Step 5: Confirm tasks (iterative) +- [ ] Step 6: Advanced - Multi-node (Data Parallel) +- [ ] Step 7: Advanced - Interceptors +- [ ] Step 8: Run the evaluation +``` diff --git a/.claude/skills/evaluation/references/model-card-research.md b/.claude/skills/evaluation/references/model-card-research.md new file mode 100644 index 000000000..4397f8873 --- /dev/null +++ b/.claude/skills/evaluation/references/model-card-research.md @@ -0,0 +1,30 @@ +# Model Card Research + +Use WebSearch to find the model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations: + +- Sampling params (`temperature`, `top_p`) +- Context length (`deployment.extra_args: "--max-model-len "`) +- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed) +- Reasoning config (if applicable): + - reasoning on/off: use either: + - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched) + - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched). + - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want) + - higher `max_new_tokens` + - etc. +- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command) +- Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with `deployment.image` e.g. vLLM above `vllm/vllm-openai:v0.11.0` stopped supporting `rope-scaling` arg used by Qwen models) +- ARM64 / non-standard GPU compatibility: The default `vllm/vllm-openai` image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead: + - Example: `deployment.image: nvcr.io/nvidia/vllm:26.01-py3` + - AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints +- Any preparation requirements (e.g., downloading reasoning parsers, custom plugins): + - If the model card mentions downloading files (like reasoning parsers, custom plugins) before deployment, add `deployment.pre_cmd` with the download command + - Use `curl` instead of `wget` as it's more widely available in Docker containers + - Example: `pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py` + - When using `pip install` in `pre_cmd`, always use `--no-cache-dir` to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems) + - Example: `pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation` +- Any other model-specific requirements + +Remember to check `evaluation.nemo_evaluator_config` and `evaluation.tasks.*.nemo_evaluator_config` overrides too for parameters to adjust (e.g. disabling reasoning)! + +Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations. diff --git a/.claude/skills/evaluation/references/multi-node.md b/.claude/skills/evaluation/references/multi-node.md new file mode 100644 index 000000000..a7b9d27fb --- /dev/null +++ b/.claude/skills/evaluation/references/multi-node.md @@ -0,0 +1,53 @@ +# Multi-Node Evaluation Patterns + +There are two multi-node patterns. Ask the user which applies: + +## Pattern A: Multi-instance (independent instances with HAProxy) + +Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances." + +```yaml +execution: + num_nodes: 4 # Total nodes + num_instances: 4 # 4 independent instances → HAProxy auto-enabled +``` + +## Pattern B: Multi-node single instance (Ray TP/PP across nodes) + +When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config: + +```yaml +defaults: + - deployment: vllm_ray # Built-in Ray cluster setup (replaces manual pre_cmd) + +execution: + num_nodes: 2 # Single instance spanning 2 nodes + +deployment: + tensor_parallel_size: 8 + pipeline_parallel_size: 2 +``` + +## Pattern A+B combined: Multi-instance with multi-node instances + +For very large models needing both cross-node parallelism AND multiple instances: + +```yaml +defaults: + - deployment: vllm_ray + +execution: + num_nodes: 4 # Total nodes + num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled + +deployment: + tensor_parallel_size: 8 + pipeline_parallel_size: 2 +``` + +## Common Confusions + +- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance. +- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas). +- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance. +- `num_nodes` must be divisible by `num_instances`. diff --git a/.claude/skills/evaluation/references/quantization-benchmarks.md b/.claude/skills/evaluation/references/quantization-benchmarks.md new file mode 100644 index 000000000..a0ca45453 --- /dev/null +++ b/.claude/skills/evaluation/references/quantization-benchmarks.md @@ -0,0 +1,26 @@ +# Quantization-Aware Benchmark Recommendations + +When evaluating a quantized checkpoint, prioritize benchmarks that are sensitive to precision loss. + +## Sensitivity ranking + +| Priority | Benchmarks | Why | +|----------|-----------|-----| +| **Always include** | MMLU | General knowledge — typically shows measurable accuracy loss from quantization | +| **Recommended** | GSM8K, ARC-Challenge | Math reasoning and general reasoning — sensitive to precision loss | +| **Good to add** | HumanEval, Winogrande | Code generation and commonsense — catches subtle degradation | +| **Less useful for quant comparison** | IFEval | Instruction following — typically less affected, but worth including for aggressive quantization like FP4 | + +## Recommended sets by use case + +| Use case | Benchmarks | +|----------|-----------| +| Quick sanity check | MMLU | +| Standard quant validation | MMLU, GSM8K, ARC-Challenge | +| Thorough evaluation | MMLU, GSM8K, ARC-Challenge, HumanEval, Winogrande | +| Code-focused model | HumanEval, MBPP, MMLU | +| Reasoning model | GSM8K, MATH-500, GPQA, MMLU | + +## How to use + +Present these recommendations to the user and ask which to include. If the user already specified benchmarks, keep their choice but mention any accuracy-sensitive benchmarks they may have missed. diff --git a/.claude/skills/evaluation/tests/evals.json b/.claude/skills/evaluation/tests/evals.json new file mode 100644 index 000000000..0f35dacd7 --- /dev/null +++ b/.claude/skills/evaluation/tests/evals.json @@ -0,0 +1,65 @@ +[ + { + "name": "nemotron3-nano-bf16-reasoning", + "skills": ["evaluation"], + "query": "Help me evaluate Nemotron 3 Nano BF16 from NVIDIA", + "files": [], + "expected_behavior": [ + "Verifies nel is installed by running 'nel --version'", + "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks) before generating the config", + "Runs 'nel skills build-config' with correct flags matching user answers: --execution slurm --deployment vllm --model-type reasoning --benchmarks standard code math_reasoning --export mlflow", + "Searches the web for the model card on HuggingFace and extracts model-specific settings", + "Sets correct HF handle: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", + "Sets reasoning sampling params from model card: temperature=1.0, top_p=1.0", + "Configures reasoning toggle via params_to_add with chat_template_kwargs.enable_thinking (not via system prompt)", + "Disables reasoning for IFEval task using enable_thinking: false with use_system_prompt: false", + "Adds deployment.pre_cmd using curl (not wget) to download nano_v3_reasoning_parser.py from HuggingFace", + "Adds vLLM extra_args including --trust-remote-code, --reasoning-parser-plugin, --reasoning-parser nano_v3, --max-num-seqs 8", + "Pins vLLM image to v0.12.0 or later as required by model card", + "Adds target.api_endpoint.api_key_name: DUMMY_API_KEY for nemo_skills tasks with self-deployment", + "Fills in all ??? placeholders after asking the user for SLURM hostname, account, output_dir, MLflow tracking_uri, and experiment_name", + "Applies user-requested SLURM customizations: partition batch_short, walltime 00:20:00, MLflow tag scenario: demo", + "Presents task list and waits for user confirmation before proceeding", + "Configures request and response logging interceptors under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config using correct field names (max_logged_requests/max_logged_responses, not max_saved_*)", + "Handles dry-run failure for missing HF_TOKEN_FOR_GPQA_DIAMOND by offering to fix the config", + "Successfully submits test run with limit_samples=10 after dry-run passes", + "Provides monitoring commands (nel status, nel info --logs) and inspects server logs via SSH when asked" + ] + }, + { + "name": "quantized-checkpoint-local-vllm", + "skills": ["evaluation"], + "query": "evaluate my FP8 quantized Llama checkpoint at ./llama-3.1-8b-fp8 on MMLU and GSM8K", + "files": [], + "expected_behavior": [ + "Verifies nel is installed by running nel --version", + "Asks all 5 base config questions (execution, deployment, auto-export, model type, benchmarks)", + "Runs nel skills build-config with correct flags matching user answers", + "Sets deployment.checkpoint_path to ./llama-3.1-8b-fp8 and deployment.hf_model_handle to null", + "Auto-detects quantization format by reading ./llama-3.1-8b-fp8/hf_quant_config.json", + "Finds quant_algo=FP8 and adds --quantization modelopt to deployment.extra_args", + "Recommends accuracy-sensitive benchmarks from references/quantization-benchmarks.md", + "Searches web for Llama-3.1-8B model card and extracts sampling params, context length, TP settings", + "Fills in remaining missing values by asking user", + "Runs dry-run, then test with limit_samples=10, then full evaluation", + "Reports accuracy results per benchmark" + ] + }, + { + "name": "slurm-quantized-model", + "skills": ["evaluation"], + "query": "Evaluate my quantized Llama-3.1-8B-FP8 checkpoint on mmlu and gsm8k on the SLURM cluster", + "files": [], + "expected_behavior": [ + "Verifies nel is installed by running nel --version", + "Asks 5 base config questions with execution=slurm pre-selected based on user request", + "Runs nel skills build-config with --execution slurm --deployment vllm --benchmarks standard", + "Detects FP8 quantization from hf_quant_config.json and sets deployment.extra_args with --quantization modelopt", + "Reads references/quantization-benchmarks.md and recommends accuracy-sensitive benchmarks", + "Uses WebSearch to research model card for sampling params and context length", + "Fills in SLURM-specific values: hostname, account, partition from user input", + "Runs dry-run validation before full evaluation", + "Provides SSH-based log monitoring commands for SLURM execution" + ] + } +] diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml index 4c5a69014..f6de39a4a 100644 --- a/.markdownlint-cli2.yaml +++ b/.markdownlint-cli2.yaml @@ -2,6 +2,8 @@ config: MD013: false # line-length MD024: false # no-duplicate-heading MD028: false # no-blanks-blockquote + MD029: false # ol-prefix — upstream NEL skill uses actual numbers MD033: false # no-inline-html + MD036: false # no-emphasis-as-heading — upstream NEL skill uses **Bold** as headers MD041: false # first-line-heading MD059: false # no-hard-tabs