Problem
The preflight LLM check in .github/run-eval/resolve_model_config.py fails for thinking models like NVIDIA Nemotron-3 Super 120B.
When enable_thinking: true is set in the model config, the model puts all its output into reasoning_content rather than content. The current preflight check only validates content, so it always sees an empty response and aborts the evaluation.
✗ NVIDIA Nemotron-3 Super 120B: Empty response (finish_reason=length, usage=Usage(completion_tokens=100, ...))
Fix
Also check reasoning_content alongside content:
response_content = response.choices[0].message.content if response.choices else None
reasoning_content = response.choices[0].message.reasoning_content if response.choices else None
if response_content or reasoning_content: