-
Notifications
You must be signed in to change notification settings - Fork 55
Add task validation CLI #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b4f45c79d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
any thoughts? @lorenss-m |
hud/datasets/runner.py
Outdated
| task_list = [t if isinstance(t, Task) else Task.from_v4(t) for t in tasks] | ||
|
|
||
| if not task_list: | ||
| raise ValueError("No tasks to run") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated task normalization logic in runner functions
Medium Severity
The task normalization logic in run_dataset_async (lines 159-175) is nearly identical to run_dataset (lines 76-94). Both functions normalize agent_type from string to AgentType enum and normalize tasks from various input types to list[Task]. This ~17 lines of duplicated code should be extracted into a shared helper function like _normalize_tasks().
579b4f0 to
0191b4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
|
||
| def _load_raw_tasks(source: str) -> tuple[list[dict[str, Any]], list[str]]: | ||
| path = Path(source) | ||
| if path.exists() and path.suffix.lower() in {".json", ".jsonl"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case sensitivity mismatch between validation and loading
Low Severity
The _load_raw_tasks and _load_raw_from_file functions use case-insensitive extension matching via .suffix.lower(), while the existing load_tasks function in loader.py uses case-sensitive matching. This means a file like tasks.JSONL would pass validation but fail when actually loaded via load_tasks, because loader.py wouldn't recognize the uppercase extension and would incorrectly try to fetch it as a HuggingFace dataset.
Additional Locations (1)
| module = importlib.util.module_from_spec(spec) # type: ignore[arg-type] | ||
| assert spec and spec.loader | ||
| spec.loader.exec_module(module) | ||
| return module.validate_command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test uses unnecessarily complex importlib module loading
Medium Severity
The _load_validate_command() function uses importlib.util.spec_from_file_location to manually load the module when a simple import would work: from hud.cli.validate import validate_command. This pattern is inconsistent with other tests in hud/cli/tests/ which use standard imports.


Summary
hud validateto check task files or HF datasets without running themNote
Low Risk
Adds a new CLI command and validation-only code paths; main risk is false positives/negatives in task parsing/validation rather than runtime behavior changes.
Overview
Adds a new
hud validatecommand that loads tasks from a local.json/.jsonlfile or a dataset slug and validates each entry without running an eval.Validation now checks v4-style tasks via
validate_v4_task(when detected) and always attempts PydanticTaskconstruction, aggregating and printing per-task errors before exiting non-zero on failure. Includes unit tests covering valid tasks, missing required fields, and non-dict entries in the tasks list.Written by Cursor Bugbot for commit 0191b4d. This will update automatically on new commits. Configure here.