-
Notifications
You must be signed in to change notification settings - Fork 308
add deep research agent benchmark script. #2135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Dependency Review✅ No vulnerabilities or license issues found.Scanned FilesNone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a benchmarking harness for measuring the accuracy of the Deep Research Agent by integrating dataset loaders, an LLM-based scoring function, and CLI support.
- Wraps the agent’s raw output in a JSON field (
"answer"
) inresearch_agent.py
. - Introduces
eval.py
to load questions, invoke the agent, score answers via an LLM judge, and save results. - Updates documentation under
benchmark/accuracy
and the main README to cover setup and usage.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
DeepResearchAgent/research_agent.py | Return payload now wrapped as {"answer": ...} |
DeepResearchAgent/benchmark/accuracy/eval.py | New benchmark script for accuracy evaluation |
DeepResearchAgent/benchmark/accuracy/README.md | Instructions for running the accuracy benchmark |
DeepResearchAgent/README.md | Added note on configuring deep_researcher.yaml in setup |
Comments suppressed due to low confidence (2)
DeepResearchAgent/benchmark/accuracy/eval.py:93
- [nitpick] There are no unit tests covering key functions like
load_questions
,process_single_question
, orrun_benchmark
. Adding tests will help ensure correctness when loading datasets and computing accuracy.
def load_questions(dataset_names: list[str] | None = None) -> list[dict[str, str]]:
DeepResearchAgent/benchmark/accuracy/eval.py:334
- The
args.agent_config
field is referenced in metadata but no--agent-config
argument is defined in the parser. Consider adding a corresponding parser argument or renaming this field to match--service-url
or another existing flag.
"agent_config": args.agent_config,
|
||
|Argument|Default value| Description | | ||
|--------|-------------| ------------- | | ||
|--datasets|together-search-bench| benchmark datasets, support "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Grammar: change "support" to "supports" for correct subject-verb agreement in the description.
|--datasets|together-search-bench| benchmark datasets, support "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" | | |
|--datasets|together-search-bench| benchmark datasets, supports "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" | |
Copilot uses AI. Check for mistakes.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?
no problem. I will |
Description
add benchmark script for evaluating the accuracy of deep research agent.