add deep research agent benchmark script. #2135

lkk12014402 · 2025-07-11T09:05:46Z

Description

add benchmark script for evaluating the accuracy of deep research agent.

github-actions · 2025-07-11T09:06:00Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

Copilot

Pull Request Overview

Adds a benchmarking harness for measuring the accuracy of the Deep Research Agent by integrating dataset loaders, an LLM-based scoring function, and CLI support.

Wraps the agent’s raw output in a JSON field ("answer") in research_agent.py.
Introduces eval.py to load questions, invoke the agent, score answers via an LLM judge, and save results.
Updates documentation under benchmark/accuracy and the main README to cover setup and usage.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
DeepResearchAgent/research_agent.py	Return payload now wrapped as `{"answer": ...}`
DeepResearchAgent/benchmark/accuracy/eval.py	New benchmark script for accuracy evaluation
DeepResearchAgent/benchmark/accuracy/README.md	Instructions for running the accuracy benchmark
DeepResearchAgent/README.md	Added note on configuring `deep_researcher.yaml` in setup

Comments suppressed due to low confidence (2)

DeepResearchAgent/benchmark/accuracy/eval.py:93

[nitpick] There are no unit tests covering key functions like load_questions, process_single_question, or run_benchmark. Adding tests will help ensure correctness when loading datasets and computing accuracy.

def load_questions(dataset_names: list[str] | None = None) -> list[dict[str, str]]:

DeepResearchAgent/benchmark/accuracy/eval.py:334

The args.agent_config field is referenced in metadata but no --agent-config argument is defined in the parser. Consider adding a corresponding parser argument or renaming this field to match --service-url or another existing flag.

            "agent_config": args.agent_config,

Copilot · 2025-07-11T09:07:28Z

DeepResearchAgent/benchmark/accuracy/README.md

+
+|Argument|Default value| Description |
+|--------|-------------| ------------- |
+|--datasets|together-search-bench| benchmark datasets, support "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" |


[nitpick] Grammar: change "support" to "supports" for correct subject-verb agreement in the description.

Suggested change

|--datasets|together-search-bench| benchmark datasets, support "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" |

|--datasets|together-search-bench| benchmark datasets, supports "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" |

for more information, see https://pre-commit.ci

minmin-intel

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

lkk12014402 · 2025-07-16T02:37:10Z

Can you add some benchmark results in README? And add some recommendations about what LLMs to use with vllm-gaudi?

no problem. I will

Copilot AI review requested due to automatic review settings July 11, 2025 09:05

lkk12014402 requested review from lvliang-intel, ftian1, chensuyue, minmin-intel and rbrugaro as code owners July 11, 2025 09:05

Copilot AI reviewed Jul 11, 2025

View reviewed changes

lkk12014402 and others added 2 commits July 11, 2025 09:10

add benchmark script.

1e9aed5

[pre-commit.ci] auto fixes from pre-commit.com hooks

75f9c26

for more information, see https://pre-commit.ci

joshuayao added this to the v1.4 milestone Jul 14, 2025

Merge branch 'main' into deep_research_agent_benchmark

594ffec

minmin-intel reviewed Jul 14, 2025

View reviewed changes

lkk12014402 and others added 2 commits August 6, 2025 12:43

Merge branch 'main' into deep_research_agent_benchmark

4d0699f

Merge branch 'main' into deep_research_agent_benchmark

eda2714

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add deep research agent benchmark script. #2135

add deep research agent benchmark script. #2135

Uh oh!

lkk12014402 commented Jul 11, 2025

Uh oh!

github-actions bot commented Jul 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 11, 2025

Uh oh!

minmin-intel left a comment

Uh oh!

lkk12014402 commented Jul 16, 2025

Uh oh!

Uh oh!

	\|--datasets\|together-search-bench\| benchmark datasets, support "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" \|
	\|--datasets\|together-search-bench\| benchmark datasets, supports "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" \|

add deep research agent benchmark script. #2135

Are you sure you want to change the base?

add deep research agent benchmark script. #2135

Uh oh!

Conversation

lkk12014402 commented Jul 11, 2025

Description

Uh oh!

github-actions bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

minmin-intel left a comment

Choose a reason for hiding this comment

Uh oh!

lkk12014402 commented Jul 16, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 11, 2025 •

edited

Loading