Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
055e4e6
Update software-agent-sdk submodule to main
openhands-agent Nov 7, 2025
a6ec978
initial commit, eval for code search
openhands-agent Nov 7, 2025
36fa267
Num runs should be managed by the user externally
adityasoni9998 Nov 7, 2025
7d3d360
Update software-agent-sdk submodule to main
adityasoni9998 Nov 7, 2025
5bf46dd
docker works
adityasoni9998 Nov 7, 2025
1fc3cac
example config for qwen3
adityasoni9998 Nov 7, 2025
5f74f63
local runtime works
adityasoni9998 Nov 7, 2025
5e2820d
use host network in agent sdk
adityasoni9998 Nov 9, 2025
bfe182a
add eval
adityasoni9998 Nov 10, 2025
72ef6ff
add eval
adityasoni9998 Nov 10, 2025
b891149
add analysis code
adityasoni9998 Nov 10, 2025
479c081
module-level rewards
adityasoni9998 Dec 4, 2025
86957d8
fine-grained rewards eval
adityasoni9998 Dec 8, 2025
fe75fb2
fine-grained rewards
adityasoni9998 Dec 8, 2025
64bb3ee
docker doesn't work but local does
adityasoni9998 Dec 8, 2025
db8e7bb
update README
adityasoni9998 Dec 8, 2025
6b92366
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 22, 2025
6d52715
revert to only allow local workspace in agentic code search
adityasoni9998 Dec 22, 2025
76b4a01
minor code bug fix
adityasoni9998 Dec 22, 2025
dea232c
Merge branch 'main' into agentic_code_search
adityasoni9998 Dec 29, 2025
a417dc6
Update software-agent-sdk submodule to match trainer
adityasoni9998 Dec 29, 2025
11ea94e
update parser config
adityasoni9998 Dec 29, 2025
7730bac
add dataset
adityasoni9998 Dec 29, 2025
160f527
Merge main into agentic_code_search and fix CI issues
openhands-agent Jan 8, 2026
39b8e0a
update agent-sdk
adityasoni9998 Jan 25, 2026
67dcc25
working checkpoint
adityasoni9998 Feb 23, 2026
9c1202e
prompt cleanup
adityasoni9998 Feb 23, 2026
65357c9
update eval code
adityasoni9998 Feb 23, 2026
a086441
cleanup code
adityasoni9998 Feb 23, 2026
a7198ae
rollout logic
adityasoni9998 Feb 23, 2026
482a100
add reminder logic to run_infer.py
adityasoni9998 Feb 23, 2026
c4eeee1
fix regression -- detect conversation ending properly
adityasoni9998 Feb 23, 2026
b449f20
polish metric computation
adityasoni9998 Feb 24, 2026
26fbbb4
minor update
adityasoni9998 Mar 18, 2026
6262db1
minor update
adityasoni9998 Mar 18, 2026
41717ea
Revise README with upcoming details notice
adityasoni9998 Mar 19, 2026
7cf83b8
Revise README for CodeScout evaluation setup
adityasoni9998 Mar 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -217,3 +217,10 @@ workspace/
# Evaluation outputs
eval_outputs/
builds/
qwen3*/
sonnet*/
sft*/
logs/
gpt*/
claude*/
agentic_code_search_oss_outputs/
45 changes: 44 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,47 @@
# OpenHands Benchmarks
# Evaluation Code for CodeScout

This code repository is used for all the evaluation experiments for the paper [CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents](https://arxiv.org/abs/2603.17829).

This codebase is developed by forking the [OpenHands/benchmarks](https://github.com/OpenHands/benchmarks) repository, originally designed for evaluating OpenHands v1 on various benchmarks, providing us several utility functions to build on top of. The original README of benchmarks repository is included [here](./README_upstream.md)

> NOTE: While the repository contains evaluation code for several other benchmarks, this repository is only intended to be used for CodeScout related evaluations in the [agentic_code_search](./benchmarks/agentic_code_search/) directory.

## Environment Setup

### Pre-requisites

Note that run experiments locally on a Unix machine. Before installing dependencies, you must install uv and ripgrep on the machine:

1. `uv >= 0.8.13` : [Installation instructions](https://docs.astral.sh/uv/getting-started/installation/).
2. `ripgrep`: [Installation instructions](https://github.com/burntsushi/ripgrep?tab=readme-ov-file#installation).
- **Note**: We have used v15.1.0 in our experiments.
- We have installed ripgrep using cargo:
```bash
# Step 1: Install Rust (if not already installed on the machine)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Step 2: Install ripgrep via cargo
cargo install ripgrep --version 15.1.0

# Step 3: Verify if installation completed successfully - this command should execute without errors
rg --version
```

## Installing Dependencies

Note that this codebase requires querying LLMs via an OpenAI compatible endpoint. We use `vllm==0.10.2` to locally host models, and the command below will attempt to install this package in your environment. Furthermore, we have pinned a specific commit of the OpenHands software-agent-sdk for reproducibility purposes. You can install all the dependencies using:

```bash
make build
```

Make sure to activate the virtual environment using `source .venv/bin/activate` before running experiments.

Refer to [this README](benchmarks/agentic_code_search/README.md) for more details on reproducing results reported in the CodeScout paper.

----
# Original README of the OpenHands Benchmarks repository

This repository contains benchmark evaluation infrastructure for [OpenHands](https://github.com/OpenHands/OpenHands/) agents. It provides standardized evaluation pipelines for testing agent capabilities across various real-world tasks.

Expand Down
4 changes: 4 additions & 0 deletions benchmarks/agentic_code_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Agentic Code Search

Benchmarking code to evaluate LLMs on their ability to localize code from a python repository that requires editing to fix a given issue description in natural language
More details coming soon...
Empty file.
83 changes: 83 additions & 0 deletions benchmarks/agentic_code_search/custom_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import re
from concurrent.futures import ThreadPoolExecutor
from typing import TYPE_CHECKING

from openhands.sdk import Agent
from openhands.sdk.conversation import (
ConversationState,
)
from openhands.sdk.logger import get_logger
from openhands.sdk.mcp import create_mcp_tools
from openhands.sdk.observability.laminar import (
maybe_init_laminar,
)
from openhands.sdk.tool import ToolDefinition, resolve_tool


if TYPE_CHECKING:
from openhands.sdk.conversation import ConversationState
logger = get_logger(__name__)
maybe_init_laminar()


class CustomAgent(Agent):
def _initialize(self, state: "ConversationState"):
"""Create an AgentBase instance from an AgentSpec."""

if self._tools:
logger.warning("Agent already initialized; skipping re-initialization.")
return

tools: list[ToolDefinition] = []

# Use ThreadPoolExecutor to parallelize tool resolution
with ThreadPoolExecutor(max_workers=4) as executor:
futures = []

# Submit tool resolution tasks
for tool_spec in self.tools:
future = executor.submit(resolve_tool, tool_spec, state)
futures.append(future)

# Submit MCP tools creation if configured
if self.mcp_config:
future = executor.submit(create_mcp_tools, self.mcp_config, 30)
futures.append(future)

# Collect results as they complete
for future in futures:
result = future.result()
tools.extend(result)

logger.info(
f"Loaded {len(tools)} tools from spec: {[tool.name for tool in tools]}"
)
if self.filter_tools_regex:
pattern = re.compile(self.filter_tools_regex)
tools = [tool for tool in tools if pattern.match(tool.name)]
logger.info(
f"Filtered to {len(tools)} tools after applying regex filter: "
f"{[tool.name for tool in tools]}",
)

# Do not include built-in tools; not subject to filtering
# Instantiate built-in tools using their .create() method
# for tool_class in BUILT_IN_TOOLS:
# tools.extend(tool_class.create(state))

# Check tool types
for tool in tools:
if not isinstance(tool, ToolDefinition):
raise ValueError(
f"Tool {tool} is not an instance of 'ToolDefinition'. "
f"Got type: {type(tool)}"
)

# Check name duplicates
tool_names = [tool.name for tool in tools]
if len(tool_names) != len(set(tool_names)):
duplicates = set(name for name in tool_names if tool_names.count(name) > 1)
raise ValueError(f"Duplicate tool names found: {duplicates}")

# Store tools in a dict for easy access
self._tools = {tool.name: tool for tool in tools}
Loading