OpenHands · adityasoni9998 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/.gitignore b/.gitignore
@@ -217,3 +217,10 @@ workspace/
 # Evaluation outputs
 eval_outputs/
 builds/
+qwen3*/
+sonnet*/
+sft*/
+logs/
+gpt*/
+claude*/
+agentic_code_search_oss_outputs/
diff --git a/README.md b/README.md
@@ -1,4 +1,47 @@
-# OpenHands Benchmarks
+# Evaluation Code for CodeScout
+
+This code repository is used for all the evaluation experiments for the paper [CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents](https://arxiv.org/abs/2603.17829). 
+
+This codebase is developed by forking the [OpenHands/benchmarks](https://github.com/OpenHands/benchmarks) repository, originally designed for evaluating OpenHands v1 on various benchmarks, providing us several utility functions to build on top of. The original README of benchmarks repository is included [here](./README_upstream.md)
+
+> NOTE: While the repository contains evaluation code for several other benchmarks, this repository is only intended to be used for CodeScout related evaluations in the [agentic_code_search](./benchmarks/agentic_code_search/) directory.
+
+## Environment Setup
+
+### Pre-requisites
+
+Note that run experiments locally on a Unix machine. Before installing dependencies, you must install uv and ripgrep on the machine:
+
+1. `uv >= 0.8.13` : [Installation instructions](https://docs.astral.sh/uv/getting-started/installation/).
+2. `ripgrep`: [Installation instructions](https://github.com/burntsushi/ripgrep?tab=readme-ov-file#installation).
+   - **Note**: We have used v15.1.0 in our experiments.
+   - We have installed ripgrep using cargo:
+   ```bash
+      # Step 1: Install Rust (if not already installed on the machine)
+      curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+      source $HOME/.cargo/env
+
+      # Step 2: Install ripgrep via cargo
+      cargo install ripgrep --version 15.1.0
+
+      # Step 3: Verify if installation completed successfully - this command should execute without errors
+      rg --version       
+   ```
+
+## Installing Dependencies
+
+Note that this codebase requires querying LLMs via an OpenAI compatible endpoint. We use `vllm==0.10.2` to locally host models, and the command below will attempt to install this package in your environment. Furthermore, we have pinned a specific commit of the OpenHands software-agent-sdk for reproducibility purposes. You can install all the dependencies using:
+
+```bash
+make build
+```
+
+Make sure to activate the virtual environment using `source .venv/bin/activate` before running experiments.
+
+Refer to [this README](benchmarks/agentic_code_search/README.md) for more details on reproducing results reported in the CodeScout paper.
+
+----
+# Original README of the OpenHands Benchmarks repository
 
 This repository contains benchmark evaluation infrastructure for [OpenHands](https://github.com/OpenHands/OpenHands/) agents. It provides standardized evaluation pipelines for testing agent capabilities across various real-world tasks.
 

diff --git a/benchmarks/agentic_code_search/README.md b/benchmarks/agentic_code_search/README.md
@@ -0,0 +1,4 @@
+## Agentic Code Search
+
+Benchmarking code to evaluate LLMs on their ability to localize code from a python repository that requires editing to fix a given issue description in natural language
+More details coming soon...
diff --git a/benchmarks/agentic_code_search/__init__.py b/benchmarks/agentic_code_search/__init__.py
diff --git a/benchmarks/agentic_code_search/custom_agent.py b/benchmarks/agentic_code_search/custom_agent.py
@@ -0,0 +1,83 @@
+import re
+from concurrent.futures import ThreadPoolExecutor
+from typing import TYPE_CHECKING
+
+from openhands.sdk import Agent
+from openhands.sdk.conversation import (
+    ConversationState,
+)
+from openhands.sdk.logger import get_logger
+from openhands.sdk.mcp import create_mcp_tools
+from openhands.sdk.observability.laminar import (
+    maybe_init_laminar,
+)
+from openhands.sdk.tool import ToolDefinition, resolve_tool
+
+
+if TYPE_CHECKING:
+    from openhands.sdk.conversation import ConversationState
+logger = get_logger(__name__)
+maybe_init_laminar()
+
+
+class CustomAgent(Agent):
+    def _initialize(self, state: "ConversationState"):
+        """Create an AgentBase instance from an AgentSpec."""
+
+        if self._tools:
+            logger.warning("Agent already initialized; skipping re-initialization.")
+            return
+
+        tools: list[ToolDefinition] = []
+
+        # Use ThreadPoolExecutor to parallelize tool resolution
+        with ThreadPoolExecutor(max_workers=4) as executor:
+            futures = []
+
+            # Submit tool resolution tasks
+            for tool_spec in self.tools:
+                future = executor.submit(resolve_tool, tool_spec, state)
+                futures.append(future)
+
+            # Submit MCP tools creation if configured
+            if self.mcp_config:
+                future = executor.submit(create_mcp_tools, self.mcp_config, 30)
+                futures.append(future)
+
+            # Collect results as they complete
+            for future in futures:
+                result = future.result()
+                tools.extend(result)
+
+        logger.info(
+            f"Loaded {len(tools)} tools from spec: {[tool.name for tool in tools]}"
+        )
+        if self.filter_tools_regex:
+            pattern = re.compile(self.filter_tools_regex)
+            tools = [tool for tool in tools if pattern.match(tool.name)]
+            logger.info(
+                f"Filtered to {len(tools)} tools after applying regex filter: "
+                f"{[tool.name for tool in tools]}",
+            )
+
+        # Do not include built-in tools; not subject to filtering
+        # Instantiate built-in tools using their .create() method
+        # for tool_class in BUILT_IN_TOOLS:
+        #     tools.extend(tool_class.create(state))
+
+        # Check tool types
+        for tool in tools:
+            if not isinstance(tool, ToolDefinition):
+                raise ValueError(
+                    f"Tool {tool} is not an instance of 'ToolDefinition'. "
+                    f"Got type: {type(tool)}"
+                )
+
+        # Check name duplicates
+        tool_names = [tool.name for tool in tools]
+        if len(tool_names) != len(set(tool_names)):
+            duplicates = set(name for name in tool_names if tool_names.count(name) > 1)
+            raise ValueError(f"Duplicate tool names found: {duplicates}")
+
+        # Store tools in a dict for easy access
+        self._tools = {tool.name: tool for tool in tools}