Primary script for QA evals #480

Dsantra92 · 2025-11-10T07:33:54Z

A lot of sharp edges and code cleanups to be done. Still in research stage and should ideally be merged when we are ready with actual benchmarks for QA.

Summary by CodeRabbit

New Features
- Added benchmarking and evaluation infrastructure to assess tool performance on standardized questions.
- Implemented semantic evaluation workflow for comparing tool outputs against expected results.
Chores
- Updated dependencies with GPU acceleration support, modern async libraries, and enhanced ML tooling for improved performance and scalability.

coderabbitai · 2025-11-10T07:34:25Z

Walkthrough

This pull request introduces a comprehensive benchmarking infrastructure for evaluating LLM agents on code-related tasks. New modules implement Git repository management with parallel worktree creation, asynchronous orchestration of Potpie service integration, evaluation workflow orchestration with CSV-based benchmark data loading, and semantic evaluation metrics. Dependencies are significantly expanded to support GPU acceleration and modern ML/AI tooling.

Changes

Cohort / File(s)	Summary
Ignore Configuration `.gitignore`	Adds `repos/` directory to gitignore to exclude repository cache from version control.
Git Repository & Worktree Management `benchmark/download.py`	New module implementing parallel Git workflow for cloning bare repositories, creating local controller clones, fetching commits, and generating multiple worktrees. Provides subprocess wrapping, idempotent worktree creation, and per-repo/per-outcome accounting with robust logging.
Potpie Service Integration `benchmark/potpie.py`	New async module providing HTTP client wrapper (`PotpieClient`), readiness coordination primitive (`ReadinessCoordinator`), and background polling worker (`PollingWorker`) for managing project parsing, status polling, and message-based QA. Includes project caching via diskcache and end-to-end workflow for aggregating LLM answers.
Evaluation Orchestration `benchmark/evaluate.py`	New script orchestrating end-to-end evaluation: loads benchmark problems from CSV, prepares worktrees, invokes agent evaluation tasks asynchronously, constructs test cases with expected outputs, runs metric-based evaluation, and outputs results.
Evaluation Metrics `benchmark/metrics.py`	New module defining semantic evaluation criteria (`EVAL_CRITERIA`) for scoring similarity between generated and expected answers, and instantiates a `GEval` evaluator (`correctness`) in strict mode.
Dependency Management `requirements.txt`	Major overhaul replacing legacy web/ORM stack (uvicorn, sqlalchemy, alembic, celery, redis) with modern async and GPU-accelerated tooling. Adds GPU/accelerator packages (nvidia-cublas-cu12, torch==2.5.1, transformers==4.57.1), async frameworks (aiohttp, fastapi), ML/NLP libraries (langchain ecosystem, sentence-transformers), and updated data/serialization packages (pydantic, orjson).

Sequence Diagrams

sequenceDiagram
    participant User
    participant evaluate.py
    participant download.py
    participant potpie.py
    participant metrics.py
    participant deepeval

    User->>evaluate.py: Run evaluation (--agents potpie --input benchmark.csv)
    evaluate.py->>evaluate.py: Load benchmark problems from CSV
    evaluate.py->>download.py: prepare_worktrees(problems, batch_no)
    download.py->>download.py: Clone bare repos in parallel
    download.py->>download.py: Create controllers & worktrees per commit
    download.py-->>evaluate.py: Return repo_dict mapping
    
    evaluate.py->>potpie.py: get_all_st_answers(problems, repo_dict)
    potpie.py->>potpie.py: Initialize PotpieClient & ReadinessCoordinator
    potpie.py->>potpie.py: Start PollingWorker background task
    
    rect rgb(220, 240, 255)
    Note over potpie.py: For each problem
    potpie.py->>potpie.py: post_parse(commit_id, repo_path)
    potpie.py->>potpie.py: Enqueue project with PollingWorker
    potpie.py->>potpie.py: PollingWorker polls get_parse_status in background
    potpie.py->>potpie.py: Mark ready when status complete
    potpie.py->>potpie.py: send_message(project_id, question)
    end
    
    potpie.py-->>evaluate.py: Aggregated answers
    evaluate.py->>evaluate.py: Align answers with expected outputs
    evaluate.py->>evaluate.py: Create LLMTestCase instances
    evaluate.py->>deepeval: evaluate(test_cases, correctness metric)
    deepeval-->>evaluate.py: Success flags per test
    evaluate.py->>evaluate.py: Output results to CSV
    evaluate.py-->>User: Evaluation complete

sequenceDiagram
    participant PollingWorker
    participant queue as asyncio.Queue
    participant coordinator as ReadinessCoordinator
    participant client as PotpieClient

    loop Background loop
        PollingWorker->>queue: Get next project_id (blocking)
        queue-->>PollingWorker: project_id
        
        rect rgb(200, 220, 255)
        Note over PollingWorker: _handle_project with retry logic
        PollingWorker->>client: get_parse_status(project_id)
        alt Status not complete
            client-->>PollingWorker: pending/processing status
            PollingWorker->>queue: Re-enqueue project
        else Status complete
            client-->>PollingWorker: complete status
            PollingWorker->>coordinator: mark_ready(project_id)
            coordinator-->>PollingWorker: Event signaled
        end
        end
        
        PollingWorker->>coordinator: release_enqueue_slot(project_id)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

benchmark/potpie.py: Async coordination patterns with ReadinessCoordinator, PollingWorker background loop, and event signaling mechanism require careful verification of race conditions and timeout handling
benchmark/download.py: Parallel processing with ThreadPoolExecutor, idempotency logic for worktree creation, and error propagation across repository operations need thorough testing paths
benchmark/evaluate.py: Integration flow between data loading, async task orchestration (asyncio.gather), test case construction, and metric evaluation; validation of answer alignment logic with batching
requirements.txt: Substantial dependency reorganization with major version bumps (torch, transformers, langchain ecosystem); verify compatibility matrix and GPU support dependencies (nvidia-cublas-cu12, etc.)

Suggested reviewers

dhirenmathur

Poem

🐰 Worktrees sprout in parallel glory,
Potpie whispers answers to each story,
Benchmarks dance through async skies,
GPU sparks in CUDA's eyes—
Code evaluated, truth descries! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.42% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the primary changes: introducing new QA evaluation orchestration scripts (evaluate.py, download.py, potpie.py) and supporting infrastructure.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval_qa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 185d0e5 and 25bf1a2.

📒 Files selected for processing (5)

.gitignore (1 hunks)
benchmark/download.py (1 hunks)
benchmark/eval.py (1 hunks)
benchmark/potpie.py (1 hunks)
start.ps1 (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchmark/potpie.py (1)

benchmark/download.py (2)

get_unique_repo_and_commits (26-42)

setup_all_worktrees (101-163)

🪛 Ruff (0.14.3)

benchmark/download.py

50-50: subprocess call: check for execution of untrusted input

(S603)

51-51: Starting a process with a partial executable path

(S607)

58-58: subprocess call: check for execution of untrusted input

(S603)

59-59: Starting a process with a partial executable path

(S607)

82-82: subprocess call: check for execution of untrusted input

(S603)

83-92: Starting a process with a partial executable path

(S607)

125-125: Do not catch blind exception: Exception

(BLE001)

160-160: Do not catch blind exception: Exception

(BLE001)

benchmark/potpie.py

41-43: Create your own exception

(TRY002)

41-43: Avoid specifying long messages outside the exception class

(TRY003)

62-62: Do not catch blind exception: Exception

(BLE001)

95-97: Create your own exception

(TRY002)

95-97: Avoid specifying long messages outside the exception class

(TRY003)

125-127: Create your own exception

(TRY002)

125-127: Avoid specifying long messages outside the exception class

(TRY003)

194-194: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

234-236: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

benchmark/download.py

benchmark/potpie.py

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

benchmark/download.py (3)
125-132: Specialize the exception handling as noted in the TODO.

Catching bare Exception can mask unexpected errors. The TODO comment acknowledges this should be specialized to subprocess.CalledProcessError or OSError.

Apply this diff:
             except (
-                Exception
-            ) as e:  # TODO: Specialize the exception later like CalledProcessError or OSError
+                subprocess.CalledProcessError,
+                OSError,
+            ) as e:
142-144: Remove redundant parent directory creation.

worktree_path.parent is bare_repo_path, which already exists after successful cloning. This mkdir call is redundant.

Apply this diff:
             for commit_id in repo_dicts[repo_url]:
                 worktree_path = bare_repo_path / commit_id
-                worktree_path.parent.mkdir(exist_ok=True)
155-163: Specialize exception handling for worktree creation failures.

Similar to bare cloning, catching bare Exception can mask unexpected errors. Consider specializing to subprocess.CalledProcessError or OSError.

Apply this diff:
             try:
                 worktree_path, commit_id = future.result()
                 repo_url = worktree_futures[future]
                 worktree_map[(repo_url, commit_id)] = worktree_path
                 logger.debug("Created worktree for commit {}", commit_id[:8])
-            except Exception as e:
+            except (subprocess.CalledProcessError, OSError) as e:
                 logger.error("Failed to create worktree: {}", str(e))
benchmark/potpie.py (2)
46-67: Use logging instead of print and specialize exception handling.

Line 63 uses print while the rest of the codebase uses logging/logger, creating inconsistency. Also, catching bare Exception can mask unexpected errors.

Apply this diff:
-        except Exception as e:
-            print(f"Error checking {project_id}: {e}")
+        except (aiohttp.ClientError, asyncio.TimeoutError) as e:
+            logging.error("Error checking {}: {}", project_id, e)
             # Optionally re-queue on error
             await check_queue.put(project_id)
70-97: Consider creating custom exception classes.

Lines 95-97 raise a generic Exception with a formatted string message. Per static analysis hints, consider creating custom exception classes for better error handling and typing.

Example custom exceptions to add near the top of the file:
class PotPieAPIError(Exception):
    """Base exception for PotPie API errors."""
    pass

class ParseError(PotPieAPIError):
    """Error during repository parsing."""
    pass
Then use:
-                raise Exception(
-                    f"Failed to get response: {await response.text()} {response.status}"
-                )
+                raise ParseError(
+                    f"Failed to parse repository: {await response.text()} (status {response.status})"
+                )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25bf1a2 and be0aadd.

📒 Files selected for processing (2)

benchmark/download.py (1 hunks)
benchmark/potpie.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchmark/potpie.py (1)

benchmark/download.py (2)

get_unique_repo_and_commits (27-43)

setup_all_worktrees (102-164)

🪛 Ruff (0.14.4)

benchmark/potpie.py

41-43: Create your own exception

(TRY002)

41-43: Avoid specifying long messages outside the exception class

(TRY003)

62-62: Do not catch blind exception: Exception

(BLE001)

95-97: Create your own exception

(TRY002)

95-97: Avoid specifying long messages outside the exception class

(TRY003)

125-127: Create your own exception

(TRY002)

125-127: Avoid specifying long messages outside the exception class

(TRY003)

194-194: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

238-240: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

benchmark/download.py

51-51: subprocess call: check for execution of untrusted input

(S603)

52-52: Starting a process with a partial executable path

(S607)

59-59: subprocess call: check for execution of untrusted input

(S603)

60-60: Starting a process with a partial executable path

(S607)

83-83: subprocess call: check for execution of untrusted input

(S603)

84-93: Starting a process with a partial executable path

(S607)

126-126: Do not catch blind exception: Exception

(BLE001)

161-161: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (8)

benchmark/download.py (5)

11-24: LGTM! URL normalization handles trailing slashes correctly.

The function properly strips both whitespace and trailing slashes before checking for the .git suffix, addressing the past review concern.

46-66: LGTM! Bare repository cloning logic is sound.

The function correctly checks for existing repos, clones as bare, and fetches all references with proper error handling via check=True.

69-99: LGTM! Worktree creation handles cleanup and errors appropriately.

The function correctly removes existing worktrees (with check=False since they may not exist) before creating new detached worktrees with proper error handling.

183-197: LGTM! Cleanup function is straightforward and safe.

The function appropriately checks for existence before removal and provides clear logging.

200-209: LGTM! Main block demonstrates proper usage.

The example correctly shows the workflow: read CSV, setup worktrees, and output the mapping.

benchmark/potpie.py (3)

1-22: LGTM! Imports and data class are well-structured.

The dependencies and PotPieUserInfo dataclass are appropriate for the async HTTP workflow.

227-232: LGTM! Worker cleanup properly implemented.

The cleanup logic correctly waits for the queue to drain with join(), then cancels workers and gathers with return_exceptions=True. This addresses the past review concern about awaiting forever-running workers.

253-256: Main block structure is appropriate.

The use of load_dotenv() and asyncio.run() is correct. However, ensure BASE_URL is moved to module level as flagged in the earlier comment.

coderabbitai · 2025-11-12T20:27:31Z

benchmark/download.py

+def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]:
+    csv_file_path = Path(csv_path)
+    required_columns = [
+        "repo_url",
+        "commit_id",
+    ]
+    df: DataFrame = pd.read_csv(
+        csv_file_path, usecols=required_columns
+    )  # pyright: ignore[reportUnknownMemberType]
+    df.dropna(inplace=True)
+    df["repo_url"] = df["repo_url"].apply(
+        _normalize_repo_url
+    )  # pyright: ignore[reportUnknownMemberType]
+    repo_commit_dict = (
+        df.groupby("repo_url")["commit_id"].apply(set).to_dict()
+    )  # pyright: ignore[reportUnknownMemberType]
+    return repo_commit_dict


⚠️ Potential issue | 🟠 Major

Fix the return type hint to match the actual return value.

The function groups commits into sets (line 41: .apply(set)), but the return type annotation declares dict[str, list[str]]. This mismatch will cause type-checking failures.

Apply this diff:

-def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]: +def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, set[str]]:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]:

csv_file_path = Path(csv_path)

required_columns = [

"repo_url",

"commit_id",

]

df: DataFrame = pd.read_csv(

csv_file_path, usecols=required_columns

) # pyright: ignore[reportUnknownMemberType]

df.dropna(inplace=True)

df["repo_url"] = df["repo_url"].apply(

_normalize_repo_url

) # pyright: ignore[reportUnknownMemberType]

repo_commit_dict = (

df.groupby("repo_url")["commit_id"].apply(set).to_dict()

) # pyright: ignore[reportUnknownMemberType]

return repo_commit_dict

def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, set[str]]:

csv_file_path = Path(csv_path)

required_columns = [

"repo_url",

"commit_id",

]

df: DataFrame = pd.read_csv(

csv_file_path, usecols=required_columns

) # pyright: ignore[reportUnknownMemberType]

df.dropna(inplace=True)

df["repo_url"] = df["repo_url"].apply(

_normalize_repo_url

) # pyright: ignore[reportUnknownMemberType]

repo_commit_dict = (

df.groupby("repo_url")["commit_id"].apply(set).to_dict()

) # pyright: ignore[reportUnknownMemberType]

return repo_commit_dict

🤖 Prompt for AI Agents

In benchmark/download.py around lines 27 to 43, the function get_unique_repo_and_commits returns a dict mapping repo_url to sets of commit IDs (it calls .apply(set)), but the type hint incorrectly declares dict[str, list[str]]; update the function signature return type to dict[str, set[str]] so the annotation matches the actual return value (alternatively, convert the sets to lists before returning if lists are desired).

coderabbitai · 2025-11-12T20:27:31Z

benchmark/download.py

+def get_repo_name(repo_url: str) -> str:
+    """
+    Get the name of the repository from the URL.
+
+    Args:
+        repo_url: The URL of the repository. Expected to be in the format
+            "https://<provider>.com/username/repository.git".
+
+    Returns:
+        The name of the repository.
+    """
+    repo_slug_split = repo_url.strip().split("/")[-2:]
+    return "_".join(repo_slug_split)[:-4]


🛠️ Refactor suggestion | 🟠 Major

Make the .git suffix removal more robust.

The [:-4] slice assumes the URL ends with .git, but offers no validation or error handling if called on a non-normalized URL. Use .removesuffix(".git") for clarity and correctness.

Apply this diff:

def get_repo_name(repo_url: str) -> str: """ Get the name of the repository from the URL. Args: repo_url: The URL of the repository. Expected to be in the format "https://<provider>.com/username/repository.git". Returns: The name of the repository. """ + # Normalize first to ensure .git suffix + repo_url = _normalize_repo_url(repo_url) repo_slug_split = repo_url.strip().split("/")[-2:] - return "_".join(repo_slug_split)[:-4] + return "_".join(repo_slug_split).removesuffix(".git")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In benchmark/download.py around lines 167 to 179, the function strips the last four characters with '[:-4]' to remove a '.git' suffix which is brittle if the URL isn't normalized; replace that slice with repo_slug_split = repo_url.strip().split("/")[-2:]; repo_name = "_".join(repo_slug_split).removesuffix(".git") and return repo_name, ensuring you use str.removesuffix(".git") so non-.git inputs remain unchanged and add no slicing-based errors.

benchmark/potpie.py

coderabbitai · 2025-11-12T20:27:31Z

benchmark/potpie.py

+    jobs: list[Awaitable[str]] = []
+    for (_, commit_id), worktree_path in worktree_map.items():
+        jobs.append(
+            parse_single_repo(
+                user_info=conn_info,
+                commit_id=commit_id,
+                repo_path=worktree_path,
+            )
+        )
+    keys = list(worktree_map.keys())
+    project_ids = await asyncio.gather(*jobs)
+    repo_commit_to_project_id = dict(zip(keys, project_ids))


🛠️ Refactor suggestion | 🟠 Major

Add strict=True to zip for length mismatch detection.

Line 194 uses zip() without the strict parameter. If keys and project_ids have mismatched lengths (due to partial failures), the silent truncation could cause data loss. Python 3.10+ supports strict=True to raise ValueError on mismatch.

Apply this diff:

keys = list(worktree_map.keys()) project_ids = await asyncio.gather(*jobs) - repo_commit_to_project_id = dict(zip(keys, project_ids)) + repo_commit_to_project_id = dict(zip(keys, project_ids, strict=True))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

jobs: list[Awaitable[str]] = []

for (_, commit_id), worktree_path in worktree_map.items():

jobs.append(

parse_single_repo(

user_info=conn_info,

commit_id=commit_id,

repo_path=worktree_path,

)

)

keys = list(worktree_map.keys())

project_ids = await asyncio.gather(*jobs)

repo_commit_to_project_id = dict(zip(keys, project_ids))

jobs: list[Awaitable[str]] = []

for (_, commit_id), worktree_path in worktree_map.items():

jobs.append(

parse_single_repo(

user_info=conn_info,

commit_id=commit_id,

repo_path=worktree_path,

)

)

keys = list(worktree_map.keys())

project_ids = await asyncio.gather(*jobs)

repo_commit_to_project_id = dict(zip(keys, project_ids, strict=True))

🧰 Tools

🪛 Ruff (0.14.4)

194-194: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In benchmark/potpie.py around lines 183 to 194, the code builds keys and project_ids then uses zip(keys, project_ids) which will silently truncate if lengths differ; update the zip call to zip(keys, project_ids, strict=True) so a ValueError is raised on length mismatch (or, if you must support pre-3.10, add an explicit length check like if len(keys) != len(project_ids): raise ValueError(...)) to detect partial failures instead of silently losing entries.

benchmark/potpie.py

coderabbitai · 2025-11-12T20:27:31Z

benchmark/potpie.py

+    test_cases = [
+        LLMTestCase(
+            input=question, actual_output=answer, expected_output=expected_answer
+        )
+        for question, answer, expected_answer in zip(
+            qa_df["question"], answers, expected_answers
+        )
+    ]


🛠️ Refactor suggestion | 🟠 Major

Add strict=True to zip for length mismatch detection.

Line 238-240 uses zip() without the strict parameter. If the lists have mismatched lengths, silent truncation could cause test cases to be skipped without warning.

Apply this diff:

test_cases = [ LLMTestCase( input=question, actual_output=answer, expected_output=expected_answer ) for question, answer, expected_answer in zip( - qa_df["question"], answers, expected_answers + qa_df["question"], answers, expected_answers, strict=True ) ]

🧰 Tools

🪛 Ruff (0.14.4)

238-240: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In benchmark/potpie.py around lines 234 to 241, the list comprehension uses zip(...) to combine qa_df["question"], answers, and expected_answers but doesn't detect length mismatches; update the zip call to zip(qa_df["question"], answers, expected_answers, strict=True) so a ValueError is raised on unequal lengths and adjust imports/environment if running on Python <3.10 to ensure strict is supported.

sonarqubecloud · 2025-11-17T05:57:08Z

Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

benchmark/evaluate.py (2)
80-80: Prefix unused variable with underscore.

The summary variable is unused. Prefix it with _ to indicate intentional non-use.

Apply this diff:
-    repo_map, summary = prepare_worktrees(
+    repo_map, _summary = prepare_worktrees(
         problem_sets, base_dir="/tmp/repos_batch", batch_no=nbatch, max_workers=6
     )
111-118: Add strict=True to zip for length mismatch detection.

If the lists have mismatched lengths due to partial failures during answer collection, the zip() will silently truncate to the shortest list, potentially hiding errors.

Apply this diff:
     test_cases = [
         LLMTestCase(
             input=question, actual_output=answer, expected_output=expected_answer
         )
         for question, answer, expected_answer in zip(
-            questions_batched, answers_flattened, expected_answers_batched
+            questions_batched, answers_flattened, expected_answers_batched, strict=True
         )
     ]
benchmark/download.py (1)
186-186: Prefix unused loop variable with underscore.

The loop variable idx is not used within the loop body.

Apply this diff:
-        for idx, (commit_id, problem_id) in enumerate(entries, 1):
+        for _idx, (commit_id, problem_id) in enumerate(entries, 1):
benchmark/potpie.py (2)
60-62: Consider using a custom exception class for HTTP errors.

The generic Exception makes it harder for callers to distinguish HTTP failures from other errors. However, this is acceptable for research code.

If you want to improve error handling, you could define:
class PotpieAPIError(Exception):
    def __init__(self, status: int, message: str):
        self.status = status
        super().__init__(f"API request failed with status {status}: {message}")
Then use it:
             else:
-                raise Exception(
-                    f"Failed to get response: {await resp.text()} {resp.status}",
-                )
+                text = await resp.text()
+                raise PotpieAPIError(resp.status, text)
333-337: Add a usage note or guard to the main block.

The __main__ block calls get_all_st_answers(None, None), which would fail since both arguments are required. This is fine for a test stub, but consider adding a comment or a proper guard for clarity.
 if __name__ == "__main__":
     from dotenv import load_dotenv
 
     _ = load_dotenv()
-    asyncio.run(get_all_st_answers(None, None))
+    # For testing: asyncio.run(get_all_st_answers(problems, repo_dict))
+    print("This module should be imported, not run directly.")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be0aadd and f1428f6.

📒 Files selected for processing (5)

benchmark/download.py (1 hunks)
benchmark/evaluate.py (1 hunks)
benchmark/metrics.py (1 hunks)
benchmark/potpie.py (1 hunks)
requirements.txt (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-02-04T12:11:29.023Z

Learnt from: SujalThapaK
Repo: potpie-ai/potpie PR: 240
File: start.sh:0-0
Timestamp: 2025-02-04T12:11:29.023Z
Learning: Environment variables in the Potpie project should follow uppercase naming convention (e.g., `CELERY_QUEUE_NAME` instead of `Celery_QUEUE_NAME`) to maintain consistency and avoid case-sensitivity issues.

Applied to files:

benchmark/potpie.py

🧬 Code graph analysis (1)

benchmark/evaluate.py (2)

benchmark/download.py (1)

prepare_worktrees (270-327)

benchmark/potpie.py (2)

get_all_st_answers (269-330)

run (151-178)

🪛 OSV Scanner (2.2.4)

requirements.txt

[HIGH] 16-16: authlib 1.6.3: Authlib: JWS/JWT accepts unknown crit headers (RFC violation → possible authz bypass)

(GHSA-9ggr-2464-2j32)

[HIGH] 16-16: authlib 1.6.3: Authlib : JWE zip=DEF decompression bomb enables DoS

(GHSA-g7f3-828f-7h7m)

[HIGH] 16-16: authlib 1.6.3: Authlib is vulnerable to Denial of Service via Oversized JOSE Segments

(GHSA-pq5p-34cr-23v9)

[HIGH] 117-117: langchain-community 0.3.16: Langchain Community Vulnerable to XML External Entity (XXE) Attacks

(GHSA-pc6w-59fv-rh23)

[HIGH] 120-120: langchain-text-splitters 0.3.5: LangChain Text Splitters is vulnerable to XML External Entity (XXE) attacks due to unsafe XSLT parsing

(GHSA-m42m-m8cr-8m58)

[HIGH] 122-122: langgraph-checkpoint 2.0.10: LangGraph Checkpoint affected by RCE in "json" mode of JsonPlusSerializer

(GHSA-wwqv-p2pp-99h5)

[HIGH] 127-127: llama-index-core 0.10.68.post1: LlamaIndex vulnerable to DoS attack through uncontrolled recursive JSON parsing

(GHSA-3wxx-q3gv-pvvv)

[HIGH] 127-127: llama-index-core 0.10.68.post1: LlamaIndex affected by a Denial of Service (DOS) in JSONReader

(GHSA-7753-xrfw-ch36)

[HIGH] 127-127: llama-index-core 0.10.68.post1: llama-index-core insecurely handles temporary files

(GHSA-cr7q-2w66-hjcm)

[CRITICAL] 180-180: pillow 10.0.1: Arbitrary Code Execution in Pillow

(GHSA-3f63-hfp8-52jq)

[CRITICAL] 180-180: pillow 10.0.1: Pillow buffer overflow vulnerability

(GHSA-44wm-f244-xhp3)

[HIGH] 252-252: starlette 0.41.3: Starlette has possible denial-of-service vector when parsing large files in multipart forms

(GHSA-2c2j-9gv5-cj73)

[HIGH] 252-252: starlette 0.41.3: Starlette vulnerable to O(n^2) DoS via Range header merging in starlette.responses.FileResponse

(GHSA-7f5h-v6xp-fcq8)

[CRITICAL] 266-266: torch 2.5.1: undefined

(PYSEC-2025-41)

[CRITICAL] 266-266: torch 2.5.1: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)

[CRITICAL] 266-266: torch 2.5.1: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)

[CRITICAL] 266-266: torch 2.5.1: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Ruff (0.14.4)

benchmark/evaluate.py

80-80: Unpacked variable summary is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

81-81: Probable insecure usage of temporary file or directory: "/tmp/repos_batch"

(S108)

115-117: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

benchmark/download.py

19-19: subprocess call: check for execution of untrusted input

(S603)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

81-83: Avoid specifying long messages outside the exception class

(TRY003)

96-96: Consider moving this statement to an else block

(TRY300)

97-97: Do not catch blind exception: Exception

(BLE001)

109-111: Avoid specifying long messages outside the exception class

(TRY003)

139-141: Avoid specifying long messages outside the exception class

(TRY003)

186-186: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

210-210: Do not catch blind exception: Exception

(BLE001)

246-246: Do not catch blind exception: Exception

(BLE001)

255-255: Do not catch blind exception: Exception

(BLE001)

309-309: Do not catch blind exception: Exception

(BLE001)

benchmark/metrics.py

37-37: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

38-38: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

39-39: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

benchmark/potpie.py

60-62: Create your own exception

(TRY002)

60-62: Avoid specifying long messages outside the exception class

(TRY003)

164-164: Do not catch blind exception: Exception

(BLE001)

172-173: try-except-pass detected, consider logging the exception

(S110)

172-172: Do not catch blind exception: Exception

(BLE001)

198-198: Do not catch blind exception: Exception

(BLE001)

240-240: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-11-17T06:01:53Z

benchmark/metrics.py

+- Minor differences → 0.7–0.9
+- Substantial but related differences → 0.3–0.6
+- Completely different → 0.0–0.2


⚠️ Potential issue | 🟡 Minor

Replace EN DASH with HYPHEN-MINUS for consistency.

Lines 37-39 use the EN DASH character (–) instead of the standard HYPHEN-MINUS (-), which can cause encoding issues in some environments.

Apply this diff:

Scoring Rules: - Base the score ONLY on similarity, NOT quality. - Wording differences do NOT reduce similarity if the meaning is the same. -- Minor differences → 0.7–0.9 -- Substantial but related differences → 0.3–0.6 -- Completely different → 0.0–0.2 +- Minor differences → 0.7-0.9 +- Substantial but related differences → 0.3-0.6 +- Completely different → 0.0-0.2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Minor differences → 0.7–0.9

- Substantial but related differences → 0.3–0.6

- Completely different → 0.0–0.2

- Minor differences → 0.7-0.9

- Substantial but related differences → 0.3-0.6

- Completely different → 0.0-0.2

🧰 Tools

🪛 Ruff (0.14.4)

37-37: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

38-38: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

39-39: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

🤖 Prompt for AI Agents

In benchmark/metrics.py around lines 37 to 39, the comment/strings use EN DASH (–) characters; replace each EN DASH with the standard HYPHEN-MINUS (-) character so the ranges read "0.7-0.9", "0.3-0.6", and "0.0-0.2". Update the three lines accordingly, save the file using UTF-8, and run the linter/tests to confirm no encoding or style issues remain.

coderabbitai · 2025-11-17T06:01:53Z

requirements.txt

+async-timeout==4.0.3
+asyncpg==0.30.0
+attrs==25.4.0
+authlib==1.6.3


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Address critical security vulnerabilities before production deployment.

Multiple dependencies have known high and critical severity vulnerabilities:

pillow 10.0.1 (line 180): Arbitrary code execution and buffer overflow (2 CRITICAL CVEs)

torch 2.5.1 (line 266): Remote code execution via torch.load and local DoS (4 CRITICAL CVEs)

authlib 1.6.3 (line 16): JWT/JWE vulnerabilities enabling authorization bypass and DoS (3 HIGH CVEs)

langchain-community 0.3.16 (line 117), langchain-text-splitters 0.3.5 (line 120): XXE attacks (HIGH)

langgraph-checkpoint 2.0.10 (line 122): RCE vulnerability (HIGH)

llama-index-core 0.10.68.post1 (line 127): DoS and insecure temp file handling (HIGH)

starlette 0.41.3 (line 252): DoS vectors (HIGH)

Since this is WIP, these should be resolved before merging to production.

Also applies to: 117-117, 120-120, 122-122, 127-127, 180-180, 252-252, 266-266

🌐 Web query:

What are the latest secure versions of pillow, torch, authlib, langchain-community, langchain-text-splitters, langgraph-checkpoint, llama-index-core, and starlette that address known CVEs as of November 2025?

💡 Result:

Summary — secure minimum versions (addressing known CVEs) as of 2025-11-17:

Pillow: >= 11.3.0. [1]

torch / PyTorch: recommend >= 2.7.0 (contains fixes for several 2024–2025 vulnerabilities; upgrade to the latest 2.7.x/2.8.x+ available for additional fixes). [2][3]

Authlib: >= 1.6.5 (1.6.4/1.6.5 patch specific CVEs; use 1.6.5 to cover recent JOSE/DoS fixes). [4][5]

langchain-community: >= 0.3.27. [6]

langchain-text-splitters: >= 0.3.9. [7]

langgraph-checkpoint: >= 3.0.0. [8]

llama-index-core: >= 0.13.0 (0.12.38 / 0.12.41 fixed earlier issues; 0.13.0 fixes the latest /tmp cache symlink issue). [9][10]

starlette: >= 0.49.1 (also note earlier fix 0.47.2 for multipart rollover). [11][12]

If you want, I can:

produce a pip/requirements.txt snippet with these minimums, or

check the current latest PyPI releases for each package and give exact latest versions (live) and links.

Citations:
[1] Pillow security / release notes (CVE fixes). (security-tracker.debian.org)
[2] PyTorch vuln fixes (use-after-free, etc. — fixed in 2.2.0 and later). (synscan.net)
[3] PyTorch 2.7.0 fixes for other 2025 issues; upgrade recommendation. (gist.github.com)
[4] Authlib JOSE/DoS advisory (fixed in 1.6.5). (cvedetails.com)
[5] Authlib crit-header/authz advisory (fixed in 1.6.4). (cvedetails.com)
[6] langchain-community GHSA / CVE advisories (fixed in 0.3.27 and earlier fixes). (advisories.gitlab.com)
[7] langchain-text-splitters GHSA / CVE advisory (fixed in 0.3.9). (github.com)
[8] langgraph-checkpoint RCE advisory (fixed in 3.0.0). (github.com)
[9] llama-index-core advisories and fixes (SQLi/JSON/other fixes; cache symlink fix in 0.13.0). (security.snyk.io)
[10] llama-index fixes for JSON/pickle/path traversal (0.12.38/0.12.41 etc.). (data.safetycli.com)
[11] Starlette CVE-2025-62727 (fixed in 0.49.1). (nvd.nist.gov)
[12] Starlette multipart/file rollover DoS (fixed in 0.47.2). (nvd.nist.gov)

Would you like me to check/lock the exact latest PyPI release numbers now and produce a requirements file?

Citations:

1: https://security-tracker.debian.org/tracker/CVE-2025-48379?utm_source=openai

2: https://synscan.net/vuln/CVE-2024-31583?utm_source=openai

3: https://gist.github.com/shaoyuyoung/4bcefba4004f8271e64b5185c95a248a?utm_source=openai

4: https://www.cvedetails.com/cve/CVE-2025-61920/?utm_source=openai

5: https://www.cvedetails.com/cve/CVE-2025-59420/?utm_source=openai

6: https://advisories.gitlab.com/pkg/pypi/langchain-community/CVE-2025-6984/?utm_source=openai

7: https://github.com/advisories/GHSA-m42m-m8cr-8m58?utm_source=openai

8: https://github.com/advisories/GHSA-wwqv-p2pp-99h5?utm_source=openai

9: https://security.snyk.io/vuln/SNYK-PYTHON-LLAMAINDEXCORE-9510942?utm_source=openai

10: https://data.safetycli.com/vulnerabilities/CVE-2025-5302/79078/?utm_source=openai

11: https://nvd.nist.gov/vuln/detail/CVE-2025-62727?utm_source=openai

12: https://nvd.nist.gov/vuln/detail/CVE-2025-54121?utm_source=openai

Update all pinned dependency versions to address known security vulnerabilities before production deployment.

Pillow must be updated to >= 11.3.0, torch to >= 2.7.0, authlib to >= 1.6.5, langchain-community to >= 0.3.27, langchain-text-splitters to >= 0.3.9, langgraph-checkpoint to >= 3.0.0, llama-index-core to >= 0.13.0, and starlette to >= 0.49.1 to address critical and high-severity CVEs affecting arbitrary code execution, authorization bypass, DoS, and XXE vulnerabilities.

Lines affected: 16, 117, 120, 122, 127, 180, 252, 266

🧰 Tools

🪛 OSV Scanner (2.2.4)

[HIGH] 16-16: authlib 1.6.3: Authlib: JWS/JWT accepts unknown crit headers (RFC violation → possible authz bypass)

(GHSA-9ggr-2464-2j32)

[HIGH] 16-16: authlib 1.6.3: Authlib : JWE zip=DEF decompression bomb enables DoS

(GHSA-g7f3-828f-7h7m)

[HIGH] 16-16: authlib 1.6.3: Authlib is vulnerable to Denial of Service via Oversized JOSE Segments

(GHSA-pq5p-34cr-23v9)

🤖 Prompt for AI Agents

In requirements.txt around lines 16, 117, 120, 122, 127, 180, 252, and 266, several pinned dependencies are vulnerable; update the listed packages to at least the secure minimum versions: set authlib >= 1.6.5 (line 16), Pillow >= 11.3.0 (line 117), torch >= 2.7.0 (line 120), langchain-community >= 0.3.27 (line 122), langchain-text-splitters >= 0.3.9 (line 127), langgraph-checkpoint >= 3.0.0 (line 180), llama-index-core >= 0.13.0 (line 252), and starlette >= 0.49.1 (line 266); after updating, run pip-compile or your dependency lock/update workflow and run the test suite to ensure compatibility, resolving any API changes or pin conflicts that arise.

Dsantra92 and others added 4 commits November 8, 2025 15:20

good enough download + basic eval test

0a2c721

updated download files

249540d

e2e potpie qa testing

0c08775

Merge 0c08775 into 185d0e5

f1489f4

chore: Auto-fix pre-commit issues

25bf1a2

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

benchmark/download.py Outdated Show resolved Hide resolved

benchmark/potpie.py Outdated Show resolved Hide resolved

Dsantra92 changed the title ~~Primary cript for QA evals~~ Primary script for QA evals Nov 11, 2025

fix the repo issues

be0aadd

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

evaluate fix

f1428f6

coderabbitai bot reviewed Nov 17, 2025

View reviewed changes

Dsantra92 requested a review from nndn November 20, 2025 07:35

Primary script for QA evals #480

Are you sure you want to change the base?

Primary script for QA evals #480

Uh oh!

Conversation

Dsantra92 commented Nov 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 17, 2025

Quality Gate failed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dsantra92 commented Nov 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading