-
Notifications
You must be signed in to change notification settings - Fork 484
Primary script for QA evals #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request introduces a comprehensive benchmarking infrastructure for evaluating LLM agents on code-related tasks. New modules implement Git repository management with parallel worktree creation, asynchronous orchestration of Potpie service integration, evaluation workflow orchestration with CSV-based benchmark data loading, and semantic evaluation metrics. Dependencies are significantly expanded to support GPU acceleration and modern ML/AI tooling. Changes
Sequence DiagramssequenceDiagram
participant User
participant evaluate.py
participant download.py
participant potpie.py
participant metrics.py
participant deepeval
User->>evaluate.py: Run evaluation (--agents potpie --input benchmark.csv)
evaluate.py->>evaluate.py: Load benchmark problems from CSV
evaluate.py->>download.py: prepare_worktrees(problems, batch_no)
download.py->>download.py: Clone bare repos in parallel
download.py->>download.py: Create controllers & worktrees per commit
download.py-->>evaluate.py: Return repo_dict mapping
evaluate.py->>potpie.py: get_all_st_answers(problems, repo_dict)
potpie.py->>potpie.py: Initialize PotpieClient & ReadinessCoordinator
potpie.py->>potpie.py: Start PollingWorker background task
rect rgb(220, 240, 255)
Note over potpie.py: For each problem
potpie.py->>potpie.py: post_parse(commit_id, repo_path)
potpie.py->>potpie.py: Enqueue project with PollingWorker
potpie.py->>potpie.py: PollingWorker polls get_parse_status in background
potpie.py->>potpie.py: Mark ready when status complete
potpie.py->>potpie.py: send_message(project_id, question)
end
potpie.py-->>evaluate.py: Aggregated answers
evaluate.py->>evaluate.py: Align answers with expected outputs
evaluate.py->>evaluate.py: Create LLMTestCase instances
evaluate.py->>deepeval: evaluate(test_cases, correctness metric)
deepeval-->>evaluate.py: Success flags per test
evaluate.py->>evaluate.py: Output results to CSV
evaluate.py-->>User: Evaluation complete
sequenceDiagram
participant PollingWorker
participant queue as asyncio.Queue
participant coordinator as ReadinessCoordinator
participant client as PotpieClient
loop Background loop
PollingWorker->>queue: Get next project_id (blocking)
queue-->>PollingWorker: project_id
rect rgb(200, 220, 255)
Note over PollingWorker: _handle_project with retry logic
PollingWorker->>client: get_parse_status(project_id)
alt Status not complete
client-->>PollingWorker: pending/processing status
PollingWorker->>queue: Re-enqueue project
else Status complete
client-->>PollingWorker: complete status
PollingWorker->>coordinator: mark_ready(project_id)
coordinator-->>PollingWorker: Event signaled
end
end
PollingWorker->>coordinator: release_enqueue_slot(project_id)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore(1 hunks)benchmark/download.py(1 hunks)benchmark/eval.py(1 hunks)benchmark/potpie.py(1 hunks)start.ps1(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/potpie.py (1)
benchmark/download.py (2)
get_unique_repo_and_commits(26-42)setup_all_worktrees(101-163)
🪛 Ruff (0.14.3)
benchmark/download.py
50-50: subprocess call: check for execution of untrusted input
(S603)
51-51: Starting a process with a partial executable path
(S607)
58-58: subprocess call: check for execution of untrusted input
(S603)
59-59: Starting a process with a partial executable path
(S607)
82-82: subprocess call: check for execution of untrusted input
(S603)
83-92: Starting a process with a partial executable path
(S607)
125-125: Do not catch blind exception: Exception
(BLE001)
160-160: Do not catch blind exception: Exception
(BLE001)
benchmark/potpie.py
41-43: Create your own exception
(TRY002)
41-43: Avoid specifying long messages outside the exception class
(TRY003)
62-62: Do not catch blind exception: Exception
(BLE001)
95-97: Create your own exception
(TRY002)
95-97: Avoid specifying long messages outside the exception class
(TRY003)
125-127: Create your own exception
(TRY002)
125-127: Avoid specifying long messages outside the exception class
(TRY003)
194-194: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
234-236: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (5)
benchmark/download.py (3)
125-132: Specialize the exception handling as noted in the TODO.Catching bare
Exceptioncan mask unexpected errors. The TODO comment acknowledges this should be specialized tosubprocess.CalledProcessErrororOSError.Apply this diff:
except ( - Exception - ) as e: # TODO: Specialize the exception later like CalledProcessError or OSError + subprocess.CalledProcessError, + OSError, + ) as e:
142-144: Remove redundant parent directory creation.
worktree_path.parentisbare_repo_path, which already exists after successful cloning. Thismkdircall is redundant.Apply this diff:
for commit_id in repo_dicts[repo_url]: worktree_path = bare_repo_path / commit_id - worktree_path.parent.mkdir(exist_ok=True)
155-163: Specialize exception handling for worktree creation failures.Similar to bare cloning, catching bare
Exceptioncan mask unexpected errors. Consider specializing tosubprocess.CalledProcessErrororOSError.Apply this diff:
try: worktree_path, commit_id = future.result() repo_url = worktree_futures[future] worktree_map[(repo_url, commit_id)] = worktree_path logger.debug("Created worktree for commit {}", commit_id[:8]) - except Exception as e: + except (subprocess.CalledProcessError, OSError) as e: logger.error("Failed to create worktree: {}", str(e))benchmark/potpie.py (2)
46-67: Use logging instead of print and specialize exception handling.Line 63 uses
logging/logger, creating inconsistency. Also, catching bareExceptioncan mask unexpected errors.Apply this diff:
- except Exception as e: - print(f"Error checking {project_id}: {e}") + except (aiohttp.ClientError, asyncio.TimeoutError) as e: + logging.error("Error checking {}: {}", project_id, e) # Optionally re-queue on error await check_queue.put(project_id)
70-97: Consider creating custom exception classes.Lines 95-97 raise a generic
Exceptionwith a formatted string message. Per static analysis hints, consider creating custom exception classes for better error handling and typing.Example custom exceptions to add near the top of the file:
class PotPieAPIError(Exception): """Base exception for PotPie API errors.""" pass class ParseError(PotPieAPIError): """Error during repository parsing.""" passThen use:
- raise Exception( - f"Failed to get response: {await response.text()} {response.status}" - ) + raise ParseError( + f"Failed to parse repository: {await response.text()} (status {response.status})" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
benchmark/download.py(1 hunks)benchmark/potpie.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmark/potpie.py (1)
benchmark/download.py (2)
get_unique_repo_and_commits(27-43)setup_all_worktrees(102-164)
🪛 Ruff (0.14.4)
benchmark/potpie.py
41-43: Create your own exception
(TRY002)
41-43: Avoid specifying long messages outside the exception class
(TRY003)
62-62: Do not catch blind exception: Exception
(BLE001)
95-97: Create your own exception
(TRY002)
95-97: Avoid specifying long messages outside the exception class
(TRY003)
125-127: Create your own exception
(TRY002)
125-127: Avoid specifying long messages outside the exception class
(TRY003)
194-194: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
238-240: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
benchmark/download.py
51-51: subprocess call: check for execution of untrusted input
(S603)
52-52: Starting a process with a partial executable path
(S607)
59-59: subprocess call: check for execution of untrusted input
(S603)
60-60: Starting a process with a partial executable path
(S607)
83-83: subprocess call: check for execution of untrusted input
(S603)
84-93: Starting a process with a partial executable path
(S607)
126-126: Do not catch blind exception: Exception
(BLE001)
161-161: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (8)
benchmark/download.py (5)
11-24: LGTM! URL normalization handles trailing slashes correctly.The function properly strips both whitespace and trailing slashes before checking for the
.gitsuffix, addressing the past review concern.
46-66: LGTM! Bare repository cloning logic is sound.The function correctly checks for existing repos, clones as bare, and fetches all references with proper error handling via
check=True.
69-99: LGTM! Worktree creation handles cleanup and errors appropriately.The function correctly removes existing worktrees (with
check=Falsesince they may not exist) before creating new detached worktrees with proper error handling.
183-197: LGTM! Cleanup function is straightforward and safe.The function appropriately checks for existence before removal and provides clear logging.
200-209: LGTM! Main block demonstrates proper usage.The example correctly shows the workflow: read CSV, setup worktrees, and output the mapping.
benchmark/potpie.py (3)
1-22: LGTM! Imports and data class are well-structured.The dependencies and
PotPieUserInfodataclass are appropriate for the async HTTP workflow.
227-232: LGTM! Worker cleanup properly implemented.The cleanup logic correctly waits for the queue to drain with
join(), then cancels workers and gathers withreturn_exceptions=True. This addresses the past review concern about awaiting forever-running workers.
253-256: Main block structure is appropriate.The use of
load_dotenv()andasyncio.run()is correct. However, ensureBASE_URLis moved to module level as flagged in the earlier comment.
benchmark/download.py
Outdated
| def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]: | ||
| csv_file_path = Path(csv_path) | ||
| required_columns = [ | ||
| "repo_url", | ||
| "commit_id", | ||
| ] | ||
| df: DataFrame = pd.read_csv( | ||
| csv_file_path, usecols=required_columns | ||
| ) # pyright: ignore[reportUnknownMemberType] | ||
| df.dropna(inplace=True) | ||
| df["repo_url"] = df["repo_url"].apply( | ||
| _normalize_repo_url | ||
| ) # pyright: ignore[reportUnknownMemberType] | ||
| repo_commit_dict = ( | ||
| df.groupby("repo_url")["commit_id"].apply(set).to_dict() | ||
| ) # pyright: ignore[reportUnknownMemberType] | ||
| return repo_commit_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the return type hint to match the actual return value.
The function groups commits into sets (line 41: .apply(set)), but the return type annotation declares dict[str, list[str]]. This mismatch will cause type-checking failures.
Apply this diff:
-def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]:
+def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, set[str]]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, list[str]]: | |
| csv_file_path = Path(csv_path) | |
| required_columns = [ | |
| "repo_url", | |
| "commit_id", | |
| ] | |
| df: DataFrame = pd.read_csv( | |
| csv_file_path, usecols=required_columns | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| df.dropna(inplace=True) | |
| df["repo_url"] = df["repo_url"].apply( | |
| _normalize_repo_url | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| repo_commit_dict = ( | |
| df.groupby("repo_url")["commit_id"].apply(set).to_dict() | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| return repo_commit_dict | |
| def get_unique_repo_and_commits(csv_path: Path | str) -> dict[str, set[str]]: | |
| csv_file_path = Path(csv_path) | |
| required_columns = [ | |
| "repo_url", | |
| "commit_id", | |
| ] | |
| df: DataFrame = pd.read_csv( | |
| csv_file_path, usecols=required_columns | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| df.dropna(inplace=True) | |
| df["repo_url"] = df["repo_url"].apply( | |
| _normalize_repo_url | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| repo_commit_dict = ( | |
| df.groupby("repo_url")["commit_id"].apply(set).to_dict() | |
| ) # pyright: ignore[reportUnknownMemberType] | |
| return repo_commit_dict |
🤖 Prompt for AI Agents
In benchmark/download.py around lines 27 to 43, the function
get_unique_repo_and_commits returns a dict mapping repo_url to sets of commit
IDs (it calls .apply(set)), but the type hint incorrectly declares dict[str,
list[str]]; update the function signature return type to dict[str, set[str]] so
the annotation matches the actual return value (alternatively, convert the sets
to lists before returning if lists are desired).
benchmark/download.py
Outdated
| def get_repo_name(repo_url: str) -> str: | ||
| """ | ||
| Get the name of the repository from the URL. | ||
| Args: | ||
| repo_url: The URL of the repository. Expected to be in the format | ||
| "https://<provider>.com/username/repository.git". | ||
| Returns: | ||
| The name of the repository. | ||
| """ | ||
| repo_slug_split = repo_url.strip().split("/")[-2:] | ||
| return "_".join(repo_slug_split)[:-4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Make the .git suffix removal more robust.
The [:-4] slice assumes the URL ends with .git, but offers no validation or error handling if called on a non-normalized URL. Use .removesuffix(".git") for clarity and correctness.
Apply this diff:
def get_repo_name(repo_url: str) -> str:
"""
Get the name of the repository from the URL.
Args:
repo_url: The URL of the repository. Expected to be in the format
"https://<provider>.com/username/repository.git".
Returns:
The name of the repository.
"""
+ # Normalize first to ensure .git suffix
+ repo_url = _normalize_repo_url(repo_url)
repo_slug_split = repo_url.strip().split("/")[-2:]
- return "_".join(repo_slug_split)[:-4]
+ return "_".join(repo_slug_split).removesuffix(".git")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In benchmark/download.py around lines 167 to 179, the function strips the last
four characters with '[:-4]' to remove a '.git' suffix which is brittle if the
URL isn't normalized; replace that slice with repo_slug_split =
repo_url.strip().split("/")[-2:]; repo_name =
"_".join(repo_slug_split).removesuffix(".git") and return repo_name, ensuring
you use str.removesuffix(".git") so non-.git inputs remain unchanged and add no
slicing-based errors.
benchmark/potpie.py
Outdated
| jobs: list[Awaitable[str]] = [] | ||
| for (_, commit_id), worktree_path in worktree_map.items(): | ||
| jobs.append( | ||
| parse_single_repo( | ||
| user_info=conn_info, | ||
| commit_id=commit_id, | ||
| repo_path=worktree_path, | ||
| ) | ||
| ) | ||
| keys = list(worktree_map.keys()) | ||
| project_ids = await asyncio.gather(*jobs) | ||
| repo_commit_to_project_id = dict(zip(keys, project_ids)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add strict=True to zip for length mismatch detection.
Line 194 uses zip() without the strict parameter. If keys and project_ids have mismatched lengths (due to partial failures), the silent truncation could cause data loss. Python 3.10+ supports strict=True to raise ValueError on mismatch.
Apply this diff:
keys = list(worktree_map.keys())
project_ids = await asyncio.gather(*jobs)
- repo_commit_to_project_id = dict(zip(keys, project_ids))
+ repo_commit_to_project_id = dict(zip(keys, project_ids, strict=True))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| jobs: list[Awaitable[str]] = [] | |
| for (_, commit_id), worktree_path in worktree_map.items(): | |
| jobs.append( | |
| parse_single_repo( | |
| user_info=conn_info, | |
| commit_id=commit_id, | |
| repo_path=worktree_path, | |
| ) | |
| ) | |
| keys = list(worktree_map.keys()) | |
| project_ids = await asyncio.gather(*jobs) | |
| repo_commit_to_project_id = dict(zip(keys, project_ids)) | |
| jobs: list[Awaitable[str]] = [] | |
| for (_, commit_id), worktree_path in worktree_map.items(): | |
| jobs.append( | |
| parse_single_repo( | |
| user_info=conn_info, | |
| commit_id=commit_id, | |
| repo_path=worktree_path, | |
| ) | |
| ) | |
| keys = list(worktree_map.keys()) | |
| project_ids = await asyncio.gather(*jobs) | |
| repo_commit_to_project_id = dict(zip(keys, project_ids, strict=True)) |
🧰 Tools
🪛 Ruff (0.14.4)
194-194: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In benchmark/potpie.py around lines 183 to 194, the code builds keys and
project_ids then uses zip(keys, project_ids) which will silently truncate if
lengths differ; update the zip call to zip(keys, project_ids, strict=True) so a
ValueError is raised on length mismatch (or, if you must support pre-3.10, add
an explicit length check like if len(keys) != len(project_ids): raise
ValueError(...)) to detect partial failures instead of silently losing entries.
benchmark/potpie.py
Outdated
| test_cases = [ | ||
| LLMTestCase( | ||
| input=question, actual_output=answer, expected_output=expected_answer | ||
| ) | ||
| for question, answer, expected_answer in zip( | ||
| qa_df["question"], answers, expected_answers | ||
| ) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add strict=True to zip for length mismatch detection.
Line 238-240 uses zip() without the strict parameter. If the lists have mismatched lengths, silent truncation could cause test cases to be skipped without warning.
Apply this diff:
test_cases = [
LLMTestCase(
input=question, actual_output=answer, expected_output=expected_answer
)
for question, answer, expected_answer in zip(
- qa_df["question"], answers, expected_answers
+ qa_df["question"], answers, expected_answers, strict=True
)
]🧰 Tools
🪛 Ruff (0.14.4)
238-240: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In benchmark/potpie.py around lines 234 to 241, the list comprehension uses
zip(...) to combine qa_df["question"], answers, and expected_answers but doesn't
detect length mismatches; update the zip call to zip(qa_df["question"], answers,
expected_answers, strict=True) so a ValueError is raised on unequal lengths and
adjust imports/environment if running on Python <3.10 to ensure strict is
supported.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
benchmark/evaluate.py (2)
80-80: Prefix unused variable with underscore.The
summaryvariable is unused. Prefix it with_to indicate intentional non-use.Apply this diff:
- repo_map, summary = prepare_worktrees( + repo_map, _summary = prepare_worktrees( problem_sets, base_dir="/tmp/repos_batch", batch_no=nbatch, max_workers=6 )
111-118: Addstrict=Trueto zip for length mismatch detection.If the lists have mismatched lengths due to partial failures during answer collection, the
zip()will silently truncate to the shortest list, potentially hiding errors.Apply this diff:
test_cases = [ LLMTestCase( input=question, actual_output=answer, expected_output=expected_answer ) for question, answer, expected_answer in zip( - questions_batched, answers_flattened, expected_answers_batched + questions_batched, answers_flattened, expected_answers_batched, strict=True ) ]benchmark/download.py (1)
186-186: Prefix unused loop variable with underscore.The loop variable
idxis not used within the loop body.Apply this diff:
- for idx, (commit_id, problem_id) in enumerate(entries, 1): + for _idx, (commit_id, problem_id) in enumerate(entries, 1):benchmark/potpie.py (2)
60-62: Consider using a custom exception class for HTTP errors.The generic
Exceptionmakes it harder for callers to distinguish HTTP failures from other errors. However, this is acceptable for research code.If you want to improve error handling, you could define:
class PotpieAPIError(Exception): def __init__(self, status: int, message: str): self.status = status super().__init__(f"API request failed with status {status}: {message}")Then use it:
else: - raise Exception( - f"Failed to get response: {await resp.text()} {resp.status}", - ) + text = await resp.text() + raise PotpieAPIError(resp.status, text)
333-337: Add a usage note or guard to the main block.The
__main__block callsget_all_st_answers(None, None), which would fail since both arguments are required. This is fine for a test stub, but consider adding a comment or a proper guard for clarity.if __name__ == "__main__": from dotenv import load_dotenv _ = load_dotenv() - asyncio.run(get_all_st_answers(None, None)) + # For testing: asyncio.run(get_all_st_answers(problems, repo_dict)) + print("This module should be imported, not run directly.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
benchmark/download.py(1 hunks)benchmark/evaluate.py(1 hunks)benchmark/metrics.py(1 hunks)benchmark/potpie.py(1 hunks)requirements.txt(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-02-04T12:11:29.023Z
Learnt from: SujalThapaK
Repo: potpie-ai/potpie PR: 240
File: start.sh:0-0
Timestamp: 2025-02-04T12:11:29.023Z
Learning: Environment variables in the Potpie project should follow uppercase naming convention (e.g., `CELERY_QUEUE_NAME` instead of `Celery_QUEUE_NAME`) to maintain consistency and avoid case-sensitivity issues.
Applied to files:
benchmark/potpie.py
🧬 Code graph analysis (1)
benchmark/evaluate.py (2)
benchmark/download.py (1)
prepare_worktrees(270-327)benchmark/potpie.py (2)
get_all_st_answers(269-330)run(151-178)
🪛 OSV Scanner (2.2.4)
requirements.txt
[HIGH] 16-16: authlib 1.6.3: Authlib: JWS/JWT accepts unknown crit headers (RFC violation → possible authz bypass)
[HIGH] 16-16: authlib 1.6.3: Authlib : JWE zip=DEF decompression bomb enables DoS
[HIGH] 16-16: authlib 1.6.3: Authlib is vulnerable to Denial of Service via Oversized JOSE Segments
[HIGH] 117-117: langchain-community 0.3.16: Langchain Community Vulnerable to XML External Entity (XXE) Attacks
[HIGH] 120-120: langchain-text-splitters 0.3.5: LangChain Text Splitters is vulnerable to XML External Entity (XXE) attacks due to unsafe XSLT parsing
[HIGH] 122-122: langgraph-checkpoint 2.0.10: LangGraph Checkpoint affected by RCE in "json" mode of JsonPlusSerializer
[HIGH] 127-127: llama-index-core 0.10.68.post1: LlamaIndex vulnerable to DoS attack through uncontrolled recursive JSON parsing
[HIGH] 127-127: llama-index-core 0.10.68.post1: LlamaIndex affected by a Denial of Service (DOS) in JSONReader
[HIGH] 127-127: llama-index-core 0.10.68.post1: llama-index-core insecurely handles temporary files
[CRITICAL] 180-180: pillow 10.0.1: Arbitrary Code Execution in Pillow
[CRITICAL] 180-180: pillow 10.0.1: Pillow buffer overflow vulnerability
[HIGH] 252-252: starlette 0.41.3: Starlette has possible denial-of-service vector when parsing large files in multipart forms
[HIGH] 252-252: starlette 0.41.3: Starlette vulnerable to O(n^2) DoS via Range header merging in starlette.responses.FileResponse
[CRITICAL] 266-266: torch 2.5.1: undefined
(PYSEC-2025-41)
[CRITICAL] 266-266: torch 2.5.1: PyTorch susceptible to local Denial of Service
[CRITICAL] 266-266: torch 2.5.1: PyTorch: torch.load with weights_only=True leads to remote code execution
[CRITICAL] 266-266: torch 2.5.1: PyTorch Improper Resource Shutdown or Release vulnerability
🪛 Ruff (0.14.4)
benchmark/evaluate.py
80-80: Unpacked variable summary is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
81-81: Probable insecure usage of temporary file or directory: "/tmp/repos_batch"
(S108)
115-117: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
benchmark/download.py
19-19: subprocess call: check for execution of untrusted input
(S603)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
81-83: Avoid specifying long messages outside the exception class
(TRY003)
96-96: Consider moving this statement to an else block
(TRY300)
97-97: Do not catch blind exception: Exception
(BLE001)
109-111: Avoid specifying long messages outside the exception class
(TRY003)
139-141: Avoid specifying long messages outside the exception class
(TRY003)
186-186: Loop control variable idx not used within loop body
Rename unused idx to _idx
(B007)
210-210: Do not catch blind exception: Exception
(BLE001)
246-246: Do not catch blind exception: Exception
(BLE001)
255-255: Do not catch blind exception: Exception
(BLE001)
309-309: Do not catch blind exception: Exception
(BLE001)
benchmark/metrics.py
37-37: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
38-38: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
39-39: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
benchmark/potpie.py
60-62: Create your own exception
(TRY002)
60-62: Avoid specifying long messages outside the exception class
(TRY003)
164-164: Do not catch blind exception: Exception
(BLE001)
172-173: try-except-pass detected, consider logging the exception
(S110)
172-172: Do not catch blind exception: Exception
(BLE001)
198-198: Do not catch blind exception: Exception
(BLE001)
240-240: Do not catch blind exception: Exception
(BLE001)
| - Minor differences → 0.7–0.9 | ||
| - Substantial but related differences → 0.3–0.6 | ||
| - Completely different → 0.0–0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace EN DASH with HYPHEN-MINUS for consistency.
Lines 37-39 use the EN DASH character (–) instead of the standard HYPHEN-MINUS (-), which can cause encoding issues in some environments.
Apply this diff:
Scoring Rules:
- Base the score ONLY on similarity, NOT quality.
- Wording differences do NOT reduce similarity if the meaning is the same.
-- Minor differences → 0.7–0.9
-- Substantial but related differences → 0.3–0.6
-- Completely different → 0.0–0.2
+- Minor differences → 0.7-0.9
+- Substantial but related differences → 0.3-0.6
+- Completely different → 0.0-0.2📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Minor differences → 0.7–0.9 | |
| - Substantial but related differences → 0.3–0.6 | |
| - Completely different → 0.0–0.2 | |
| - Minor differences → 0.7-0.9 | |
| - Substantial but related differences → 0.3-0.6 | |
| - Completely different → 0.0-0.2 |
🧰 Tools
🪛 Ruff (0.14.4)
37-37: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
38-38: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
39-39: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
🤖 Prompt for AI Agents
In benchmark/metrics.py around lines 37 to 39, the comment/strings use EN DASH
(–) characters; replace each EN DASH with the standard HYPHEN-MINUS (-)
character so the ranges read "0.7-0.9", "0.3-0.6", and "0.0-0.2". Update the
three lines accordingly, save the file using UTF-8, and run the linter/tests to
confirm no encoding or style issues remain.
| async-timeout==4.0.3 | ||
| asyncpg==0.30.0 | ||
| attrs==25.4.0 | ||
| authlib==1.6.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Address critical security vulnerabilities before production deployment.
Multiple dependencies have known high and critical severity vulnerabilities:
- pillow 10.0.1 (line 180): Arbitrary code execution and buffer overflow (2 CRITICAL CVEs)
- torch 2.5.1 (line 266): Remote code execution via
torch.loadand local DoS (4 CRITICAL CVEs) - authlib 1.6.3 (line 16): JWT/JWE vulnerabilities enabling authorization bypass and DoS (3 HIGH CVEs)
- langchain-community 0.3.16 (line 117), langchain-text-splitters 0.3.5 (line 120): XXE attacks (HIGH)
- langgraph-checkpoint 2.0.10 (line 122): RCE vulnerability (HIGH)
- llama-index-core 0.10.68.post1 (line 127): DoS and insecure temp file handling (HIGH)
- starlette 0.41.3 (line 252): DoS vectors (HIGH)
Since this is WIP, these should be resolved before merging to production.
Also applies to: 117-117, 120-120, 122-122, 127-127, 180-180, 252-252, 266-266
🌐 Web query:
What are the latest secure versions of pillow, torch, authlib, langchain-community, langchain-text-splitters, langgraph-checkpoint, llama-index-core, and starlette that address known CVEs as of November 2025?
💡 Result:
Summary — secure minimum versions (addressing known CVEs) as of 2025-11-17:
- Pillow: >= 11.3.0. [1]
- torch / PyTorch: recommend >= 2.7.0 (contains fixes for several 2024–2025 vulnerabilities; upgrade to the latest 2.7.x/2.8.x+ available for additional fixes). [2][3]
- Authlib: >= 1.6.5 (1.6.4/1.6.5 patch specific CVEs; use 1.6.5 to cover recent JOSE/DoS fixes). [4][5]
- langchain-community: >= 0.3.27. [6]
- langchain-text-splitters: >= 0.3.9. [7]
- langgraph-checkpoint: >= 3.0.0. [8]
- llama-index-core: >= 0.13.0 (0.12.38 / 0.12.41 fixed earlier issues; 0.13.0 fixes the latest /tmp cache symlink issue). [9][10]
- starlette: >= 0.49.1 (also note earlier fix 0.47.2 for multipart rollover). [11][12]
If you want, I can:
- produce a pip/requirements.txt snippet with these minimums, or
- check the current latest PyPI releases for each package and give exact latest versions (live) and links.
Citations:
[1] Pillow security / release notes (CVE fixes). (security-tracker.debian.org)
[2] PyTorch vuln fixes (use-after-free, etc. — fixed in 2.2.0 and later). (synscan.net)
[3] PyTorch 2.7.0 fixes for other 2025 issues; upgrade recommendation. (gist.github.com)
[4] Authlib JOSE/DoS advisory (fixed in 1.6.5). (cvedetails.com)
[5] Authlib crit-header/authz advisory (fixed in 1.6.4). (cvedetails.com)
[6] langchain-community GHSA / CVE advisories (fixed in 0.3.27 and earlier fixes). (advisories.gitlab.com)
[7] langchain-text-splitters GHSA / CVE advisory (fixed in 0.3.9). (github.com)
[8] langgraph-checkpoint RCE advisory (fixed in 3.0.0). (github.com)
[9] llama-index-core advisories and fixes (SQLi/JSON/other fixes; cache symlink fix in 0.13.0). (security.snyk.io)
[10] llama-index fixes for JSON/pickle/path traversal (0.12.38/0.12.41 etc.). (data.safetycli.com)
[11] Starlette CVE-2025-62727 (fixed in 0.49.1). (nvd.nist.gov)
[12] Starlette multipart/file rollover DoS (fixed in 0.47.2). (nvd.nist.gov)
Would you like me to check/lock the exact latest PyPI release numbers now and produce a requirements file?
Citations:
- 1: https://security-tracker.debian.org/tracker/CVE-2025-48379?utm_source=openai
- 2: https://synscan.net/vuln/CVE-2024-31583?utm_source=openai
- 3: https://gist.github.com/shaoyuyoung/4bcefba4004f8271e64b5185c95a248a?utm_source=openai
- 4: https://www.cvedetails.com/cve/CVE-2025-61920/?utm_source=openai
- 5: https://www.cvedetails.com/cve/CVE-2025-59420/?utm_source=openai
- 6: https://advisories.gitlab.com/pkg/pypi/langchain-community/CVE-2025-6984/?utm_source=openai
- 7: https://github.com/advisories/GHSA-m42m-m8cr-8m58?utm_source=openai
- 8: https://github.com/advisories/GHSA-wwqv-p2pp-99h5?utm_source=openai
- 9: https://security.snyk.io/vuln/SNYK-PYTHON-LLAMAINDEXCORE-9510942?utm_source=openai
- 10: https://data.safetycli.com/vulnerabilities/CVE-2025-5302/79078/?utm_source=openai
- 11: https://nvd.nist.gov/vuln/detail/CVE-2025-62727?utm_source=openai
- 12: https://nvd.nist.gov/vuln/detail/CVE-2025-54121?utm_source=openai
Update all pinned dependency versions to address known security vulnerabilities before production deployment.
Pillow must be updated to >= 11.3.0, torch to >= 2.7.0, authlib to >= 1.6.5, langchain-community to >= 0.3.27, langchain-text-splitters to >= 0.3.9, langgraph-checkpoint to >= 3.0.0, llama-index-core to >= 0.13.0, and starlette to >= 0.49.1 to address critical and high-severity CVEs affecting arbitrary code execution, authorization bypass, DoS, and XXE vulnerabilities.
Lines affected: 16, 117, 120, 122, 127, 180, 252, 266
🧰 Tools
🪛 OSV Scanner (2.2.4)
[HIGH] 16-16: authlib 1.6.3: Authlib: JWS/JWT accepts unknown crit headers (RFC violation → possible authz bypass)
[HIGH] 16-16: authlib 1.6.3: Authlib : JWE zip=DEF decompression bomb enables DoS
[HIGH] 16-16: authlib 1.6.3: Authlib is vulnerable to Denial of Service via Oversized JOSE Segments
🤖 Prompt for AI Agents
In requirements.txt around lines 16, 117, 120, 122, 127, 180, 252, and 266,
several pinned dependencies are vulnerable; update the listed packages to at
least the secure minimum versions: set authlib >= 1.6.5 (line 16), Pillow >=
11.3.0 (line 117), torch >= 2.7.0 (line 120), langchain-community >= 0.3.27
(line 122), langchain-text-splitters >= 0.3.9 (line 127), langgraph-checkpoint
>= 3.0.0 (line 180), llama-index-core >= 0.13.0 (line 252), and starlette >=
0.49.1 (line 266); after updating, run pip-compile or your dependency
lock/update workflow and run the test suite to ensure compatibility, resolving
any API changes or pin conflicts that arise.


A lot of sharp edges and code cleanups to be done. Still in research stage and should ideally be merged when we are ready with actual benchmarks for QA.
Summary by CodeRabbit
New Features
Chores