fix(#161): re-parse criteria when LLM returns stringified JSON dict by drewdrewthis · Pull Request #552 · langwatch/scenario

drewdrewthis · 2026-05-25T07:31:53Z

Problem

Some LLMs serialise nested objects (like the criteria map) as a quoted JSON string instead of an inline dict. When the JudgeAgent calls .values() on that string it blows up:

AttributeError: 'str' object has no attribute 'values'

Closes #161.

Fix

After json.loads(tool_call.function.arguments), check if criteria_verdicts is a str. If so, try a second json.loads() pass. On decode failure fall back to {} so the judge still produces a result rather than crashing the run.

Test

test_judge_agent_handles_criteria_as_json_string in tests/test_judge_agent.py exercises the exact payload shape that triggered the bug — criteria returned as "{\"test_criterion\": true}" (a JSON string, not a dict). The test was failing with AttributeError before the fix and passes after.

Checklist

Regression test added
No behaviour change when LLM returns a normal dict
Graceful fallback ({}) when the string is not valid JSON

🤖 Generated with Claude Code

Some LLMs serialise nested objects (like the criteria map) as a quoted JSON string instead of an inline dict. criteria_verdicts.values() then throws AttributeError: 'str' object has no attribute 'values'. Re-parse with json.loads() when a string is detected; fall back to {} on decode failure so the judge still produces a result rather than crashing the test run. Regression test added: test_judge_agent_handles_criteria_as_json_string exercises the exact payload shape that triggered the bug. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis · 2026-05-25T07:58:18Z

[grinder] READY for human review

CI: green (zero failing, zero pending)
ACs: met — JudgeAgent now re-parses criteria when LLM returns a JSON string instead of a dict; graceful fallback to {} on parse failure; regression test test_judge_agent_handles_criteria_as_json_string verifies the exact payload shape that triggered the bug
Threads: zero unresolved
Note: LLM evaluator declined auto-approve (modifies runtime parsing logic in JudgeAgent) — human review required

Verified by:
`command gh pr checks 552` → all checks pass or skipping; test (3.12) pass 7m31s, evaluate pass, javascript-complete pass
regression test in worktree: `PASSED` (.worktrees/issue161)

github-actions · 2026-05-25T08:04:35Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

Scope: Re-parse 'criteria' if it's returned as a JSON string in JudgeAgent (fallback to {} on decode failure) and add a regression test exercising this case.
Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
Classification: low-risk-change under the documented policy.

The PR only adjusts how the JudgeAgent parses LLM-returned arguments (re-parsing a stringified JSON 'criteria' and falling back to an empty dict) and adds a regression test. It does not touch authentication/authorization, secrets/encryption, database schemas/migrations, business-critical logic, or external integrations, so it fits the low-risk criteria and is easy to revert if needed.

An approving review has been submitted by automation. The PR may merge once required CI checks pass.

github-actions

Approved by automation: PR qualifies as low-risk-change under the documented policy.

drewdrewthis added grinding Grinder is actively managing this PR low-risk-change PR qualifies as low-risk per policy and can be merged without manual review labels May 25, 2026

github-actions Bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 25, 2026

drewdrewthis added pr-ready and removed grinding Grinder is actively managing this PR labels May 25, 2026

github-actions Bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 25, 2026

github-actions Bot approved these changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#161): re-parse criteria when LLM returns stringified JSON dict#552

fix(#161): re-parse criteria when LLM returns stringified JSON dict#552
drewdrewthis wants to merge 1 commit into
mainfrom
issue161/fix-judge-criteria-string

drewdrewthis commented May 25, 2026

Uh oh!

drewdrewthis commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewdrewthis commented May 25, 2026

Problem

Fix

Test

Checklist

Uh oh!

drewdrewthis commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant