Fix OpenAI Responses API reasoning content capture by 82deutschmark · Pull Request #56 · arcprize/arc-agi-benchmarking

82deutschmark · 2025-11-20T04:21:32Z

Problem

The OpenAI Responses API streaming implementation was failing to capture reasoning content from GPT-5 models despite:

Correctly configuring reasoning.summary: "detailed" in models.yml
Correctly setting text.verbosity: "high" via _ensure_verbosity()
Models generating reasoning tokens (confirmed by usage.reasoning_tokens > 0)
API charging for reasoning tokens

Result: reasoning_summary field in saved submissions was always null, wasting money on reasoning tokens that weren't being captured.

Root Cause

The streaming response handler (_responses_stream()) had three critical bugs:

Mock Response Object Missing reasoning Attribute: The _ResponsesResponse mock class didn't include a reasoning attribute, causing _get_reasoning_summary() to always return None.
Wrong Field for Reasoning Content: Code was looking for reasoning in response.reasoning.summary, which is just the config parameter ("detailed"), not the actual reasoning content.
Incorrect Parsing of Output Array: When retrieving the final response, the code didn't properly parse the output array structure. OpenAI Responses API returns reasoning in output array items with type: "reasoning", where each reasoning item has a summary field containing the actual reasoning text.

Solution

1. Added Reasoning Support to Mock Response

Created _ResponsesReasoning class with summary attribute
Added reasoning parameter to _ResponsesResponse.init()

2. Captured Reasoning During Streaming

Collect reasoning deltas via response.reasoning.delta chunks
Store in reasoning_chunks array for fallback

3. Fixed Output Array Parsing

After streaming completes, always retrieve the final response to parse the output array:

Look for output items with type: "reasoning"
Extract from summary field (plain text) first
Fall back to content field if summary not available
Handle both string and list content structures

4. Added Verbosity Helper

Implemented _ensure_verbosity() to automatically set text.verbosity: "high"
Ensures detailed output is returned from Responses API
Called before all Responses API requests

5. Improved _get_reasoning_summary()

Added _coerce_reasoning_summary_to_text() helper
Handles various reasoning summary structures (str/list/dict/objects)
Provides fallback to nested reasoning on output items

Verification

Tested with gpt-5-mini-2025-08-07-medium on task 00576224:

✅ Reasoning captured successfully (previously null)
✅ Full reasoning trace saved (~1,500 characters)
✅ Reasoning tokens properly accounted for in usage/cost tracking
✅ No performance impact

Files Modified

src/arc_agi_benchmarking/adapters/openai_base.py
- Added _ResponsesReasoning class
- Updated _ResponsesResponse to include reasoning attribute
- Modified _responses_stream() to capture reasoning and parse output array
- Added _ensure_verbosity() helper method
- Enhanced _get_reasoning_summary() with robust parsing
- Added _coerce_reasoning_summary_to_text() for normalization

Impact

All GPT-5 model submissions will now include full reasoning traces
No breaking changes - existing code continues to work
Better value - actually capturing reasoning content being paid for
Improved debugging - can inspect model's step-by-step thinking

## Problem The OpenAI Responses API streaming implementation was failing to capture reasoning content from GPT-5 models despite: - Correctly configuring reasoning.summary: "detailed" in models.yml - Correctly setting text.verbosity: "high" via _ensure_verbosity() - Models generating reasoning tokens (confirmed by usage.reasoning_tokens > 0) - API charging for reasoning tokens Result: reasoning_summary field in saved submissions was always null, wasting money on reasoning tokens that weren't being captured. ## Root Cause The streaming response handler (_responses_stream()) had three critical bugs: 1. **Mock Response Object Missing reasoning Attribute**: The _ResponsesResponse mock class didn't include a reasoning attribute, causing _get_reasoning_summary() to always return None. 2. **Wrong Field for Reasoning Content**: Code was looking for reasoning in response.reasoning.summary, which is just the config parameter ("detailed"), not the actual reasoning content. 3. **Incorrect Parsing of Output Array**: When retrieving the final response, the code didn't properly parse the output array structure. OpenAI Responses API returns reasoning in output array items with type: "reasoning", where each reasoning item has a summary field containing the actual reasoning text. ## Solution ### 1. Added Reasoning Support to Mock Response - Created _ResponsesReasoning class with summary attribute - Added reasoning parameter to _ResponsesResponse.__init__() ### 2. Captured Reasoning During Streaming - Collect reasoning deltas via response.reasoning.delta chunks - Store in reasoning_chunks array for fallback ### 3. Fixed Output Array Parsing After streaming completes, always retrieve the final response to parse the output array: - Look for output items with type: "reasoning" - Extract from summary field (plain text) first - Fall back to content field if summary not available - Handle both string and list content structures ### 4. Added Verbosity Helper - Implemented _ensure_verbosity() to automatically set text.verbosity: "high" - Ensures detailed output is returned from Responses API - Called before all Responses API requests ### 5. Improved _get_reasoning_summary() - Added _coerce_reasoning_summary_to_text() helper - Handles various reasoning summary structures (str/list/dict/objects) - Provides fallback to nested reasoning on output items ## Verification Tested with gpt-5-mini-2025-08-07-medium on task 00576224: - ✅ Reasoning captured successfully (previously null) - ✅ Full reasoning trace saved (~1,500 characters) - ✅ Reasoning tokens properly accounted for in usage/cost tracking - ✅ No performance impact ## Files Modified - src/arc_agi_benchmarking/adapters/openai_base.py - Added _ResponsesReasoning class - Updated _ResponsesResponse to include reasoning attribute - Modified _responses_stream() to capture reasoning and parse output array - Added _ensure_verbosity() helper method - Enhanced _get_reasoning_summary() with robust parsing - Added _coerce_reasoning_summary_to_text() for normalization ## Impact - All GPT-5 model submissions will now include full reasoning traces - No breaking changes - existing code continues to work - Better value - actually capturing reasoning content being paid for - Improved debugging - can inspect model's step-by-step thinking

82deutschmark · 2025-12-04T20:56:44Z

Just as an additional bit of complexity GPT-5.1 Codex models do not allow for the verbosity to be set at high, and it can only be set at medium. As long as reasoning_summary is detailed, it should still return reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenAI Responses API reasoning content capture#56

Fix OpenAI Responses API reasoning content capture#56
82deutschmark wants to merge 1 commit intoarcprize:mainfrom
82deutschmark:claude/fix-reasoning-capture-0153X6wDAaMhgwkxJ7e42yGE

82deutschmark commented Nov 20, 2025

Uh oh!

82deutschmark commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

82deutschmark commented Nov 20, 2025

Problem

Root Cause

Solution

1. Added Reasoning Support to Mock Response

2. Captured Reasoning During Streaming

3. Fixed Output Array Parsing

4. Added Verbosity Helper

5. Improved _get_reasoning_summary()

Verification

Files Modified

Impact

Uh oh!

82deutschmark commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants