Fix OpenAI Responses API reasoning content capture#56
Open
82deutschmark wants to merge 1 commit intoarcprize:mainfrom
Open
Fix OpenAI Responses API reasoning content capture#5682deutschmark wants to merge 1 commit intoarcprize:mainfrom
82deutschmark wants to merge 1 commit intoarcprize:mainfrom
Conversation
## Problem
The OpenAI Responses API streaming implementation was failing to capture
reasoning content from GPT-5 models despite:
- Correctly configuring reasoning.summary: "detailed" in models.yml
- Correctly setting text.verbosity: "high" via _ensure_verbosity()
- Models generating reasoning tokens (confirmed by usage.reasoning_tokens > 0)
- API charging for reasoning tokens
Result: reasoning_summary field in saved submissions was always null,
wasting money on reasoning tokens that weren't being captured.
## Root Cause
The streaming response handler (_responses_stream()) had three critical bugs:
1. **Mock Response Object Missing reasoning Attribute**: The _ResponsesResponse
mock class didn't include a reasoning attribute, causing _get_reasoning_summary()
to always return None.
2. **Wrong Field for Reasoning Content**: Code was looking for reasoning in
response.reasoning.summary, which is just the config parameter ("detailed"),
not the actual reasoning content.
3. **Incorrect Parsing of Output Array**: When retrieving the final response,
the code didn't properly parse the output array structure. OpenAI Responses
API returns reasoning in output array items with type: "reasoning", where
each reasoning item has a summary field containing the actual reasoning text.
## Solution
### 1. Added Reasoning Support to Mock Response
- Created _ResponsesReasoning class with summary attribute
- Added reasoning parameter to _ResponsesResponse.__init__()
### 2. Captured Reasoning During Streaming
- Collect reasoning deltas via response.reasoning.delta chunks
- Store in reasoning_chunks array for fallback
### 3. Fixed Output Array Parsing
After streaming completes, always retrieve the final response to parse the
output array:
- Look for output items with type: "reasoning"
- Extract from summary field (plain text) first
- Fall back to content field if summary not available
- Handle both string and list content structures
### 4. Added Verbosity Helper
- Implemented _ensure_verbosity() to automatically set text.verbosity: "high"
- Ensures detailed output is returned from Responses API
- Called before all Responses API requests
### 5. Improved _get_reasoning_summary()
- Added _coerce_reasoning_summary_to_text() helper
- Handles various reasoning summary structures (str/list/dict/objects)
- Provides fallback to nested reasoning on output items
## Verification
Tested with gpt-5-mini-2025-08-07-medium on task 00576224:
- ✅ Reasoning captured successfully (previously null)
- ✅ Full reasoning trace saved (~1,500 characters)
- ✅ Reasoning tokens properly accounted for in usage/cost tracking
- ✅ No performance impact
## Files Modified
- src/arc_agi_benchmarking/adapters/openai_base.py
- Added _ResponsesReasoning class
- Updated _ResponsesResponse to include reasoning attribute
- Modified _responses_stream() to capture reasoning and parse output array
- Added _ensure_verbosity() helper method
- Enhanced _get_reasoning_summary() with robust parsing
- Added _coerce_reasoning_summary_to_text() for normalization
## Impact
- All GPT-5 model submissions will now include full reasoning traces
- No breaking changes - existing code continues to work
- Better value - actually capturing reasoning content being paid for
- Improved debugging - can inspect model's step-by-step thinking
Author
|
Just as an additional bit of complexity GPT-5.1 Codex models do not allow for the verbosity to be set at high, and it can only be set at medium. As long as reasoning_summary is detailed, it should still return reasoning. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The OpenAI Responses API streaming implementation was failing to capture reasoning content from GPT-5 models despite:
Result: reasoning_summary field in saved submissions was always null, wasting money on reasoning tokens that weren't being captured.
Root Cause
The streaming response handler (_responses_stream()) had three critical bugs:
Mock Response Object Missing reasoning Attribute: The _ResponsesResponse mock class didn't include a reasoning attribute, causing _get_reasoning_summary() to always return None.
Wrong Field for Reasoning Content: Code was looking for reasoning in response.reasoning.summary, which is just the config parameter ("detailed"), not the actual reasoning content.
Incorrect Parsing of Output Array: When retrieving the final response, the code didn't properly parse the output array structure. OpenAI Responses API returns reasoning in output array items with type: "reasoning", where each reasoning item has a summary field containing the actual reasoning text.
Solution
1. Added Reasoning Support to Mock Response
2. Captured Reasoning During Streaming
3. Fixed Output Array Parsing
After streaming completes, always retrieve the final response to parse the output array:
4. Added Verbosity Helper
5. Improved _get_reasoning_summary()
Verification
Tested with gpt-5-mini-2025-08-07-medium on task 00576224:
Files Modified
Impact