Skip to content

Fix OpenAI Responses API reasoning content capture#56

Open
82deutschmark wants to merge 1 commit intoarcprize:mainfrom
82deutschmark:claude/fix-reasoning-capture-0153X6wDAaMhgwkxJ7e42yGE
Open

Fix OpenAI Responses API reasoning content capture#56
82deutschmark wants to merge 1 commit intoarcprize:mainfrom
82deutschmark:claude/fix-reasoning-capture-0153X6wDAaMhgwkxJ7e42yGE

Conversation

@82deutschmark
Copy link
Copy Markdown

Problem

The OpenAI Responses API streaming implementation was failing to capture reasoning content from GPT-5 models despite:

  • Correctly configuring reasoning.summary: "detailed" in models.yml
  • Correctly setting text.verbosity: "high" via _ensure_verbosity()
  • Models generating reasoning tokens (confirmed by usage.reasoning_tokens > 0)
  • API charging for reasoning tokens

Result: reasoning_summary field in saved submissions was always null, wasting money on reasoning tokens that weren't being captured.

Root Cause

The streaming response handler (_responses_stream()) had three critical bugs:

  1. Mock Response Object Missing reasoning Attribute: The _ResponsesResponse mock class didn't include a reasoning attribute, causing _get_reasoning_summary() to always return None.

  2. Wrong Field for Reasoning Content: Code was looking for reasoning in response.reasoning.summary, which is just the config parameter ("detailed"), not the actual reasoning content.

  3. Incorrect Parsing of Output Array: When retrieving the final response, the code didn't properly parse the output array structure. OpenAI Responses API returns reasoning in output array items with type: "reasoning", where each reasoning item has a summary field containing the actual reasoning text.

Solution

1. Added Reasoning Support to Mock Response

  • Created _ResponsesReasoning class with summary attribute
  • Added reasoning parameter to _ResponsesResponse.init()

2. Captured Reasoning During Streaming

  • Collect reasoning deltas via response.reasoning.delta chunks
  • Store in reasoning_chunks array for fallback

3. Fixed Output Array Parsing

After streaming completes, always retrieve the final response to parse the output array:

  • Look for output items with type: "reasoning"
  • Extract from summary field (plain text) first
  • Fall back to content field if summary not available
  • Handle both string and list content structures

4. Added Verbosity Helper

  • Implemented _ensure_verbosity() to automatically set text.verbosity: "high"
  • Ensures detailed output is returned from Responses API
  • Called before all Responses API requests

5. Improved _get_reasoning_summary()

  • Added _coerce_reasoning_summary_to_text() helper
  • Handles various reasoning summary structures (str/list/dict/objects)
  • Provides fallback to nested reasoning on output items

Verification

Tested with gpt-5-mini-2025-08-07-medium on task 00576224:

  • ✅ Reasoning captured successfully (previously null)
  • ✅ Full reasoning trace saved (~1,500 characters)
  • ✅ Reasoning tokens properly accounted for in usage/cost tracking
  • ✅ No performance impact

Files Modified

  • src/arc_agi_benchmarking/adapters/openai_base.py
    • Added _ResponsesReasoning class
    • Updated _ResponsesResponse to include reasoning attribute
    • Modified _responses_stream() to capture reasoning and parse output array
    • Added _ensure_verbosity() helper method
    • Enhanced _get_reasoning_summary() with robust parsing
    • Added _coerce_reasoning_summary_to_text() for normalization

Impact

  • All GPT-5 model submissions will now include full reasoning traces
  • No breaking changes - existing code continues to work
  • Better value - actually capturing reasoning content being paid for
  • Improved debugging - can inspect model's step-by-step thinking

## Problem

The OpenAI Responses API streaming implementation was failing to capture
reasoning content from GPT-5 models despite:
- Correctly configuring reasoning.summary: "detailed" in models.yml
- Correctly setting text.verbosity: "high" via _ensure_verbosity()
- Models generating reasoning tokens (confirmed by usage.reasoning_tokens > 0)
- API charging for reasoning tokens

Result: reasoning_summary field in saved submissions was always null,
wasting money on reasoning tokens that weren't being captured.

## Root Cause

The streaming response handler (_responses_stream()) had three critical bugs:

1. **Mock Response Object Missing reasoning Attribute**: The _ResponsesResponse
   mock class didn't include a reasoning attribute, causing _get_reasoning_summary()
   to always return None.

2. **Wrong Field for Reasoning Content**: Code was looking for reasoning in
   response.reasoning.summary, which is just the config parameter ("detailed"),
   not the actual reasoning content.

3. **Incorrect Parsing of Output Array**: When retrieving the final response,
   the code didn't properly parse the output array structure. OpenAI Responses
   API returns reasoning in output array items with type: "reasoning", where
   each reasoning item has a summary field containing the actual reasoning text.

## Solution

### 1. Added Reasoning Support to Mock Response
- Created _ResponsesReasoning class with summary attribute
- Added reasoning parameter to _ResponsesResponse.__init__()

### 2. Captured Reasoning During Streaming
- Collect reasoning deltas via response.reasoning.delta chunks
- Store in reasoning_chunks array for fallback

### 3. Fixed Output Array Parsing
After streaming completes, always retrieve the final response to parse the
output array:
- Look for output items with type: "reasoning"
- Extract from summary field (plain text) first
- Fall back to content field if summary not available
- Handle both string and list content structures

### 4. Added Verbosity Helper
- Implemented _ensure_verbosity() to automatically set text.verbosity: "high"
- Ensures detailed output is returned from Responses API
- Called before all Responses API requests

### 5. Improved _get_reasoning_summary()
- Added _coerce_reasoning_summary_to_text() helper
- Handles various reasoning summary structures (str/list/dict/objects)
- Provides fallback to nested reasoning on output items

## Verification

Tested with gpt-5-mini-2025-08-07-medium on task 00576224:
- ✅ Reasoning captured successfully (previously null)
- ✅ Full reasoning trace saved (~1,500 characters)
- ✅ Reasoning tokens properly accounted for in usage/cost tracking
- ✅ No performance impact

## Files Modified

- src/arc_agi_benchmarking/adapters/openai_base.py
  - Added _ResponsesReasoning class
  - Updated _ResponsesResponse to include reasoning attribute
  - Modified _responses_stream() to capture reasoning and parse output array
  - Added _ensure_verbosity() helper method
  - Enhanced _get_reasoning_summary() with robust parsing
  - Added _coerce_reasoning_summary_to_text() for normalization

## Impact

- All GPT-5 model submissions will now include full reasoning traces
- No breaking changes - existing code continues to work
- Better value - actually capturing reasoning content being paid for
- Improved debugging - can inspect model's step-by-step thinking
@82deutschmark
Copy link
Copy Markdown
Author

Just as an additional bit of complexity GPT-5.1 Codex models do not allow for the verbosity to be set at high, and it can only be set at medium. As long as reasoning_summary is detailed, it should still return reasoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants