OOM failures persist in SWTBench evaluations despite fix in #433

# OOM failures persist in SWTBench evaluations despite fix in #433

## Summary

Evaluation jobs continue to experience OOM kills on SWTBench despite the memory leak fix merged in #433 (PR #434). The fix addressed parent process memory accumulation but left worker process memory accumulation unaddressed. This results in continued OOM failures on long-running evaluations with 30 parallel workers.

## Evidence

### Recent OOM failure
- **Job**: `eval-22318536157-qwen3-code-fjvsl`  
- **Started**: 2026-02-23 18:08:00 UTC (3 days after fix was merged)
- **OOMKilled**: 2026-02-23 23:12:12 UTC (exit code 137)
- **Runtime**: 5 hours 4 minutes
- **Configuration**: 30 workers, 8Gi memory limit, SWT-bench dataset
- **Model**: Qwen3-Coder (via OpenRouter)

This is the second Qwen3-Coder job to OOM on SWTBench (`eval-22240291718-qwen3-code-c4cnd` also OOMKilled at 72h12m runtime).

### Pattern from logs
```
workers=30
memory limit: 8Gi
Multiple instances showing:
- "runtime init failure" with increasing resource_factor (2→4→8)
- "Remote conversation ended with error"
- "Remote conversation got stuck"
- LLM provider validation errors
```

The combination of 30 parallel workers + retry resource_factor amplification + long-running conversations creates memory pressure that accumulates in worker processes.

## Root cause: Incomplete fix in #433

Issue #433 correctly identified the O(N) memory accumulation problem and proposed three fixes:

1. ✅ **Implemented**: Release `EvalOutput.history` after disk write (line 367)
2. ❌ **Not implemented**: Add `max_tasks_per_child` to recycle worker processes  
3. ❌ **Not implemented**: Share metadata by reference instead of deep copying

### What's missing

**1. Worker processes never recycle** (line 372):
```python
# Current implementation
pool = ProcessPoolExecutor(max_workers=self.num_workers)
```

Worker processes run for the entire evaluation duration (hours to days). They accumulate:
- Fragmented Python heap from repeated large allocations
- Leaked references in C extensions (OpenTelemetry, gRPC, multiprocessing internals)
- Growing process RSS that never shrinks even after objects are freed

With 30 workers each processing multiple instances with retries and resource scaling, memory fragmentation compounds over time.

**2. Deep metadata copies still occur** (lines 422, 460):
```python
out.metadata = self.metadata.model_copy(deep=True)
```

This creates a full copy of metadata (including nested LLM config, environment vars, paths) for every instance result, adding 10-50 KB per output. With hundreds of instances × multiple attempts, this becomes measurable.

## Why the fix in #433 wasn't sufficient

The fix reduced parent process memory from O(N × 20 MB) to O(workers × ~10 KB), which is correct. However:

1. **Worker process memory is independent** - each worker process has its own Python heap that grows independently of the parent
2. **Resource factor amplification** - failed instances retry with 2x/4x/8x resources, meaning some workers spawn runtimes requesting much more memory
3. **Long-running workers** - with no recycling, a single worker might process 10-20+ instances over 5+ hours, accumulating fragmentation
4. **No per-worker memory limits** - the 8Gi limit applies to the container (parent + all workers combined), not individual workers

### Memory calculation

With 30 workers and 8Gi container limit:
- Base allocation: ~267 MB average per worker
- With resource_factor=8 on retries: some workers need 2+ GB temporarily
- Memory fragmentation: Python heap doesn't release memory back to OS
- Result: Combined RSS of all workers grows monotonically until OOM

## Proposed fix

Apply the remaining two fixes from #433:

### 1. Recycle worker processes (evaluation.py:372)

```python
# Current
pool = ProcessPoolExecutor(max_workers=self.num_workers)

# Proposed
pool = ProcessPoolExecutor(
    max_workers=self.num_workers,
    max_tasks_per_child=10  # Recycle after 10 instances per worker
)
```

This forces workers to restart after processing 10 instances, releasing accumulated fragmented memory. With 30 workers, this means a fresh worker pool every ~300 instances, preventing unbounded memory growth.

### 2. Share metadata by reference (evaluation.py:422, 460)

```python
# Current
out.metadata = self.metadata.model_copy(deep=True)

# Proposed  
out.metadata = self.metadata  # Metadata is read-only after initialization
```

Metadata is never modified after being set in the Evaluation constructor, so deep copying is unnecessary. This saves 10-50 KB per instance × attempts.

## Why this needs to be fixed

1. **Blocking SWTBench evaluations**: Qwen3-Coder jobs consistently OOM, preventing completion of evaluation runs
2. **Unpredictable failures**: OOM can occur at any point depending on retry patterns and resource scaling
3. **Wastes compute resources**: Jobs run for hours before failing, wasting expensive GPU/CPU time
4. **Blocks CI/CD**: Evaluation pipelines cannot reliably complete
5. **Affects all high-worker-count runs**: Any benchmark with 30 workers on long-running instances will eventually hit this

## Related issues

- #433 - Original memory leak report (fixed parent process only)
- #434 - PR that implemented partial fix
- Similar OOM pattern observed in earlier runs: `eval-22202381065-qwen3-code-tz8ts`, `eval-22240291718-qwen3-code-c4cnd`

## Implementation notes

The proposed fixes are:
- **Low risk**: `max_tasks_per_child` is a standard Python 3.11+ feature
- **Backward compatible**: Existing evaluations will see improved memory usage
- **Well-tested pattern**: Used in production systems for memory-intensive parallel workloads
- **Minimal code change**: 2-3 lines modified total

The fix should be applied before next major evaluation run to prevent continued OOM failures.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM failures persist in SWTBench evaluations despite fix in #433 #441

OOM failures persist in SWTBench evaluations despite fix in #433

Summary

Evidence

Recent OOM failure

Pattern from logs

Root cause: Incomplete fix in #433

What's missing

Why the fix in #433 wasn't sufficient

Memory calculation

Proposed fix

1. Recycle worker processes (evaluation.py:372)

2. Share metadata by reference (evaluation.py:422, 460)

Why this needs to be fixed

Related issues

Implementation notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM failures persist in SWTBench evaluations despite fix in #433 #441

Description

OOM failures persist in SWTBench evaluations despite fix in #433

Summary

Evidence

Recent OOM failure

Pattern from logs

Root cause: Incomplete fix in #433

What's missing

Why the fix in #433 wasn't sufficient

Memory calculation

Proposed fix

1. Recycle worker processes (evaluation.py:372)

2. Share metadata by reference (evaluation.py:422, 460)

Why this needs to be fixed

Related issues

Implementation notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions