Skip to content

OOM failures persist in SWTBench evaluations despite fix in #433 #441

@juanmichelini

Description

@juanmichelini

OOM failures persist in SWTBench evaluations despite fix in #433

Summary

Evaluation jobs continue to experience OOM kills on SWTBench despite the memory leak fix merged in #433 (PR #434). The fix addressed parent process memory accumulation but left worker process memory accumulation unaddressed. This results in continued OOM failures on long-running evaluations with 30 parallel workers.

Evidence

Recent OOM failure

  • Job: eval-22318536157-qwen3-code-fjvsl
  • Started: 2026-02-23 18:08:00 UTC (3 days after fix was merged)
  • OOMKilled: 2026-02-23 23:12:12 UTC (exit code 137)
  • Runtime: 5 hours 4 minutes
  • Configuration: 30 workers, 8Gi memory limit, SWT-bench dataset
  • Model: Qwen3-Coder (via OpenRouter)

This is the second Qwen3-Coder job to OOM on SWTBench (eval-22240291718-qwen3-code-c4cnd also OOMKilled at 72h12m runtime).

Pattern from logs

workers=30
memory limit: 8Gi
Multiple instances showing:
- "runtime init failure" with increasing resource_factor (2→4→8)
- "Remote conversation ended with error"
- "Remote conversation got stuck"
- LLM provider validation errors

The combination of 30 parallel workers + retry resource_factor amplification + long-running conversations creates memory pressure that accumulates in worker processes.

Root cause: Incomplete fix in #433

Issue #433 correctly identified the O(N) memory accumulation problem and proposed three fixes:

  1. Implemented: Release EvalOutput.history after disk write (line 367)
  2. Not implemented: Add max_tasks_per_child to recycle worker processes
  3. Not implemented: Share metadata by reference instead of deep copying

What's missing

1. Worker processes never recycle (line 372):

# Current implementation
pool = ProcessPoolExecutor(max_workers=self.num_workers)

Worker processes run for the entire evaluation duration (hours to days). They accumulate:

  • Fragmented Python heap from repeated large allocations
  • Leaked references in C extensions (OpenTelemetry, gRPC, multiprocessing internals)
  • Growing process RSS that never shrinks even after objects are freed

With 30 workers each processing multiple instances with retries and resource scaling, memory fragmentation compounds over time.

2. Deep metadata copies still occur (lines 422, 460):

out.metadata = self.metadata.model_copy(deep=True)

This creates a full copy of metadata (including nested LLM config, environment vars, paths) for every instance result, adding 10-50 KB per output. With hundreds of instances × multiple attempts, this becomes measurable.

Why the fix in #433 wasn't sufficient

The fix reduced parent process memory from O(N × 20 MB) to O(workers × ~10 KB), which is correct. However:

  1. Worker process memory is independent - each worker process has its own Python heap that grows independently of the parent
  2. Resource factor amplification - failed instances retry with 2x/4x/8x resources, meaning some workers spawn runtimes requesting much more memory
  3. Long-running workers - with no recycling, a single worker might process 10-20+ instances over 5+ hours, accumulating fragmentation
  4. No per-worker memory limits - the 8Gi limit applies to the container (parent + all workers combined), not individual workers

Memory calculation

With 30 workers and 8Gi container limit:

  • Base allocation: ~267 MB average per worker
  • With resource_factor=8 on retries: some workers need 2+ GB temporarily
  • Memory fragmentation: Python heap doesn't release memory back to OS
  • Result: Combined RSS of all workers grows monotonically until OOM

Proposed fix

Apply the remaining two fixes from #433:

1. Recycle worker processes (evaluation.py:372)

# Current
pool = ProcessPoolExecutor(max_workers=self.num_workers)

# Proposed
pool = ProcessPoolExecutor(
    max_workers=self.num_workers,
    max_tasks_per_child=10  # Recycle after 10 instances per worker
)

This forces workers to restart after processing 10 instances, releasing accumulated fragmented memory. With 30 workers, this means a fresh worker pool every ~300 instances, preventing unbounded memory growth.

2. Share metadata by reference (evaluation.py:422, 460)

# Current
out.metadata = self.metadata.model_copy(deep=True)

# Proposed  
out.metadata = self.metadata  # Metadata is read-only after initialization

Metadata is never modified after being set in the Evaluation constructor, so deep copying is unnecessary. This saves 10-50 KB per instance × attempts.

Why this needs to be fixed

  1. Blocking SWTBench evaluations: Qwen3-Coder jobs consistently OOM, preventing completion of evaluation runs
  2. Unpredictable failures: OOM can occur at any point depending on retry patterns and resource scaling
  3. Wastes compute resources: Jobs run for hours before failing, wasting expensive GPU/CPU time
  4. Blocks CI/CD: Evaluation pipelines cannot reliably complete
  5. Affects all high-worker-count runs: Any benchmark with 30 workers on long-running instances will eventually hit this

Related issues

Implementation notes

The proposed fixes are:

  • Low risk: max_tasks_per_child is a standard Python 3.11+ feature
  • Backward compatible: Existing evaluations will see improved memory usage
  • Well-tested pattern: Used in production systems for memory-intensive parallel workloads
  • Minimal code change: 2-3 lines modified total

The fix should be applied before next major evaluation run to prevent continued OOM failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions