Skip to content

Problem: runner leaks TCP connections — newHTTPClient() on every poll cycle exhausts ephemeral ports #3941

Description

@BK0STAR

Issue

Issue

The remote runner leaks TCP connections to the server until it exhausts all ephemeral source ports, after which it can no longer poll for jobs and all tasks remain stuck in waiting.

Root cause (in services/runners/job_pool.go):

sendProgress() and checkNewJobs() each call newHTTPClient() on every invocation, and both are invoked every second by the requestTimer ticker in JobPool.Run():

func (p *JobPool) sendProgress() (ok bool) {
      client := newHTTPClient()   // new http.Transport on every call
      ...
}

func (p *JobPool) checkNewJobs() {
      ...
      client := newHTTPClient()   // new http.Transport on every call
      ...
}

Each newHTTPClient() creates a fresh http.Transport with its own keep-alive connection pool. The connection is therefore never reused across poll cycles, and the idle connection is never closed — it stays
ESTABLISHED inside an abandoned transport. At ~2 requests/second this accumulates roughly 2 new connections per second.

Observed impact (Pro v2.18.8, Kubernetes/OpenShift, remote runners):

After ~12h of uptime, each runner pod held ~28,200 ESTABLISHED connections to the server service:

$ netstat -tn | awk '{print $6}' | sort | uniq -c
  28232 ESTABLISHED

all pointing to the server:

  28232 172.30.54.155:3000

At that point every new outbound dial fails and the runner logs, twice per second:

level=error msg="Put \"http://semaphore:3000/api/internal/runners\": dial tcp 172.30.54.155:3000: connect: cannot assign requested address" action="send request" context=sending_progress
level=error msg="Get \"http://semaphore:3000/api/internal/runners\": dial tcp 172.30.54.155:3000: connect: cannot assign requested address" action="send request" context="checking new jobs"

Tasks dispatched to the runner stay in waiting forever. The accumulated connections (TCP buffers + transports held by GC) also inflate memory usage over time — we initially hit OOMKills at a 768Mi limit
before identifying this leak.

Aggravating detail: in checkNewJobs(), defer resp.Body.Close() is placed after the early return for resp.StatusCode >= 400, so in error mode response bodies leak as well.

Suggested fix

Create the http.Client once (e.g. a field on JobPool, or a package-level lazily-initialized client) and reuse it in sendProgress(), checkNewJobs(), tryRegisterRunner() and Unregister(). The default
http.Transport keep-alive pool will then reuse a single connection instead of leaking one per request. Also move defer resp.Body.Close() before the status-code check in checkNewJobs().

Workaround

Kubernetes liveness probe that restarts the runner pod before port exhaustion:

livenessProbe:
  exec:
    command:
      - sh
      - -c
      - '[ "$(netstat -tn 2>/dev/null | grep -c ESTABLISHED)" -lt 5000 ]'
  initialDelaySeconds: 120
  periodSeconds: 60
  failureThreshold: 3

Impact

Ansible (task execution)

Installation method

Kubernetes

Database

Postgres

Semaphore Version

Pro v2.18.8 (server and runner) — bug still present in current develop source

Additional information

Two independent runner pods on the same cluster exhibited identical behavior (28,232 and 28,234 connections respectively after ~12h).

Impact

Configuration

Installation method

Docker

Database

No response

Browser

No response

Semaphore Version

Pro v2.18.8

Ansible Version

Logs & errors

No response

Manual installation - system information

No response

Configuration

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions