Issue
Issue
The remote runner leaks TCP connections to the server until it exhausts all ephemeral source ports, after which it can no longer poll for jobs and all tasks remain stuck in waiting.
Root cause (in services/runners/job_pool.go):
sendProgress() and checkNewJobs() each call newHTTPClient() on every invocation, and both are invoked every second by the requestTimer ticker in JobPool.Run():
func (p *JobPool) sendProgress() (ok bool) {
client := newHTTPClient() // new http.Transport on every call
...
}
func (p *JobPool) checkNewJobs() {
...
client := newHTTPClient() // new http.Transport on every call
...
}
Each newHTTPClient() creates a fresh http.Transport with its own keep-alive connection pool. The connection is therefore never reused across poll cycles, and the idle connection is never closed — it stays
ESTABLISHED inside an abandoned transport. At ~2 requests/second this accumulates roughly 2 new connections per second.
Observed impact (Pro v2.18.8, Kubernetes/OpenShift, remote runners):
After ~12h of uptime, each runner pod held ~28,200 ESTABLISHED connections to the server service:
$ netstat -tn | awk '{print $6}' | sort | uniq -c
28232 ESTABLISHED
all pointing to the server:
At that point every new outbound dial fails and the runner logs, twice per second:
level=error msg="Put \"http://semaphore:3000/api/internal/runners\": dial tcp 172.30.54.155:3000: connect: cannot assign requested address" action="send request" context=sending_progress
level=error msg="Get \"http://semaphore:3000/api/internal/runners\": dial tcp 172.30.54.155:3000: connect: cannot assign requested address" action="send request" context="checking new jobs"
Tasks dispatched to the runner stay in waiting forever. The accumulated connections (TCP buffers + transports held by GC) also inflate memory usage over time — we initially hit OOMKills at a 768Mi limit
before identifying this leak.
Aggravating detail: in checkNewJobs(), defer resp.Body.Close() is placed after the early return for resp.StatusCode >= 400, so in error mode response bodies leak as well.
Suggested fix
Create the http.Client once (e.g. a field on JobPool, or a package-level lazily-initialized client) and reuse it in sendProgress(), checkNewJobs(), tryRegisterRunner() and Unregister(). The default
http.Transport keep-alive pool will then reuse a single connection instead of leaking one per request. Also move defer resp.Body.Close() before the status-code check in checkNewJobs().
Workaround
Kubernetes liveness probe that restarts the runner pod before port exhaustion:
livenessProbe:
exec:
command:
- sh
- -c
- '[ "$(netstat -tn 2>/dev/null | grep -c ESTABLISHED)" -lt 5000 ]'
initialDelaySeconds: 120
periodSeconds: 60
failureThreshold: 3
Impact
Ansible (task execution)
Installation method
Kubernetes
Database
Postgres
Semaphore Version
Pro v2.18.8 (server and runner) — bug still present in current develop source
Additional information
Two independent runner pods on the same cluster exhibited identical behavior (28,232 and 28,234 connections respectively after ~12h).
Impact
Configuration
Installation method
Docker
Database
No response
Browser
No response
Semaphore Version
Pro v2.18.8
Ansible Version
Logs & errors
No response
Manual installation - system information
No response
Configuration
No response
Additional information
No response
Issue
Issue
The remote runner leaks TCP connections to the server until it exhausts all ephemeral source ports, after which it can no longer poll for jobs and all tasks remain stuck in
waiting.Root cause (in
services/runners/job_pool.go):sendProgress()andcheckNewJobs()each callnewHTTPClient()on every invocation, and both are invoked every second by therequestTimerticker inJobPool.Run():Each
newHTTPClient()creates a freshhttp.Transportwith its own keep-alive connection pool. The connection is therefore never reused across poll cycles, and the idle connection is never closed — it staysESTABLISHED inside an abandoned transport. At ~2 requests/second this accumulates roughly 2 new connections per second.
Observed impact (Pro v2.18.8, Kubernetes/OpenShift, remote runners):
After ~12h of uptime, each runner pod held ~28,200 ESTABLISHED connections to the server service:
all pointing to the server:
At that point every new outbound dial fails and the runner logs, twice per second:
Tasks dispatched to the runner stay in
waitingforever. The accumulated connections (TCP buffers + transports held by GC) also inflate memory usage over time — we initially hit OOMKills at a 768Mi limitbefore identifying this leak.
Aggravating detail: in
checkNewJobs(),defer resp.Body.Close()is placed after the earlyreturnforresp.StatusCode >= 400, so in error mode response bodies leak as well.Suggested fix
Create the
http.Clientonce (e.g. a field onJobPool, or a package-level lazily-initialized client) and reuse it insendProgress(),checkNewJobs(),tryRegisterRunner()andUnregister(). The defaulthttp.Transportkeep-alive pool will then reuse a single connection instead of leaking one per request. Also movedefer resp.Body.Close()before the status-code check incheckNewJobs().Workaround
Kubernetes liveness probe that restarts the runner pod before port exhaustion:
Impact
Ansible (task execution)
Installation method
Kubernetes
Database
Postgres
Semaphore Version
Pro v2.18.8 (server and runner) — bug still present in current
developsourceAdditional information
Two independent runner pods on the same cluster exhibited identical behavior (28,232 and 28,234 connections respectively after ~12h).
Impact
Configuration
Installation method
Docker
Database
No response
Browser
No response
Semaphore Version
Pro v2.18.8
Ansible Version
Logs & errors
No response
Manual installation - system information
No response
Configuration
No response
Additional information
No response