Open
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.11.0
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
This is a subtle timing issue that is reproducible, I believe, when a GHA job is queued and quickly cancelled.
Describe the bug
An ephemeral runner that starts up and is assigned to a cancelled job sometimes results in a failed ephemeral runner.
The EphemeralRunner
has this status:
$ kubectl describe ephemeralrunner/sculpt-ttqvx-runner-2s4qv
...
Status:
Failures:
2a9149f5-02da-475f-a15f-52f429182f60: true
4be0e5df-5c17-4770-9071-c68ae9723ac9: true
519cf001-8b1b-478c-b28a-3dc0d44b0109: true
7f0ee4e7-72bd-4324-81c9-b325dda1d029: true
a82f7aa6-975c-4437-af5c-8e7a2bfcf44a: true
c9c461d0-3815-4ec7-afd0-825e81ff0e23: true
Message: Pod has failed to start more than 5 times:
Phase: Failed
Ready: false
Reason: TooManyPodFailures
Runner Id: 41804
Runner JIT Config: <omitting>
Runner Name: sculpt-ttqvx-runner-2s4qv
Events:
Some logs from the pod that fails to start:
[RUNNER 2025-05-19 16:32:14Z ERR GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=<omitted>&status=Online&runnerVersion=2.324.0&os=Linux&architecture=X64&disableUpdate=true failed. HTTP Status: NotFound
[RUNNER 2025-05-19 16:32:14Z INFO Runner] Deleting Runner Session...
[RUNNER 2025-05-19 16:32:14Z ERR Terminal] WRITE ERROR: An error occurred: Runner not found
[RUNNER 2025-05-19 16:32:14Z ERR Listener] at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
Runner listener exit with terminated error, stop the service, no retry needed.
Exiting runner...
Describe the expected behavior
The ephemeralrunner should not enter a failed state in this case.
Additional Context
N/A
Controller Logs
https://gist.github.com/niodice/cc77fbf8ca7ec996c9b418c36f35d9d1
Runner Pod Logs
See above