Skip to content

EphemeralRunner stuck in failed state if the job it was allocated to is cancelled #4091

Open
@niodice

Description

@niodice

Checks

Controller Version

0.11.0

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This is a subtle timing issue that is reproducible, I believe, when a GHA job is queued and quickly cancelled.

Describe the bug

An ephemeral runner that starts up and is assigned to a cancelled job sometimes results in a failed ephemeral runner.

The EphemeralRunner has this status:

$ kubectl describe ephemeralrunner/sculpt-ttqvx-runner-2s4qv
...
Status:
  Failures:
    2a9149f5-02da-475f-a15f-52f429182f60:  true
    4be0e5df-5c17-4770-9071-c68ae9723ac9:  true
    519cf001-8b1b-478c-b28a-3dc0d44b0109:  true
    7f0ee4e7-72bd-4324-81c9-b325dda1d029:  true
    a82f7aa6-975c-4437-af5c-8e7a2bfcf44a:  true
    c9c461d0-3815-4ec7-afd0-825e81ff0e23:  true
  Message:                                 Pod has failed to start more than 5 times:
  Phase:                                   Failed
  Ready:                                   false
  Reason:                                  TooManyPodFailures
  Runner Id:                               41804
  Runner JIT Config:                       <omitting>
  Runner Name:                             sculpt-ttqvx-runner-2s4qv
Events:

Some logs from the pod that fails to start:

[RUNNER 2025-05-19 16:32:14Z ERR  GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=<omitted>&status=Online&runnerVersion=2.324.0&os=Linux&architecture=X64&disableUpdate=true failed. HTTP Status: NotFound

[RUNNER 2025-05-19 16:32:14Z INFO Runner] Deleting Runner Session...

[RUNNER 2025-05-19 16:32:14Z ERR  Terminal] WRITE ERROR: An error occurred: Runner not found

[RUNNER 2025-05-19 16:32:14Z ERR  Listener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)

Runner listener exit with terminated error, stop the service, no retry needed.

Exiting runner...

Describe the expected behavior

The ephemeralrunner should not enter a failed state in this case.

Additional Context

N/A

Controller Logs

https://gist.github.com/niodice/cc77fbf8ca7ec996c9b418c36f35d9d1

Runner Pod Logs

See above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions