Skip to content

Stale/mismatched escript can crash orchestrator during terminal-state reconciliation #82

@tmustier

Description

@tmustier

Summary

I hit a failure mode where the orchestrator appeared healthy enough to keep running, but the runtime behavior did not match the current source tree and the app eventually crashed during terminal-state reconciliation.

This looks like a stale / mismatched escript build artifact problem rather than a bad project WORKFLOW.md config.

What I observed

  • The live /api/v1/state response shape did not match the current Presenter.state_payload/2 source.
    • I saw legacy-looking keys such as tracked_issues, backoff_queue, and merge_queue
    • current source returns keys like tracked, retrying, and merge
  • After a worker finished and the orchestrator reconciled terminal state, the app crashed and later restarted.
  • During the bad window, the dashboard/API became inconsistent (snapshot unavailable, delayed dispatch of newly unblocked issues).
  • After restart, dispatch resumed without any project-config changes.

Sanitized crash evidence

The primary crash was an undef during worker termination / terminal-state reconciliation:

error: Supervisor: {local,'Elixir.SymphonyElixir.Supervisor'}. Context: child_terminated.
Reason: {undef,[
  {'Elixir.SymphonyElixir.PiAnalytics',emit_symphony_run,...},
  {'Elixir.SymphonyElixir.Orchestrator',terminate_running_issue,5,[{file,"lib/symphony_elixir/orchestrator.ex"},{line,936}]},
  {'Elixir.SymphonyElixir.Orchestrator',reconcile_running_issue_states,4,...},
  {'Elixir.SymphonyElixir.Orchestrator',reconcile_running_issues,1,...},
  {'Elixir.SymphonyElixir.Orchestrator',maybe_dispatch,1,...}
]}

Immediately after, shutdown/restart emitted another undef-style error in Phoenix shutdown handling:

error: Generic server ... terminating.
Reason: {'module could not be loaded',[
  {'Elixir.Stream.Reducers',chunk_every,...},
  {'Elixir.Phoenix.Socket.PoolDrainer',terminate,2,...}
]}

Why I think this is a stale-build / runtime-mismatch issue

  • The current source tree contains lib/symphony_elixir/pi_analytics.ex, so undef on SymphonyElixir.PiAnalytics.emit_symphony_run/2 strongly suggests the running executable was built from older code.
  • The live API response shape also did not match the current checked-out source.
  • Once the process restarted, the previously stalled dispatch path resumed without any WORKFLOW.md changes.

Expected behavior

  • Running an out-of-date escript should fail loudly at startup, or at least report a clear build/source mismatch.
  • Analytics emission should never be able to crash the orchestrator.
  • Snapshot/dashboard behavior should degrade gracefully if one subsystem misbehaves.

Suggested fixes

  1. Add build/source version metadata to the escript and validate it at startup.
    • If the built artifact does not match the checked-out source/build metadata, fail loudly.
  2. Wrap analytics emission defensively.
    • emit_symphony_run_analytics(...) should not be able to bring down the orchestrator.
    • Log a warning and continue if analytics emission fails.
  3. Add a regression test for terminal-state reconciliation when analytics/runtime modules are unavailable.
  4. Consider including an explicit API/schema/build version in /api/v1/state to make runtime/source mismatch diagnosis much easier.

Why this matters

This failure mode is confusing operationally:

  • the process may still appear alive
  • the dashboard/API can become partially inconsistent
  • newly unblocked issues may not dispatch until restart/recovery

That makes it easy to misdiagnose as a repo/workflow configuration issue when the root problem is actually the Symphony runtime/build itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions