Summary
I hit a failure mode where the orchestrator appeared healthy enough to keep running, but the runtime behavior did not match the current source tree and the app eventually crashed during terminal-state reconciliation.
This looks like a stale / mismatched escript build artifact problem rather than a bad project WORKFLOW.md config.
What I observed
- The live
/api/v1/state response shape did not match the current Presenter.state_payload/2 source.
- I saw legacy-looking keys such as
tracked_issues, backoff_queue, and merge_queue
- current source returns keys like
tracked, retrying, and merge
- After a worker finished and the orchestrator reconciled terminal state, the app crashed and later restarted.
- During the bad window, the dashboard/API became inconsistent (
snapshot unavailable, delayed dispatch of newly unblocked issues).
- After restart, dispatch resumed without any project-config changes.
Sanitized crash evidence
The primary crash was an undef during worker termination / terminal-state reconciliation:
error: Supervisor: {local,'Elixir.SymphonyElixir.Supervisor'}. Context: child_terminated.
Reason: {undef,[
{'Elixir.SymphonyElixir.PiAnalytics',emit_symphony_run,...},
{'Elixir.SymphonyElixir.Orchestrator',terminate_running_issue,5,[{file,"lib/symphony_elixir/orchestrator.ex"},{line,936}]},
{'Elixir.SymphonyElixir.Orchestrator',reconcile_running_issue_states,4,...},
{'Elixir.SymphonyElixir.Orchestrator',reconcile_running_issues,1,...},
{'Elixir.SymphonyElixir.Orchestrator',maybe_dispatch,1,...}
]}
Immediately after, shutdown/restart emitted another undef-style error in Phoenix shutdown handling:
error: Generic server ... terminating.
Reason: {'module could not be loaded',[
{'Elixir.Stream.Reducers',chunk_every,...},
{'Elixir.Phoenix.Socket.PoolDrainer',terminate,2,...}
]}
Why I think this is a stale-build / runtime-mismatch issue
- The current source tree contains
lib/symphony_elixir/pi_analytics.ex, so undef on SymphonyElixir.PiAnalytics.emit_symphony_run/2 strongly suggests the running executable was built from older code.
- The live API response shape also did not match the current checked-out source.
- Once the process restarted, the previously stalled dispatch path resumed without any
WORKFLOW.md changes.
Expected behavior
- Running an out-of-date escript should fail loudly at startup, or at least report a clear build/source mismatch.
- Analytics emission should never be able to crash the orchestrator.
- Snapshot/dashboard behavior should degrade gracefully if one subsystem misbehaves.
Suggested fixes
- Add build/source version metadata to the escript and validate it at startup.
- If the built artifact does not match the checked-out source/build metadata, fail loudly.
- Wrap analytics emission defensively.
emit_symphony_run_analytics(...) should not be able to bring down the orchestrator.
- Log a warning and continue if analytics emission fails.
- Add a regression test for terminal-state reconciliation when analytics/runtime modules are unavailable.
- Consider including an explicit API/schema/build version in
/api/v1/state to make runtime/source mismatch diagnosis much easier.
Why this matters
This failure mode is confusing operationally:
- the process may still appear alive
- the dashboard/API can become partially inconsistent
- newly unblocked issues may not dispatch until restart/recovery
That makes it easy to misdiagnose as a repo/workflow configuration issue when the root problem is actually the Symphony runtime/build itself.
Summary
I hit a failure mode where the orchestrator appeared healthy enough to keep running, but the runtime behavior did not match the current source tree and the app eventually crashed during terminal-state reconciliation.
This looks like a stale / mismatched escript build artifact problem rather than a bad project
WORKFLOW.mdconfig.What I observed
/api/v1/stateresponse shape did not match the currentPresenter.state_payload/2source.tracked_issues,backoff_queue, andmerge_queuetracked,retrying, andmergesnapshot unavailable, delayed dispatch of newly unblocked issues).Sanitized crash evidence
The primary crash was an
undefduring worker termination / terminal-state reconciliation:Immediately after, shutdown/restart emitted another
undef-style error in Phoenix shutdown handling:Why I think this is a stale-build / runtime-mismatch issue
lib/symphony_elixir/pi_analytics.ex, soundefonSymphonyElixir.PiAnalytics.emit_symphony_run/2strongly suggests the running executable was built from older code.WORKFLOW.mdchanges.Expected behavior
Suggested fixes
emit_symphony_run_analytics(...)should not be able to bring down the orchestrator./api/v1/stateto make runtime/source mismatch diagnosis much easier.Why this matters
This failure mode is confusing operationally:
That makes it easy to misdiagnose as a repo/workflow configuration issue when the root problem is actually the Symphony runtime/build itself.