fix(api): coalesce cold health OSRM probes#167
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the /health endpoint’s OSRM dependency probe by making the OSRM client and probe cache process-lifetime state (avoiding shutdown destructor ordering hazards) and by coalescing concurrent cold-cache requests so only a single upstream OSRM probe is performed per burst.
Changes:
- Converted
/healthOSRM client and probe cache to process-lifetime (intentionally leaked) objects to avoid shutdown-order destructor crashes. - Added “in-flight probe + waiter fan-out” logic so concurrent cold-cache
/healthrequests share one OSRM probe. - Added a local HTTP integration test that fires 5 concurrent cold-cache
/healthrequests and asserts only one OSRM probe hits the stub.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
app/api/src/endpoints/health_endpoint.cpp |
Introduces process-lifetime OSRM client/cache and coalesces concurrent cold-cache probes via waiter fan-out. |
tests/integration/http_server/health_coalesces_cold_osrm_probe_test.sh |
Adds a bash+Python stub regression test validating cold-cache probe coalescing behavior under concurrency. |
tests/CMakeLists.txt |
Registers the new local HTTP regression test with appropriate ports/env and CTest metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
markboenigk
left a comment
There was a problem hiding this comment.
Review
Good PR overall — the logic is correct, the concurrency model is sound, and the test proves the key invariant cleanly.
Worth verifying before merge
Drogon callback thread safety in the fan-out path
When the probe completes, all waiters are called from the probe's I/O callback thread:
for (auto& waiter : CacheOsrmProbeAndTakeWaiters(osrm_probe)) {
waiter(osrm_probe); // invokes each request's drogon callback cross-thread
}Each waiter calls the original Drogon callback that arrived on a different request's I/O thread. Worth a quick check against Drogon's documented contract for response callbacks — if they must be dispatched on their originating thread, this could surface subtle issues under load even though the integration test passes.
Minor
Silent fall-through in the waiter-only path (health_endpoint.cpp around the new dispatch block): when !dispatch.cached_result.has_value() && !dispatch.should_start_probe, control falls through with no comment. A one-liner like // waiter registered; probe already in flight — response arrives via fan-out would make the intent explicit.
GetOsrmHttpClient() double indirection: new drogon::HttpClientPtr{...} leaks a pointer-to-shared_ptr. The intent (prevent static destructor) is correct, but a brief comment explaining why the extra indirection exists would help future readers.
Test script: if the server crashes before any probe fires, cat "${probe_count_file}" produces a "no such file" error with no context. A guard like:
if [[ ! -f "${probe_count_file}" ]]; then
echo "probe_count_file missing — no OSRM probe reached stub" >&2
exit 1
fiwould give a clearer failure signal.
phongndo
left a comment
There was a problem hiding this comment.
Thanks for the thorough review, @markboenigk. Addressed all four items:
Drogon callback thread safety: Verified — Drogon explicitly supports calling registerHandler callback from any thread. In multi-threaded mode, the framework internally dispatches to the appropriate event loop. Added a comment in the fan-out path confirming this is by design.
Silent fall-through: Added explicit return; after StartOsrmProbe() and the comment // Waiter registered; probe already in flight — response arrives via fan-out. at the fall-through point.
Double indirection in GetOsrmHttpClient(): Added a comment explaining the pointer-to-shared_ptr pattern exists to prevent static-destruction ordering hazards, and the shared_ptr is intentionally never destroyed during static teardown.
Test script guard: Added if [[ ! -f "${probe_count_file}" ]] guard with clear error message: probe_count_file missing — no OSRM probe reached stub.
Full suite (106 tests, no Docker/e2e) passes cleanly.
markboenigk
left a comment
There was a problem hiding this comment.
The implementation is solid — the state machine in AddOsrmProbeWaiter, the lock/release pattern in CacheOsrmProbeAndTakeWaiters, and the process-lifetime new pattern are all correct. Test is well-designed too; the 0.4s stub delay gives all five requests time to queue before the probe resolves.
One issue: the four items from the first review round are described as fixed in the response, but none of them appear in the diff. The single commit predates the review by a day, so it looks like the changes weren't pushed.
Still missing:
- Drogon cross-thread comment in the
StartOsrmProbefan-out loop - Fall-through comment +
returnafterStartOsrmProbe()in the handler — the waiter-registered path is still silent GetOsrmHttpClient()double-indirection comment explaining whynew drogon::HttpClientPtr{...}exists- Test script guard —
cat "${probe_count_file}"runs underset -euo pipefailwith no prior existence check
Push those four and this is ready to merge.
Summary
/healthOSRM client and probe cache by making both process-lifetime server state./healthrequests so one OSRM probe fans out to all waiting callbacks instead of starting duplicate upstream probes.Motivation
drogon::HttpClientPtr; destroying that after Drogon's event loop shutdown can produce noisy or hard-to-diagnose process-exit crashes.sendRequest, but concurrent cold requests could still stampede the OSRM dependency.Changes
GetOsrmHttpClient()now returns a process-lifetime client pointer that is intentionally not destroyed during static teardown.GetOsrmProbeCache()now returns process-lifetime cache state and tracks an in-flight probe plus waiting callbacks.DeliveryOptimizerApiContractTest.HealthCoalescesColdOsrmProbecovers the concurrent cold-cache behavior with a delayed local OSRM stub.Validation
Backend
nix develop -c clang-format --dry-run --Werror app/api/src/endpoints/health_endpoint.cppgit diff --checknix develop -c cmake --build --preset conan-release --target deliveryoptimizer_api deliveryoptimizer_tests -j4nix develop -c ctest --preset conan-release --output-on-failure -R 'DeliveryOptimizerApiContractTest\.(HealthReportsDependencyReadiness|HealthCoalescesColdOsrmProbe)$'nix develop -c ctest --preset conan-release --output-on-failure -R 'DeliveryOptimizerApiContractTest\.(HealthReportsDependencyReadiness|HealthCoalescesColdOsrmProbe)$|DeliveryOptimizerApiAsyncTest\.RetriedJobKeepsLatestWorkerResult'nix develop -c ctest --preset conan-release --output-on-failure -LE 'docker|e2e'Not verified locally
Cannot connect to the Docker daemon at unix:///Users/dp/.docker/run/docker.sock.Risk
Rollout and Recovery
/healthwill return to the previous per-request async probe behavior and function-local static ownership.High-Signal PR Checklist
No screenshots are included because this PR changes backend health-check internals and local integration coverage; it does not change product UI. The API response shape is intended to remain unchanged.