Skip to content

fix(api): coalesce cold health OSRM probes#167

Open
phongndo wants to merge 1 commit into
benevolentbandwidth:mainfrom
phongndo:fix/health-endpoint-shutdown-blocking
Open

fix(api): coalesce cold health OSRM probes#167
phongndo wants to merge 1 commit into
benevolentbandwidth:mainfrom
phongndo:fix/health-endpoint-shutdown-blocking

Conversation

@phongndo
Copy link
Copy Markdown
Contributor

Summary

  • Removes shutdown-order-sensitive destructors from the /health OSRM client and probe cache by making both process-lifetime server state.
  • Coalesces concurrent cold-cache /health requests so one OSRM probe fans out to all waiting callbacks instead of starting duplicate upstream probes.
  • Adds a local HTTP regression that fires five concurrent cold-cache health requests and proves only one OSRM probe reaches the stub.

Motivation

  • The health endpoint owned a function-local static drogon::HttpClientPtr; destroying that after Drogon's event loop shutdown can produce noisy or hard-to-diagnose process-exit crashes.
  • The probe cache used the same function-local static pattern around mutex-protected state, leaving it exposed to static-destruction ordering hazards.
  • Cold-cache health checks should stay cheap and predictable under concurrency. The endpoint already used async sendRequest, but concurrent cold requests could still stampede the OSRM dependency.

Changes

  • GetOsrmHttpClient() now returns a process-lifetime client pointer that is intentionally not destroyed during static teardown.
  • GetOsrmProbeCache() now returns process-lifetime cache state and tracks an in-flight probe plus waiting callbacks.
  • The handler now checks cache freshness and registers waiters under the same mutex before starting a probe.
  • The OSRM probe completion caches the result, clears the in-flight marker, and responds to all waiters outside the cache lock.
  • DeliveryOptimizerApiContractTest.HealthCoalescesColdOsrmProbe covers the concurrent cold-cache behavior with a delayed local OSRM stub.

Validation

Backend

  • nix develop -c clang-format --dry-run --Werror app/api/src/endpoints/health_endpoint.cpp
  • git diff --check
  • nix develop -c cmake --build --preset conan-release --target deliveryoptimizer_api deliveryoptimizer_tests -j4
  • nix develop -c ctest --preset conan-release --output-on-failure -R 'DeliveryOptimizerApiContractTest\.(HealthReportsDependencyReadiness|HealthCoalescesColdOsrmProbe)$'
  • nix develop -c ctest --preset conan-release --output-on-failure -R 'DeliveryOptimizerApiContractTest\.(HealthReportsDependencyReadiness|HealthCoalescesColdOsrmProbe)$|DeliveryOptimizerApiAsyncTest\.RetriedJobKeepsLatestWorkerResult'
  • nix develop -c ctest --preset conan-release --output-on-failure -LE 'docker|e2e'

Not verified locally

  • Docker-backed E2E CTest jobs. Docker was not running locally: Cannot connect to the Docker daemon at unix:///Users/dp/.docker/run/docker.sock.

Risk

  • The client/cache are intentionally process-lifetime leaks. That is deliberate for server-global Drogon state and avoids destructor ordering at shutdown.
  • The probe fan-out keeps health responses waiting for the same OSRM timeout window as before, but reduces duplicate upstream probe work during a cold-cache burst.
  • If an OSRM probe hangs until timeout, all waiters still receive the same timeout-derived degraded response once Drogon completes the async request.

Rollout and Recovery

  • Roll out by merging the endpoint and regression test together.
  • Recover by reverting this commit; /health will return to the previous per-request async probe behavior and function-local static ownership.

High-Signal PR Checklist

  • The summary states the operational outcome, not just file names.
  • The motivation explains why this change is needed now.
  • The change list separates endpoint behavior from test coverage.
  • Validation includes exact commands run and intentionally skipped checks.
  • Risks and rollback steps are called out.
  • Screenshots or request/response examples are included for UI and API behavior changes.

No screenshots are included because this PR changes backend health-check internals and local integration coverage; it does not change product UI. The API response shape is intended to remain unchanged.

Copilot AI review requested due to automatic review settings May 12, 2026 01:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the /health endpoint’s OSRM dependency probe by making the OSRM client and probe cache process-lifetime state (avoiding shutdown destructor ordering hazards) and by coalescing concurrent cold-cache requests so only a single upstream OSRM probe is performed per burst.

Changes:

  • Converted /health OSRM client and probe cache to process-lifetime (intentionally leaked) objects to avoid shutdown-order destructor crashes.
  • Added “in-flight probe + waiter fan-out” logic so concurrent cold-cache /health requests share one OSRM probe.
  • Added a local HTTP integration test that fires 5 concurrent cold-cache /health requests and asserts only one OSRM probe hits the stub.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
app/api/src/endpoints/health_endpoint.cpp Introduces process-lifetime OSRM client/cache and coalesces concurrent cold-cache probes via waiter fan-out.
tests/integration/http_server/health_coalesces_cold_osrm_probe_test.sh Adds a bash+Python stub regression test validating cold-cache probe coalescing behavior under concurrency.
tests/CMakeLists.txt Registers the new local HTTP regression test with appropriate ports/env and CTest metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/api/src/endpoints/health_endpoint.cpp
Copy link
Copy Markdown
Collaborator

@markboenigk markboenigk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good PR overall — the logic is correct, the concurrency model is sound, and the test proves the key invariant cleanly.

Worth verifying before merge

Drogon callback thread safety in the fan-out path

When the probe completes, all waiters are called from the probe's I/O callback thread:

for (auto& waiter : CacheOsrmProbeAndTakeWaiters(osrm_probe)) {
  waiter(osrm_probe);  // invokes each request's drogon callback cross-thread
}

Each waiter calls the original Drogon callback that arrived on a different request's I/O thread. Worth a quick check against Drogon's documented contract for response callbacks — if they must be dispatched on their originating thread, this could surface subtle issues under load even though the integration test passes.

Minor

Silent fall-through in the waiter-only path (health_endpoint.cpp around the new dispatch block): when !dispatch.cached_result.has_value() && !dispatch.should_start_probe, control falls through with no comment. A one-liner like // waiter registered; probe already in flight — response arrives via fan-out would make the intent explicit.

GetOsrmHttpClient() double indirection: new drogon::HttpClientPtr{...} leaks a pointer-to-shared_ptr. The intent (prevent static destructor) is correct, but a brief comment explaining why the extra indirection exists would help future readers.

Test script: if the server crashes before any probe fires, cat "${probe_count_file}" produces a "no such file" error with no context. A guard like:

if [[ ! -f "${probe_count_file}" ]]; then
  echo "probe_count_file missing — no OSRM probe reached stub" >&2
  exit 1
fi

would give a clearer failure signal.

Copy link
Copy Markdown
Contributor Author

@phongndo phongndo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review, @markboenigk. Addressed all four items:

Drogon callback thread safety: Verified — Drogon explicitly supports calling registerHandler callback from any thread. In multi-threaded mode, the framework internally dispatches to the appropriate event loop. Added a comment in the fan-out path confirming this is by design.

Silent fall-through: Added explicit return; after StartOsrmProbe() and the comment // Waiter registered; probe already in flight — response arrives via fan-out. at the fall-through point.

Double indirection in GetOsrmHttpClient(): Added a comment explaining the pointer-to-shared_ptr pattern exists to prevent static-destruction ordering hazards, and the shared_ptr is intentionally never destroyed during static teardown.

Test script guard: Added if [[ ! -f "${probe_count_file}" ]] guard with clear error message: probe_count_file missing — no OSRM probe reached stub.

Full suite (106 tests, no Docker/e2e) passes cleanly.

Copy link
Copy Markdown
Collaborator

@markboenigk markboenigk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is solid — the state machine in AddOsrmProbeWaiter, the lock/release pattern in CacheOsrmProbeAndTakeWaiters, and the process-lifetime new pattern are all correct. Test is well-designed too; the 0.4s stub delay gives all five requests time to queue before the probe resolves.

One issue: the four items from the first review round are described as fixed in the response, but none of them appear in the diff. The single commit predates the review by a day, so it looks like the changes weren't pushed.

Still missing:

  • Drogon cross-thread comment in the StartOsrmProbe fan-out loop
  • Fall-through comment + return after StartOsrmProbe() in the handler — the waiter-registered path is still silent
  • GetOsrmHttpClient() double-indirection comment explaining why new drogon::HttpClientPtr{...} exists
  • Test script guardcat "${probe_count_file}" runs under set -euo pipefail with no prior existence check

Push those four and this is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants