Bug Description
Per docs/en/guides/05-observability.md, /health is documented as a "simple liveness check" responding {"status": "ok"}. However, the endpoint is gated behind the ASGI lifespan protocol — service.initialize() in app.py runs before yield, blocking ALL HTTP request processing, including /health, until initialization completes. When initialization triggers expensive collection recovery on existing workspace data, the server listens on port 1933 but does not respond to HTTP for minutes, causing the Docker entrypoint's health-check loop to time out and kill the still-initializing server.
Steps to Reproduce
- Deploy OpenViking v0.3.10 via Docker with an existing workspace volume containing populated vector data
- Start the container:
docker compose up -d
- While the server is initializing, run:
curl http://localhost:1933/health
- Observe:
curl hangs (no response) until initialization completes or entrypoint kills the server
- Container restarts → loop repeats
Expected Behavior
/health should respond as soon as the HTTP server is accepting connections — matching the documented "simple liveness check" contract.
Actual Behavior
/health does not respond during service.initialize(). When the entrypoint's health-check timeout fires, it sends SIGTERM to the still-initializing server:
OpenViking HTTP Server is running on 0.0.0.0:1933
...
File "openviking/storage/vectordb/utils/data_processor.py", line 339
json.dumps(converted, ensure_ascii=False)
File "openviking/utils/process_lock.py", line 120
signal.SIGTERM → KeyboardInterrupt
[openviking-console-entrypoint] openviking-server exited before becoming healthy
Error Logs
2026-04-29 06:26:53,388 - uvicorn.error - ERROR - Traceback (most recent call last):
File "/app/.venv/lib/python3.13/site-packages/starlette/routing.py", line 694, in lifespan
async with self.lifespan_context(app) as maybe_state:
...
File "openviking/server/app.py", line 78, in lifespan
await service.initialize()
File "openviking/service/core.py", line 258, in initialize
await init_context_collection(self._vikingdb_manager)
...
File "openviking/storage/vectordb/collection/local_collection.py", line 996, in _recover
index.upsert_data(upsert_list)
File "openviking/storage/vectordb/utils/data_processor.py", line 339, in convert_fields_for_index
return json.dumps(converted, ensure_ascii=False)
File "openviking/utils/process_lock.py", line 120, in <lambda>
signal.SIGTERM, lambda sig, frame: (_cleanup(), signal.default_int_handler(sig, frame))
KeyboardInterrupt
[openviking-console-entrypoint] openviking-server exited before becoming healthy
OpenViking Version
v0.3.10
Python Version
3.13
Operating System
Linux (Debian, Docker)
Additional Context
Root cause: openviking/server/app.py:146-153 places the entire service.initialize() before yield in the ASGI lifespan. The slowest step is init_context_collection() at core.py:265, which triggers PersistCollection._recover() — duration scales with existing workspace data volume.
Proposed fix: Separate liveness from readiness by splitting initialize() into two phases:
init_essentials() (before yield, seconds): storage managers, embedder client, encryption setup
init_deferred() (after yield, background task): collection recovery, VikingFS init, queue workers
/health responds {"status": "starting"} (HTTP 503) immediately after the server starts, then {"status": "ok"} (HTTP 200) when deferred init completes. The Docker entrypoint polls for "status":"ok" rather than just HTTP 200.
Secondary issue: openviking/utils/process_lock.py:120 uses signal.default_int_handler (SIGINT handler) for SIGTERM, raising KeyboardInterrupt instead of clean shutdown:
- signal.signal(signal.SIGTERM, lambda sig, frame: (_cleanup(), signal.default_int_handler(sig, frame)))
+ signal.signal(signal.SIGTERM, lambda sig, frame: (_cleanup(), sys.exit(0)))
Bug Description
Per
docs/en/guides/05-observability.md,/healthis documented as a "simple liveness check" responding{"status": "ok"}. However, the endpoint is gated behind the ASGI lifespan protocol —service.initialize()inapp.pyruns beforeyield, blocking ALL HTTP request processing, including/health, until initialization completes. When initialization triggers expensive collection recovery on existing workspace data, the server listens on port 1933 but does not respond to HTTP for minutes, causing the Docker entrypoint's health-check loop to time out and kill the still-initializing server.Steps to Reproduce
docker compose up -dcurl http://localhost:1933/healthcurlhangs (no response) until initialization completes or entrypoint kills the serverExpected Behavior
/healthshould respond as soon as the HTTP server is accepting connections — matching the documented "simple liveness check" contract.Actual Behavior
/healthdoes not respond duringservice.initialize(). When the entrypoint's health-check timeout fires, it sends SIGTERM to the still-initializing server:Error Logs
OpenViking Version
v0.3.10
Python Version
3.13
Operating System
Linux (Debian, Docker)
Additional Context
Root cause:
openviking/server/app.py:146-153places the entireservice.initialize()beforeyieldin the ASGI lifespan. The slowest step isinit_context_collection()atcore.py:265, which triggersPersistCollection._recover()— duration scales with existing workspace data volume.Proposed fix: Separate liveness from readiness by splitting
initialize()into two phases:init_essentials()(beforeyield, seconds): storage managers, embedder client, encryption setupinit_deferred()(afteryield, background task): collection recovery, VikingFS init, queue workers/healthresponds{"status": "starting"}(HTTP 503) immediately after the server starts, then{"status": "ok"}(HTTP 200) when deferred init completes. The Docker entrypoint polls for"status":"ok"rather than just HTTP 200.Secondary issue:
openviking/utils/process_lock.py:120usessignal.default_int_handler(SIGINT handler) for SIGTERM, raisingKeyboardInterruptinstead of clean shutdown: