Skip to content

[Bug]: /health endpoint blocked by FastAPI lifespan during service.initialize(), violating documented liveness semantics #1793

@A0nameless0man

Description

@A0nameless0man

Bug Description

Per docs/en/guides/05-observability.md, /health is documented as a "simple liveness check" responding {"status": "ok"}. However, the endpoint is gated behind the ASGI lifespan protocol — service.initialize() in app.py runs before yield, blocking ALL HTTP request processing, including /health, until initialization completes. When initialization triggers expensive collection recovery on existing workspace data, the server listens on port 1933 but does not respond to HTTP for minutes, causing the Docker entrypoint's health-check loop to time out and kill the still-initializing server.

Steps to Reproduce

  1. Deploy OpenViking v0.3.10 via Docker with an existing workspace volume containing populated vector data
  2. Start the container: docker compose up -d
  3. While the server is initializing, run: curl http://localhost:1933/health
  4. Observe: curl hangs (no response) until initialization completes or entrypoint kills the server
  5. Container restarts → loop repeats

Expected Behavior

/health should respond as soon as the HTTP server is accepting connections — matching the documented "simple liveness check" contract.

Actual Behavior

/health does not respond during service.initialize(). When the entrypoint's health-check timeout fires, it sends SIGTERM to the still-initializing server:

OpenViking HTTP Server is running on 0.0.0.0:1933
  ...
  File "openviking/storage/vectordb/utils/data_processor.py", line 339
    json.dumps(converted, ensure_ascii=False)
  File "openviking/utils/process_lock.py", line 120
    signal.SIGTERM → KeyboardInterrupt
[openviking-console-entrypoint] openviking-server exited before becoming healthy

Error Logs

2026-04-29 06:26:53,388 - uvicorn.error - ERROR - Traceback (most recent call last):
  File "/app/.venv/lib/python3.13/site-packages/starlette/routing.py", line 694, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  ...
  File "openviking/server/app.py", line 78, in lifespan
    await service.initialize()
  File "openviking/service/core.py", line 258, in initialize
    await init_context_collection(self._vikingdb_manager)
  ...
  File "openviking/storage/vectordb/collection/local_collection.py", line 996, in _recover
    index.upsert_data(upsert_list)
  File "openviking/storage/vectordb/utils/data_processor.py", line 339, in convert_fields_for_index
    return json.dumps(converted, ensure_ascii=False)
  File "openviking/utils/process_lock.py", line 120, in <lambda>
    signal.SIGTERM, lambda sig, frame: (_cleanup(), signal.default_int_handler(sig, frame))
KeyboardInterrupt
[openviking-console-entrypoint] openviking-server exited before becoming healthy

OpenViking Version

v0.3.10

Python Version

3.13

Operating System

Linux (Debian, Docker)

Additional Context

Root cause: openviking/server/app.py:146-153 places the entire service.initialize() before yield in the ASGI lifespan. The slowest step is init_context_collection() at core.py:265, which triggers PersistCollection._recover() — duration scales with existing workspace data volume.

Proposed fix: Separate liveness from readiness by splitting initialize() into two phases:

  1. init_essentials() (before yield, seconds): storage managers, embedder client, encryption setup
  2. init_deferred() (after yield, background task): collection recovery, VikingFS init, queue workers

/health responds {"status": "starting"} (HTTP 503) immediately after the server starts, then {"status": "ok"} (HTTP 200) when deferred init completes. The Docker entrypoint polls for "status":"ok" rather than just HTTP 200.

Secondary issue: openviking/utils/process_lock.py:120 uses signal.default_int_handler (SIGINT handler) for SIGTERM, raising KeyboardInterrupt instead of clean shutdown:

- signal.signal(signal.SIGTERM, lambda sig, frame: (_cleanup(), signal.default_int_handler(sig, frame)))
+ signal.signal(signal.SIGTERM, lambda sig, frame: (_cleanup(), sys.exit(0)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions