Skip to content

Improve Twinkle service stability#230

Merged
Yunnglin merged 11 commits into
mainfrom
fix/service-stability
Jun 22, 2026
Merged

Improve Twinkle service stability#230
Yunnglin merged 11 commits into
mainfrom
fix/service-stability

Conversation

@Yunnglin

@Yunnglin Yunnglin commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix completion_mask padding so legal inputs no longer raise KeyError.
  • Add a FastAPI deployment-level exception boundary that returns 500 with traceback while keeping the server alive.
  • Preserve full traceback in task failure payloads.
  • Replace tail -f-based liveness in megatron/run.sh with a singleton watchdog/supervisor that restarts server/Ray locally.
  • Simplify online startup by setting default MODELSCOPE_CACHE and TWINKLE_WORK_DIR inside run.sh.
  • Add explicit restart mode via ./run.sh --restart or TWINKLE_RUN_EXISTING_ACTION=restart.

Root Cause

completion_mask was a valid input field but was missing from InputProcessor.padding_map, so normal padding could fail with KeyError. Separately, run.sh only stayed alive through tail -f run.log, so server/Ray failures could leave the process model fragile or fake-alive.

Startup / Restart

Default startup can now be shortened to:

bash /twinkle/cookbook/client/server/megatron/run.sh

For a code update that should actively restart an already-running service:

bash /twinkle/cookbook/client/server/megatron/run.sh --restart

or:

TWINKLE_RUN_EXISTING_ACTION=restart bash /twinkle/cookbook/client/server/megatron/run.sh

Default duplicate execution remains safe: without restart mode, a second run.sh exits instead of disrupting the live service.

Validation

  • bash -n cookbook/client/server/megatron/run.sh
  • bash cookbook/client/server/megatron/run.sh --help
  • git diff --check
  • /Users/yunlin/miniconda3/envs/twinkle/bin/python -m pre_commit run --files ...
  • /Users/yunlin/miniconda3/envs/twinkle/bin/python -m pytest tests/processor/test_processor.py tests/server/utils/test_task_errors.py tests/server/utils/test_task_queue_mixin.py tests/server/test_deployment_exception_boundary.py tests/server/test_app_builders_characterization.py tests/server/model -q
  • Local Ray Serve destructive smoke: completion_mask payload succeeded; repeated exception requests returned 500 + traceback; the same replica remained healthy and RUNNING.

Notes

Full megatron run.sh startup was not executed locally because it can stop local Ray/vLLM/Redis/LGTM processes and consume GPU resources. First online rollout should set bounded restart env vars before enabling unlimited supervisor restarts.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a watchdog supervisor script for Megatron, adds a completion mask to the input processor, and implements unhandled exception handling middleware in the FastAPI server alongside task queue error formatting. The review feedback highlights several critical issues: the exception-handling middleware is registered outermost, which interferes with telemetry and header injection; returning raw tracebacks to clients poses a security risk; the hardcoded lock file path in /tmp can cause permission conflicts in multi-user environments; spawning Python for health checks and date for time tracking in the shell script is highly inefficient; and a task failure helper is bypassed in the worker, causing code duplication.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/twinkle/server/deployment.py Outdated
Comment thread src/twinkle/server/deployment.py Outdated
Comment thread cookbook/client/server/megatron/run.sh Outdated
Comment thread cookbook/client/server/megatron/run.sh Outdated
Comment thread cookbook/client/server/megatron/run.sh Outdated
Comment thread cookbook/client/server/megatron/run.sh Outdated
Comment thread src/twinkle/server/utils/task_queue/worker.py Outdated
@Yunnglin Yunnglin marked this pull request as ready for review June 22, 2026 03:20
Copilot AI review requested due to automatic review settings June 22, 2026 03:20

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Twinkle’s service stability by (1) fixing padding for a valid input field (completion_mask), (2) ensuring unhandled FastAPI route exceptions return a structured 500 response (with traceback) rather than destabilizing the replica, and (3) making the Megatron service startup more resilient via a container entrypoint watchdog and a single-instance/restart mechanism in run.sh.

Changes:

  • Add completion_mask to InputProcessor.padding_map and cover it with a unit test to prevent padding-time KeyError.
  • Add a deployment-level exception boundary middleware that converts unhandled route exceptions into 500 JSON responses containing the traceback.
  • Improve operational robustness of Megatron service startup by introducing entrypoint.sh supervision and enhancing run.sh with single-instance + restart-request handling.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/twinkle/processor/base.py Adds completion_mask to padding defaults to prevent padding failures.
tests/processor/test_processor.py Adds regression coverage for completion_mask padding; adjusts multimodal assertion.
src/twinkle/server/deployment.py Adds an unhandled-exception middleware boundary returning 500 with traceback.
tests/server/test_deployment_exception_boundary.py Validates the exception boundary returns tracebacks and continues serving; checks replica header behavior.
tests/server/test_app_builders_characterization.py Updates expected middleware ordering to include the new exception boundary.
src/twinkle/server/utils/task_errors.py Introduces a shared helper for standard task failure payloads.
src/twinkle/server/utils/task_queue/worker.py Uses task_error_payload for failed task persistence.
src/twinkle/server/utils/task_queue/mixin.py Uses task_error_payload for background-task failure persistence.
tests/server/utils/test_task_errors.py Adds tests ensuring task failure payloads preserve traceback strings.
cookbook/client/server/megatron/run.sh Replaces tail-based liveness with a supervised foreground wait loop and adds restart/single-instance mechanics.
cookbook/client/server/megatron/entrypoint.sh Adds a container entrypoint watchdog that restarts run.sh based on Ray + HTTP health checks.

Comment thread cookbook/client/server/megatron/run.sh
Comment thread cookbook/client/server/megatron/run.sh
Comment thread src/twinkle/server/deployment.py
@Yunnglin Yunnglin merged commit ca5cb73 into main Jun 22, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants