-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Copy link
Labels
Description
Description
Currently, sandbox environments behave like a black box. When processes die (e.g., OOM, crash), users cannot see resource usage or termination reasons. This makes it hard to distinguish between application bugs and infrastructure issues.
Proposal
- Expose per-sandbox metrics (CPU, memory, disk, I/O) via API and dashboard.
- Surface process termination reasons (e.g., OOM kill, manual stop, internal crash).
- Enable integration with external observability tools (e.g., OTEL collector endpoint configuration).
- (Optional) Provide a built-in lightweight supervisor to restart main processes when killed.
Acceptance Criteria
- API returns real-time and historical CPU/memory/disk usage for each sandbox.
- Sandbox events include termination reason with timestamp.
- Dashboard displays metrics graphs and termination events.
- Configurable OTEL collector endpoint supported for streaming metrics/logs.
- (Optional) Supervisor process can be toggled per sandbox for auto-restarts.
Impact
- Users can self-diagnose issues like OOM without needing Daytona support.
- Faster debugging and reduced downtime for production-like workloads.
- Enables external monitoring and alerting pipelines (Datadog, Grafana, etc.).
- Improves reliability and trust.