Skip to content

Add sandbox observability and resource monitoring #2487

@radisicc

Description

@radisicc

Description
Currently, sandbox environments behave like a black box. When processes die (e.g., OOM, crash), users cannot see resource usage or termination reasons. This makes it hard to distinguish between application bugs and infrastructure issues.

Proposal

  • Expose per-sandbox metrics (CPU, memory, disk, I/O) via API and dashboard.
  • Surface process termination reasons (e.g., OOM kill, manual stop, internal crash).
  • Enable integration with external observability tools (e.g., OTEL collector endpoint configuration).
  • (Optional) Provide a built-in lightweight supervisor to restart main processes when killed.

Acceptance Criteria

  • API returns real-time and historical CPU/memory/disk usage for each sandbox.
  • Sandbox events include termination reason with timestamp.
  • Dashboard displays metrics graphs and termination events.
  • Configurable OTEL collector endpoint supported for streaming metrics/logs.
  • (Optional) Supervisor process can be toggled per sandbox for auto-restarts.

Impact

  • Users can self-diagnose issues like OOM without needing Daytona support.
  • Faster debugging and reduced downtime for production-like workloads.
  • Enables external monitoring and alerting pipelines (Datadog, Grafana, etc.).
  • Improves reliability and trust.

Sub-issues

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions