A GPU operations platform for data center GPU health monitoring, registration, and optimization.
┌─────────────────────────────────────────────────────────────────┐
│ GPU Ops Platform │
├─────────────────────────────────────────────────────────────────┤
│ gputl CLI (Go) │ gputld Daemon (Go) │ Starlark Engine (Python) │
└─────────────────────────────────────────────────────────────────┘
| Component | Language | Description |
|---|---|---|
| gputld | Go | Main daemon for GPU registration and health monitoring |
| gputl | Go | CLI tool for GPU operations |
| Starlark | Python | Policy engine for GPU allocation and optimization |
- Go 1.21+
- Python 3.10+ (for policy engine)
- UV for Python dependency management (recommended)
pip install uv
- NVIDIA GPU with NVML support
cd gpu-ops-platform
go mod download
.\build.ps1
# Or build and install to PATH
.\build.ps1 -Installcd gpu-ops-platform
go mod download
make build
# Or install to /usr/local/bin
make install# Windows
.\build.ps1 -RunDaemon
# Or if binaries are already built
.\bin\gputld.exe# Linux/macOS
make run-daemon
# or
./bin/gputldThe daemon starts:
- HTTP API server on
http://localhost:8080 - Prometheus metrics on
http://localhost:9090/metrics - Health monitoring loop
# Windows
.\bin\gputl.exe health
.\bin\gputl.exe status
.\bin\gputl.exe status 0
.\bin\gputl.exe health-checks
.\bin\gputl.exe register 0 --name "RTX_5070_Ti" --pool "production" --tags "high-perf,ml"# Linux/macOS
./bin/gputl health
./bin/gputl status
./bin/gputl status 0
./bin/gputl health-checks
./bin/gputl register 0 --name "RTX_5070_Ti" --pool "production" --tags "high-perf,ml"pip install uv
cd python
uv sync
# Load and test policies
uv run star --load-all
# View registered pools
uv run star --load-all --list-pools
# Test allocation for specific GPU
uv run star --load-all --test-gpu 0| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/api/v1/gpus |
GET | List all GPUs |
/api/v1/gpus/:id |
GET | Get GPU details |
/api/v1/register |
POST | Register a GPU |
/api/v1/unregister/:id |
DELETE | Unregister a GPU |
/api/v1/healthchecks |
GET | List health check configurations |
curl http://localhost:8080/api/v1/gpus
curl http://localhost:8080/health
curl http://localhost:9090/metricsgpu-ops-platform/
├── cmd/
│ ├── gputld/ # Daemon entry point
│ └── gputl/ # CLI entry point
├── pkg/
│ ├── gpu/ # GPU discovery and monitoring
│ ├── registration/ # GPU registration service
│ ├── health/ # Health checks
│ ├── metrics/ # Prometheus metrics
│ └── config/ # Configuration management
├── python/
│ ├── policies/ # Starlark policy definitions
│ │ ├── basic.gsky # Basic pool policy
│ │ ├── production.gsky # Production pool policy
│ │ └── development.gsky # Development pool policy
│ └── starlark_engine/ # Starlark policy evaluator
├── bin/ # Built binaries (created after build)
├── build.ps1 # Windows build script
├── Makefile # Unix Makefile
└── go.mod # Go module dependencies
- Run
go generateif you add new protobuf files - Use
go test -v ./...to run all tests - Check
/metricsendpoint for current GPU metrics - Look at
python/policies/for examples of Starlark policy syntax - Use
uv syncinpython/to install dependencies - The Starlark engine uses pure Python for evaluation (no Go integration yet)
- Explore the code: Start by looking at
pkg/gpu/gpu.gofor GPU discovery - Write a policy: Create a new
.gskyfile inpython/policies/ - Add health checks: See
pkg/health/health.goto understand health monitoring - NVML Integration: Replace mock GPU data with actual
github.com/NVIDIA/go-nvmlcalls - Add web UI: Build a dashboard to visualize GPU status and health
Go is not in your PATH. Install Go from https://go.dev/dl/ and add it to your PATH.
Install Python 3.10+ from https://www.python.org/downloads/ and run:
pip install uv
cd python
uv syncKill the process using the port or change the port in the code.
MIT