|
| 1 | +# GitHub Copilot Instructions for NVSentinel |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +NVSentinel is a GPU Node Resilience System for Kubernetes that automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It's designed for high-performance computing environments running NVIDIA GPUs. |
| 6 | + |
| 7 | +**Status**: Experimental/Preview Release - APIs and configurations may change. |
| 8 | + |
| 9 | +## Architecture & Technologies |
| 10 | + |
| 11 | +### Core Technologies |
| 12 | +- **Language**: Go 1.25+ (primary), Python 3.12+ (monitoring tools) |
| 13 | +- **Container Platform**: Kubernetes 1.25+ |
| 14 | +- **Deployment**: Helm 3.0+, Tilt (development) |
| 15 | +- **Storage**: MongoDB (event store with change streams) |
| 16 | +- **Communication**: gRPC with Protocol Buffers |
| 17 | +- **GPU Monitoring**: NVIDIA DCGM (Data Center GPU Manager) |
| 18 | + |
| 19 | +### Project Structure |
| 20 | +``` |
| 21 | +├── health-monitors/ # Pluggable health detection modules |
| 22 | +│ ├── gpu-health-monitor/ # DCGM-based GPU monitoring (Python) |
| 23 | +│ ├── csp-health-monitor/ # Cloud provider health checks (Go) |
| 24 | +│ └── syslog-health-monitor/ # System log analysis (Go) |
| 25 | +├── health-events-analyzer/ # Event classification and routing |
| 26 | +├── fault-quarantine-module/ # Node isolation (cordon) |
| 27 | +├── node-drainer-module/ # Workload eviction |
| 28 | +├── fault-remediation-module/ # Break-fix automation |
| 29 | +├── labeler-module/ # Node labeling (DCGM version, driver status, Kata detection) |
| 30 | +├── janitor/ # State cleanup and maintenance |
| 31 | +├── platform-connectors/ # CSP integration (GCP, AWS, Azure) |
| 32 | +├── commons/ # Shared utilities |
| 33 | +├── data-models/ # Protocol Buffer definitions |
| 34 | +├── store-client-sdk/ # MongoDB client library |
| 35 | +└── distros/kubernetes/ # Helm charts |
| 36 | +``` |
| 37 | + |
| 38 | +## Coding Standards |
| 39 | + |
| 40 | +### Go Code Guidelines |
| 41 | + |
| 42 | +#### Module Organization |
| 43 | +- Each service is a separate Go module with its own `go.mod` |
| 44 | +- Use semantic import versioning |
| 45 | +- Keep dependencies minimal and up-to-date |
| 46 | +- Use `commons/` for shared utilities across modules |
| 47 | + |
| 48 | +#### Code Style |
| 49 | +- Follow standard Go conventions (gofmt, golint) |
| 50 | +- Use structured logging via `log/slog` |
| 51 | +- Error handling: wrap errors with context using `fmt.Errorf("context: %w", err)` |
| 52 | +- Within `retry.RetryOnConflict` blocks, return errors **without wrapping** to preserve retry behavior |
| 53 | +- Use meaningful variable names (`synced` over `ok` for cache sync checks) |
| 54 | + |
| 55 | +#### Kubernetes Integration |
| 56 | +- Use `client-go` for Kubernetes API interactions |
| 57 | +- Prefer informers over direct API calls for watching resources |
| 58 | +- Use `envtest` for testing Kubernetes controllers (not fake clients) |
| 59 | +- Implement proper shutdown handling with context cancellation |
| 60 | + |
| 61 | +#### Testing Requirements |
| 62 | +- Use `testify/assert` and `testify/require` for assertions |
| 63 | +- Write table-driven tests when testing multiple scenarios |
| 64 | +- Use `envtest` for integration tests with real Kubernetes API |
| 65 | +- Test coverage: aim for >80% on critical paths |
| 66 | +- Name tests descriptively: `TestFunctionName_Scenario_ExpectedBehavior` |
| 67 | + |
| 68 | +### Python Code Guidelines |
| 69 | +- Use Poetry for dependency management |
| 70 | +- Follow PEP 8 style guide |
| 71 | +- Use Black for formatting |
| 72 | +- Type hints required for all functions |
| 73 | +- Use dataclasses for structured data |
| 74 | + |
| 75 | +### Protobuf Guidelines |
| 76 | +- Define messages in `data-models/protobufs/` |
| 77 | +- Use semantic versioning for breaking changes |
| 78 | +- Include comprehensive comments for all fields |
| 79 | +- Generate code with: `make protos-generate` |
| 80 | + |
| 81 | +## Development Workflows |
| 82 | + |
| 83 | +### Building & Testing |
| 84 | +```bash |
| 85 | +# Lint and test all modules |
| 86 | +make lint-test-all |
| 87 | + |
| 88 | +# Lint specific module |
| 89 | +cd labeler-module && make lint |
| 90 | + |
| 91 | +# Test specific module |
| 92 | +cd health-events-analyzer && make test |
| 93 | + |
| 94 | +# Build container images (uses ko) |
| 95 | +make images |
| 96 | + |
| 97 | +# View all make targets |
| 98 | +make help |
| 99 | +``` |
| 100 | + |
| 101 | +### Version Management |
| 102 | +- All tool versions centralized in `.versions.yaml` |
| 103 | +- Use `yq` to read versions in scripts |
| 104 | +- Update versions in one place, propagates everywhere |
| 105 | +- Check versions: `make show-versions` |
| 106 | + |
| 107 | +### Local Development with Tilt |
| 108 | +```bash |
| 109 | +cd tilt |
| 110 | +tilt up # Start local development environment |
| 111 | +``` |
| 112 | + |
| 113 | +## Kata Containers Detection |
| 114 | + |
| 115 | +The labeler-module implements Kata Containers detection: |
| 116 | + |
| 117 | +### Detection Architecture |
| 118 | +- **Input labels** (on nodes): `katacontainers.io/kata-runtime` (default) + optional custom label |
| 119 | +- **Output label** (set by labeler): `nvsentinel.dgxc.nvidia.com/kata.enabled: "true"|"false"` |
| 120 | +- **Truthy values**: `"true"`, `"enabled"`, `"1"`, `"yes"` (case-insensitive) |
| 121 | +- **Lifecycle separation**: Pod events → DCGM/driver labels, Node events → kata labels |
| 122 | + |
| 123 | +### DaemonSet Variants |
| 124 | +- Separate DaemonSets for kata vs regular nodes |
| 125 | +- Selection via `nodeAffinity` based on kata.enabled label |
| 126 | +- Different volume mounts: |
| 127 | + - Regular: `/var/log` (file-based logs) |
| 128 | + - Kata: `/run/log/journal` and `/var/log/journal` (systemd journal) |
| 129 | + |
| 130 | +## Important Patterns |
| 131 | + |
| 132 | +### Error Handling in Retry Loops |
| 133 | +```go |
| 134 | +// ✅ CORRECT - Return error as-is for retry logic |
| 135 | +err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { |
| 136 | + _, err := client.Update(ctx, obj, metav1.UpdateOptions{}) |
| 137 | + return err // Don't wrap! |
| 138 | +}) |
| 139 | + |
| 140 | +// ❌ WRONG - Wrapping breaks retry detection |
| 141 | +err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { |
| 142 | + _, err := client.Update(ctx, obj, metav1.UpdateOptions{}) |
| 143 | + return fmt.Errorf("failed: %w", err) // Breaks retry! |
| 144 | +}) |
| 145 | +``` |
| 146 | + |
| 147 | +### Informer Usage |
| 148 | +```go |
| 149 | +// Use separate informers for different resource types |
| 150 | +podInformer := factory.Core().V1().Pods().Informer() |
| 151 | +nodeInformer := factory.Core().V1().Nodes().Informer() |
| 152 | + |
| 153 | +// Extract event handler setup into helper methods |
| 154 | +func (l *Labeler) registerPodEventHandlers() error { |
| 155 | + _, err := l.podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ |
| 156 | + // handlers... |
| 157 | + }) |
| 158 | + return err |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +### Testing with envtest |
| 163 | +```go |
| 164 | +// Use real Kubernetes API server for tests |
| 165 | +testEnv := &envtest.Environment{ |
| 166 | + CRDDirectoryPaths: []string{}, |
| 167 | +} |
| 168 | +cfg, err := testEnv.Start() |
| 169 | +// Create clients from cfg |
| 170 | +// Run tests |
| 171 | +testEnv.Stop() |
| 172 | +``` |
| 173 | + |
| 174 | +## Documentation Standards |
| 175 | + |
| 176 | +### Code Comments |
| 177 | +- Package-level godoc for all packages |
| 178 | +- Function comments for exported functions |
| 179 | +- Inline comments for complex logic only |
| 180 | +- TODO comments should reference issues |
| 181 | + |
| 182 | +### Helm Chart Documentation |
| 183 | +- Document all values in `values.yaml` with inline comments |
| 184 | +- Include examples for non-obvious configurations |
| 185 | +- Note truthy value requirements where applicable |
| 186 | +- Explain DaemonSet variant selection logic |
| 187 | + |
| 188 | +## Security & Compliance |
| 189 | + |
| 190 | +### Licensing |
| 191 | +- All files must include Apache 2.0 license header |
| 192 | +- Use `make license-headers-add` to add headers |
| 193 | +- No GPL or viral licenses in dependencies |
| 194 | + |
| 195 | +### Security |
| 196 | +- Report security issues via SECURITY.md process |
| 197 | +- Never commit secrets or credentials |
| 198 | +- Use Kubernetes secrets for sensitive data |
| 199 | +- Scan dependencies: automated via CI |
| 200 | + |
| 201 | +## CI/CD Pipeline |
| 202 | + |
| 203 | +### GitHub Actions Workflows |
| 204 | +- **PR Checks**: lint, test, build for all modules |
| 205 | +- **Release**: automated image builds and Helm chart publishing |
| 206 | +- **Vulnerability Scanning**: daily Trivy scans uploaded to Security tab |
| 207 | +- **Provenance**: SLSA attestation for all artifacts |
| 208 | + |
| 209 | +### Container Images |
| 210 | +- Built with `ko` (Go) and `docker` (Python) |
| 211 | +- Multi-architecture support (amd64, arm64) |
| 212 | +- Published to `ghcr.io/nvidia/nvsentinel/*` |
| 213 | +- Signed with Sigstore cosign |
| 214 | + |
| 215 | +## Common Tasks |
| 216 | + |
| 217 | +### Adding a New Health Monitor |
| 218 | +1. Create directory in `health-monitors/` |
| 219 | +2. Implement gRPC service from `data-models/protobufs/` |
| 220 | +3. Add Makefile with standard targets |
| 221 | +4. Create Helm chart in `distros/kubernetes/nvsentinel/charts/` |
| 222 | +5. Register in platform-connectors for CSP integration |
| 223 | +6. Add to CI pipeline workflows |
| 224 | + |
| 225 | +### Updating Dependencies |
| 226 | +```bash |
| 227 | +# Update Go dependencies |
| 228 | +make gomod-tidy |
| 229 | + |
| 230 | +# Update Python dependencies |
| 231 | +cd health-monitors/gpu-health-monitor |
| 232 | +poetry update |
| 233 | +``` |
| 234 | + |
| 235 | +### Adding New Node Labels |
| 236 | +1. Update labeler-module logic in `pkg/labeler/labeler.go` |
| 237 | +2. Add tests in `labeler_test.go` |
| 238 | +3. Document in Helm chart values |
| 239 | +4. Update KATA_TESTING.md if kata-related |
| 240 | + |
| 241 | +## Debugging Tips |
| 242 | + |
| 243 | +### Local Development |
| 244 | +- Use Tilt for rapid iteration |
| 245 | +- Check logs: `kubectl logs -n nvsentinel <pod>` |
| 246 | +- Port-forward for direct access: `kubectl port-forward -n nvsentinel svc/janitor 8080:8080` |
| 247 | +- MongoDB Compass for event store inspection |
| 248 | + |
| 249 | +### Common Issues |
| 250 | +- **Informer not syncing**: Check RBAC permissions |
| 251 | +- **Labels not updating**: Verify labeler is running and has node permissions |
| 252 | +- **DaemonSet not scheduling**: Check node selectors and labels |
| 253 | +- **Kata detection failing**: Verify input labels on nodes |
| 254 | + |
| 255 | +## References |
| 256 | + |
| 257 | +- [DEVELOPMENT.md](../DEVELOPMENT.md) - Detailed development guide |
| 258 | +- [CONTRIBUTING.md](../CONTRIBUTING.md) - Contribution guidelines |
| 259 | +- [README.md](../README.md) - Project overview |
| 260 | +- [docs/](../docs/) - Architecture documentation |
0 commit comments