Skip to content

Commit 78d1fb8

Browse files
mchmarnyCopilot
andauthored
[Feature]: Auto detect kata mode and self configure monitoring (#217)
Co-authored-by: Copilot <[email protected]>
1 parent f2c34d7 commit 78d1fb8

File tree

26 files changed

+1557
-265
lines changed

26 files changed

+1557
-265
lines changed

.github/copilot-instructions.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# GitHub Copilot Instructions for NVSentinel
2+
3+
## Project Overview
4+
5+
NVSentinel is a GPU Node Resilience System for Kubernetes that automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It's designed for high-performance computing environments running NVIDIA GPUs.
6+
7+
**Status**: Experimental/Preview Release - APIs and configurations may change.
8+
9+
## Architecture & Technologies
10+
11+
### Core Technologies
12+
- **Language**: Go 1.25+ (primary), Python 3.12+ (monitoring tools)
13+
- **Container Platform**: Kubernetes 1.25+
14+
- **Deployment**: Helm 3.0+, Tilt (development)
15+
- **Storage**: MongoDB (event store with change streams)
16+
- **Communication**: gRPC with Protocol Buffers
17+
- **GPU Monitoring**: NVIDIA DCGM (Data Center GPU Manager)
18+
19+
### Project Structure
20+
```
21+
├── health-monitors/ # Pluggable health detection modules
22+
│ ├── gpu-health-monitor/ # DCGM-based GPU monitoring (Python)
23+
│ ├── csp-health-monitor/ # Cloud provider health checks (Go)
24+
│ └── syslog-health-monitor/ # System log analysis (Go)
25+
├── health-events-analyzer/ # Event classification and routing
26+
├── fault-quarantine-module/ # Node isolation (cordon)
27+
├── node-drainer-module/ # Workload eviction
28+
├── fault-remediation-module/ # Break-fix automation
29+
├── labeler-module/ # Node labeling (DCGM version, driver status, Kata detection)
30+
├── janitor/ # State cleanup and maintenance
31+
├── platform-connectors/ # CSP integration (GCP, AWS, Azure)
32+
├── commons/ # Shared utilities
33+
├── data-models/ # Protocol Buffer definitions
34+
├── store-client-sdk/ # MongoDB client library
35+
└── distros/kubernetes/ # Helm charts
36+
```
37+
38+
## Coding Standards
39+
40+
### Go Code Guidelines
41+
42+
#### Module Organization
43+
- Each service is a separate Go module with its own `go.mod`
44+
- Use semantic import versioning
45+
- Keep dependencies minimal and up-to-date
46+
- Use `commons/` for shared utilities across modules
47+
48+
#### Code Style
49+
- Follow standard Go conventions (gofmt, golint)
50+
- Use structured logging via `log/slog`
51+
- Error handling: wrap errors with context using `fmt.Errorf("context: %w", err)`
52+
- Within `retry.RetryOnConflict` blocks, return errors **without wrapping** to preserve retry behavior
53+
- Use meaningful variable names (`synced` over `ok` for cache sync checks)
54+
55+
#### Kubernetes Integration
56+
- Use `client-go` for Kubernetes API interactions
57+
- Prefer informers over direct API calls for watching resources
58+
- Use `envtest` for testing Kubernetes controllers (not fake clients)
59+
- Implement proper shutdown handling with context cancellation
60+
61+
#### Testing Requirements
62+
- Use `testify/assert` and `testify/require` for assertions
63+
- Write table-driven tests when testing multiple scenarios
64+
- Use `envtest` for integration tests with real Kubernetes API
65+
- Test coverage: aim for >80% on critical paths
66+
- Name tests descriptively: `TestFunctionName_Scenario_ExpectedBehavior`
67+
68+
### Python Code Guidelines
69+
- Use Poetry for dependency management
70+
- Follow PEP 8 style guide
71+
- Use Black for formatting
72+
- Type hints required for all functions
73+
- Use dataclasses for structured data
74+
75+
### Protobuf Guidelines
76+
- Define messages in `data-models/protobufs/`
77+
- Use semantic versioning for breaking changes
78+
- Include comprehensive comments for all fields
79+
- Generate code with: `make protos-generate`
80+
81+
## Development Workflows
82+
83+
### Building & Testing
84+
```bash
85+
# Lint and test all modules
86+
make lint-test-all
87+
88+
# Lint specific module
89+
cd labeler-module && make lint
90+
91+
# Test specific module
92+
cd health-events-analyzer && make test
93+
94+
# Build container images (uses ko)
95+
make images
96+
97+
# View all make targets
98+
make help
99+
```
100+
101+
### Version Management
102+
- All tool versions centralized in `.versions.yaml`
103+
- Use `yq` to read versions in scripts
104+
- Update versions in one place, propagates everywhere
105+
- Check versions: `make show-versions`
106+
107+
### Local Development with Tilt
108+
```bash
109+
cd tilt
110+
tilt up # Start local development environment
111+
```
112+
113+
## Kata Containers Detection
114+
115+
The labeler-module implements Kata Containers detection:
116+
117+
### Detection Architecture
118+
- **Input labels** (on nodes): `katacontainers.io/kata-runtime` (default) + optional custom label
119+
- **Output label** (set by labeler): `nvsentinel.dgxc.nvidia.com/kata.enabled: "true"|"false"`
120+
- **Truthy values**: `"true"`, `"enabled"`, `"1"`, `"yes"` (case-insensitive)
121+
- **Lifecycle separation**: Pod events → DCGM/driver labels, Node events → kata labels
122+
123+
### DaemonSet Variants
124+
- Separate DaemonSets for kata vs regular nodes
125+
- Selection via `nodeAffinity` based on kata.enabled label
126+
- Different volume mounts:
127+
- Regular: `/var/log` (file-based logs)
128+
- Kata: `/run/log/journal` and `/var/log/journal` (systemd journal)
129+
130+
## Important Patterns
131+
132+
### Error Handling in Retry Loops
133+
```go
134+
// ✅ CORRECT - Return error as-is for retry logic
135+
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
136+
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
137+
return err // Don't wrap!
138+
})
139+
140+
// ❌ WRONG - Wrapping breaks retry detection
141+
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
142+
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
143+
return fmt.Errorf("failed: %w", err) // Breaks retry!
144+
})
145+
```
146+
147+
### Informer Usage
148+
```go
149+
// Use separate informers for different resource types
150+
podInformer := factory.Core().V1().Pods().Informer()
151+
nodeInformer := factory.Core().V1().Nodes().Informer()
152+
153+
// Extract event handler setup into helper methods
154+
func (l *Labeler) registerPodEventHandlers() error {
155+
_, err := l.podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
156+
// handlers...
157+
})
158+
return err
159+
}
160+
```
161+
162+
### Testing with envtest
163+
```go
164+
// Use real Kubernetes API server for tests
165+
testEnv := &envtest.Environment{
166+
CRDDirectoryPaths: []string{},
167+
}
168+
cfg, err := testEnv.Start()
169+
// Create clients from cfg
170+
// Run tests
171+
testEnv.Stop()
172+
```
173+
174+
## Documentation Standards
175+
176+
### Code Comments
177+
- Package-level godoc for all packages
178+
- Function comments for exported functions
179+
- Inline comments for complex logic only
180+
- TODO comments should reference issues
181+
182+
### Helm Chart Documentation
183+
- Document all values in `values.yaml` with inline comments
184+
- Include examples for non-obvious configurations
185+
- Note truthy value requirements where applicable
186+
- Explain DaemonSet variant selection logic
187+
188+
## Security & Compliance
189+
190+
### Licensing
191+
- All files must include Apache 2.0 license header
192+
- Use `make license-headers-add` to add headers
193+
- No GPL or viral licenses in dependencies
194+
195+
### Security
196+
- Report security issues via SECURITY.md process
197+
- Never commit secrets or credentials
198+
- Use Kubernetes secrets for sensitive data
199+
- Scan dependencies: automated via CI
200+
201+
## CI/CD Pipeline
202+
203+
### GitHub Actions Workflows
204+
- **PR Checks**: lint, test, build for all modules
205+
- **Release**: automated image builds and Helm chart publishing
206+
- **Vulnerability Scanning**: daily Trivy scans uploaded to Security tab
207+
- **Provenance**: SLSA attestation for all artifacts
208+
209+
### Container Images
210+
- Built with `ko` (Go) and `docker` (Python)
211+
- Multi-architecture support (amd64, arm64)
212+
- Published to `ghcr.io/nvidia/nvsentinel/*`
213+
- Signed with Sigstore cosign
214+
215+
## Common Tasks
216+
217+
### Adding a New Health Monitor
218+
1. Create directory in `health-monitors/`
219+
2. Implement gRPC service from `data-models/protobufs/`
220+
3. Add Makefile with standard targets
221+
4. Create Helm chart in `distros/kubernetes/nvsentinel/charts/`
222+
5. Register in platform-connectors for CSP integration
223+
6. Add to CI pipeline workflows
224+
225+
### Updating Dependencies
226+
```bash
227+
# Update Go dependencies
228+
make gomod-tidy
229+
230+
# Update Python dependencies
231+
cd health-monitors/gpu-health-monitor
232+
poetry update
233+
```
234+
235+
### Adding New Node Labels
236+
1. Update labeler-module logic in `pkg/labeler/labeler.go`
237+
2. Add tests in `labeler_test.go`
238+
3. Document in Helm chart values
239+
4. Update KATA_TESTING.md if kata-related
240+
241+
## Debugging Tips
242+
243+
### Local Development
244+
- Use Tilt for rapid iteration
245+
- Check logs: `kubectl logs -n nvsentinel <pod>`
246+
- Port-forward for direct access: `kubectl port-forward -n nvsentinel svc/janitor 8080:8080`
247+
- MongoDB Compass for event store inspection
248+
249+
### Common Issues
250+
- **Informer not syncing**: Check RBAC permissions
251+
- **Labels not updating**: Verify labeler is running and has node permissions
252+
- **DaemonSet not scheduling**: Check node selectors and labels
253+
- **Kata detection failing**: Verify input labels on nodes
254+
255+
## References
256+
257+
- [DEVELOPMENT.md](../DEVELOPMENT.md) - Detailed development guide
258+
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Contribution guidelines
259+
- [README.md](../README.md) - Project overview
260+
- [docs/](../docs/) - Architecture documentation

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -430,5 +430,7 @@ images.json
430430
fault-quarantine-module/fault-quarantine-module
431431
fault-remediation-module/fault-remediation-module
432432
health-events-analyzer/health-events-analyzer
433+
health-monitors/syslog-health-monitor/syslog-health-monitor
434+
labeler-module/labeler-module
433435
node-drainer-module/node-drainer-module
434436
platform-connectors/platform-connectors

commons/pkg/annotation/prefix.go

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
// Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
package annotation
16+
17+
import "strings"
18+
19+
// HasPrefixMatch checks if any key in the annotations map starts with any of the provided prefixes.
20+
// Returns the matching annotation key and value if found, or empty strings if no match.
21+
func HasPrefixMatch(annotations map[string]string, prefixes []string) (string, string, bool) {
22+
for key, value := range annotations {
23+
for _, prefix := range prefixes {
24+
if strings.HasPrefix(key, prefix) {
25+
return key, value, true
26+
}
27+
}
28+
}
29+
30+
return "", "", false
31+
}
32+
33+
// FindAllPrefixMatches returns all annotation keys and values that match any of the provided prefixes.
34+
// Returns a map of matching annotation keys to their values.
35+
func FindAllPrefixMatches(annotations map[string]string, prefixes []string) map[string]string {
36+
matches := make(map[string]string)
37+
38+
for key, value := range annotations {
39+
for _, prefix := range prefixes {
40+
if strings.HasPrefix(key, prefix) {
41+
matches[key] = value
42+
break
43+
}
44+
}
45+
}
46+
47+
return matches
48+
}

0 commit comments

Comments
 (0)