Skip to content

Commit fed657f

Browse files
committed
chore: restoring the syslog kata flag, increase cert ready delay, copilot instruction
1 parent e77ccfa commit fed657f

File tree

3 files changed

+288
-5
lines changed

3 files changed

+288
-5
lines changed

.github/copilot-instructions.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# GitHub Copilot Instructions for NVSentinel
2+
3+
## Project Overview
4+
5+
NVSentinel is a GPU Node Resilience System for Kubernetes that automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It's designed for high-performance computing environments running NVIDIA GPUs.
6+
7+
**Status**: Experimental/Preview Release - APIs and configurations may change.
8+
9+
## Architecture & Technologies
10+
11+
### Core Technologies
12+
- **Language**: Go 1.25+ (primary), Python 3.12+ (monitoring tools)
13+
- **Container Platform**: Kubernetes 1.25+
14+
- **Deployment**: Helm 3.0+, Tilt (development)
15+
- **Storage**: MongoDB (event store with change streams)
16+
- **Communication**: gRPC with Protocol Buffers
17+
- **GPU Monitoring**: NVIDIA DCGM (Data Center GPU Manager)
18+
19+
### Project Structure
20+
```
21+
├── health-monitors/ # Pluggable health detection modules
22+
│ ├── gpu-health-monitor/ # DCGM-based GPU monitoring (Python)
23+
│ ├── csp-health-monitor/ # Cloud provider health checks (Go)
24+
│ └── syslog-health-monitor/ # System log analysis (Go)
25+
├── health-events-analyzer/ # Event classification and routing
26+
├── fault-quarantine-module/ # Node isolation (cordon)
27+
├── node-drainer-module/ # Workload eviction
28+
├── fault-remediation-module/ # Break-fix automation
29+
├── labeler-module/ # Node labeling (DCGM version, driver status, Kata detection)
30+
├── janitor/ # State cleanup and maintenance
31+
├── platform-connectors/ # CSP integration (GCP, AWS, Azure)
32+
├── commons/ # Shared utilities
33+
├── data-models/ # Protocol Buffer definitions
34+
├── store-client-sdk/ # MongoDB client library
35+
└── distros/kubernetes/ # Helm charts
36+
```
37+
38+
## Coding Standards
39+
40+
### Go Code Guidelines
41+
42+
#### Module Organization
43+
- Each service is a separate Go module with its own `go.mod`
44+
- Use semantic import versioning
45+
- Keep dependencies minimal and up-to-date
46+
- Use `commons/` for shared utilities across modules
47+
48+
#### Code Style
49+
- Follow standard Go conventions (gofmt, golint)
50+
- Use structured logging via `log/slog`
51+
- Error handling: wrap errors with context using `fmt.Errorf("context: %w", err)`
52+
- Within `retry.RetryOnConflict` blocks, return errors **without wrapping** to preserve retry behavior
53+
- Use meaningful variable names (`synced` over `ok` for cache sync checks)
54+
55+
#### Kubernetes Integration
56+
- Use `client-go` for Kubernetes API interactions
57+
- Prefer informers over direct API calls for watching resources
58+
- Use `envtest` for testing Kubernetes controllers (not fake clients)
59+
- Implement proper shutdown handling with context cancellation
60+
61+
#### Testing Requirements
62+
- Use `testify/assert` and `testify/require` for assertions
63+
- Write table-driven tests when testing multiple scenarios
64+
- Use `envtest` for integration tests with real Kubernetes API
65+
- Test coverage: aim for >80% on critical paths
66+
- Name tests descriptively: `TestFunctionName_Scenario_ExpectedBehavior`
67+
68+
### Python Code Guidelines
69+
- Use Poetry for dependency management
70+
- Follow PEP 8 style guide
71+
- Use Black for formatting
72+
- Type hints required for all functions
73+
- Use dataclasses for structured data
74+
75+
### Protobuf Guidelines
76+
- Define messages in `data-models/protobufs/`
77+
- Use semantic versioning for breaking changes
78+
- Include comprehensive comments for all fields
79+
- Generate code with: `make protos-generate`
80+
81+
## Development Workflows
82+
83+
### Building & Testing
84+
```bash
85+
# Lint and test all modules
86+
make lint-test-all
87+
88+
# Lint specific module
89+
cd labeler-module && make lint
90+
91+
# Test specific module
92+
cd health-events-analyzer && make test
93+
94+
# Build container images (uses ko)
95+
make images
96+
97+
# View all make targets
98+
make help
99+
```
100+
101+
### Version Management
102+
- All tool versions centralized in `.versions.yaml`
103+
- Use `yq` to read versions in scripts
104+
- Update versions in one place, propagates everywhere
105+
- Check versions: `make show-versions`
106+
107+
### Local Development with Tilt
108+
```bash
109+
cd tilt
110+
tilt up # Start local development environment
111+
```
112+
113+
## Kata Containers Detection
114+
115+
The labeler-module implements Kata Containers detection:
116+
117+
### Detection Architecture
118+
- **Input labels** (on nodes): `katacontainers.io/kata-runtime` (default) + optional custom label
119+
- **Output label** (set by labeler): `nvsentinel.dgxc.nvidia.com/kata.enabled: "true"|"false"`
120+
- **Truthy values**: `"true"`, `"enabled"`, `"1"`, `"yes"` (case-insensitive)
121+
- **Lifecycle separation**: Pod events → DCGM/driver labels, Node events → kata labels
122+
123+
### DaemonSet Variants
124+
- Separate DaemonSets for kata vs regular nodes
125+
- Selection via `nodeAffinity` based on kata.enabled label
126+
- Different volume mounts:
127+
- Regular: `/var/log` (file-based logs)
128+
- Kata: `/run/log/journal` and `/var/log/journal` (systemd journal)
129+
130+
## Important Patterns
131+
132+
### Error Handling in Retry Loops
133+
```go
134+
// ✅ CORRECT - Return error as-is for retry logic
135+
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
136+
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
137+
return err // Don't wrap!
138+
})
139+
140+
// ❌ WRONG - Wrapping breaks retry detection
141+
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
142+
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
143+
return fmt.Errorf("failed: %w", err) // Breaks retry!
144+
})
145+
```
146+
147+
### Informer Usage
148+
```go
149+
// Use separate informers for different resource types
150+
podInformer := factory.Core().V1().Pods().Informer()
151+
nodeInformer := factory.Core().V1().Nodes().Informer()
152+
153+
// Extract event handler setup into helper methods
154+
func (l *Labeler) registerPodEventHandlers() error {
155+
_, err := l.podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
156+
// handlers...
157+
})
158+
return err
159+
}
160+
```
161+
162+
### Testing with envtest
163+
```go
164+
// Use real Kubernetes API server for tests
165+
testEnv := &envtest.Environment{
166+
CRDDirectoryPaths: []string{},
167+
}
168+
cfg, err := testEnv.Start()
169+
// Create clients from cfg
170+
// Run tests
171+
testEnv.Stop()
172+
```
173+
174+
## Documentation Standards
175+
176+
### Code Comments
177+
- Package-level godoc for all packages
178+
- Function comments for exported functions
179+
- Inline comments for complex logic only
180+
- TODO comments should reference issues
181+
182+
### Helm Chart Documentation
183+
- Document all values in `values.yaml` with inline comments
184+
- Include examples for non-obvious configurations
185+
- Note truthy value requirements where applicable
186+
- Explain DaemonSet variant selection logic
187+
188+
## Security & Compliance
189+
190+
### Licensing
191+
- All files must include Apache 2.0 license header
192+
- Use `make license-headers-add` to add headers
193+
- No GPL or viral licenses in dependencies
194+
195+
### Security
196+
- Report security issues via SECURITY.md process
197+
- Never commit secrets or credentials
198+
- Use Kubernetes secrets for sensitive data
199+
- Scan dependencies: automated via CI
200+
201+
## CI/CD Pipeline
202+
203+
### GitHub Actions Workflows
204+
- **PR Checks**: lint, test, build for all modules
205+
- **Release**: automated image builds and Helm chart publishing
206+
- **Vulnerability Scanning**: daily Trivy scans uploaded to Security tab
207+
- **Provenance**: SLSA attestation for all artifacts
208+
209+
### Container Images
210+
- Built with `ko` (Go) and `docker` (Python)
211+
- Multi-architecture support (amd64, arm64)
212+
- Published to `ghcr.io/nvidia/nvsentinel/*`
213+
- Signed with Sigstore cosign
214+
215+
## Common Tasks
216+
217+
### Adding a New Health Monitor
218+
1. Create directory in `health-monitors/`
219+
2. Implement gRPC service from `data-models/protobufs/`
220+
3. Add Makefile with standard targets
221+
4. Create Helm chart in `distros/kubernetes/nvsentinel/charts/`
222+
5. Register in platform-connectors for CSP integration
223+
6. Add to CI pipeline workflows
224+
225+
### Updating Dependencies
226+
```bash
227+
# Update Go dependencies
228+
make gomod-tidy
229+
230+
# Update Python dependencies
231+
cd health-monitors/gpu-health-monitor
232+
poetry update
233+
```
234+
235+
### Adding New Node Labels
236+
1. Update labeler-module logic in `pkg/labeler/labeler.go`
237+
2. Add tests in `labeler_test.go`
238+
3. Document in Helm chart values
239+
4. Update KATA_TESTING.md if kata-related
240+
241+
## Debugging Tips
242+
243+
### Local Development
244+
- Use Tilt for rapid iteration
245+
- Check logs: `kubectl logs -n nvsentinel <pod>`
246+
- Port-forward for direct access: `kubectl port-forward -n nvsentinel svc/janitor 8080:8080`
247+
- MongoDB Compass for event store inspection
248+
249+
### Common Issues
250+
- **Informer not syncing**: Check RBAC permissions
251+
- **Labels not updating**: Verify labeler is running and has node permissions
252+
- **DaemonSet not scheduling**: Check node selectors and labels
253+
- **Kata detection failing**: Verify input labels on nodes
254+
255+
## References
256+
257+
- [DEVELOPMENT.md](../DEVELOPMENT.md) - Detailed development guide
258+
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Contribution guidelines
259+
- [README.md](../README.md) - Project overview
260+
- [docs/](../docs/) - Architecture documentation

health-monitors/syslog-health-monitor/main.go

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ var (
6363
metricsPort = flag.String("metrics-port", "2112", "Port to expose Prometheus metrics on")
6464
xidAnalyserEndpoint = flag.String("xid-analyser-endpoint", "",
6565
"Endpoint to the XID analyser service.")
66+
kataEnabled = flag.String("kata-enabled", "false",
67+
"Indicates if this monitor is running in Kata Containers mode (set by DaemonSet variant).")
6668
)
6769

6870
// ConfigFile matches the top-level structure of the YAML config file
@@ -90,7 +92,7 @@ func run() error {
9092
return fmt.Errorf("NODE_NAME env not set and --node-name flag not provided, cannot run")
9193
}
9294

93-
slog.Info("Configuration", "node", nodeName)
95+
slog.Info("Configuration", "node", nodeName, "kata-enabled", *kataEnabled)
9496

9597
// Root context canceled on SIGINT/SIGTERM so goroutines can exit cleanly.
9698
root := context.Background()
@@ -129,6 +131,15 @@ func run() error {
129131
return fmt.Errorf("no checks defined in the config file")
130132
}
131133

134+
// Add kata-specific journal tags if running in Kata mode
135+
if *kataEnabled == "true" {
136+
slog.Info("Kata mode enabled, adding containerd service filter to journal checks")
137+
for i := range config.Checks {
138+
// Add "-u containerd.service" tag to filter for containerd logs in systemd journal
139+
config.Checks[i].Tags = append(config.Checks[i].Tags, "-u", "containerd.service")
140+
}
141+
}
142+
132143
slog.Info("Creating syslog monitor", "checksCount", len(config.Checks))
133144

134145
fdHealthMonitor, err := fd.NewSyslogMonitor(

platform-connectors/pkg/connectors/store/store_connector_test.go

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -439,13 +439,25 @@ func TestPollTillCACertIsSuccessfullyMounted(t *testing.T) {
439439
certPath := filepath.Join(tempDir, "ca.crt")
440440
certContent := []byte("delayed certificate content")
441441

442+
// Write cert after a delay to simulate delayed mount
443+
written := make(chan struct{})
442444
go func() {
443-
time.Sleep(1 * time.Second)
444-
os.WriteFile(certPath, certContent, 0644)
445+
time.Sleep(1500 * time.Millisecond) // Increase delay to ensure polling starts first
446+
err := os.WriteFile(certPath, certContent, 0600)
447+
if err != nil {
448+
t.Logf("Failed to write cert file: %v", err)
449+
}
450+
close(written)
445451
}()
446452

447-
result, err := pollTillCACertIsMountedSuccessfully(certPath, 5*time.Second, 500*time.Millisecond)
453+
// Should retry and eventually read the cert
454+
result, err := pollTillCACertIsMountedSuccessfully(certPath, 10*time.Second, 300*time.Millisecond)
455+
456+
// Wait for goroutine to finish
457+
<-written
458+
448459
require.NoError(t, err)
449-
require.Equal(t, certContent, result)
460+
require.NotEmpty(t, result, "expected cert content but got empty slice")
461+
require.Equal(t, certContent, result, "cert content mismatch")
450462
})
451463
}

0 commit comments

Comments
 (0)