Skip to content

Commit fcfad57

Browse files
authored
make the helm chart configurable and document all the values (#246)
1 parent f0524b2 commit fcfad57

File tree

21 files changed

+1739
-808
lines changed

21 files changed

+1739
-808
lines changed

README.md

Lines changed: 27 additions & 137 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ kubectl get nodes # Verify GPU nodes are visible
9393
./scripts/validate-nvsentinel.sh --version v0.2.0 --verbose
9494
```
9595

96-
> **Testing**: The example above uses default settings. For testing with simulated GPU nodes, use [`tilt/release/values-release.yaml`](tilt/release/values-release.yaml). For production, customize values for your environment.
96+
> **Testing**: The example above uses default settings. For production, customize values for your environment.
9797
9898
> **Production**: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See [Configuration](#-configuration) below.
9999
@@ -156,168 +156,58 @@ graph TB
156156
157157
## ⚙️ Configuration
158158

159-
### Global Settings
159+
NVSentinel is highly configurable with options for each module. For complete configuration documentation, see the **[Helm Chart README](distros/kubernetes/README.md)**.
160160

161-
Control module enablement and behavior:
161+
### Quick Configuration Overview
162162

163163
```yaml
164164
global:
165-
dryRun: false # Test mode - no actual actions
165+
dryRun: false # Test mode - log actions without executing
166166

167167
# Health Monitors (enabled by default)
168168
gpuHealthMonitor:
169169
enabled: true
170170
syslogHealthMonitor:
171171
enabled: true
172172

173-
cspHealthMonitor:
174-
enabled: false # Cloud provider integration
175-
176173
# Core Modules (disabled by default - enable for production)
177-
faultQuarantineModule:
174+
faultQuarantine:
178175
enabled: false
179-
nodeDrainerModule:
176+
nodeDrainer:
180177
enabled: false
181-
faultRemediationModule:
178+
faultRemediation:
182179
enabled: false
183-
healthEventsAnalyzer:
180+
janitor:
184181
enabled: false
182+
mongodbStore:
183+
enabled: false
185184
```
186185
187-
For detailed per-module configuration, see [Module Details](#-module-details).
186+
**Configuration Resources**:
187+
- **[Helm Chart Configuration Guide](distros/kubernetes/README.md#configuration)**: Complete configuration reference
188+
- **[values-full.yaml](distros/kubernetes/nvsentinel/values-full.yaml)**: Detailed reference with all options
189+
- **[values.yaml](distros/kubernetes/nvsentinel/values.yaml)**: Default values
188190
189191
## 📦 Module Details
190192
191-
### 🔍 Health Monitors
192-
193-
#### GPU Health Monitor
194-
Monitors GPU hardware health via DCGM - detects thermal issues, ECC errors, and XID events.
193+
For detailed module configuration, see the **[Helm Chart Configuration Guide](distros/kubernetes/README.md#module-specific-configuration)**.
195194
196-
**Key Configuration**:
197-
```yaml
198-
global:
199-
gpuHealthMonitor:
200-
enabled: true
201-
useHostNetworking: false # Enable for direct DCGM access
202-
dcgm:
203-
service:
204-
endpoint: "nvidia-dcgm.gpu-operator.svc"
205-
port: 5555
206-
```
207-
208-
#### Syslog Health Monitor
209-
Analyzes system logs for hardware and software fault patterns via journalctl.
210-
211-
**Key Configuration**:
212-
```yaml
213-
global:
214-
syslogHealthMonitor:
215-
enabled: true
216-
pollingInterval: "30m"
217-
stateFile: "/var/run/syslog_health_monitor/state.json"
218-
```
219-
220-
#### CSP Health Monitor
221-
Integrates with cloud provider APIs (GCP/AWS) for maintenance events.
195+
### 🔍 Health Monitors
222196
223-
**Key Configuration**:
224-
```yaml
225-
global:
226-
cspHealthMonitor:
227-
enabled: false
228-
cspName: "gcp" # or "aws"
229-
configToml:
230-
maintenanceEventPollIntervalSeconds: 60
231-
```
197+
- **GPU Health Monitor**: Monitors GPU hardware health via DCGM - detects thermal issues, ECC errors, and XID events
198+
- **Syslog Health Monitor**: Analyzes system logs for hardware and software fault patterns via journalctl
199+
- **CSP Health Monitor**: Integrates with cloud provider APIs (GCP/AWS) for maintenance events
232200
233201
### 🏗️ Core Modules
234202
235-
#### Platform Connectors
236-
Receives health events from monitors via gRPC, persists to MongoDB, and updates Kubernetes node status.
237-
238-
**Key Configuration**:
239-
```yaml
240-
platformConnector:
241-
mongodbStore:
242-
enabled: true
243-
connectionString: "mongodb://nvsentinel-mongodb:27017"
244-
```
245-
246-
#### Fault Quarantine Module
247-
Watches MongoDB for health events and cordons nodes based on configurable rules.
248-
249-
**Key Configuration**:
250-
```yaml
251-
global:
252-
faultQuarantineModule:
253-
enabled: false
254-
config: |
255-
[[rule-sets]]
256-
name = "GPU fatal error ruleset"
257-
[[rule-sets.match.all]]
258-
expression = "event.isFatal == true"
259-
[rule-sets.cordon]
260-
shouldCordon = true
261-
```
262-
263-
#### Node Drainer Module
264-
Gracefully evicts workloads from cordoned nodes with configurable policies.
265-
266-
**Key Configuration**:
267-
```yaml
268-
global:
269-
nodeDrainerModule:
270-
enabled: false
271-
config: |
272-
evictionTimeoutInSeconds = "60"
273-
[[userNamespaces]]
274-
name = "runai-*"
275-
mode = "AllowCompletion"
276-
```
277-
278-
#### Fault Remediation Module
279-
Triggers external break-fix systems after drain completion.
280-
281-
**Key Configuration**:
282-
```yaml
283-
global:
284-
faultRemediationModule:
285-
enabled: false
286-
maintenanceResource:
287-
apiGroup: "janitor.dgxc.nvidia.com"
288-
namespace: "dgxc-janitor"
289-
```
290-
291-
#### Health Events Analyzer
292-
Analyzes event patterns and generates recommended actions.
293-
294-
**Key Configuration**:
295-
```yaml
296-
global:
297-
healthEventsAnalyzer:
298-
enabled: false
299-
config: |
300-
[[rules]]
301-
name = "XID Pattern Detection"
302-
time_window = "30m"
303-
recommended_action = "COMPONENT_RESET"
304-
```
305-
306-
#### MongoDB Store
307-
Persistent storage for health events with real-time change streams.
308-
309-
**Key Configuration**:
310-
```yaml
311-
mongodb:
312-
architecture: replicaset
313-
replicaCount: 3
314-
auth:
315-
enabled: true
316-
tls:
317-
enabled: true
318-
mTLS:
319-
enabled: true
320-
```
203+
- **Platform Connectors**: Receives health events from monitors via gRPC, persists to MongoDB, and updates Kubernetes node status
204+
- **Fault Quarantine**: Watches MongoDB for health events and cordons nodes based on configurable CEL rules
205+
- **Node Drainer**: Gracefully evicts workloads from cordoned nodes with per-namespace eviction strategies
206+
- **Fault Remediation**: Triggers external break-fix systems by creating maintenance CRDs after drain completion
207+
- **Janitor**: Executes node reboots and terminations via cloud provider APIs
208+
- **Health Events Analyzer**: Analyzes event patterns and generates recommended actions
209+
- **MongoDB Store**: Persistent storage for health events with real-time change streams
210+
- **Labeler**: Automatically labels nodes with DCGM and driver versions
321211
322212
## 📋 Requirements
323213

0 commit comments

Comments
 (0)