Skip to content

Commit b3901d4

Browse files
authored
Merge branch 'main' into feat/attestation-policies
2 parents f4556ae + 43081b9 commit b3901d4

File tree

16 files changed

+1377
-131
lines changed

16 files changed

+1377
-131
lines changed

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,6 @@ spec:
110110
image: "{{ $root.Values.global.syslogHealthMonitor.image.repository }}:{{ $root.Values.global.image.tag | default $root.Chart.AppVersion }}"
111111
imagePullPolicy: {{ $root.Values.global.image.pullPolicy }}
112112
args:
113-
- "--config-file"
114-
- "/etc/syslog-monitor/log_check_definitions.yaml"
115113
- "--polling-interval"
116114
- "{{ $root.Values.pollingInterval }}"
117115
- "--metrics-port"
@@ -122,6 +120,8 @@ spec:
122120
- "--xid-analyser-endpoint"
123121
- "http://localhost:8080"
124122
{{- end }}
123+
- "--checks"
124+
- "{{ join "," $root.Values.enabledChecks }}"
125125
resources:
126126
{{- toYaml $root.Values.resources | nindent 12 }}
127127
ports:
@@ -150,9 +150,6 @@ spec:
150150
apiVersion: v1
151151
fieldPath: spec.nodeName
152152
volumeMounts:
153-
- name: config-volume
154-
mountPath: /etc/syslog-monitor
155-
readOnly: true
156153
- name: var-run-vol
157154
mountPath: /var/run/
158155
- name: syslog-state-vol
@@ -200,9 +197,6 @@ spec:
200197
value: "8080"
201198
{{- end }}
202199
volumes:
203-
- name: config-volume
204-
configMap:
205-
name: {{ include "syslog-health-monitor.fullname" $root }}
206200
- name: var-run-vol
207201
hostPath:
208202
path: /var/run/nvsentinel

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/configmap.yaml

Lines changed: 0 additions & 48 deletions
This file was deleted.

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,8 @@ driverWatcher:
7575
requests:
7676
cpu: 50m
7777
memory: 64Mi
78+
79+
enabledChecks:
80+
- SysLogsXIDError
81+
- SysLogsSXIDError
82+
- SysLogsGPUFallenOff
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# ADR-001: Architecture — Health Event Detection Interface
2+
3+
## Context
4+
5+
Hardware failures in accelerated computing clusters need to be detected quickly and acted upon to maintain system reliability. The system consists of multiple health monitoring components (GPU monitors, network monitors, switch monitors, etc.) that need to report failures to a central processing system.
6+
7+
The primary challenge is creating a clean separation between the detection logic (which may be vendor-specific or hardware-specific) and the platform-specific handling logic (which understands Kubernetes, cloud providers, etc.). This separation allows:
8+
- Health monitors to be developed and maintained independently
9+
- Easy addition of new types of health monitors
10+
- Platform-agnostic health monitor binaries
11+
12+
Three architectural options were considered:
13+
1. Direct Kubernetes API writes from monitors
14+
2. Shared database for communication
15+
3. gRPC-based interface with platform connectors
16+
17+
## Decision
18+
19+
Use a gRPC-based interface where health monitors report events to platform connectors over Unix Domain Sockets (UDS). Health monitors are standalone daemons that detect issues and encode events using Protocol Buffers, then send them via gRPC to platform connectors that translate events into platform-specific actions.
20+
21+
## Implementation
22+
23+
- Health monitors run as DaemonSet pods on every node
24+
- Each monitor implements a gRPC client that connects to a Unix Domain Socket
25+
- Platform connectors expose a gRPC server listening on the UDS at `/var/run/nvsentinel/platform-connector.sock`
26+
- The interface uses the `HealthEventOccuredV1` RPC with `HealthEvents` message type
27+
- Events include: agent name, component class, check name, fatality flag, error codes, impacted entities, and recommended actions
28+
- All communication happens locally on the node - no network calls required
29+
30+
Key interface fields:
31+
```
32+
message HealthEvent {
33+
string agent // monitor name (e.g., "GPUHealthMonitor")
34+
string componentClass // component type (e.g., "GPU", "NIC")
35+
string checkName // specific check executed
36+
bool isFatal // requires immediate action
37+
bool isHealthy // current health status
38+
string message // human-readable description
39+
RecommendedAction recommendedAction // suggested remediation
40+
repeated string errorCode // machine-readable codes
41+
repeated Entity entitiesImpacted // affected hardware (GPU UUIDs, etc.)
42+
map<string, string> metadata // additional context
43+
google.protobuf.Timestamp generatedTimestamp
44+
string nodeName
45+
}
46+
```
47+
48+
## Rationale
49+
50+
- **Loose coupling**: Health monitors don't need Kubernetes client libraries or cloud provider SDKs
51+
- **Language agnostic**: Protocol Buffers and gRPC support many languages
52+
- **Simple deployment**: Unix Domain Sockets don't require network configuration or service discovery
53+
- **Security**: Local socket communication eliminates network attack surface
54+
- **Performance**: UDS provides high throughput with low latency for local IPC
55+
56+
## Consequences
57+
58+
### Positive
59+
- Health monitors can be written in any language (Python for GPU monitoring, C++ for low-level hardware)
60+
- New health monitors can be added without modifying platform connectors
61+
- Testing is simplified - monitors can be tested independently
62+
- Binary portability - same monitor binary works across different platforms
63+
- No authentication needed for local socket communication
64+
65+
### Negative
66+
- Requires Unix Domain Socket volume mounts in pod specifications
67+
- Additional abstraction layer adds complexity
68+
- Health monitors and platform connectors must both be running
69+
- Protocol Buffer schema changes require coordination
70+
71+
### Mitigations
72+
- Use semantic versioning in the `HealthEvents.version` field
73+
- Platform connectors maintain in-memory cache until health monitors connect
74+
- Include retry logic in health monitors with exponential backoff
75+
- Monitor pod anti-affinity rules prevent scheduling issues
76+
77+
## Alternatives Considered
78+
79+
### Direct Kubernetes API Integration
80+
**Rejected** because: Health monitors would require Kubernetes client libraries, making them platform-dependent. This would increase binary size, add complexity, and make monitors harder to test independently. Additionally, it would require managing service account tokens and RBAC policies for every monitor.
81+
82+
### Shared Database Communication
83+
**Rejected** because: Introducing a database dependency for every health monitor adds operational complexity. Monitors would need database drivers, connection management, and retry logic. It also creates a single point of failure and requires network connectivity for local communication.
84+
85+
## Notes
86+
87+
- Health monitors should implement graceful shutdown to drain pending events
88+
- The gRPC interface is versioned to support future extensions
89+
- Events are fire-and-forget from the monitor's perspective - the platform connector handles persistence and retries
90+
- For testing, a mock platform connector can record events to files
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# ADR-002: Infrastructure — Storage Layer Selection
2+
3+
## Context
4+
5+
The system needs persistent storage for health events that multiple components can read and react to in real-time. The Fault Quarantine, Node Drainer, and Fault Remediation modules all need to watch for new health events and take action accordingly.
6+
7+
Key requirements:
8+
1. **Strong consistency**: All components must see the same view of events at any point in time
9+
2. **High availability**: No single point of failure; survive node failures
10+
3. **Watch/notification mechanism**: Components need to be notified when new events arrive
11+
4. **Robust ecosystem**: Well-maintained with good documentation and client libraries
12+
13+
Storage candidates evaluated: etcd, MongoDB, Redis, CouchDB, and Cassandra.
14+
15+
## Decision
16+
17+
Use MongoDB with replica sets as the storage layer for health events. MongoDB provides strong consistency through configurable write concerns, high availability through automatic failover, and real-time notifications through Change Streams.
18+
19+
## Implementation
20+
21+
- Deploy MongoDB as a StatefulSet with 3 replicas for high availability
22+
- Use majority write concern (`w: majority`) to ensure data durability
23+
- Use majority read concern to prevent reading stale data
24+
- Configure Change Streams for real-time event notifications to downstream modules
25+
- Use transaction-based operations for atomic multi-event insertions
26+
- Store health events as documents with indexes on: nodeName, timestamp, isFatal, componentClass
27+
- Implement in-memory caching in the MongoDB connector to reduce duplicate writes
28+
29+
Deployment in the cluster:
30+
```
31+
mongodb-0 [Primary]
32+
mongodb-1 [Secondary]
33+
mongodb-2 [Secondary]
34+
```
35+
36+
Key operations:
37+
- Platform connectors insert health events with majority write concern
38+
- Quarantine/Drainer/Remediation modules establish Change Stream watches
39+
- Aggregation pipelines used for complex event correlation
40+
41+
## Rationale
42+
43+
- **Strong consistency**: Write concern `majority` ensures data is persisted to most replicas before acknowledging
44+
- **Automatic failover**: Replica sets automatically promote a secondary to primary on failure
45+
- **Change Streams**: Provide resumable, real-time notifications of document changes without polling
46+
- **Rich querying**: Aggregation pipelines enable complex event correlation (e.g., counting repeated non-fatal events)
47+
- **Mature Go support**: Official MongoDB Go driver is well-maintained and feature-complete
48+
- **Operational experience**: MongoDB is widely deployed and understood by operations teams
49+
50+
## Consequences
51+
52+
### Positive
53+
- Components receive events in real-time without polling
54+
- Replica sets provide automatic recovery from node failures
55+
- Aggregation framework enables sophisticated event analysis
56+
- Change Streams are resumable - clients can recover from disconnections without missing events
57+
- Official Go driver simplifies development
58+
59+
### Negative
60+
- MongoDB requires more resources than simpler key-value stores
61+
- Stateful deployment requires persistent volumes
62+
- Operators need MongoDB operational knowledge
63+
- Replica set coordination adds latency vs single-node writes
64+
65+
### Mitigations
66+
- Set resource limits appropriate for cluster size
67+
- Use local SSDs for persistent volumes to reduce latency
68+
- Provide monitoring dashboards and runbooks for common operations
69+
- Implement connection pooling in clients to reduce overhead
70+
- Use in-memory caching to minimize database writes
71+
72+
## Alternatives Considered
73+
74+
### etcd
75+
**Rejected** because: While etcd provides strong consistency via Raft and excellent watch capabilities, it's optimized for small key-value data (typically < 1.5MB total). Health events include metadata, stack traces, and diagnostic information that can be large. etcd's performance degrades with larger values. Additionally, complex queries (like aggregating repeated events) would require client-side logic.
76+
77+
### Redis
78+
**Rejected** because: Redis provides only eventual consistency through asynchronous replication. The `WAIT` command can enforce synchronous replication but doesn't provide the same consistency guarantees as consensus algorithms. More critically, Redis Pub/Sub is fire-and-forget - if a client disconnects, all events during disconnection are lost. This violates the requirement that no health events should be missed.
79+
80+
### CouchDB
81+
**Rejected** because: CouchDB's changes feed can provide notifications, but the setup is complex and resuming after disconnections requires manual state management. The Go ecosystem for CouchDB is immature - there's no widely-adopted, production-ready client library. CouchDB's multi-master replication model provides eventual consistency by default, requiring additional configuration for stronger guarantees.
82+
83+
### Cassandra
84+
**Rejected** because: Cassandra lacks built-in watch/notification mechanisms. Implementing event notifications would require external systems like Kafka, adding significant complexity. While Cassandra excels at write-heavy workloads, our read patterns (watching for events, running aggregations) don't align with Cassandra's strengths.
85+
86+
## Notes
87+
88+
- Change Streams require MongoDB replica sets (not standalone instances)
89+
- For very large clusters (>10k nodes), consider sharding based on nodeName
90+
- The health event schema should include TTL indexes to automatically clean up old events
91+
- Non-goals include using MongoDB as a general-purpose application database

0 commit comments

Comments
 (0)