NVIDIA
diff --git a/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl‎
Lines changed: 2 additions & 8 deletions b/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl‎
Lines changed: 2 additions & 8 deletions
diff --git a/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/configmap.yaml‎
Lines changed: 0 additions & 48 deletions b/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/configmap.yaml‎
Lines changed: 0 additions & 48 deletions
diff --git a/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml‎
Lines changed: 5 additions & 0 deletions b/‎distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/designs/001-health-event-detection-interface.md‎
Lines changed: 90 additions & 0 deletions b/‎docs/designs/001-health-event-detection-interface.md‎
Lines changed: 90 additions & 0 deletions
diff --git a/‎docs/designs/002-storage-layer-selection.md‎
Lines changed: 91 additions & 0 deletions b/‎docs/designs/002-storage-layer-selection.md‎
Lines changed: 91 additions & 0 deletions
@@ -110,8 +110,6 @@ spec:
           image: "{{ $root.Values.global.syslogHealthMonitor.image.repository }}:{{ $root.Values.global.image.tag | default $root.Chart.AppVersion }}"
           imagePullPolicy: {{ $root.Values.global.image.pullPolicy }}
           args:
-            - "--config-file"
-            - "/etc/syslog-monitor/log_check_definitions.yaml"
             - "--polling-interval"
             - "{{ $root.Values.pollingInterval }}"
             - "--metrics-port"
@@ -122,6 +120,8 @@ spec:
             - "--xid-analyser-endpoint"
             - "http://localhost:8080"
             {{- end }}
+            - "--checks"
+            - "{{ join "," $root.Values.enabledChecks }}"
           resources:
             {{- toYaml $root.Values.resources | nindent 12 }}
           ports:
@@ -150,9 +150,6 @@ spec:
                   apiVersion: v1
                   fieldPath: spec.nodeName
           volumeMounts:
-            - name: config-volume
-              mountPath: /etc/syslog-monitor
-              readOnly: true
             - name: var-run-vol
               mountPath: /var/run/
             - name: syslog-state-vol
@@ -200,9 +197,6 @@ spec:
               value: "8080"
         {{- end }}
       volumes:
-        - name: config-volume
-          configMap:
-            name: {{ include "syslog-health-monitor.fullname" $root }}
         - name: var-run-vol
           hostPath:
             path: /var/run/nvsentinel
 
@@ -75,3 +75,8 @@ driverWatcher:
     requests:
       cpu: 50m
       memory: 64Mi
+
+enabledChecks: 
+  - SysLogsXIDError
+  - SysLogsSXIDError
+  - SysLogsGPUFallenOff
@@ -0,0 +1,90 @@
+# ADR-001: Architecture — Health Event Detection Interface
+
+## Context
+
+Hardware failures in accelerated computing clusters need to be detected quickly and acted upon to maintain system reliability. The system consists of multiple health monitoring components (GPU monitors, network monitors, switch monitors, etc.) that need to report failures to a central processing system.
+
+The primary challenge is creating a clean separation between the detection logic (which may be vendor-specific or hardware-specific) and the platform-specific handling logic (which understands Kubernetes, cloud providers, etc.). This separation allows:
+- Health monitors to be developed and maintained independently
+- Easy addition of new types of health monitors
+- Platform-agnostic health monitor binaries
+
+Three architectural options were considered:
+1. Direct Kubernetes API writes from monitors
+2. Shared database for communication
+3. gRPC-based interface with platform connectors
+
+## Decision
+
+Use a gRPC-based interface where health monitors report events to platform connectors over Unix Domain Sockets (UDS). Health monitors are standalone daemons that detect issues and encode events using Protocol Buffers, then send them via gRPC to platform connectors that translate events into platform-specific actions.
+
+## Implementation
+
+- Health monitors run as DaemonSet pods on every node
+- Each monitor implements a gRPC client that connects to a Unix Domain Socket
+- Platform connectors expose a gRPC server listening on the UDS at `/var/run/nvsentinel/platform-connector.sock`
+- The interface uses the `HealthEventOccuredV1` RPC with `HealthEvents` message type
+- Events include: agent name, component class, check name, fatality flag, error codes, impacted entities, and recommended actions
+- All communication happens locally on the node - no network calls required
+
+Key interface fields:
+```
+message HealthEvent {
+  string agent                      // monitor name (e.g., "GPUHealthMonitor")
+  string componentClass             // component type (e.g., "GPU", "NIC")
+  string checkName                  // specific check executed
+  bool isFatal                      // requires immediate action
+  bool isHealthy                    // current health status
+  string message                    // human-readable description
+  RecommendedAction recommendedAction  // suggested remediation
+  repeated string errorCode         // machine-readable codes
+  repeated Entity entitiesImpacted  // affected hardware (GPU UUIDs, etc.)
+  map<string, string> metadata      // additional context
+  google.protobuf.Timestamp generatedTimestamp
+  string nodeName
+}
+```
+
+## Rationale
+
+- **Loose coupling**: Health monitors don't need Kubernetes client libraries or cloud provider SDKs
+- **Language agnostic**: Protocol Buffers and gRPC support many languages
+- **Simple deployment**: Unix Domain Sockets don't require network configuration or service discovery
+- **Security**: Local socket communication eliminates network attack surface
+- **Performance**: UDS provides high throughput with low latency for local IPC
+
+## Consequences
+
+### Positive
+- Health monitors can be written in any language (Python for GPU monitoring, C++ for low-level hardware)
+- New health monitors can be added without modifying platform connectors
+- Testing is simplified - monitors can be tested independently
+- Binary portability - same monitor binary works across different platforms
+- No authentication needed for local socket communication
+
+### Negative
+- Requires Unix Domain Socket volume mounts in pod specifications
+- Additional abstraction layer adds complexity
+- Health monitors and platform connectors must both be running
+- Protocol Buffer schema changes require coordination
+
+### Mitigations
+- Use semantic versioning in the `HealthEvents.version` field
+- Platform connectors maintain in-memory cache until health monitors connect
+- Include retry logic in health monitors with exponential backoff
+- Monitor pod anti-affinity rules prevent scheduling issues
+
+## Alternatives Considered
+
+### Direct Kubernetes API Integration
+**Rejected** because: Health monitors would require Kubernetes client libraries, making them platform-dependent. This would increase binary size, add complexity, and make monitors harder to test independently. Additionally, it would require managing service account tokens and RBAC policies for every monitor.
+
+### Shared Database Communication
+**Rejected** because: Introducing a database dependency for every health monitor adds operational complexity. Monitors would need database drivers, connection management, and retry logic. It also creates a single point of failure and requires network connectivity for local communication.
+
+## Notes
+
+- Health monitors should implement graceful shutdown to drain pending events
+- The gRPC interface is versioned to support future extensions
+- Events are fire-and-forget from the monitor's perspective - the platform connector handles persistence and retries
+- For testing, a mock platform connector can record events to files
@@ -0,0 +1,91 @@
+# ADR-002: Infrastructure — Storage Layer Selection
+
+## Context
+
+The system needs persistent storage for health events that multiple components can read and react to in real-time. The Fault Quarantine, Node Drainer, and Fault Remediation modules all need to watch for new health events and take action accordingly.
+
+Key requirements:
+1. **Strong consistency**: All components must see the same view of events at any point in time
+2. **High availability**: No single point of failure; survive node failures
+3. **Watch/notification mechanism**: Components need to be notified when new events arrive
+4. **Robust ecosystem**: Well-maintained with good documentation and client libraries
+
+Storage candidates evaluated: etcd, MongoDB, Redis, CouchDB, and Cassandra.
+
+## Decision
+
+Use MongoDB with replica sets as the storage layer for health events. MongoDB provides strong consistency through configurable write concerns, high availability through automatic failover, and real-time notifications through Change Streams.
+
+## Implementation
+
+- Deploy MongoDB as a StatefulSet with 3 replicas for high availability
+- Use majority write concern (`w: majority`) to ensure data durability
+- Use majority read concern to prevent reading stale data
+- Configure Change Streams for real-time event notifications to downstream modules
+- Use transaction-based operations for atomic multi-event insertions
+- Store health events as documents with indexes on: nodeName, timestamp, isFatal, componentClass
+- Implement in-memory caching in the MongoDB connector to reduce duplicate writes
+
+Deployment in the cluster:
+```
+mongodb-0   [Primary]
+mongodb-1   [Secondary]  
+mongodb-2   [Secondary]
+```
+
+Key operations:
+- Platform connectors insert health events with majority write concern
+- Quarantine/Drainer/Remediation modules establish Change Stream watches
+- Aggregation pipelines used for complex event correlation
+
+## Rationale
+
+- **Strong consistency**: Write concern `majority` ensures data is persisted to most replicas before acknowledging
+- **Automatic failover**: Replica sets automatically promote a secondary to primary on failure
+- **Change Streams**: Provide resumable, real-time notifications of document changes without polling
+- **Rich querying**: Aggregation pipelines enable complex event correlation (e.g., counting repeated non-fatal events)
+- **Mature Go support**: Official MongoDB Go driver is well-maintained and feature-complete
+- **Operational experience**: MongoDB is widely deployed and understood by operations teams
+
+## Consequences
+
+### Positive
+- Components receive events in real-time without polling
+- Replica sets provide automatic recovery from node failures
+- Aggregation framework enables sophisticated event analysis
+- Change Streams are resumable - clients can recover from disconnections without missing events
+- Official Go driver simplifies development
+
+### Negative
+- MongoDB requires more resources than simpler key-value stores
+- Stateful deployment requires persistent volumes
+- Operators need MongoDB operational knowledge
+- Replica set coordination adds latency vs single-node writes
+
+### Mitigations
+- Set resource limits appropriate for cluster size
+- Use local SSDs for persistent volumes to reduce latency
+- Provide monitoring dashboards and runbooks for common operations
+- Implement connection pooling in clients to reduce overhead
+- Use in-memory caching to minimize database writes
+
+## Alternatives Considered
+
+### etcd
+**Rejected** because: While etcd provides strong consistency via Raft and excellent watch capabilities, it's optimized for small key-value data (typically < 1.5MB total). Health events include metadata, stack traces, and diagnostic information that can be large. etcd's performance degrades with larger values. Additionally, complex queries (like aggregating repeated events) would require client-side logic.
+
+### Redis
+**Rejected** because: Redis provides only eventual consistency through asynchronous replication. The `WAIT` command can enforce synchronous replication but doesn't provide the same consistency guarantees as consensus algorithms. More critically, Redis Pub/Sub is fire-and-forget - if a client disconnects, all events during disconnection are lost. This violates the requirement that no health events should be missed.
+
+### CouchDB
+**Rejected** because: CouchDB's changes feed can provide notifications, but the setup is complex and resuming after disconnections requires manual state management. The Go ecosystem for CouchDB is immature - there's no widely-adopted, production-ready client library. CouchDB's multi-master replication model provides eventual consistency by default, requiring additional configuration for stronger guarantees.
+
+### Cassandra
+**Rejected** because: Cassandra lacks built-in watch/notification mechanisms. Implementing event notifications would require external systems like Kafka, adding significant complexity. While Cassandra excels at write-heavy workloads, our read patterns (watching for events, running aggregations) don't align with Cassandra's strengths.
+
+## Notes
+
+- Change Streams require MongoDB replica sets (not standalone instances)
+- For very large clusters (>10k nodes), consider sharding based on nodeName
+- The health event schema should include TTL indexes to automatically clean up old events
+- Non-goals include using MongoDB as a general-purpose application database