|
| 1 | +# ADR-001: Architecture — Health Event Detection Interface |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Hardware failures in accelerated computing clusters need to be detected quickly and acted upon to maintain system reliability. The system consists of multiple health monitoring components (GPU monitors, network monitors, switch monitors, etc.) that need to report failures to a central processing system. |
| 6 | + |
| 7 | +The primary challenge is creating a clean separation between the detection logic (which may be vendor-specific or hardware-specific) and the platform-specific handling logic (which understands Kubernetes, cloud providers, etc.). This separation allows: |
| 8 | +- Health monitors to be developed and maintained independently |
| 9 | +- Easy addition of new types of health monitors |
| 10 | +- Platform-agnostic health monitor binaries |
| 11 | + |
| 12 | +Three architectural options were considered: |
| 13 | +1. Direct Kubernetes API writes from monitors |
| 14 | +2. Shared database for communication |
| 15 | +3. gRPC-based interface with platform connectors |
| 16 | + |
| 17 | +## Decision |
| 18 | + |
| 19 | +Use a gRPC-based interface where health monitors report events to platform connectors over Unix Domain Sockets (UDS). Health monitors are standalone daemons that detect issues and encode events using Protocol Buffers, then send them via gRPC to platform connectors that translate events into platform-specific actions. |
| 20 | + |
| 21 | +## Implementation |
| 22 | + |
| 23 | +- Health monitors run as DaemonSet pods on every node |
| 24 | +- Each monitor implements a gRPC client that connects to a Unix Domain Socket |
| 25 | +- Platform connectors expose a gRPC server listening on the UDS at `/var/run/nvsentinel/platform-connector.sock` |
| 26 | +- The interface uses the `HealthEventOccuredV1` RPC with `HealthEvents` message type |
| 27 | +- Events include: agent name, component class, check name, fatality flag, error codes, impacted entities, and recommended actions |
| 28 | +- All communication happens locally on the node - no network calls required |
| 29 | + |
| 30 | +Key interface fields: |
| 31 | +``` |
| 32 | +message HealthEvent { |
| 33 | + string agent // monitor name (e.g., "GPUHealthMonitor") |
| 34 | + string componentClass // component type (e.g., "GPU", "NIC") |
| 35 | + string checkName // specific check executed |
| 36 | + bool isFatal // requires immediate action |
| 37 | + bool isHealthy // current health status |
| 38 | + string message // human-readable description |
| 39 | + RecommendedAction recommendedAction // suggested remediation |
| 40 | + repeated string errorCode // machine-readable codes |
| 41 | + repeated Entity entitiesImpacted // affected hardware (GPU UUIDs, etc.) |
| 42 | + map<string, string> metadata // additional context |
| 43 | + google.protobuf.Timestamp generatedTimestamp |
| 44 | + string nodeName |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +## Rationale |
| 49 | + |
| 50 | +- **Loose coupling**: Health monitors don't need Kubernetes client libraries or cloud provider SDKs |
| 51 | +- **Language agnostic**: Protocol Buffers and gRPC support many languages |
| 52 | +- **Simple deployment**: Unix Domain Sockets don't require network configuration or service discovery |
| 53 | +- **Security**: Local socket communication eliminates network attack surface |
| 54 | +- **Performance**: UDS provides high throughput with low latency for local IPC |
| 55 | + |
| 56 | +## Consequences |
| 57 | + |
| 58 | +### Positive |
| 59 | +- Health monitors can be written in any language (Python for GPU monitoring, C++ for low-level hardware) |
| 60 | +- New health monitors can be added without modifying platform connectors |
| 61 | +- Testing is simplified - monitors can be tested independently |
| 62 | +- Binary portability - same monitor binary works across different platforms |
| 63 | +- No authentication needed for local socket communication |
| 64 | + |
| 65 | +### Negative |
| 66 | +- Requires Unix Domain Socket volume mounts in pod specifications |
| 67 | +- Additional abstraction layer adds complexity |
| 68 | +- Health monitors and platform connectors must both be running |
| 69 | +- Protocol Buffer schema changes require coordination |
| 70 | + |
| 71 | +### Mitigations |
| 72 | +- Use semantic versioning in the `HealthEvents.version` field |
| 73 | +- Platform connectors maintain in-memory cache until health monitors connect |
| 74 | +- Include retry logic in health monitors with exponential backoff |
| 75 | +- Monitor pod anti-affinity rules prevent scheduling issues |
| 76 | + |
| 77 | +## Alternatives Considered |
| 78 | + |
| 79 | +### Direct Kubernetes API Integration |
| 80 | +**Rejected** because: Health monitors would require Kubernetes client libraries, making them platform-dependent. This would increase binary size, add complexity, and make monitors harder to test independently. Additionally, it would require managing service account tokens and RBAC policies for every monitor. |
| 81 | + |
| 82 | +### Shared Database Communication |
| 83 | +**Rejected** because: Introducing a database dependency for every health monitor adds operational complexity. Monitors would need database drivers, connection management, and retry logic. It also creates a single point of failure and requires network connectivity for local communication. |
| 84 | + |
| 85 | +## Notes |
| 86 | + |
| 87 | +- Health monitors should implement graceful shutdown to drain pending events |
| 88 | +- The gRPC interface is versioned to support future extensions |
| 89 | +- Events are fire-and-forget from the monitor's perspective - the platform connector handles persistence and retries |
| 90 | +- For testing, a mock platform connector can record events to files |
0 commit comments