|
| 1 | +# NVSentinel Log Collection Guide |
| 2 | + |
| 3 | +This guide explains NVSentinel's automatic log collection functionality for troubleshooting GPU node faults. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +- [Overview](#overview) |
| 8 | +- [What Logs Are Collected](#what-logs-are-collected) |
| 9 | +- [Where Logs Are Stored](#where-logs-are-stored) |
| 10 | +- [When Logs Are Collected](#when-logs-are-collected) |
| 11 | +- [How to Download Logs](#how-to-download-logs) |
| 12 | +- [Log Rotation and Retention](#log-rotation-and-retention) |
| 13 | +- [Additional Resources](#additional-resources) |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## Overview |
| 18 | + |
| 19 | +When NVSentinel detects a fault on a GPU node, it automatically collects diagnostic logs to help with troubleshooting and root cause analysis. These logs are stored in an in-cluster file server and can be easily downloaded via your web browser. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## What Logs Are Collected |
| 24 | + |
| 25 | +### 1. NVIDIA Bug Report |
| 26 | +- **File**: `nvidia-bug-report-<node-name>-<timestamp>.log.gz` |
| 27 | +- **Description**: Comprehensive NVIDIA driver and GPU diagnostic report |
| 28 | +- **Collection Method**: |
| 29 | + - **GPU Operator clusters**: Runs `nvidia-bug-report.sh` inside the nvidia-driver-daemonset pod |
| 30 | + - **GCP COS clusters**: Executes pre-installed nvidia-bug-report from host filesystem |
| 31 | +- **Contains**: |
| 32 | + - GPU configuration and status |
| 33 | + - Driver version and details |
| 34 | + - System information |
| 35 | + - GPU error logs |
| 36 | + - PCIe information |
| 37 | + - DCGM diagnostics |
| 38 | + |
| 39 | +### 2. GPU Operator Must-Gather |
| 40 | +- **File**: `gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz` |
| 41 | +- **Description**: Kubernetes resources and logs for GPU operator components |
| 42 | +- **Contains**: |
| 43 | + - GPU operator pod logs |
| 44 | + - DCGM exporter logs |
| 45 | + - Device plugin logs |
| 46 | + - GPU feature discovery logs |
| 47 | + - Operator configuration |
| 48 | + - Kubernetes events |
| 49 | + |
| 50 | +### 3. GCP SOS Report (Optional) |
| 51 | +- **File**: `sosreport-<hostname>-<timestamp>.tar.xz` |
| 52 | +- **When Collected**: Only on GCP instances when `enableGcpSosCollection: true` |
| 53 | +- **Contains**: System logs, configuration files, network diagnostics, storage information |
| 54 | + |
| 55 | +### 4. AWS SOS Report (Optional) |
| 56 | +- **File**: `sosreport-<hostname>-nvsentinel-<unique-id>-<timestamp>.tar.xz` |
| 57 | +- **When Collected**: Only on AWS instances when `enableAwsSosCollection: true` |
| 58 | +- **Contains**: System logs, configuration files, network diagnostics, EC2 metadata |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Where Logs Are Stored |
| 63 | + |
| 64 | +### Storage Architecture |
| 65 | + |
| 66 | +```text |
| 67 | +Log Collector Job → In-Cluster File Server → Persistent Volume |
| 68 | +``` |
| 69 | + |
| 70 | +### In-Cluster File Server |
| 71 | + |
| 72 | +- **Service Name**: `nvsentinel-incluster-file-server` |
| 73 | +- **Namespace**: `nvsentinel` |
| 74 | +- **Internal URL**: `http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local` |
| 75 | +- **Technology**: NGINX with WebDAV support |
| 76 | + |
| 77 | +### Storage Configuration |
| 78 | + |
| 79 | +Configure persistence in your Helm values: |
| 80 | + |
| 81 | +```yaml |
| 82 | +# Helm values for file server persistence |
| 83 | +inclusterFileServer: |
| 84 | + persistence: |
| 85 | + enabled: true |
| 86 | + storageClassName: "" # Uses default storage class |
| 87 | + accessModes: |
| 88 | + - ReadWriteOnce |
| 89 | + size: 50Gi # Default size |
| 90 | +``` |
| 91 | +
|
| 92 | +### Directory Structure |
| 93 | +
|
| 94 | +Logs are organized by node name and timestamp: |
| 95 | +
|
| 96 | +```text |
| 97 | +/usr/share/nginx/html/ |
| 98 | +└── <node-name>/ |
| 99 | + └── <timestamp>/ |
| 100 | + ├── nvidia-bug-report-<node-name>-<timestamp>.log.gz |
| 101 | + ├── gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz |
| 102 | + ├── sosreport-<hostname>-<timestamp>.tar.xz (if GCP SOS enabled) |
| 103 | + └── sosreport-<hostname>-nvsentinel-<id>-<timestamp>.tar.xz (if AWS SOS enabled) |
| 104 | +``` |
| 105 | + |
| 106 | +**Example**: |
| 107 | +```text |
| 108 | +/usr/share/nginx/html/ |
| 109 | +└── worker-node-01/ |
| 110 | + └── 20250106-143022/ |
| 111 | + ├── nvidia-bug-report-worker-node-01-20250106-143022.log.gz |
| 112 | + └── gpu-operator-must-gather-worker-node-01-20250106-143022.tar.gz |
| 113 | +``` |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## When Logs Are Collected |
| 118 | + |
| 119 | +### Automatic Collection Triggers |
| 120 | + |
| 121 | +Logs are automatically collected when: |
| 122 | + |
| 123 | +1. Fault Remediation Module detects a drain completion on a node |
| 124 | +2. Log collection is enabled in the fault-remediation chart configuration |
| 125 | +3. Node has experienced a fault that triggered quarantine and drain |
| 126 | + |
| 127 | +### Configuration |
| 128 | + |
| 129 | +Enable log collection in your Helm values: |
| 130 | + |
| 131 | +```yaml |
| 132 | +faultRemediation: |
| 133 | + enabled: true |
| 134 | + logCollector: |
| 135 | + enabled: true # Set to true to enable automatic log collection |
| 136 | + uploadURL: "http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local/upload" |
| 137 | + gpuOperatorNamespaces: "gpu-operator" # Comma-separated list |
| 138 | + enableGcpSosCollection: false # Enable for GCP clusters |
| 139 | + enableAwsSosCollection: false # Enable for AWS clusters |
| 140 | +``` |
| 141 | +
|
| 142 | +### Job Lifecycle |
| 143 | +
|
| 144 | +1. **Creation**: Fault-remediation module creates log collector job after node drain completes |
| 145 | +2. **Execution**: Job runs with privileged access on the target node |
| 146 | +3. **Collection**: Gathers all configured diagnostic logs (5-15 minutes typical duration) |
| 147 | +4. **Upload**: Uploads collected logs to file server |
| 148 | +5. **Completion**: Job completes and is cleaned up after TTL expires |
| 149 | +6. **TTL**: Job is automatically deleted 1 hour after completion (`ttlSecondsAfterFinished: 3600`) |
| 150 | + |
| 151 | +### Timeout Configuration |
| 152 | + |
| 153 | +You can configure the collection timeout: |
| 154 | + |
| 155 | +```yaml |
| 156 | +logCollector: |
| 157 | + collectionTimeout: 900 # 15 minutes default |
| 158 | +``` |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## How to Download Logs |
| 163 | + |
| 164 | +### Using Port-Forward and Browser |
| 165 | + |
| 166 | +This is the simplest way to browse and download logs from your local machine. |
| 167 | + |
| 168 | +#### Step 1: Set up port-forward |
| 169 | + |
| 170 | +```bash |
| 171 | +kubectl port-forward -n nvsentinel svc/nvsentinel-incluster-file-server 8080:80 |
| 172 | +``` |
| 173 | + |
| 174 | +#### Step 2: Access via web browser |
| 175 | + |
| 176 | +Open your browser to: |
| 177 | +```text |
| 178 | +http://localhost:8080 |
| 179 | +``` |
| 180 | + |
| 181 | +You'll see a directory listing with all node folders. Navigate through the folders to find your logs. |
| 182 | + |
| 183 | +#### Step 3: Download files |
| 184 | + |
| 185 | +Click on any file to download it directly from the browser. |
| 186 | + |
| 187 | +### Viewing Collected Logs |
| 188 | + |
| 189 | +After downloading, extract and view the logs: |
| 190 | + |
| 191 | +#### NVIDIA Bug Report |
| 192 | +```bash |
| 193 | +# Decompress and view |
| 194 | +gunzip nvidia-bug-report-<node-name>-<timestamp>.log.gz |
| 195 | +less nvidia-bug-report-<node-name>-<timestamp>.log |
| 196 | +``` |
| 197 | + |
| 198 | +#### GPU Operator Must-Gather |
| 199 | +```bash |
| 200 | +# Extract tarball |
| 201 | +tar -xzf gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz |
| 202 | +cd gpu-operator-must-gather-<node-name>-<timestamp>/ |
| 203 | +
|
| 204 | +# Browse collected resources |
| 205 | +ls -R |
| 206 | +``` |
| 207 | + |
| 208 | +#### SOS Reports |
| 209 | +```bash |
| 210 | +# Extract SOS report |
| 211 | +tar -xJf sosreport-<hostname>-<timestamp>.tar.xz |
| 212 | +cd sosreport-<hostname>-<timestamp>/ |
| 213 | +
|
| 214 | +# View summary |
| 215 | +less sos_reports/sos.txt |
| 216 | +``` |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## Log Rotation and Retention |
| 221 | + |
| 222 | +### Overview |
| 223 | + |
| 224 | +The file server includes an automated log cleanup service that manages disk space by removing old log files based on a configurable retention policy. |
| 225 | + |
| 226 | +### Configuration |
| 227 | + |
| 228 | +Configure log rotation in your Helm values: |
| 229 | + |
| 230 | +```yaml |
| 231 | +inclusterFileServer: |
| 232 | + logCleanup: |
| 233 | + enabled: true |
| 234 | + retentionDays: 7 # Keep logs for 7 days (minimum: 1 day) |
| 235 | + sleepInterval: 86400 # Run cleanup every 24 hours (in seconds) |
| 236 | +``` |
| 237 | + |
| 238 | +### How Log Rotation Works |
| 239 | + |
| 240 | +1. **Continuous Monitoring**: Cleanup service runs as a sidecar container in the file server pod |
| 241 | +2. **Periodic Cleanup**: Executes cleanup every `sleepInterval` seconds (default: 24 hours) |
| 242 | +3. **Age-Based Deletion**: Removes files older than `retentionDays` days based on file modification time |
| 243 | +4. **Safe Operation**: Only operates within `/usr/share/nginx/html` directory for security |
| 244 | + |
| 245 | +### Cleanup Process |
| 246 | + |
| 247 | +The cleanup service uses the `find` command to identify and delete old files: |
| 248 | + |
| 249 | +```bash |
| 250 | +find /usr/share/nginx/html -type f -mtime +<retentionDays> -delete |
| 251 | +``` |
| 252 | + |
| 253 | +### Safety Features |
| 254 | + |
| 255 | +1. **Minimum Retention**: Helm chart validates `retentionDays >= 1` to prevent accidental data loss |
| 256 | +2. **Path Validation**: Only cleans files within the designated directory |
| 257 | +3. **Timeout Protection**: Cleanup operations timeout after 5 minutes |
| 258 | +4. **Error Tracking**: Failed cleanups are logged and tracked in metrics |
| 259 | + |
| 260 | +### Manual Cleanup |
| 261 | + |
| 262 | +If needed, you can manually trigger cleanup or remove specific logs: |
| 263 | + |
| 264 | +```bash |
| 265 | +# Get the file server pod name |
| 266 | +FILE_SERVER_POD=$(kubectl get pods -n nvsentinel -l app.kubernetes.io/name=incluster-file-server -o jsonpath='{.items[0].metadata.name}') |
| 267 | +
|
| 268 | +# Remove logs for a specific node |
| 269 | +kubectl exec -n nvsentinel $FILE_SERVER_POD -- rm -rf /usr/share/nginx/html/<node-name> |
| 270 | +
|
| 271 | +# Remove logs older than a specific date |
| 272 | +kubectl exec -n nvsentinel $FILE_SERVER_POD -- find /usr/share/nginx/html -type f -mtime +14 -delete |
| 273 | +``` |
| 274 | + |
| 275 | +--- |
| 276 | + |
| 277 | +## Additional Resources |
| 278 | + |
| 279 | +- **[Metrics Documentation](METRICS.md)** - Prometheus metrics for monitoring log collection and file server operations |
| 280 | +- **[Troubleshooting Runbooks](runbooks/)** - Step-by-step guides for resolving common issues: |
| 281 | + - [Log Collection Job Failures](runbooks/log-collection-job-failures.md) |
| 282 | + - [Log Rotation Failures](runbooks/log-rotation-failures.md) |
| 283 | +- **[NVSentinel Overview](OVERVIEW.md)** - General overview of NVSentinel |
| 284 | +- **[Helm Chart Configuration](../distros/kubernetes/README.md)** - Complete Helm chart documentation |
| 285 | + |
| 286 | +--- |
| 287 | + |
| 288 | +## Support |
| 289 | + |
| 290 | +For issues or questions: |
| 291 | +- 🐛 **Bug Reports**: [Create an issue](https://github.com/NVIDIA/NVSentinel/issues/new) |
| 292 | +- ❓ **Questions**: [Start a discussion](https://github.com/NVIDIA/NVSentinel/discussions/new?category=q-a) |
| 293 | +- 📖 **Documentation**: [NVSentinel Docs](https://github.com/NVIDIA/NVSentinel/tree/main/docs) |
0 commit comments