Skip to content

Commit 2ac0cd9

Browse files
authored
docs: add comprehensive log collection documentation (#285)
1 parent 0827e1e commit 2ac0cd9

File tree

5 files changed

+772
-0
lines changed

5 files changed

+772
-0
lines changed

docs/LOG_COLLECTION.md

Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# NVSentinel Log Collection Guide
2+
3+
This guide explains NVSentinel's automatic log collection functionality for troubleshooting GPU node faults.
4+
5+
## Table of Contents
6+
7+
- [Overview](#overview)
8+
- [What Logs Are Collected](#what-logs-are-collected)
9+
- [Where Logs Are Stored](#where-logs-are-stored)
10+
- [When Logs Are Collected](#when-logs-are-collected)
11+
- [How to Download Logs](#how-to-download-logs)
12+
- [Log Rotation and Retention](#log-rotation-and-retention)
13+
- [Additional Resources](#additional-resources)
14+
15+
---
16+
17+
## Overview
18+
19+
When NVSentinel detects a fault on a GPU node, it automatically collects diagnostic logs to help with troubleshooting and root cause analysis. These logs are stored in an in-cluster file server and can be easily downloaded via your web browser.
20+
21+
---
22+
23+
## What Logs Are Collected
24+
25+
### 1. NVIDIA Bug Report
26+
- **File**: `nvidia-bug-report-<node-name>-<timestamp>.log.gz`
27+
- **Description**: Comprehensive NVIDIA driver and GPU diagnostic report
28+
- **Collection Method**:
29+
- **GPU Operator clusters**: Runs `nvidia-bug-report.sh` inside the nvidia-driver-daemonset pod
30+
- **GCP COS clusters**: Executes pre-installed nvidia-bug-report from host filesystem
31+
- **Contains**:
32+
- GPU configuration and status
33+
- Driver version and details
34+
- System information
35+
- GPU error logs
36+
- PCIe information
37+
- DCGM diagnostics
38+
39+
### 2. GPU Operator Must-Gather
40+
- **File**: `gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz`
41+
- **Description**: Kubernetes resources and logs for GPU operator components
42+
- **Contains**:
43+
- GPU operator pod logs
44+
- DCGM exporter logs
45+
- Device plugin logs
46+
- GPU feature discovery logs
47+
- Operator configuration
48+
- Kubernetes events
49+
50+
### 3. GCP SOS Report (Optional)
51+
- **File**: `sosreport-<hostname>-<timestamp>.tar.xz`
52+
- **When Collected**: Only on GCP instances when `enableGcpSosCollection: true`
53+
- **Contains**: System logs, configuration files, network diagnostics, storage information
54+
55+
### 4. AWS SOS Report (Optional)
56+
- **File**: `sosreport-<hostname>-nvsentinel-<unique-id>-<timestamp>.tar.xz`
57+
- **When Collected**: Only on AWS instances when `enableAwsSosCollection: true`
58+
- **Contains**: System logs, configuration files, network diagnostics, EC2 metadata
59+
60+
---
61+
62+
## Where Logs Are Stored
63+
64+
### Storage Architecture
65+
66+
```text
67+
Log Collector Job → In-Cluster File Server → Persistent Volume
68+
```
69+
70+
### In-Cluster File Server
71+
72+
- **Service Name**: `nvsentinel-incluster-file-server`
73+
- **Namespace**: `nvsentinel`
74+
- **Internal URL**: `http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local`
75+
- **Technology**: NGINX with WebDAV support
76+
77+
### Storage Configuration
78+
79+
Configure persistence in your Helm values:
80+
81+
```yaml
82+
# Helm values for file server persistence
83+
inclusterFileServer:
84+
persistence:
85+
enabled: true
86+
storageClassName: "" # Uses default storage class
87+
accessModes:
88+
- ReadWriteOnce
89+
size: 50Gi # Default size
90+
```
91+
92+
### Directory Structure
93+
94+
Logs are organized by node name and timestamp:
95+
96+
```text
97+
/usr/share/nginx/html/
98+
└── <node-name>/
99+
└── <timestamp>/
100+
├── nvidia-bug-report-<node-name>-<timestamp>.log.gz
101+
├── gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz
102+
├── sosreport-<hostname>-<timestamp>.tar.xz (if GCP SOS enabled)
103+
└── sosreport-<hostname>-nvsentinel-<id>-<timestamp>.tar.xz (if AWS SOS enabled)
104+
```
105+
106+
**Example**:
107+
```text
108+
/usr/share/nginx/html/
109+
└── worker-node-01/
110+
└── 20250106-143022/
111+
├── nvidia-bug-report-worker-node-01-20250106-143022.log.gz
112+
└── gpu-operator-must-gather-worker-node-01-20250106-143022.tar.gz
113+
```
114+
115+
---
116+
117+
## When Logs Are Collected
118+
119+
### Automatic Collection Triggers
120+
121+
Logs are automatically collected when:
122+
123+
1. Fault Remediation Module detects a drain completion on a node
124+
2. Log collection is enabled in the fault-remediation chart configuration
125+
3. Node has experienced a fault that triggered quarantine and drain
126+
127+
### Configuration
128+
129+
Enable log collection in your Helm values:
130+
131+
```yaml
132+
faultRemediation:
133+
enabled: true
134+
logCollector:
135+
enabled: true # Set to true to enable automatic log collection
136+
uploadURL: "http://nvsentinel-incluster-file-server.nvsentinel.svc.cluster.local/upload"
137+
gpuOperatorNamespaces: "gpu-operator" # Comma-separated list
138+
enableGcpSosCollection: false # Enable for GCP clusters
139+
enableAwsSosCollection: false # Enable for AWS clusters
140+
```
141+
142+
### Job Lifecycle
143+
144+
1. **Creation**: Fault-remediation module creates log collector job after node drain completes
145+
2. **Execution**: Job runs with privileged access on the target node
146+
3. **Collection**: Gathers all configured diagnostic logs (5-15 minutes typical duration)
147+
4. **Upload**: Uploads collected logs to file server
148+
5. **Completion**: Job completes and is cleaned up after TTL expires
149+
6. **TTL**: Job is automatically deleted 1 hour after completion (`ttlSecondsAfterFinished: 3600`)
150+
151+
### Timeout Configuration
152+
153+
You can configure the collection timeout:
154+
155+
```yaml
156+
logCollector:
157+
collectionTimeout: 900 # 15 minutes default
158+
```
159+
160+
---
161+
162+
## How to Download Logs
163+
164+
### Using Port-Forward and Browser
165+
166+
This is the simplest way to browse and download logs from your local machine.
167+
168+
#### Step 1: Set up port-forward
169+
170+
```bash
171+
kubectl port-forward -n nvsentinel svc/nvsentinel-incluster-file-server 8080:80
172+
```
173+
174+
#### Step 2: Access via web browser
175+
176+
Open your browser to:
177+
```text
178+
http://localhost:8080
179+
```
180+
181+
You'll see a directory listing with all node folders. Navigate through the folders to find your logs.
182+
183+
#### Step 3: Download files
184+
185+
Click on any file to download it directly from the browser.
186+
187+
### Viewing Collected Logs
188+
189+
After downloading, extract and view the logs:
190+
191+
#### NVIDIA Bug Report
192+
```bash
193+
# Decompress and view
194+
gunzip nvidia-bug-report-<node-name>-<timestamp>.log.gz
195+
less nvidia-bug-report-<node-name>-<timestamp>.log
196+
```
197+
198+
#### GPU Operator Must-Gather
199+
```bash
200+
# Extract tarball
201+
tar -xzf gpu-operator-must-gather-<node-name>-<timestamp>.tar.gz
202+
cd gpu-operator-must-gather-<node-name>-<timestamp>/
203+
204+
# Browse collected resources
205+
ls -R
206+
```
207+
208+
#### SOS Reports
209+
```bash
210+
# Extract SOS report
211+
tar -xJf sosreport-<hostname>-<timestamp>.tar.xz
212+
cd sosreport-<hostname>-<timestamp>/
213+
214+
# View summary
215+
less sos_reports/sos.txt
216+
```
217+
218+
---
219+
220+
## Log Rotation and Retention
221+
222+
### Overview
223+
224+
The file server includes an automated log cleanup service that manages disk space by removing old log files based on a configurable retention policy.
225+
226+
### Configuration
227+
228+
Configure log rotation in your Helm values:
229+
230+
```yaml
231+
inclusterFileServer:
232+
logCleanup:
233+
enabled: true
234+
retentionDays: 7 # Keep logs for 7 days (minimum: 1 day)
235+
sleepInterval: 86400 # Run cleanup every 24 hours (in seconds)
236+
```
237+
238+
### How Log Rotation Works
239+
240+
1. **Continuous Monitoring**: Cleanup service runs as a sidecar container in the file server pod
241+
2. **Periodic Cleanup**: Executes cleanup every `sleepInterval` seconds (default: 24 hours)
242+
3. **Age-Based Deletion**: Removes files older than `retentionDays` days based on file modification time
243+
4. **Safe Operation**: Only operates within `/usr/share/nginx/html` directory for security
244+
245+
### Cleanup Process
246+
247+
The cleanup service uses the `find` command to identify and delete old files:
248+
249+
```bash
250+
find /usr/share/nginx/html -type f -mtime +<retentionDays> -delete
251+
```
252+
253+
### Safety Features
254+
255+
1. **Minimum Retention**: Helm chart validates `retentionDays >= 1` to prevent accidental data loss
256+
2. **Path Validation**: Only cleans files within the designated directory
257+
3. **Timeout Protection**: Cleanup operations timeout after 5 minutes
258+
4. **Error Tracking**: Failed cleanups are logged and tracked in metrics
259+
260+
### Manual Cleanup
261+
262+
If needed, you can manually trigger cleanup or remove specific logs:
263+
264+
```bash
265+
# Get the file server pod name
266+
FILE_SERVER_POD=$(kubectl get pods -n nvsentinel -l app.kubernetes.io/name=incluster-file-server -o jsonpath='{.items[0].metadata.name}')
267+
268+
# Remove logs for a specific node
269+
kubectl exec -n nvsentinel $FILE_SERVER_POD -- rm -rf /usr/share/nginx/html/<node-name>
270+
271+
# Remove logs older than a specific date
272+
kubectl exec -n nvsentinel $FILE_SERVER_POD -- find /usr/share/nginx/html -type f -mtime +14 -delete
273+
```
274+
275+
---
276+
277+
## Additional Resources
278+
279+
- **[Metrics Documentation](METRICS.md)** - Prometheus metrics for monitoring log collection and file server operations
280+
- **[Troubleshooting Runbooks](runbooks/)** - Step-by-step guides for resolving common issues:
281+
- [Log Collection Job Failures](runbooks/log-collection-job-failures.md)
282+
- [Log Rotation Failures](runbooks/log-rotation-failures.md)
283+
- **[NVSentinel Overview](OVERVIEW.md)** - General overview of NVSentinel
284+
- **[Helm Chart Configuration](../distros/kubernetes/README.md)** - Complete Helm chart documentation
285+
286+
---
287+
288+
## Support
289+
290+
For issues or questions:
291+
- 🐛 **Bug Reports**: [Create an issue](https://github.com/NVIDIA/NVSentinel/issues/new)
292+
- ❓ **Questions**: [Start a discussion](https://github.com/NVIDIA/NVSentinel/discussions/new?category=q-a)
293+
- 📖 **Documentation**: [NVSentinel Docs](https://github.com/NVIDIA/NVSentinel/tree/main/docs)

docs/METRICS.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,23 @@ This document outlines all Prometheus metrics exposed by NVSentinel components.
108108
| `fault_remediation_log_collector_job_duration_seconds` | Histogram | `node_name`, `status` | Duration of log collector jobs in seconds |
109109
| `fault_remediation_log_collector_errors_total` | Counter | `error_type`, `node_name` | Total number of errors encountered in log collector operations |
110110

111+
### File Server Metrics
112+
113+
#### HTTP Request Metrics
114+
115+
| Metric Name | Type | Labels | Description |
116+
|------------|------|--------|-------------|
117+
| `http_response_count_total` | Counter | `method`, `status`, `app` | Total HTTP responses by method and status code |
118+
| `http_request_duration_seconds` | Histogram | `method`, `status` | HTTP request duration in seconds |
119+
120+
#### Log Rotation Metrics
121+
122+
| Metric Name | Type | Labels | Description |
123+
|------------|------|--------|-------------|
124+
| `fileserver_log_rotation_successful_total` | Counter | - | Total successful log cleanup operations |
125+
| `fileserver_log_rotation_failed_total` | Counter | - | Total failed log cleanup operations |
126+
| `fileserver_disk_space_free_bytes` | Gauge | - | Free disk space in bytes |
127+
111128
---
112129

113130
## Labeler Module

docs/runbooks/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# NVSentinel Runbooks
2+
3+
This directory contains troubleshooting runbooks for NVSentinel operations.
4+
5+
## Available Runbooks
6+
7+
- [Log Collection Job Failures](log-collection-job-failures.md) - Troubleshooting failed log collection jobs
8+
- [Log Rotation Failures](log-rotation-failures.md) - Troubleshooting log rotation and cleanup issues
9+
10+
## How to Use These Runbooks
11+
12+
Each runbook includes:
13+
- **Symptoms**: Signs that indicate the issue (including relevant alert names)
14+
- **Diagnosis Steps**: Commands to identify the root cause
15+
- **Common Issues and Solutions**: Known problems with fixes
16+
- **Resolution Steps**: Actions to resolve the problem
17+

0 commit comments

Comments
 (0)