Skip to content

Commit dc76cf0

Browse files
authored
Merge pull request #11 from NVIDIA/add-kdump
feat(kdump): Added kdump package
2 parents 10e4279 + bdcaaa6 commit dc76cf0

File tree

11 files changed

+1055
-0
lines changed

11 files changed

+1055
-0
lines changed

README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,24 @@ A package for managing the tuned system tuning daemon on Linux systems for autom
6060
- Support for built-in profiles (balanced, powersave, throughput-performance, etc.)
6161
- Idempotent operations safe for repeated execution
6262

63+
### 4. Kdump Package (`kdump/`)
64+
A package for automated installation and configuration of kdump crash dump collection on Linux systems.
65+
66+
**Capabilities:**
67+
- Multi-distribution support (Ubuntu/Debian, CentOS/RHEL/Amazon Linux, Fedora)
68+
- Automated kdump package installation and service management
69+
- Crashkernel parameter configuration in GRUB
70+
- Custom kdump configuration deployment via configmaps
71+
- Comprehensive validation and health checks
72+
- Safe uninstallation with complete cleanup
73+
74+
**Key features:**
75+
- Configure kernel crash dump functionality for debugging system failures
76+
- Automatic crashkernel memory reservation in GRUB
77+
- Support for custom kdump.conf configurations
78+
- Post-interrupt validation of crash kernel functionality
79+
- Complete lifecycle management (install, configure, validate, uninstall)
80+
6381
## Package Structure
6482

6583
Each package follows the standard skyhook package structure:
@@ -190,6 +208,7 @@ This validation step is crucial as the agent uses JSON schema validation to ensu
190208
- `shellscript` for custom scripts and automation
191209
- `tuning` for system-level configuration management
192210
- `tuned` for automated performance tuning with the tuned daemon
211+
- `kdump` for kernel crash dump collection and debugging
193212
2. **Review the package README** for specific usage instructions and examples
194213
3. **Create a Skyhook Custom Resource** referencing the package
195214
4. **Apply the SCR** to your cluster and monitor the package deployment
@@ -201,6 +220,7 @@ This validation step is crucial as the agent uses JSON schema validation to ensu
201220
- [Shellscript Package](./shellscript/README.md) - Usage guide for the shellscript package
202221
- [Tuning Package](./tuning/README.md) - Usage guide for the tuning package
203222
- [Tuned Package](./tuned/README.md) - Usage guide for the tuned package
223+
- [Kdump Package](./kdump/README.md) - Usage guide for the kdump package
204224
- [NVIDIA Skyhook Documentation](https://github.com/NVIDIA/skyhook) - Main skyhook operator documentation
205225

206226
## Contributing

kdump/Dockerfile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
FROM busybox:latest
18+
19+
RUN mkdir -p /skyhook-package/skyhook_dir
20+
COPY . /skyhook-package

kdump/README.md

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# Kdump Package
2+
3+
This Skyhook Package provides automated installation and configuration of kdump for crash dump collection on Linux systems. It supports multiple distributions and handles the complete lifecycle from installation to post-interrupt validations.
4+
5+
## Overview
6+
7+
The kdump package configures kernel crash dump functionality, which captures the contents of system memory when a kernel panic occurs. This is essential for debugging kernel crashes and system failures in production environments.
8+
9+
**Capabilities:**
10+
- Multi-distribution support (Ubuntu/Debian, CentOS/RHEL/Amazon Linux/Fedora)
11+
- Automated kdump package installation
12+
- Crashkernel parameter configuration in GRUB
13+
- Kdump service configuration and management
14+
- Comprehensive validation and health checks
15+
- Safe uninstallation with cleanup
16+
17+
## Required ConfigMaps
18+
19+
### `crashkernel`
20+
Specifies the amount of memory to reserve for the crash kernel. This value is added to the kernel command line.
21+
22+
**Format:** Single line with the crashkernel value
23+
**Examples:**
24+
- `256M` - Reserve 256MB for crash kernel
25+
- `512M` - Reserve 512MB for crash kernel
26+
- `1G` - Reserve 1GB for crash kernel
27+
- `auto` - Let the system determine the appropriate size
28+
29+
Read the kdump documentation for more information for correct crashkernel sizes.
30+
31+
### `kdump.conf` (Optional)
32+
Custom kdump configuration file content. If not provided, the package uses system defaults.
33+
34+
**Format:** Standard kdump.conf format
35+
**Example:**
36+
```
37+
path /var/crash
38+
core_collector makedumpfile -l --message-level 1 -d 31
39+
```
40+
41+
## Lifecycle Stages
42+
43+
### Apply Stage (`install_kdump.sh`)
44+
- Detects the Linux distribution
45+
- Installs appropriate kdump packages:
46+
- **Ubuntu/Debian**: `kdump-tools`, `crash`, `makedumpfile`
47+
- **CentOS/RHEL/Amazon/Fedora**: `kexec-tools`, `crash`
48+
- Enables kdump service for automatic startup
49+
50+
### Config Stage (`configure_kdump.sh`)
51+
- Reads crashkernel value from configmap
52+
- Configures GRUB with crashkernel parameter:
53+
- Uses `/etc/default/grub.d/` if available (preferred)
54+
- Falls back to modifying `/etc/default/grub` directly
55+
- Updates GRUB configuration (`update-grub` or `grub2-mkconfig`)
56+
- Copies custom kdump.conf if provided
57+
58+
### Post-Interrupt Check (`kdump_post_interrupt_check.sh`)
59+
- Validates crashkernel parameter is active in running kernel
60+
- Verifies kdump service is running and enabled
61+
- Performs comprehensive system state validation
62+
63+
### Uninstall Stage (`uninstall_kdump.sh`)
64+
- Removes crashkernel parameter from GRUB configuration
65+
- Updates GRUB to remove crash kernel reservation
66+
- Stops and disables kdump service
67+
- Removes installed kdump packages
68+
- Cleans up configuration files
69+
70+
**NOTE**: The crashkernel will be removed from the GRUB config, but a reboot will be needed in order for that to take effect. This isn't handled by the kdump skyhook package.
71+
72+
## Example Skyhook Custom Resource
73+
74+
### Basic kdump setup with 256MB crash kernel:
75+
```yaml
76+
apiVersion: skyhook.nvidia.com/v1alpha1
77+
kind: Skyhook
78+
metadata:
79+
name: kdump-setup
80+
spec:
81+
nodeSelectors:
82+
matchLabels:
83+
skyhook.nvidia.com/node-type: worker
84+
packages:
85+
kdump:
86+
version: 1.0.0
87+
image: ghcr.io/nvidia/skyhook-packages/kdump:1.0.0
88+
interrupt:
89+
type: reboot # required for crashkernel parameter to take effect
90+
configInterrupts:
91+
crashkernel:
92+
type: reboot
93+
configMap:
94+
crashkernel: "256M"
95+
```
96+
97+
### Advanced setup with custom kdump configuration:
98+
```yaml
99+
apiVersion: skyhook.nvidia.com/v1alpha1
100+
kind: Skyhook
101+
metadata:
102+
name: kdump-advanced
103+
spec:
104+
nodeSelectors:
105+
matchLabels:
106+
skyhook.nvidia.com/node-type: worker
107+
packages:
108+
kdump:
109+
version: 1.0.0
110+
image: ghcr.io/nvidia/skyhook-packages/kdump:1.0.0
111+
interrupt:
112+
type: reboot # required for crashkernel parameter to take effect
113+
configInterrupts:
114+
crashkernel:
115+
type: reboot
116+
kdump.conf:
117+
type: service
118+
services: ["kdump"] # For RHEL/CentOS/Fedora
119+
services: ["kdump-tools"] # For Debian based distros
120+
configMap:
121+
crashkernel: "512M"
122+
kdump.conf: |
123+
# kdump-tools configuration
124+
# ---------------------------------------------------------------------------
125+
# USE_KDUMP - controls kdump will be configured
126+
# 0 - kdump kernel will not be loaded
127+
# 1 - kdump kernel will be loaded and kdump is configured
128+
#
129+
USE_KDUMP=1
130+
131+
132+
# ---------------------------------------------------------------------------
133+
# Kdump Kernel:
134+
# KDUMP_KERNEL - A full pathname to a kdump kernel.
135+
# KDUMP_INITRD - A full pathname to the kdump initrd (if used).
136+
# If these are not set, kdump-config will try to use the current kernel
137+
# and initrd if it is relocatable. Otherwise, you will need to specify
138+
# these manually.
139+
KDUMP_KERNEL=/var/lib/kdump/vmlinuz
140+
KDUMP_INITRD=/var/lib/kdump/initrd.img
141+
142+
143+
# ---------------------------------------------------------------------------
144+
# vmcore Handling:
145+
# KDUMP_COREDIR - local path to save the vmcore to.
146+
# KDUMP_FAIL_CMD - This variable can be used to cause a reboot or
147+
# start a shell if saving the vmcore fails. If not set, "reboot -f"
148+
# is the default.
149+
# Example - start a shell if the vmcore copy fails:
150+
# KDUMP_FAIL_CMD="echo 'makedumpfile FAILED.'; /bin/bash; reboot -f"
151+
# KDUMP_DUMP_DMESG - This variable controls if the dmesg buffer is dumped.
152+
# If unset or set to 1, the dmesg buffer is dumped. If set to 0, the dmesg
153+
# buffer is not dumped.
154+
# KDUMP_NUM_DUMPS - This variable controls how many dump files are kept on
155+
# the machine to prevent running out of disk space. If set to 0 or unset,
156+
# the variable is ignored and no dump files are automatically purged.
157+
# KDUMP_COMPRESSION - Compress the dumpfile. No compression is used by default.
158+
# Supported compressions: bzip2, gzip, lz4, xz
159+
KDUMP_COREDIR="/var/crash"
160+
#KDUMP_FAIL_CMD="reboot -f"
161+
#KDUMP_DUMP_DMESG=
162+
#KDUMP_NUM_DUMPS=
163+
#KDUMP_COMPRESSION=
164+
165+
166+
# ---------------------------------------------------------------------------
167+
# Makedumpfile options:
168+
# MAKEDUMP_ARGS - extra arguments passed to makedumpfile (8). The default,
169+
# if unset, is to pass '-c -d 31' telling makedumpfile to use compression
170+
# and reduce the corefile to in-use kernel pages only.
171+
#MAKEDUMP_ARGS="-c -d 31"
172+
173+
174+
# ---------------------------------------------------------------------------
175+
# Kexec/Kdump args
176+
# KDUMP_KEXEC_ARGS - Additional arguments to the kexec command used to load
177+
# the kdump kernel
178+
# Example - Use this option on x86 systems with PAE and more than
179+
# 4 gig of memory:
180+
# KDUMP_KEXEC_ARGS="--elf64-core-headers"
181+
# KDUMP_CMDLINE - The default is to use the contents of /proc/cmdline.
182+
# Set this variable to override /proc/cmdline.
183+
# KDUMP_CMDLINE_APPEND - Additional arguments to append to the command line
184+
# for the kdump kernel. If unset, it defaults to
185+
# "reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb"
186+
#KDUMP_KEXEC_ARGS=""
187+
#KDUMP_CMDLINE=""
188+
#KDUMP_CMDLINE_APPEND="reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb"
189+
190+
191+
# ---------------------------------------------------------------------------
192+
# Architecture specific Overrides:
193+
194+
# ---------------------------------------------------------------------------
195+
# Remote dump facilities:
196+
# HOSTTAG - Select if hostname of IP address will be used as a prefix to the
197+
# timestamped directory when sending files to the remote server.
198+
# 'ip' is the default.
199+
#HOSTTAG="hostname|[ip]"
200+
201+
# NFS - Hostname and mount point of the NFS server configured to receive
202+
# the crash dump. The syntax must be {HOSTNAME}:{MOUNTPOINT}
203+
# (e.g. remote:/var/crash)
204+
# NFS_TIMEO - Timeout before NFS retries a request. See man nfs(5) for details.
205+
# NFS_RETRANS - Number of times NFS client retries a request. See man nfs(5) for details.
206+
#NFS="<nfs mount>"
207+
#NFS_TIMEO="600"
208+
#NFS_RETRANS="3"
209+
210+
# FTP - Hostname and path of the FTP server configured to receive the crash dump.
211+
# The syntax is {HOSTNAME}[:{PATH}] with PATH defaulting to /.
212+
# FTP_USER - FTP username. A anonomous upload will be used if not set.
213+
# FTP_PASSWORD - password for the FTP user
214+
# FTP_PORT=21 - FTP port. Port 21 will be used by default.
215+
#FTP="<server>:<path>"
216+
#FTP_USER=""
217+
#FTP_PASSWORD=""
218+
#FTP_PORT=21
219+
220+
# SSH - username and hostname of the remote server that will receive the dump
221+
# and dmesg files.
222+
# SSH_KEY - Full path of the ssh private key to be used to login to the remote
223+
# server. use kdump-config propagate to send the public key to the
224+
# remote server
225+
#SSH="<user at server>"
226+
#SSH_KEY="<path>"
227+
```
228+
229+
## Important Notes
230+
231+
### Single Package Support
232+
**Note:** Only one kdump package should be enabled at any given time. Configuring multiple kdump packages simultaneously can lead to conflicts and unpredictable behavior.
233+
234+
### Reboot Requirement
235+
- **Initial Setup**: A reboot is required after applying the package for the crashkernel parameter to take effect
236+
- **Configuration Changes**: Changing the crashkernel value requires a reboot
237+
- **Service Changes**: Modifying kdump.conf may require service restart but not a full reboot
238+
- **Uninstallation**: The crashkernel will be removed from the GRUB config after an uninstallation, but a reboot will be needed in order for that to take effect. This isn't handled by the kdump skyhook package.
239+
240+
### Memory Considerations
241+
- The crashkernel parameter reserves memory that is not available to the main system
242+
- Choose an appropriate size based on your system's total memory and debugging needs
243+
- Too small: May not capture complete crash dumps
244+
- Too large: Reduces available system memory
245+
246+
### Distribution Support
247+
The package automatically detects and supports:
248+
- **Ubuntu 18.04+** and **Debian 9+**
249+
- **CentOS 7+**, **RHEL 7+**, **Amazon Linux 2+**
250+
- **Fedora 30+**
251+
252+
### Validation
253+
The package includes comprehensive checks:
254+
- GRUB configuration validation
255+
- Kernel parameter verification
256+
- Service status monitoring
257+
- Cross-stage consistency validation
258+
259+
## Troubleshooting
260+
261+
### Common Issues
262+
263+
1. **Crashkernel not active after reboot**
264+
- Verify GRUB configuration was updated correctly
265+
- Check if secure boot is preventing kernel parameter changes
266+
- Ensure sufficient memory is available for reservation
267+
268+
2. **Kdump service fails to start**
269+
- Check system logs: `journalctl -u kdump` (kdump-tools on debian-based distros)
270+
- Verify crashkernel parameter is active: `cat /proc/cmdline`
271+
- Ensure adequate memory is reserved
272+
273+
3. **Package installation fails**
274+
- Verify network connectivity for package downloads
275+
- Check distribution compatibility
276+
- Review package manager logs
277+
278+
### Verification Commands
279+
280+
```bash
281+
# Check if crashkernel is active
282+
cat /proc/cmdline | grep crashkernel
283+
284+
# Verify kdump service status
285+
systemctl status kdump (kdump-tools on debian-based distros)
286+
287+
# Check available crash dump space
288+
df -h /var/crash
289+
290+
# Test crash dump functionality (USE WITH CAUTION)
291+
echo c > /proc/sysrq-trigger
292+
```
293+
294+
## Security Considerations
295+
296+
- Crash dumps may contain sensitive information from system memory
297+
- Ensure proper access controls on crash dump storage locations
298+
- Consider encryption for crash dump files in sensitive environments
299+
- Regular cleanup of old crash dumps to prevent disk space issues
300+
301+
## Kdump Documentation:
302+
- [official kernel documentation](https://docs.kernel.org/admin-guide/kdump/kdump.html)
303+
- [redhat kdump documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide)

0 commit comments

Comments
 (0)