|
| 1 | +# Kdump Package |
| 2 | + |
| 3 | +This Skyhook Package provides automated installation and configuration of kdump for crash dump collection on Linux systems. It supports multiple distributions and handles the complete lifecycle from installation to post-interrupt validations. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The kdump package configures kernel crash dump functionality, which captures the contents of system memory when a kernel panic occurs. This is essential for debugging kernel crashes and system failures in production environments. |
| 8 | + |
| 9 | +**Capabilities:** |
| 10 | +- Multi-distribution support (Ubuntu/Debian, CentOS/RHEL/Amazon Linux/Fedora) |
| 11 | +- Automated kdump package installation |
| 12 | +- Crashkernel parameter configuration in GRUB |
| 13 | +- Kdump service configuration and management |
| 14 | +- Comprehensive validation and health checks |
| 15 | +- Safe uninstallation with cleanup |
| 16 | + |
| 17 | +## Required ConfigMaps |
| 18 | + |
| 19 | +### `crashkernel` |
| 20 | +Specifies the amount of memory to reserve for the crash kernel. This value is added to the kernel command line. |
| 21 | + |
| 22 | +**Format:** Single line with the crashkernel value |
| 23 | +**Examples:** |
| 24 | +- `256M` - Reserve 256MB for crash kernel |
| 25 | +- `512M` - Reserve 512MB for crash kernel |
| 26 | +- `1G` - Reserve 1GB for crash kernel |
| 27 | +- `auto` - Let the system determine the appropriate size |
| 28 | + |
| 29 | +Read the kdump documentation for more information for correct crashkernel sizes. |
| 30 | + |
| 31 | +### `kdump.conf` (Optional) |
| 32 | +Custom kdump configuration file content. If not provided, the package uses system defaults. |
| 33 | + |
| 34 | +**Format:** Standard kdump.conf format |
| 35 | +**Example:** |
| 36 | +``` |
| 37 | +path /var/crash |
| 38 | +core_collector makedumpfile -l --message-level 1 -d 31 |
| 39 | +``` |
| 40 | + |
| 41 | +## Lifecycle Stages |
| 42 | + |
| 43 | +### Apply Stage (`install_kdump.sh`) |
| 44 | +- Detects the Linux distribution |
| 45 | +- Installs appropriate kdump packages: |
| 46 | + - **Ubuntu/Debian**: `kdump-tools`, `crash`, `makedumpfile` |
| 47 | + - **CentOS/RHEL/Amazon/Fedora**: `kexec-tools`, `crash` |
| 48 | +- Enables kdump service for automatic startup |
| 49 | + |
| 50 | +### Config Stage (`configure_kdump.sh`) |
| 51 | +- Reads crashkernel value from configmap |
| 52 | +- Configures GRUB with crashkernel parameter: |
| 53 | + - Uses `/etc/default/grub.d/` if available (preferred) |
| 54 | + - Falls back to modifying `/etc/default/grub` directly |
| 55 | +- Updates GRUB configuration (`update-grub` or `grub2-mkconfig`) |
| 56 | +- Copies custom kdump.conf if provided |
| 57 | + |
| 58 | +### Post-Interrupt Check (`kdump_post_interrupt_check.sh`) |
| 59 | +- Validates crashkernel parameter is active in running kernel |
| 60 | +- Verifies kdump service is running and enabled |
| 61 | +- Performs comprehensive system state validation |
| 62 | + |
| 63 | +### Uninstall Stage (`uninstall_kdump.sh`) |
| 64 | +- Removes crashkernel parameter from GRUB configuration |
| 65 | +- Updates GRUB to remove crash kernel reservation |
| 66 | +- Stops and disables kdump service |
| 67 | +- Removes installed kdump packages |
| 68 | +- Cleans up configuration files |
| 69 | + |
| 70 | +**NOTE**: The crashkernel will be removed from the GRUB config, but a reboot will be needed in order for that to take effect. This isn't handled by the kdump skyhook package. |
| 71 | + |
| 72 | +## Example Skyhook Custom Resource |
| 73 | + |
| 74 | +### Basic kdump setup with 256MB crash kernel: |
| 75 | +```yaml |
| 76 | +apiVersion: skyhook.nvidia.com/v1alpha1 |
| 77 | +kind: Skyhook |
| 78 | +metadata: |
| 79 | + name: kdump-setup |
| 80 | +spec: |
| 81 | + nodeSelectors: |
| 82 | + matchLabels: |
| 83 | + skyhook.nvidia.com/node-type: worker |
| 84 | + packages: |
| 85 | + kdump: |
| 86 | + version: 1.0.0 |
| 87 | + image: ghcr.io/nvidia/skyhook-packages/kdump:1.0.0 |
| 88 | + interrupt: |
| 89 | + type: reboot # required for crashkernel parameter to take effect |
| 90 | + configInterrupts: |
| 91 | + crashkernel: |
| 92 | + type: reboot |
| 93 | + configMap: |
| 94 | + crashkernel: "256M" |
| 95 | +``` |
| 96 | +
|
| 97 | +### Advanced setup with custom kdump configuration: |
| 98 | +```yaml |
| 99 | +apiVersion: skyhook.nvidia.com/v1alpha1 |
| 100 | +kind: Skyhook |
| 101 | +metadata: |
| 102 | + name: kdump-advanced |
| 103 | +spec: |
| 104 | + nodeSelectors: |
| 105 | + matchLabels: |
| 106 | + skyhook.nvidia.com/node-type: worker |
| 107 | + packages: |
| 108 | + kdump: |
| 109 | + version: 1.0.0 |
| 110 | + image: ghcr.io/nvidia/skyhook-packages/kdump:1.0.0 |
| 111 | + interrupt: |
| 112 | + type: reboot # required for crashkernel parameter to take effect |
| 113 | + configInterrupts: |
| 114 | + crashkernel: |
| 115 | + type: reboot |
| 116 | + kdump.conf: |
| 117 | + type: service |
| 118 | + services: ["kdump"] # For RHEL/CentOS/Fedora |
| 119 | + services: ["kdump-tools"] # For Debian based distros |
| 120 | + configMap: |
| 121 | + crashkernel: "512M" |
| 122 | + kdump.conf: | |
| 123 | + # kdump-tools configuration |
| 124 | + # --------------------------------------------------------------------------- |
| 125 | + # USE_KDUMP - controls kdump will be configured |
| 126 | + # 0 - kdump kernel will not be loaded |
| 127 | + # 1 - kdump kernel will be loaded and kdump is configured |
| 128 | + # |
| 129 | + USE_KDUMP=1 |
| 130 | +
|
| 131 | +
|
| 132 | + # --------------------------------------------------------------------------- |
| 133 | + # Kdump Kernel: |
| 134 | + # KDUMP_KERNEL - A full pathname to a kdump kernel. |
| 135 | + # KDUMP_INITRD - A full pathname to the kdump initrd (if used). |
| 136 | + # If these are not set, kdump-config will try to use the current kernel |
| 137 | + # and initrd if it is relocatable. Otherwise, you will need to specify |
| 138 | + # these manually. |
| 139 | + KDUMP_KERNEL=/var/lib/kdump/vmlinuz |
| 140 | + KDUMP_INITRD=/var/lib/kdump/initrd.img |
| 141 | +
|
| 142 | +
|
| 143 | + # --------------------------------------------------------------------------- |
| 144 | + # vmcore Handling: |
| 145 | + # KDUMP_COREDIR - local path to save the vmcore to. |
| 146 | + # KDUMP_FAIL_CMD - This variable can be used to cause a reboot or |
| 147 | + # start a shell if saving the vmcore fails. If not set, "reboot -f" |
| 148 | + # is the default. |
| 149 | + # Example - start a shell if the vmcore copy fails: |
| 150 | + # KDUMP_FAIL_CMD="echo 'makedumpfile FAILED.'; /bin/bash; reboot -f" |
| 151 | + # KDUMP_DUMP_DMESG - This variable controls if the dmesg buffer is dumped. |
| 152 | + # If unset or set to 1, the dmesg buffer is dumped. If set to 0, the dmesg |
| 153 | + # buffer is not dumped. |
| 154 | + # KDUMP_NUM_DUMPS - This variable controls how many dump files are kept on |
| 155 | + # the machine to prevent running out of disk space. If set to 0 or unset, |
| 156 | + # the variable is ignored and no dump files are automatically purged. |
| 157 | + # KDUMP_COMPRESSION - Compress the dumpfile. No compression is used by default. |
| 158 | + # Supported compressions: bzip2, gzip, lz4, xz |
| 159 | + KDUMP_COREDIR="/var/crash" |
| 160 | + #KDUMP_FAIL_CMD="reboot -f" |
| 161 | + #KDUMP_DUMP_DMESG= |
| 162 | + #KDUMP_NUM_DUMPS= |
| 163 | + #KDUMP_COMPRESSION= |
| 164 | +
|
| 165 | +
|
| 166 | + # --------------------------------------------------------------------------- |
| 167 | + # Makedumpfile options: |
| 168 | + # MAKEDUMP_ARGS - extra arguments passed to makedumpfile (8). The default, |
| 169 | + # if unset, is to pass '-c -d 31' telling makedumpfile to use compression |
| 170 | + # and reduce the corefile to in-use kernel pages only. |
| 171 | + #MAKEDUMP_ARGS="-c -d 31" |
| 172 | +
|
| 173 | +
|
| 174 | + # --------------------------------------------------------------------------- |
| 175 | + # Kexec/Kdump args |
| 176 | + # KDUMP_KEXEC_ARGS - Additional arguments to the kexec command used to load |
| 177 | + # the kdump kernel |
| 178 | + # Example - Use this option on x86 systems with PAE and more than |
| 179 | + # 4 gig of memory: |
| 180 | + # KDUMP_KEXEC_ARGS="--elf64-core-headers" |
| 181 | + # KDUMP_CMDLINE - The default is to use the contents of /proc/cmdline. |
| 182 | + # Set this variable to override /proc/cmdline. |
| 183 | + # KDUMP_CMDLINE_APPEND - Additional arguments to append to the command line |
| 184 | + # for the kdump kernel. If unset, it defaults to |
| 185 | + # "reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb" |
| 186 | + #KDUMP_KEXEC_ARGS="" |
| 187 | + #KDUMP_CMDLINE="" |
| 188 | + #KDUMP_CMDLINE_APPEND="reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb" |
| 189 | +
|
| 190 | +
|
| 191 | + # --------------------------------------------------------------------------- |
| 192 | + # Architecture specific Overrides: |
| 193 | +
|
| 194 | + # --------------------------------------------------------------------------- |
| 195 | + # Remote dump facilities: |
| 196 | + # HOSTTAG - Select if hostname of IP address will be used as a prefix to the |
| 197 | + # timestamped directory when sending files to the remote server. |
| 198 | + # 'ip' is the default. |
| 199 | + #HOSTTAG="hostname|[ip]" |
| 200 | +
|
| 201 | + # NFS - Hostname and mount point of the NFS server configured to receive |
| 202 | + # the crash dump. The syntax must be {HOSTNAME}:{MOUNTPOINT} |
| 203 | + # (e.g. remote:/var/crash) |
| 204 | + # NFS_TIMEO - Timeout before NFS retries a request. See man nfs(5) for details. |
| 205 | + # NFS_RETRANS - Number of times NFS client retries a request. See man nfs(5) for details. |
| 206 | + #NFS="<nfs mount>" |
| 207 | + #NFS_TIMEO="600" |
| 208 | + #NFS_RETRANS="3" |
| 209 | +
|
| 210 | + # FTP - Hostname and path of the FTP server configured to receive the crash dump. |
| 211 | + # The syntax is {HOSTNAME}[:{PATH}] with PATH defaulting to /. |
| 212 | + # FTP_USER - FTP username. A anonomous upload will be used if not set. |
| 213 | + # FTP_PASSWORD - password for the FTP user |
| 214 | + # FTP_PORT=21 - FTP port. Port 21 will be used by default. |
| 215 | + #FTP="<server>:<path>" |
| 216 | + #FTP_USER="" |
| 217 | + #FTP_PASSWORD="" |
| 218 | + #FTP_PORT=21 |
| 219 | +
|
| 220 | + # SSH - username and hostname of the remote server that will receive the dump |
| 221 | + # and dmesg files. |
| 222 | + # SSH_KEY - Full path of the ssh private key to be used to login to the remote |
| 223 | + # server. use kdump-config propagate to send the public key to the |
| 224 | + # remote server |
| 225 | + #SSH="<user at server>" |
| 226 | + #SSH_KEY="<path>" |
| 227 | +``` |
| 228 | +
|
| 229 | +## Important Notes |
| 230 | +
|
| 231 | +### Single Package Support |
| 232 | +**Note:** Only one kdump package should be enabled at any given time. Configuring multiple kdump packages simultaneously can lead to conflicts and unpredictable behavior. |
| 233 | +
|
| 234 | +### Reboot Requirement |
| 235 | +- **Initial Setup**: A reboot is required after applying the package for the crashkernel parameter to take effect |
| 236 | +- **Configuration Changes**: Changing the crashkernel value requires a reboot |
| 237 | +- **Service Changes**: Modifying kdump.conf may require service restart but not a full reboot |
| 238 | +- **Uninstallation**: The crashkernel will be removed from the GRUB config after an uninstallation, but a reboot will be needed in order for that to take effect. This isn't handled by the kdump skyhook package. |
| 239 | +
|
| 240 | +### Memory Considerations |
| 241 | +- The crashkernel parameter reserves memory that is not available to the main system |
| 242 | +- Choose an appropriate size based on your system's total memory and debugging needs |
| 243 | +- Too small: May not capture complete crash dumps |
| 244 | +- Too large: Reduces available system memory |
| 245 | +
|
| 246 | +### Distribution Support |
| 247 | +The package automatically detects and supports: |
| 248 | +- **Ubuntu 18.04+** and **Debian 9+** |
| 249 | +- **CentOS 7+**, **RHEL 7+**, **Amazon Linux 2+** |
| 250 | +- **Fedora 30+** |
| 251 | +
|
| 252 | +### Validation |
| 253 | +The package includes comprehensive checks: |
| 254 | +- GRUB configuration validation |
| 255 | +- Kernel parameter verification |
| 256 | +- Service status monitoring |
| 257 | +- Cross-stage consistency validation |
| 258 | +
|
| 259 | +## Troubleshooting |
| 260 | +
|
| 261 | +### Common Issues |
| 262 | +
|
| 263 | +1. **Crashkernel not active after reboot** |
| 264 | + - Verify GRUB configuration was updated correctly |
| 265 | + - Check if secure boot is preventing kernel parameter changes |
| 266 | + - Ensure sufficient memory is available for reservation |
| 267 | +
|
| 268 | +2. **Kdump service fails to start** |
| 269 | + - Check system logs: `journalctl -u kdump` (kdump-tools on debian-based distros) |
| 270 | + - Verify crashkernel parameter is active: `cat /proc/cmdline` |
| 271 | + - Ensure adequate memory is reserved |
| 272 | + |
| 273 | +3. **Package installation fails** |
| 274 | + - Verify network connectivity for package downloads |
| 275 | + - Check distribution compatibility |
| 276 | + - Review package manager logs |
| 277 | + |
| 278 | +### Verification Commands |
| 279 | + |
| 280 | +```bash |
| 281 | +# Check if crashkernel is active |
| 282 | +cat /proc/cmdline | grep crashkernel |
| 283 | +
|
| 284 | +# Verify kdump service status |
| 285 | +systemctl status kdump (kdump-tools on debian-based distros) |
| 286 | +
|
| 287 | +# Check available crash dump space |
| 288 | +df -h /var/crash |
| 289 | +
|
| 290 | +# Test crash dump functionality (USE WITH CAUTION) |
| 291 | +echo c > /proc/sysrq-trigger |
| 292 | +``` |
| 293 | + |
| 294 | +## Security Considerations |
| 295 | + |
| 296 | +- Crash dumps may contain sensitive information from system memory |
| 297 | +- Ensure proper access controls on crash dump storage locations |
| 298 | +- Consider encryption for crash dump files in sensitive environments |
| 299 | +- Regular cleanup of old crash dumps to prevent disk space issues |
| 300 | + |
| 301 | +## Kdump Documentation: |
| 302 | +- [official kernel documentation](https://docs.kernel.org/admin-guide/kdump/kdump.html) |
| 303 | +- [redhat kdump documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide) |
0 commit comments