Skip to content

Troubleshooting

Dr. Jan-Philip Gehrcke edited this page Nov 18, 2025 · 6 revisions

Collecting data

Kubelet plugin logs

Collect all kubelet plugin logs into a single file:

kubectl logs \
    -n nvidia-dra-driver-gpu \
    -l nvidia-dra-driver-gpu-component=kubelet-plugin \
    --prefix \
    --all-containers \
    --timestamps \
    --tail=-1 \
    > dra-driver-dbg_plugins_$(date -u +"%Y-%m-%dT%H%M%SZ").log

Notes:

  • In a larger-scale environment, this may fetch a lot of data.
  • Adding --prefix and --timestamps is critical for debuggability.

CD daemon logs (for a specific ComputeDomain)

Use this shell function (paste it into your terminal):

get_all_cd_daemon_logs_for_cd_name() {
  if [ -z "$*" ]; then echo "missing arg: CD name"; return 1; fi
  CD_NAME="$1"
  CD_UID=$(kubectl describe computedomains.resource.nvidia.com "${CD_NAME}" | grep UID | awk '{print $2}')
  CD_LABEL_KV="resource.nvidia.com/computeDomain=${CD_UID}"
  _filename="dra-driver-dbg_cd-daemons_$(date -u +"%Y-%m-%dT%H%M%SZ").log.gz"
  echo "fetching CD daemon logs for CD: $CD_LABEL_KV ($CD_NAME), creating $_filename"
  kubectl logs \
    -n nvidia-dra-driver-gpu \
    -l "${CD_LABEL_KV}" \
    --all-containers \
    --timestamps \
    --tail=-1 \
    --prefix \
    --all-containers | gzip > "${_filename}"
}

Run it for a specific CD. Example:

$ get_all_cd_daemon_logs_for_cd_name imex-channel-injection
fetching CD daemon logs for CD: resource.nvidia.com/computeDomain=a97f19b1-b41e-4266-8ecd-d2730f96dbb2 (imex-channel-injection), creating dra-driver-dbg_cd-daemons_2025-11-18T144249Z.log.gz

Controlling log verbosity

During helm install et al.

Log verbosity can be set for all components using the --set logVerbosity=<V> parameter during helm install ... or helm upgrade -i ....

Post-install

The verbosity can be changed after deployment and per-component, using various finer-grained mechanisms. Some examples are shown below.

Note that for now none of the components can update their log verbosity truly at runtime -- a pod restart is always required (to pick up mutated configuration).

Controller

Set log verbosity of just the controller pod:

kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu LOG_VERBOSITY=6

This command restarts the controller pod.

Kubelet plugins

Set log verbosity across kubelet plugin instances:

kubectl set env ds nvidia-dra-driver-gpu-kubelet-plugin -n nvidia-dra-driver-gpu LOG_VERBOSITY=6

This command triggers a restart for all plugin pods.

ComputeDomain daemons

Set log verbosity of CD daemons started in the future (this restarts the controller pod):

kubectl set env deployment nvidia-dra-driver-gpu-controller -n nvidia-dra-driver-gpu LOG_VERBOSITY_CD_DAEMON=6
Clone this wiki locally