Skip to content

Issues with ceph-csi, Nomad, and Podman #5615

@benemon

Description

@benemon

Describe the bug

When attempting to create a Ceph RBD volume via the Ceph-CSI driver on Nomad, the node plugin fails with:

rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument

This fails for both static and dynamic volume mounts.

For static mounts, the volume is created successfully but cannot be staged or mounted on the client.

For dynamic mounts, the volume fails to be created and fails with the above error

I was walking through this guide:

https://docs.ceph.com/en/latest/rbd/rbd-nomad/

But trying to use Podman instead of Docker.

  • The environment is a set of 3 identical hosts.
  • Nomad Server / Client, and Ceph are co-deployed on the hosts.
  • The controller and nodes deploy succesfully.
$ podman version
Client:       Podman Engine
Version:      5.4.0
API Version:  5.4.0
Go Version:   go1.23.4 (Red Hat 1.23.4-1.el9)
Built:        Wed Feb 12 15:54:13 2025
OS/Arch:      linux/amd64

Environment details

  • Image/version of Ceph CSI driver: quay.io/cephcsi/cephcsi:v3.14.2 (also tried v3.15.0)
  • Helm chart version: N/A (using Nomad CSI integration, not Kubernetes)
  • Kernel version: RHEL 9.6 - 5.14.0-570.12.1.el9_6.x86_64
  • Mounter used:
[root@node1 ~]# lsmod | grep rbd
rbd                   155648  0
libceph               614400  1 rbd
  • Orchestrator:
Nomad v1.10.5+ent
BuildDate 2025-09-10T12:00:47Z
Revision 2c21645c33d1447a55b84a235ba4dd9ba225501e
  • Ceph cluster version: ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)

Steps to reproduce (dynamic)

  1. Deploy Ceph cluster using cephadm.
  2. Deploy Ceph-CSI RBD plugin as Nomad CSI plugin with pool and secrets.
  3. Try and create a volume rbd-test.

Actual results

Dynamic

  • Volume provisioning fails
  • Controller logs show:
2025-10-01T20:13:21.684252331+01:00 stderr F I1001 19:13:21.684231       1 utils.go:265] ID: 2014 Req-ID: rbd-test GRPC call: /csi.v1.Controller/CreateVolume
2025-10-01T20:13:21.684280988+01:00 stderr F I1001 19:13:21.684277       1 utils.go:266] ID: 2014 Req-ID: rbd-test GRPC request: {"accessibility_requirements":{},"capacity_range":{"limit_bytes":2000000000,"required_bytes":2000000000},"name":"rbd-test","parameters":{"clusterID":"eb7e38f6-954d-11f0-957a-b0416f15f052","imageFeatures":"layering","pool":"rbd"},"secrets":"***stripped***","volume_capabilities":[{"access_mode":{"mode":"SINGLE_NODE_WRITER"},"mount":{"fs_type":"ext4"}}]}
2025-10-01T20:13:21.684416476+01:00 stderr F I1001 19:13:21.684411       1 rbd_util.go:1411] ID: 2014 Req-ID: rbd-test setting disableInUseChecks: false image features: [layering] mounter: rbd
2025-10-01T20:13:21.691141742+01:00 stderr F E1001 19:13:21.691116       1 controllerserver.go:243] ID: 2014 Req-ID: rbd-test failed to connect to volume : failed to get connection: connecting failed: rados: ret=-22, Invalid argument
2025-10-01T20:13:21.691153007+01:00 stderr F E1001 19:13:21.691140       1 utils.go:270] ID: 2014 Req-ID: rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument

Static

  • Volume registration succeeds.
  • During staging/mount, Nomad fails with:
failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: node plugin returned an internal error
  • logs show a similar error:
I1001 12:36:47.752182       1 utils.go:266] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC request: {"staging_target_path":"/local/csi/staging/default/rbd-test/rw-file-system-single-node-writer","volume_capability":{"access_mode":{"mode":"SINGLE_NODE_WRITER"}}
E1001 12:36:47.753009       1 utils.go:291] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument

Expected behavior

The Ceph RBD volume should stage and mount successfully to the Nomad client in either static or dynamic workflows


Logs

  • csi-rbdplugin logs: see above

Additional context

  • Pool and user keyring are valid; rbd map on the same node works fine:
rbd ls -p <pool_name>
rbd info <pool_name>/<image_name>
sudo rbd map <pool_name>/<image_name> --id <ceph_user> --keyfile /etc/ceph/<user>.keyring
  • Issue appears specific to Ceph-CSI when used in Nomad.

  • Cluster health:

ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)
  cluster:
    id:     eb7e38f6-954d-11f0-957a-b0416f15f052
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 7h)
    mgr: node1.pjfnba(active, since 7h), standbys: node3.bcisde, node2.scpkvw
    osd: 3 osds: 3 up (since 7h), 3 in (since 11d)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    pools:   8 pools, 137 pgs
    objects: 381 objects, 992 KiB
    usage:   217 MiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     137 active+clean```

- Pools:

[ceph: root@node1 /]# rados lspools
.mgr
rbd
default.rgw.buckets.data
.rgw.root
default.rgw.buckets.index
lab.rgw.log
lab.rgw.control
lab.rgw.meta



- Controller:
```hcl
job "csi-rbd-controller" {
  datacenters = ["dc1"]
  node_pool   = "hci-pool"
  type        = "service"

  update {
    progress_deadline = "25m"
  }

  group "controller" {
    network {
      port "metrics" {
      }
    }

    task "csi-rbd-plugin-controller" {
      driver = "podman"

      template {
        data = <<EOF
[{
  "clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
  "monitors": [
    "node1",
    "node2",
    "node3"
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

      config {
        image        = "quay.io/cephcsi/cephcsi:v3.14.2"
        force_pull   = false
        network_mode = "host"

        volumes = [
          "local/config.json:/etc/ceph-csi-config/config.json:ro",
          "/home/nomad/csi/keys:/tmp/csi/keys:Z"
        ]

        args = [
          "--type=rbd",
          "--controllerserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=${NOMAD_PORT_metrics}"
        ]
        privileged = true
      }

      kill_timeout = "20m"

      csi_plugin {
        id        = "rbd.csi.ceph.com"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 512
      }

      service {
        name     = "ceph-csi-controller"
        provider = "nomad"
        port     = "metrics"
        tags     = ["prometheus"]
      }
    }
  }
}
job "csi-rbd-nodes" {
  datacenters = ["dc1"]
  node_pool   = "hci-pool"
  type        = "system"

  update {
    progress_deadline = "25m"
  }

  group "nodes" {
    task "csi-rbd-plugin-node" {
      driver = "podman"

      template {
        data = <<EOF
[{
  "clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
  "monitors": [
    "node1",
    "node2",
    "node3"
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

      config {
        image        = "quay.io/cephcsi/cephcsi:v3.14.2"
        force_pull   = false
        network_mode = "host"

        volumes = [
          "local/config.json:/etc/ceph-csi-config/config.json:ro",
          "/home/nomad/csi/keys:/tmp/csi/keys:Z",
          "/lib/modules/${attr.kernel.version}:/lib/modules/${attr.kernel.version}:ro"
        ]

        args = [
          "--type=rbd",
          "--nodeserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-node",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5"
        ]
        privileged = true
      }

      kill_timeout = "20m"

      csi_plugin {
        id        = "rbd.csi.ceph.com"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 512
      }

      service {
        name     = "ceph-csi-node"
        provider = "nomad"
      }
    }
  }
}

NOTE: have also tried IPs for mons. Same response.

  • Volume:
id           = "rbd-test"
name         = "rbd-test"
type         = "csi"
plugin_id    = "rbd.csi.ceph.com"
capacity_min = "2G"
capacity_max = "2G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  userID  = "nomad-csi"
  userKey = "my-key"
}

parameters {
  clusterID     = "eb7e38f6-954d-11f0-957a-b0416f15f052"
  pool          = "rbd"
  imageFeatures = "layering"
}

mount_options {
  fs_type = "ext4"       
}

I hope this is a case of PEBKAC, but I'm struggling to find reference for folks doing a modern Noamd version, with Ceph-CSI and Podman!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions