Issues with ceph-csi, Nomad, and Podman

# Describe the bug

When attempting to create a Ceph RBD volume via the Ceph-CSI driver on Nomad, the node plugin fails with:

```
rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
```
This fails for both static and dynamic volume mounts.

For static mounts, the volume is created successfully but cannot be staged or mounted on the client.

For dynamic mounts, the volume fails to be created and fails with the above error

I was walking through this guide:

https://docs.ceph.com/en/latest/rbd/rbd-nomad/

But trying to use Podman instead of Docker.

* The environment is a set of 3 identical hosts.
* Nomad Server / Client, and Ceph are co-deployed on the hosts.
* The controller and nodes deploy succesfully.

```
$ podman version
Client:       Podman Engine
Version:      5.4.0
API Version:  5.4.0
Go Version:   go1.23.4 (Red Hat 1.23.4-1.el9)
Built:        Wed Feb 12 15:54:13 2025
OS/Arch:      linux/amd64
```

---

# Environment details

- **Image/version of Ceph CSI driver**: `quay.io/cephcsi/cephcsi:v3.14.2`  (also tried v3.15.0)
- **Helm chart version**: N/A (using Nomad CSI integration, not Kubernetes)  
- **Kernel version**: `RHEL 9.6 - 5.14.0-570.12.1.el9_6.x86_64`  
- **Mounter used**:
```
[root@node1 ~]# lsmod | grep rbd
rbd                   155648  0
libceph               614400  1 rbd
```  
- **Orchestrator**:
```
Nomad v1.10.5+ent
BuildDate 2025-09-10T12:00:47Z
Revision 2c21645c33d1447a55b84a235ba4dd9ba225501e
```
- **Ceph cluster version**: `ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)` 

---

# Steps to reproduce (dynamic)

1. Deploy Ceph cluster using cephadm.
2. Deploy Ceph-CSI RBD plugin as Nomad CSI plugin with pool and secrets.  
3. Try and create a volume `rbd-test`.

---

# Actual results

## Dynamic
- Volume provisioning fails
- Controller logs show:

```
2025-10-01T20:13:21.684252331+01:00 stderr F I1001 19:13:21.684231       1 utils.go:265] ID: 2014 Req-ID: rbd-test GRPC call: /csi.v1.Controller/CreateVolume
2025-10-01T20:13:21.684280988+01:00 stderr F I1001 19:13:21.684277       1 utils.go:266] ID: 2014 Req-ID: rbd-test GRPC request: {"accessibility_requirements":{},"capacity_range":{"limit_bytes":2000000000,"required_bytes":2000000000},"name":"rbd-test","parameters":{"clusterID":"eb7e38f6-954d-11f0-957a-b0416f15f052","imageFeatures":"layering","pool":"rbd"},"secrets":"***stripped***","volume_capabilities":[{"access_mode":{"mode":"SINGLE_NODE_WRITER"},"mount":{"fs_type":"ext4"}}]}
2025-10-01T20:13:21.684416476+01:00 stderr F I1001 19:13:21.684411       1 rbd_util.go:1411] ID: 2014 Req-ID: rbd-test setting disableInUseChecks: false image features: [layering] mounter: rbd
2025-10-01T20:13:21.691141742+01:00 stderr F E1001 19:13:21.691116       1 controllerserver.go:243] ID: 2014 Req-ID: rbd-test failed to connect to volume : failed to get connection: connecting failed: rados: ret=-22, Invalid argument
2025-10-01T20:13:21.691153007+01:00 stderr F E1001 19:13:21.691140       1 utils.go:270] ID: 2014 Req-ID: rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
```

## Static
- Volume registration succeeds.  
- During staging/mount, Nomad fails with:

```
failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: node plugin returned an internal error
```

- logs show a similar error:

```
I1001 12:36:47.752182       1 utils.go:266] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC request: {"staging_target_path":"/local/csi/staging/default/rbd-test/rw-file-system-single-node-writer","volume_capability":{"access_mode":{"mode":"SINGLE_NODE_WRITER"}}
E1001 12:36:47.753009       1 utils.go:291] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
```
---

# Expected behavior

The Ceph RBD volume should stage and mount successfully to the Nomad client in either static or dynamic workflows

---

# Logs

- **csi-rbdplugin** logs: see above  

---

# Additional context

- Pool and user keyring are valid; `rbd map` on the same node works fine:

```
rbd ls -p <pool_name>
rbd info <pool_name>/<image_name>
sudo rbd map <pool_name>/<image_name> --id <ceph_user> --keyfile /etc/ceph/<user>.keyring
```

- Issue appears specific to Ceph-CSI when used in Nomad.  

- Cluster health:

```
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)
  cluster:
    id:     eb7e38f6-954d-11f0-957a-b0416f15f052
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 7h)
    mgr: node1.pjfnba(active, since 7h), standbys: node3.bcisde, node2.scpkvw
    osd: 3 osds: 3 up (since 7h), 3 in (since 11d)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    pools:   8 pools, 137 pgs
    objects: 381 objects, 992 KiB
    usage:   217 MiB used, 2.7 TiB / 2.7 TiB avail
    pgs:     137 active+clean```

- Pools:
```
[ceph: root@node1 /]# rados lspools
.mgr
rbd
default.rgw.buckets.data
.rgw.root
default.rgw.buckets.index
lab.rgw.log
lab.rgw.control
lab.rgw.meta
```


- Controller:
```hcl
job "csi-rbd-controller" {
  datacenters = ["dc1"]
  node_pool   = "hci-pool"
  type        = "service"

  update {
    progress_deadline = "25m"
  }

  group "controller" {
    network {
      port "metrics" {
      }
    }

    task "csi-rbd-plugin-controller" {
      driver = "podman"

      template {
        data = <<EOF
[{
  "clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
  "monitors": [
    "node1",
    "node2",
    "node3"
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

      config {
        image        = "quay.io/cephcsi/cephcsi:v3.14.2"
        force_pull   = false
        network_mode = "host"

        volumes = [
          "local/config.json:/etc/ceph-csi-config/config.json:ro",
          "/home/nomad/csi/keys:/tmp/csi/keys:Z"
        ]

        args = [
          "--type=rbd",
          "--controllerserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=${NOMAD_PORT_metrics}"
        ]
        privileged = true
      }

      kill_timeout = "20m"

      csi_plugin {
        id        = "rbd.csi.ceph.com"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 512
      }

      service {
        name     = "ceph-csi-controller"
        provider = "nomad"
        port     = "metrics"
        tags     = ["prometheus"]
      }
    }
  }
}
```

```hcl
job "csi-rbd-nodes" {
  datacenters = ["dc1"]
  node_pool   = "hci-pool"
  type        = "system"

  update {
    progress_deadline = "25m"
  }

  group "nodes" {
    task "csi-rbd-plugin-node" {
      driver = "podman"

      template {
        data = <<EOF
[{
  "clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
  "monitors": [
    "node1",
    "node2",
    "node3"
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

      config {
        image        = "quay.io/cephcsi/cephcsi:v3.14.2"
        force_pull   = false
        network_mode = "host"

        volumes = [
          "local/config.json:/etc/ceph-csi-config/config.json:ro",
          "/home/nomad/csi/keys:/tmp/csi/keys:Z",
          "/lib/modules/${attr.kernel.version}:/lib/modules/${attr.kernel.version}:ro"
        ]

        args = [
          "--type=rbd",
          "--nodeserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-node",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5"
        ]
        privileged = true
      }

      kill_timeout = "20m"

      csi_plugin {
        id        = "rbd.csi.ceph.com"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 512
      }

      service {
        name     = "ceph-csi-node"
        provider = "nomad"
      }
    }
  }
}
```

*NOTE:* have also tried IPs for mons. Same response.
- Volume:
```hcl
id           = "rbd-test"
name         = "rbd-test"
type         = "csi"
plugin_id    = "rbd.csi.ceph.com"
capacity_min = "2G"
capacity_max = "2G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  userID  = "nomad-csi"
  userKey = "my-key"
}

parameters {
  clusterID     = "eb7e38f6-954d-11f0-957a-b0416f15f052"
  pool          = "rbd"
  imageFeatures = "layering"
}

mount_options {
  fs_type = "ext4"       
}
```

I hope this is a case of PEBKAC, but I'm struggling to find reference for folks doing a modern Noamd version, with Ceph-CSI and Podman!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with ceph-csi, Nomad, and Podman #5615

Describe the bug

Environment details

Steps to reproduce (dynamic)

Actual results

Dynamic

Static

Expected behavior

Logs

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with ceph-csi, Nomad, and Podman #5615

Description

Describe the bug

Environment details

Steps to reproduce (dynamic)

Actual results

Dynamic

Static

Expected behavior

Logs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions