-
Notifications
You must be signed in to change notification settings - Fork 587
Description
Describe the bug
When attempting to create a Ceph RBD volume via the Ceph-CSI driver on Nomad, the node plugin fails with:
rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
This fails for both static and dynamic volume mounts.
For static mounts, the volume is created successfully but cannot be staged or mounted on the client.
For dynamic mounts, the volume fails to be created and fails with the above error
I was walking through this guide:
https://docs.ceph.com/en/latest/rbd/rbd-nomad/
But trying to use Podman instead of Docker.
- The environment is a set of 3 identical hosts.
- Nomad Server / Client, and Ceph are co-deployed on the hosts.
- The controller and nodes deploy succesfully.
$ podman version
Client: Podman Engine
Version: 5.4.0
API Version: 5.4.0
Go Version: go1.23.4 (Red Hat 1.23.4-1.el9)
Built: Wed Feb 12 15:54:13 2025
OS/Arch: linux/amd64
Environment details
- Image/version of Ceph CSI driver:
quay.io/cephcsi/cephcsi:v3.14.2
(also tried v3.15.0) - Helm chart version: N/A (using Nomad CSI integration, not Kubernetes)
- Kernel version:
RHEL 9.6 - 5.14.0-570.12.1.el9_6.x86_64
- Mounter used:
[root@node1 ~]# lsmod | grep rbd
rbd 155648 0
libceph 614400 1 rbd
- Orchestrator:
Nomad v1.10.5+ent
BuildDate 2025-09-10T12:00:47Z
Revision 2c21645c33d1447a55b84a235ba4dd9ba225501e
- Ceph cluster version:
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)
Steps to reproduce (dynamic)
- Deploy Ceph cluster using cephadm.
- Deploy Ceph-CSI RBD plugin as Nomad CSI plugin with pool and secrets.
- Try and create a volume
rbd-test
.
Actual results
Dynamic
- Volume provisioning fails
- Controller logs show:
2025-10-01T20:13:21.684252331+01:00 stderr F I1001 19:13:21.684231 1 utils.go:265] ID: 2014 Req-ID: rbd-test GRPC call: /csi.v1.Controller/CreateVolume
2025-10-01T20:13:21.684280988+01:00 stderr F I1001 19:13:21.684277 1 utils.go:266] ID: 2014 Req-ID: rbd-test GRPC request: {"accessibility_requirements":{},"capacity_range":{"limit_bytes":2000000000,"required_bytes":2000000000},"name":"rbd-test","parameters":{"clusterID":"eb7e38f6-954d-11f0-957a-b0416f15f052","imageFeatures":"layering","pool":"rbd"},"secrets":"***stripped***","volume_capabilities":[{"access_mode":{"mode":"SINGLE_NODE_WRITER"},"mount":{"fs_type":"ext4"}}]}
2025-10-01T20:13:21.684416476+01:00 stderr F I1001 19:13:21.684411 1 rbd_util.go:1411] ID: 2014 Req-ID: rbd-test setting disableInUseChecks: false image features: [layering] mounter: rbd
2025-10-01T20:13:21.691141742+01:00 stderr F E1001 19:13:21.691116 1 controllerserver.go:243] ID: 2014 Req-ID: rbd-test failed to connect to volume : failed to get connection: connecting failed: rados: ret=-22, Invalid argument
2025-10-01T20:13:21.691153007+01:00 stderr F E1001 19:13:21.691140 1 utils.go:270] ID: 2014 Req-ID: rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
Static
- Volume registration succeeds.
- During staging/mount, Nomad fails with:
failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: node plugin returned an internal error
- logs show a similar error:
I1001 12:36:47.752182 1 utils.go:266] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC request: {"staging_target_path":"/local/csi/staging/default/rbd-test/rw-file-system-single-node-writer","volume_capability":{"access_mode":{"mode":"SINGLE_NODE_WRITER"}}
E1001 12:36:47.753009 1 utils.go:291] ID: 74 Req-ID: rbd/nomad-rbd-test GRPC error: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-22, Invalid argument
Expected behavior
The Ceph RBD volume should stage and mount successfully to the Nomad client in either static or dynamic workflows
Logs
- csi-rbdplugin logs: see above
Additional context
- Pool and user keyring are valid;
rbd map
on the same node works fine:
rbd ls -p <pool_name>
rbd info <pool_name>/<image_name>
sudo rbd map <pool_name>/<image_name> --id <ceph_user> --keyfile /etc/ceph/<user>.keyring
-
Issue appears specific to Ceph-CSI when used in Nomad.
-
Cluster health:
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)
cluster:
id: eb7e38f6-954d-11f0-957a-b0416f15f052
health: HEALTH_OK
services:
mon: 3 daemons, quorum node1,node2,node3 (age 7h)
mgr: node1.pjfnba(active, since 7h), standbys: node3.bcisde, node2.scpkvw
osd: 3 osds: 3 up (since 7h), 3 in (since 11d)
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 8 pools, 137 pgs
objects: 381 objects, 992 KiB
usage: 217 MiB used, 2.7 TiB / 2.7 TiB avail
pgs: 137 active+clean```
- Pools:
[ceph: root@node1 /]# rados lspools
.mgr
rbd
default.rgw.buckets.data
.rgw.root
default.rgw.buckets.index
lab.rgw.log
lab.rgw.control
lab.rgw.meta
- Controller:
```hcl
job "csi-rbd-controller" {
datacenters = ["dc1"]
node_pool = "hci-pool"
type = "service"
update {
progress_deadline = "25m"
}
group "controller" {
network {
port "metrics" {
}
}
task "csi-rbd-plugin-controller" {
driver = "podman"
template {
data = <<EOF
[{
"clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
"monitors": [
"node1",
"node2",
"node3"
]
}]
EOF
destination = "local/config.json"
change_mode = "restart"
}
config {
image = "quay.io/cephcsi/cephcsi:v3.14.2"
force_pull = false
network_mode = "host"
volumes = [
"local/config.json:/etc/ceph-csi-config/config.json:ro",
"/home/nomad/csi/keys:/tmp/csi/keys:Z"
]
args = [
"--type=rbd",
"--controllerserver=true",
"--drivername=rbd.csi.ceph.com",
"--endpoint=unix://csi/csi.sock",
"--nodeid=${node.unique.name}",
"--instanceid=${node.unique.name}-controller",
"--pidlimit=-1",
"--logtostderr=true",
"--v=5",
"--metricsport=${NOMAD_PORT_metrics}"
]
privileged = true
}
kill_timeout = "20m"
csi_plugin {
id = "rbd.csi.ceph.com"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 512
}
service {
name = "ceph-csi-controller"
provider = "nomad"
port = "metrics"
tags = ["prometheus"]
}
}
}
}
job "csi-rbd-nodes" {
datacenters = ["dc1"]
node_pool = "hci-pool"
type = "system"
update {
progress_deadline = "25m"
}
group "nodes" {
task "csi-rbd-plugin-node" {
driver = "podman"
template {
data = <<EOF
[{
"clusterID": "eb7e38f6-954d-11f0-957a-b0416f15f052",
"monitors": [
"node1",
"node2",
"node3"
]
}]
EOF
destination = "local/config.json"
change_mode = "restart"
}
config {
image = "quay.io/cephcsi/cephcsi:v3.14.2"
force_pull = false
network_mode = "host"
volumes = [
"local/config.json:/etc/ceph-csi-config/config.json:ro",
"/home/nomad/csi/keys:/tmp/csi/keys:Z",
"/lib/modules/${attr.kernel.version}:/lib/modules/${attr.kernel.version}:ro"
]
args = [
"--type=rbd",
"--nodeserver=true",
"--drivername=rbd.csi.ceph.com",
"--endpoint=unix://csi/csi.sock",
"--nodeid=${node.unique.name}",
"--instanceid=${node.unique.name}-node",
"--pidlimit=-1",
"--logtostderr=true",
"--v=5"
]
privileged = true
}
kill_timeout = "20m"
csi_plugin {
id = "rbd.csi.ceph.com"
type = "node"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 512
}
service {
name = "ceph-csi-node"
provider = "nomad"
}
}
}
}
NOTE: have also tried IPs for mons. Same response.
- Volume:
id = "rbd-test"
name = "rbd-test"
type = "csi"
plugin_id = "rbd.csi.ceph.com"
capacity_min = "2G"
capacity_max = "2G"
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
secrets {
userID = "nomad-csi"
userKey = "my-key"
}
parameters {
clusterID = "eb7e38f6-954d-11f0-957a-b0416f15f052"
pool = "rbd"
imageFeatures = "layering"
}
mount_options {
fs_type = "ext4"
}
I hope this is a case of PEBKAC, but I'm struggling to find reference for folks doing a modern Noamd version, with Ceph-CSI and Podman!