-
Notifications
You must be signed in to change notification settings - Fork 145
add runbooks for new alerts #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # ODFCorePodRestarted | ||
|
|
||
| ## Meaning | ||
|
|
||
| A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why ODF operator and metrics exporter are considered as the core pods? |
||
| restarted at least once in the last 24 hours while the Ceph cluster is active. | ||
|
|
||
| ## Impact | ||
|
|
||
| * If OSDs are restarted frequently or do not start up within 5 minutes, | ||
| the cluster might decide to rebalance the data onto other more reliable | ||
| disks. If this happens, the cluster will temporarily be slightly less | ||
| performant. | ||
| * Operator restart delays configuration changes or health checks. | ||
| * May indicate underlying instability (resource pressure, bugs, or node issues). | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Identify pod from alert (pod, namespace). | ||
| 2. [pod debug](helpers/podDebug.md) | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. If OOMKilled: Increase memory limits for the container. | ||
| 2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities. | ||
| 3. If node-related: Cordon and drain the node; replace if faulty. | ||
| 4. Ensure HA: MONs should be ≥3; OSDs should be distributed. | ||
| 5. Update: If due to a known bug, upgrade ODF to a fixed version. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no mitigation section? please add mitigation steps,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure of what mitigation steps should be added here, so I left it empty for now!!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mitigation for this is to either move workloads to other storage systems or (preferred) add more disks. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # ODFDiskUtilizationHigh | ||
|
|
||
| ## Meaning | ||
|
|
||
| A Ceph OSD disk is >90% busy (as measured by %util from iostat | ||
| semantics via node_disk_io_time_seconds_total), indicating heavy I/O load. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Increased I/O latency for block/object/file clients. | ||
| * Reduced cluster throughput during peak workloads. | ||
| * Potential for “slow request” warnings in Ceph logs. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Identify node and device from alert labels. | ||
| 2. Check disk model and type: | ||
| ```bash | ||
| oc debug node/<node> | ||
| lsblk -d -o NAME,ROTA,MODEL | ||
| # Confirm it’s an expected OSD device (HDD/SSD/NVMe) | ||
| ``` | ||
| 3. Monitor real-time I/O: | ||
| ```bash | ||
| iostat -x 2 5 | ||
| ``` | ||
| 4. Correlate with Ceph: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are they supposed to run these commands on the toolbox pod? |
||
| ```bash | ||
| ceph osd df tree # check weight and reweight | ||
| ceph osd perf # check commit/apply latency | ||
| ``` | ||
|
|
||
| ## Mitigation | ||
|
|
||
| * Add more disks to the cluster enhance the performance. | ||
| * Move the workloads to another storage system. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # ODFNodeLatencyHighOnNONOSDNodes | ||
|
|
||
| ## Meaning | ||
|
|
||
| ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are client nodes here? csi client? |
||
| exceeds 100 milliseconds over the last 24 hours. These nodes participate in | ||
| Ceph control plane or client access but do not store data. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Delayed Ceph monitor elections or quorum instability. | ||
| * Slower metadata operations in CephFS. | ||
| * Increased latency for CSI controller operations. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about the CSI node operation and csi-addons operations? |
||
| * Potential timeouts in ODF operator reconciliation. | ||
| * Not support if it is a permanent configuration. | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another impact is that we do not support this if it is a permanent configuration ;) |
||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. From the alert, note the instance (node IP). | ||
| 2. Test connectivity: | ||
| ```bash | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from there they are suppose to run these commands? |
||
| ping <node-ip> | ||
| mtr <node-ip> | ||
| ``` | ||
| 3. Check system load and network interface stats on the node: | ||
| ```bash | ||
| oc debug node/<node-name> | ||
| sar -n DEV 1 5 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add details what is 1 and 5 here and how to get it |
||
| ip -s link show <iface> | ||
| ``` | ||
| 4. Review Ceph monitor logs if the node hosts MONs: | ||
| ```bash | ||
| oc logs -l app=rook-ceph-mon -n openshift-storage | ||
| ``` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another step could be to check switch / networking monitoring to see if any ports are too busy |
||
| 5. switch network monitoring to see if any ports are too busy. | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads. | ||
| 2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links. | ||
| 3. If node is a client (e.g., running applications), verify it’s not on an | ||
| overloaded subnet. | ||
| 4. Tune kernel network parameters if packet loss or buffer drops are observed. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # ODFNodeLatencyHighOnOSDNodes | ||
|
|
||
| ## Meaning | ||
|
|
||
| ICMP round-trip time (RTT) latency between ODF monitoring probes and | ||
| OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert | ||
| triggers only on nodes that host Ceph OSD pods, indicating potential | ||
| network congestion or issues on the storage network. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Increased latency in Ceph replication and recovery operations. | ||
| * Higher client I/O latency for RBD and CephFS workloads. | ||
| * Risk of OSDs being marked down if heartbeat timeouts occur. | ||
| * Degraded cluster performance and possible client timeouts. | ||
|
|
||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. Check the alert’s instance label to get the node IP. | ||
| 2. From a monitoring or debug pod, test connectivity: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we provide a example command to get the monitoring or debug pod |
||
| ```bash | ||
| ping <node-internal-ip> | ||
| ``` | ||
| 3. Use mtr or traceroute to analyze path and hops. | ||
| 4. Verify if the node is under high CPU or network load: | ||
| ```bash | ||
| oc debug node/<node> | ||
| top -b -n 1 | head -20 | ||
| sar -u 1 5 | ||
| ``` | ||
| 5. Check Ceph health and OSD status: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is toolbox already enabled or they need to enable it? |
||
| ```bash | ||
| ceph osd status | ||
| ceph -s | ||
| ``` | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate | ||
| from management/tenant traffic. | ||
| 2. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>), | ||
| and NIC firmware. | ||
| 3. Topology: Ensure OSD nodes are in the same rack/zone or connected via | ||
| low-latency fabric. | ||
| 4. If latency is transient, monitor; if persistent, engage network or | ||
| infrastructure team. | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here, add mitigation steps
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The MTU runbook should mention how to verify jumbo frames work end-to-end
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about this, maybe we can work on it once you are back.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can find many "Jumbo Frame" test instructions on the internet - for example this one: In the end you use ping with a certain icmp size (which different for the different OSs) and you tell the network stack not to fragment the package (but send it whole). As a mitigation, customers need to ensure the node network interfaces are configured for 9000 bytes AND that all switches in between the nodes also support 9000 bytes on their ports. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| # ODFNodeMTULessThan9000 | ||
|
|
||
| ## Meaning | ||
|
|
||
| At least one physical or relevant network interface on an ODF node has an | ||
| MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best | ||
| practices for storage networks. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Suboptimal Ceph network performance due to increased packet overhead. | ||
| * Higher CPU utilization on OSD nodes from processing more packets. | ||
| * Potential for packet fragmentation if mixed MTU sizes exist in the path. | ||
| * Reduced throughput during rebalancing or recovery. | ||
|
|
||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. List all nodes in the storage cluster: | ||
| ```bash | ||
| oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' | ||
| ``` | ||
| 2. For each node, check interface MTUs: | ||
| ```bash | ||
| oc debug node/<node-name> | ||
| ip link show | ||
| # Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali) | ||
| ``` | ||
| 3. Alternatively, use Prometheus: | ||
| ```promql | ||
| node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000 | ||
| ``` | ||
| 4. Verify MTU consistency across all nodes and all switches in the storage fabric. | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Ensure the node network interfaces are configured for 9000 bytes | ||
| 2. Ensure switches in between the nodes support 9000 bytes on their ports. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # ODFNodeNICBandwidthSaturation | ||
|
|
||
| ## Meaning | ||
|
|
||
| A network interface on an ODF node is operating at >90% of its reported | ||
| link speed, indicating potential bandwidth saturation. | ||
|
|
||
| ## Impact | ||
|
|
||
| * Network congestion leading to packet drops or latency spikes. | ||
| * Slowed Ceph replication, backfill, and recovery. | ||
| * Client I/O timeouts or stalls. | ||
| * Possible Ceph OSD evictions due to heartbeat failures. | ||
|
|
||
| ## Diagnosis | ||
|
|
||
| 1. From alert, note instance and device. | ||
| 2. Check current utilization: | ||
| ```bash | ||
| oc debug node/<node> | ||
| sar -n DEV 1 5 | ||
| ``` | ||
| 3. Use Prometheus to graph: | ||
| ```promql | ||
| rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8 | ||
| rate(node_network_transmit_bytes_total{...}) * 8 | ||
| ``` | ||
| 4. Determine if traffic is Ceph-related (e.g., during rebalance) or external. | ||
|
|
||
| ## Mitigation | ||
|
|
||
| 1. Short term: Throttle non-essential traffic on the node. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how? |
||
| * Taint the OSD node to prevent scheduling of non-storage workloads. | ||
| * Drain existing non-essential pods from the node. | ||
| 2. Long term: | ||
| * Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE). | ||
| * Use multiple bonded interfaces with LACP. | ||
| * Separate storage and client traffic using VLANs or dedicated NICs. | ||
| 3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce | ||
| recovery bandwidth. | ||
| 4. Enable NIC offload features (TSO, GRO) if disabled. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing runbooks reference shared helper documents like:
the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.