diff --git a/alerts/openshift-container-storage-operator/ODFCorePodRestarted.md b/alerts/openshift-container-storage-operator/ODFCorePodRestarted.md new file mode 100644 index 00000000..bfac066c --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFCorePodRestarted.md @@ -0,0 +1,28 @@ +# ODFCorePodRestarted + +## Meaning + +A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has +restarted at least once in the last 24 hours while the Ceph cluster is active. + +## Impact + +* If OSDs are restarted frequently or do not start up within 5 minutes, + the cluster might decide to rebalance the data onto other more reliable + disks. If this happens, the cluster will temporarily be slightly less + performant. +* Operator restart delays configuration changes or health checks. +* May indicate underlying instability (resource pressure, bugs, or node issues). + +## Diagnosis + +1. Identify pod from alert (pod, namespace). +2. [pod debug](helpers/podDebug.md) + +## Mitigation + +1. If OOMKilled: Increase memory limits for the container. +2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities. +3. If node-related: Cordon and drain the node; replace if faulty. +4. Ensure HA: MONs should be ≥3; OSDs should be distributed. +5. Update: If due to a known bug, upgrade ODF to a fixed version. diff --git a/alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md b/alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md new file mode 100644 index 00000000..7bc9fe13 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFDiskUtilizationHigh.md @@ -0,0 +1,37 @@ +# ODFDiskUtilizationHigh + +## Meaning + +A Ceph OSD disk is >90% busy (as measured by %util from iostat +semantics via node_disk_io_time_seconds_total), indicating heavy I/O load. + +## Impact + +* Increased I/O latency for block/object/file clients. +* Reduced cluster throughput during peak workloads. +* Potential for “slow request” warnings in Ceph logs. + +## Diagnosis + +1. Identify node and device from alert labels. +2. Check disk model and type: +```bash +oc debug node/ +lsblk -d -o NAME,ROTA,MODEL +# Confirm it’s an expected OSD device (HDD/SSD/NVMe) +``` +3. Monitor real-time I/O: +```bash +iostat -x 2 5 +``` +4. Correlate with Ceph: +```bash +ceph osd df tree # check weight and reweight +ceph osd perf # check commit/apply latency +``` + +## Mitigation + +* Add more disks to the cluster enhance the performance. +* Move the workloads to another storage system. + diff --git a/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnNONOSDNodes.md b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnNONOSDNodes.md new file mode 100644 index 00000000..847a7855 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnNONOSDNodes.md @@ -0,0 +1,44 @@ +# ODFNodeLatencyHighOnNONOSDNodes + +## Meaning + +ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes) +exceeds 100 milliseconds over the last 24 hours. These nodes participate in +Ceph control plane or client access but do not store data. + +## Impact + +* Delayed Ceph monitor elections or quorum instability. +* Slower metadata operations in CephFS. +* Increased latency for CSI controller operations. +* Potential timeouts in ODF operator reconciliation. +* Not support if it is a permanent configuration. + + +## Diagnosis + +1. From the alert, note the instance (node IP). +2. Test connectivity: +```bash +ping +mtr +``` +3. Check system load and network interface stats on the node: +```bash +oc debug node/ +sar -n DEV 1 5 +ip -s link show +``` +4. Review Ceph monitor logs if the node hosts MONs: +```bash +oc logs -l app=rook-ceph-mon -n openshift-storage +``` +5. switch network monitoring to see if any ports are too busy. + +## Mitigation + +1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads. +2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links. +3. If node is a client (e.g., running applications), verify it’s not on an + overloaded subnet. +4. Tune kernel network parameters if packet loss or buffer drops are observed. diff --git a/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md new file mode 100644 index 00000000..c2dbd20b --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFNodeLatencyHighOnOSDNodes.md @@ -0,0 +1,48 @@ +# ODFNodeLatencyHighOnOSDNodes + +## Meaning + +ICMP round-trip time (RTT) latency between ODF monitoring probes and +OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert +triggers only on nodes that host Ceph OSD pods, indicating potential +network congestion or issues on the storage network. + +## Impact + +* Increased latency in Ceph replication and recovery operations. +* Higher client I/O latency for RBD and CephFS workloads. +* Risk of OSDs being marked down if heartbeat timeouts occur. +* Degraded cluster performance and possible client timeouts. + + +## Diagnosis + +1. Check the alert’s instance label to get the node IP. +2. From a monitoring or debug pod, test connectivity: +```bash +ping +``` +3. Use mtr or traceroute to analyze path and hops. +4. Verify if the node is under high CPU or network load: +```bash +oc debug node/ +top -b -n 1 | head -20 +sar -u 1 5 +``` +5. Check Ceph health and OSD status: +```bash +ceph osd status +ceph -s +``` + +## Mitigation + +1. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate + from management/tenant traffic. +2. Hardware check: Inspect switch logs, NIC errors (ethtool -S ), + and NIC firmware. +3. Topology: Ensure OSD nodes are in the same rack/zone or connected via + low-latency fabric. +4. If latency is transient, monitor; if persistent, engage network or + infrastructure team. + diff --git a/alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md b/alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md new file mode 100644 index 00000000..255f095e --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFNodeMTULessThan9000.md @@ -0,0 +1,38 @@ +# ODFNodeMTULessThan9000 + +## Meaning + +At least one physical or relevant network interface on an ODF node has an +MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best +practices for storage networks. + +## Impact + +* Suboptimal Ceph network performance due to increased packet overhead. +* Higher CPU utilization on OSD nodes from processing more packets. +* Potential for packet fragmentation if mixed MTU sizes exist in the path. +* Reduced throughput during rebalancing or recovery. + + +## Diagnosis + +1. List all nodes in the storage cluster: +```bash +oc get nodes -l cluster.ocs.openshift.io/openshift-storage='' +``` +2. For each node, check interface MTUs: +```bash +oc debug node/ +ip link show +# Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali) +``` +3. Alternatively, use Prometheus: +```promql +node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000 +``` +4. Verify MTU consistency across all nodes and all switches in the storage fabric. + +## Mitigation + +1. Ensure the node network interfaces are configured for 9000 bytes +2. Ensure switches in between the nodes support 9000 bytes on their ports. \ No newline at end of file diff --git a/alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md b/alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md new file mode 100644 index 00000000..a69bb566 --- /dev/null +++ b/alerts/openshift-container-storage-operator/ODFNodeNICBandwidthSaturation.md @@ -0,0 +1,41 @@ +# ODFNodeNICBandwidthSaturation + +## Meaning + +A network interface on an ODF node is operating at >90% of its reported +link speed, indicating potential bandwidth saturation. + +## Impact + +* Network congestion leading to packet drops or latency spikes. +* Slowed Ceph replication, backfill, and recovery. +* Client I/O timeouts or stalls. +* Possible Ceph OSD evictions due to heartbeat failures. + +## Diagnosis + +1. From alert, note instance and device. +2. Check current utilization: +```bash +oc debug node/ +sar -n DEV 1 5 +``` +3. Use Prometheus to graph: +```promql +rate(node_network_receive_bytes_total{instance="", device=""}[5m]) * 8 +rate(node_network_transmit_bytes_total{...}) * 8 +``` +4. Determine if traffic is Ceph-related (e.g., during rebalance) or external. + +## Mitigation + +1. Short term: Throttle non-essential traffic on the node. + * Taint the OSD node to prevent scheduling of non-storage workloads. + * Drain existing non-essential pods from the node. +2. Long term: + * Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE). + * Use multiple bonded interfaces with LACP. + * Separate storage and client traffic using VLANs or dedicated NICs. +3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce + recovery bandwidth. +4. Enable NIC offload features (TSO, GRO) if disabled.