Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions alerts/openshift-container-storage-operator/ODFCorePodRestarted.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing runbooks reference shared helper documents like:

  • helpers/podDebug.md
  • helpers/troubleshootCeph.md
  • helpers/gatherLogs.md
  • helpers/networkConnectivity.md

the new runbooks embed all commands inline instead of referencing these. consider using helper links for consistency and maintainability.

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# ODFCorePodRestarted

## Meaning

A core ODF pod (OSD, MON, MGR, ODF operator, or metrics exporter) has
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ODF operator and metrics exporter are considered as the core pods?

restarted at least once in the last 24 hours while the Ceph cluster is active.

## Impact

* If OSDs are restarted frequently or do not start up within 5 minutes,
the cluster might decide to rebalance the data onto other more reliable
disks. If this happens, the cluster will temporarily be slightly less
performant.
* Operator restart delays configuration changes or health checks.
* May indicate underlying instability (resource pressure, bugs, or node issues).

## Diagnosis

1. Identify pod from alert (pod, namespace).
2. [pod debug](helpers/podDebug.md)

## Mitigation

1. If OOMKilled: Increase memory limits for the container.
2. If CrashLoopBackOff: Check for configuration errors or version incompatibilities.
3. If node-related: Cordon and drain the node; replace if faulty.
4. Ensure HA: MONs should be ≥3; OSDs should be distributed.
5. Update: If due to a known bug, upgrade ODF to a fixed version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no mitigation section? please add mitigation steps,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of what mitigation steps should be added here, so I left it empty for now!!
@weirdwiz if you have any suggestions, we can discuss offline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mitigation for this is to either move workloads to other storage systems or (preferred) add more disks.
Ceph is one of the few storage systems that grows IO performance linearly with capacity... so more disks = more performance

Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# ODFDiskUtilizationHigh

## Meaning

A Ceph OSD disk is >90% busy (as measured by %util from iostat
semantics via node_disk_io_time_seconds_total), indicating heavy I/O load.

## Impact

* Increased I/O latency for block/object/file clients.
* Reduced cluster throughput during peak workloads.
* Potential for “slow request” warnings in Ceph logs.

## Diagnosis

1. Identify node and device from alert labels.
2. Check disk model and type:
```bash
oc debug node/<node>
lsblk -d -o NAME,ROTA,MODEL
# Confirm it’s an expected OSD device (HDD/SSD/NVMe)
```
3. Monitor real-time I/O:
```bash
iostat -x 2 5
```
4. Correlate with Ceph:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they supposed to run these commands on the toolbox pod?

```bash
ceph osd df tree # check weight and reweight
ceph osd perf # check commit/apply latency
```

## Mitigation

* Add more disks to the cluster enhance the performance.
* Move the workloads to another storage system.

Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# ODFNodeLatencyHighOnNONOSDNodes

## Meaning

ICMP RTT latency to non-OSD ODF nodes (e.g., MON, MGR, MDS, or client nodes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are client nodes here? csi client?

exceeds 100 milliseconds over the last 24 hours. These nodes participate in
Ceph control plane or client access but do not store data.

## Impact

* Delayed Ceph monitor elections or quorum instability.
* Slower metadata operations in CephFS.
* Increased latency for CSI controller operations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the CSI node operation and csi-addons operations?

* Potential timeouts in ODF operator reconciliation.
* Not support if it is a permanent configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another impact is that we do not support this if it is a permanent configuration ;)
If this is a rare occurrence, support status is fine of course


## Diagnosis

1. From the alert, note the instance (node IP).
2. Test connectivity:
```bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from there they are suppose to run these commands?

ping <node-ip>
mtr <node-ip>
```
3. Check system load and network interface stats on the node:
```bash
oc debug node/<node-name>
sar -n DEV 1 5
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add details what is 1 and 5 here and how to get it

ip -s link show <iface>
```
4. Review Ceph monitor logs if the node hosts MONs:
```bash
oc logs -l app=rook-ceph-mon -n openshift-storage
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another step could be to check switch / networking monitoring to see if any ports are too busy

5. switch network monitoring to see if any ports are too busy.

## Mitigation

1. Ensure control-plane nodes are not oversubscribed or co-located with noisy workloads.
2. Validate network path between MON/MGR nodes—prefer low-latency, dedicated links.
3. If node is a client (e.g., running applications), verify it’s not on an
overloaded subnet.
4. Tune kernel network parameters if packet loss or buffer drops are observed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ODFNodeLatencyHighOnOSDNodes

## Meaning

ICMP round-trip time (RTT) latency between ODF monitoring probes and
OSD nodes exceeds 10 milliseconds over the last 24 hours. This alert
triggers only on nodes that host Ceph OSD pods, indicating potential
network congestion or issues on the storage network.

## Impact

* Increased latency in Ceph replication and recovery operations.
* Higher client I/O latency for RBD and CephFS workloads.
* Risk of OSDs being marked down if heartbeat timeouts occur.
* Degraded cluster performance and possible client timeouts.


## Diagnosis

1. Check the alert’s instance label to get the node IP.
2. From a monitoring or debug pod, test connectivity:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we provide a example command to get the monitoring or debug pod

```bash
ping <node-internal-ip>
```
3. Use mtr or traceroute to analyze path and hops.
4. Verify if the node is under high CPU or network load:
```bash
oc debug node/<node>
top -b -n 1 | head -20
sar -u 1 5
```
5. Check Ceph health and OSD status:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is toolbox already enabled or they need to enable it?

```bash
ceph osd status
ceph -s
```

## Mitigation

1. Isolate traffic: Confirm storage traffic uses a dedicated VLAN or NIC, separate
from management/tenant traffic.
2. Hardware check: Inspect switch logs, NIC errors (ethtool -S <iface>),
and NIC firmware.
3. Topology: Ensure OSD nodes are in the same rack/zone or connected via
low-latency fabric.
4. If latency is transient, monitor; if persistent, engage network or
infrastructure team.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, add mitigation steps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MTU runbook should mention how to verify jumbo frames work end-to-end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this, maybe we can work on it once you are back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find many "Jumbo Frame" test instructions on the internet - for example this one:
https://blah.cloud/networks/test-jumbo-frames-working/

In the end you use ping with a certain icmp size (which different for the different OSs) and you tell the network stack not to fragment the package (but send it whole).

As a mitigation, customers need to ensure the node network interfaces are configured for 9000 bytes AND that all switches in between the nodes also support 9000 bytes on their ports.

Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# ODFNodeMTULessThan9000

## Meaning

At least one physical or relevant network interface on an ODF node has an
MTU (Maximum Transmission Unit) less than 9000 bytes, violating ODF best
practices for storage networks.

## Impact

* Suboptimal Ceph network performance due to increased packet overhead.
* Higher CPU utilization on OSD nodes from processing more packets.
* Potential for packet fragmentation if mixed MTU sizes exist in the path.
* Reduced throughput during rebalancing or recovery.


## Diagnosis

1. List all nodes in the storage cluster:
```bash
oc get nodes -l cluster.ocs.openshift.io/openshift-storage=''
```
2. For each node, check interface MTUs:
```bash
oc debug node/<node-name>
ip link show
# Look for interfaces like eth0, ens*, eno*, etc. (exclude veth, docker, cali)
```
3. Alternatively, use Prometheus:
```promql
node_network_mtu_bytes{device!~"^(veth|docker|flannel|cali|tun|tap).*"} < 9000
```
4. Verify MTU consistency across all nodes and all switches in the storage fabric.

## Mitigation

1. Ensure the node network interfaces are configured for 9000 bytes
2. Ensure switches in between the nodes support 9000 bytes on their ports.
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# ODFNodeNICBandwidthSaturation

## Meaning

A network interface on an ODF node is operating at >90% of its reported
link speed, indicating potential bandwidth saturation.

## Impact

* Network congestion leading to packet drops or latency spikes.
* Slowed Ceph replication, backfill, and recovery.
* Client I/O timeouts or stalls.
* Possible Ceph OSD evictions due to heartbeat failures.

## Diagnosis

1. From alert, note instance and device.
2. Check current utilization:
```bash
oc debug node/<node>
sar -n DEV 1 5
```
3. Use Prometheus to graph:
```promql
rate(node_network_receive_bytes_total{instance="<ip>", device="<dev>"}[5m]) * 8
rate(node_network_transmit_bytes_total{...}) * 8
```
4. Determine if traffic is Ceph-related (e.g., during rebalance) or external.

## Mitigation

1. Short term: Throttle non-essential traffic on the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how?

* Taint the OSD node to prevent scheduling of non-storage workloads.
* Drain existing non-essential pods from the node.
2. Long term:
* Upgrade to higher-speed NICs (e.g., 25GbE → 100GbE).
* Use multiple bonded interfaces with LACP.
* Separate storage and client traffic using VLANs or dedicated NICs.
3. Tune Ceph osd_max_backfills, osd_recovery_max_active to reduce
recovery bandwidth.
4. Enable NIC offload features (TSO, GRO) if disabled.