This repository was archived by the owner on Dec 20, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 73
Troubleshooting SparkRDMA
Peter Rudenko edited this page Mar 30, 2018
·
2 revisions
-
If you encounter spark job failures or performance inconsistencies when using the SparkRDMA plugin it is a good idea to refer to the job logs in hopes of identifying any potential issues.
$ cat <your log file> | grep Rdma -
There will be a lot of informative information, not all of which is related to an actual error. A common issue related to performance is oversubscription of a QP. If you see the following indication, please follow the recomendation and increase the rdmaSendDepth parameter.
17/08/14 14:33:38 WARN RdmaChannel: RDMA channel org.apache.spark.shuffle.rdma.RdmaChannel@7608ffc9 oversubscription detected. RDMA send queue depth is too small. To improve performance, please set set spark.shuffle.io.rdmaSendDepth to a higher value (current depth: 1024 -
Failed to bind. Make sure your NIC supports RDMA.- add the following tospark-env.sh:
RDMA_INTERFACE="RDMA_INTERFACE_NAME"
RDMA_IP=`ip addr show $RDMA_INTERFACE | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`
export SPARK_LOCAL_IP=$RDMA_IP