-
Notifications
You must be signed in to change notification settings - Fork 73
Errors when using 2 or more nodes #37
Description
Hello!
I found out that SparkRDMA works correctly on our cluster only when 2 worker nodes interacting between each other. The number of executors on each node doesn't matter (I tried 1, 2, 4 executors per node). When 3 or more nodes take part in task processing, some errors occur (logs below). Job ends successfully, but it takes much more time than without RDMA. How can i solve this problem? In case when only 2 nodes take part in task processing I can see performance of SparkRDMA.
Cluster
4 nodes (all nodes can be workers)
| Component | Description |
|---|---|
| Storage | Samsung SSD 970 EVO Plus 250GB (NVMe) |
| OS | Kubuntu 20.04 LTS |
| Network switch | Huawei S5720 36-C with switching capacity of 598 Gbit/s |
| Network physical layer | single mode optical fiber |
| Network adapter | Mellanox ConnectX-4 Lx EN, 10 GbE single port SFP+ |
| Memory | 32 GB |
| CPU | Intel Core i7-9700 CPU @ 3.00GHz, 8 cores with 1 thread per core |
Yarn configurations (Hadoop 3.1.3)
| Configuration | Value |
|---|---|
| yarn.nodemanager.resource.memory-mb | 27648 |
| yarn.scheduler.maximum-allocation-mb | 27648 |
| yarn.scheduler.minimum-allocation-mb | 13824 |
Spark configurations (Spark 2.4.0 )
SparkRDMA-3.1
DiSNI-1.7
| Configuration | Value | Description |
|---|---|---|
| yarn.executor.num | 3 | For this test I use 3 nodes (1 executor per node) |
| yarn.executor.cores | 5 | |
| spark.executor.memory | 7g | I use only ~50% of possible memory to prevent OOM errors (very often it can be) |
| spark.executor.memoryOverhead | 8g | as suggested here |
| spark.driver.memory | 2g | |
| spark.default.parallelism | 45 | |
| spark.sql.shuffle.partitions | 45 | |
| spark.shuffle.manager | org.apache.spark.shuffle.rdma.RdmaShuffleManager | |
| spark.shuffle.compress | false | as suggested here |
| spark.shuffle.spill.compress | false | |
| spark.broadcast.compress | false | |
| spark.locality.wait | 0 |
I run tests using HiBench. Dataset profile - huge (about 30 GB). Workload - TeraSort.
Some of Spark logs:
2020-11-24 12:00:02 INFO YarnClientSchedulerBackend:54 - Stopped
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(0)
2020-11-24 12:00:02 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(2)
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(1)
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:345)
at org.apache.spark.shuffle.rdma.RdmaChannel.stop(RdmaChannel.java:894)
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:203)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "RdmaNode connection listening thread" java.lang.RuntimeException: Exception in RdmaNode listening thread java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:210)
at java.lang.Thread.run(Thread.java:748)
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO RdmaBufferManager:218 - Rdma buffers allocation statistics:
2020-11-24 12:00:02 INFO RdmaBufferManager:222 - Pre allocated 0, allocated 770 buffers of size 4 KB
2020-11-24 12:00:02 INFO disni:201 - deallocPd, pd 1
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:263 - destroyEventChannel, channel 0
2020-11-24 12:00:02 INFO MemoryStore:54 - MemoryStore cleared
2020-11-24 12:00:02 INFO BlockManager:54 - BlockManager stopped
2020-11-24 12:00:02 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2020-11-24 12:00:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2020-11-24 12:00:02 INFO SparkContext:54 - Successfully stopped SparkContext
2020-11-24 12:00:02 INFO ShutdownHookManager:54 - Shutdown hook called
2020-11-24 12:00:02 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-3a2b9379-56c3-4b40-ab40-e92f03a5c591
2020-11-24 12:00:02 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-9e3d837b-5d83-467f-9a12-591ffd0503a8
Yarn logs from one worker
2020-11-24 11:58:51 INFO Executor:54 - Finished task 40.0 in stage 2.0 (TID 355). 1502 bytes result sent to driver
2020-11-24 12:00:02 INFO CoarseGrainedExecutorBackend:54 - Driver commanded a shutdown
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(1)
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(2)
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(0)
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(3)
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO disni:285 - destroyQP, id 0
2020-11-24 12:00:02 INFO disni:214 - destroyCQ, cq 140630589705088
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:189 - destroyCompChannel, compChannel 0
2020-11-24 12:00:02 INFO disni:263 - destroyEventChannel, channel 0
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO disni:285 - destroyQP, id 0
2020-11-24 12:00:02 INFO disni:214 - destroyCQ, cq 94745206395344
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:189 - destroyCompChannel, compChannel 0
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO CoarseGrainedExecutorBackend:54 - Driver from hadoop-master:45735 disconnected during shutdown
2020-11-24 12:00:02 INFO CoarseGrainedExecutorBackend:54 - Driver from hadoop-master:45735 disconnected during shutdown
2020-11-24 12:00:02 INFO RdmaChannel:874 - Stopping RdmaChannel RdmaChannel(4)
2020-11-24 12:00:02 ERROR RdmaNode:384 - Failed to stop RdmaChannel during 50 ms
2020-11-24 12:00:02 INFO disni:257 - disconnect, id 0
2020-11-24 12:00:02 INFO disni:285 - destroyQP, id 0
2020-11-24 12:00:02 WARN RdmaChannel:897 - Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
2020-11-24 12:00:02 INFO disni:285 - destroyQP, id 0
2020-11-24 12:00:02 INFO disni:214 - destroyCQ, cq 94745206905120
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:189 - destroyCompChannel, compChannel 0
2020-11-24 12:00:02 INFO disni:214 - destroyCQ, cq 94745206402048
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:189 - destroyCompChannel, compChannel 0
2020-11-24 12:00:02 INFO disni:263 - destroyEventChannel, channel 0
2020-11-24 12:00:02 WARN RdmaChannel:897 - Failed to get RDMA_CM_EVENT_DISCONNECTED: getCmEvent() failed
2020-11-24 12:00:02 INFO disni:285 - destroyQP, id 0
2020-11-24 12:00:02 INFO disni:214 - destroyCQ, cq 140630593575520
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:189 - destroyCompChannel, compChannel 0
2020-11-24 12:00:02 INFO disni:263 - destroyEventChannel, channel 0
2020-11-24 12:00:02 INFO RdmaNode:213 - Exiting RdmaNode Listening Server
2020-11-24 12:00:02 INFO RdmaBufferManager:218 - Rdma buffers allocation statistics:
2020-11-24 12:00:02 INFO RdmaBufferManager:222 - Pre allocated 0, allocated 380 buffers of size 4 KB
2020-11-24 12:00:02 INFO RdmaBufferManager:222 - Pre allocated 0, allocated 87 buffers of size 4096 KB
2020-11-24 12:00:02 INFO RdmaBufferManager:222 - Pre allocated 0, allocated 24 buffers of size 1024 KB
2020-11-24 12:00:02 INFO RdmaBufferManager:222 - Pre allocated 0, allocated 10 buffers of size 2048 KB
2020-11-24 12:00:02 INFO disni:201 - deallocPd, pd 1
2020-11-24 12:00:02 INFO disni:274 - destroyCmId, id 0
2020-11-24 12:00:02 INFO disni:263 - destroyEventChannel, channel 0
2020-11-24 12:00:02 INFO MemoryStore:54 - MemoryStore cleared
2020-11-24 12:00:02 INFO BlockManager:54 - BlockManager stopped
2020-11-24 12:00:02 INFO ShutdownHookManager:54 - Shutdown hook called
