Skip to content
This repository was archived by the owner on Dec 20, 2022. It is now read-only.
This repository was archived by the owner on Dec 20, 2022. It is now read-only.

Error in accept call on a passive RdmaChannel #31

@rmunoz527

Description

@rmunoz527

Hi,

I am currently evaluating this library and not having done any specific configuration with respect to the infiniband network the spark nodes interconnect on. Can you point me in right direction on what might be cause of issue. See config and stacktrace below.

spark2-submit -v --num-executors 10 --executor-cores 5 --executor-memory 4G --conf spark.driver.extraClassPath=/opt/mellanox/spark-rdma-3.1.jar --conf spark.executor.extraClassPath=/opt/mellanox/spark-rdma-3.1.jar --conf spark.shuffle.manager=org.apache.spark.shuffle.rdma.RdmaShuffleManager --class com.github.ehiggs.spark.terasort.TeraSort /tmp/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /tmp/data/terasort_in /tmp/data/terasort_out

Parsed arguments:
master yarn
deployMode client
executorMemory 4G
executorCores 5
totalExecutorCores null
propertiesFile /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath /opt/mellanox/spark-rdma-3.1.jar
driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native
driverExtraJavaOptions null
supervise false
queue null
numExecutors 10
files null
pyFiles null
archives null
mainClass com.github.ehiggs.spark.terasort.TeraSort
primaryResource file:/tmp/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar
name com.github.ehiggs.spark.terasort.TeraSort
childArgs [/tmp/data/terasort_in /tmp/data/terasort_out]
jars null
packages null
packagesExclusions null
repositories null

(spark.shuffle.manager,org.apache.spark.shuffle.rdma.RdmaShuffleManager)
(spark.executor.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.authenticate,false)
(spark.yarn.jars,local:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/)
(spark.driver.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.yarn.historyServer.address,


(spark.yarn.am.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.eventLog.enabled,true)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.yarn.config.gatewayPath,/opt/cloudera/parcels)
(spark.ui.killEnabled,true)
(spark.dynamicAllocation.maxExecutors,148)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.shuffle.service.enabled,true)
(spark.hadoop.yarn.application.classpath,)
(spark.dynamicAllocation.minExecutors,0)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.yarn.config.replacementPath,{{HADOOP_COMMON_HOME}}/../../..)
(spark.sql.hive.metastore.version,1.1.0)
(spark.submit.deployMode,client)
(spark.shuffle.service.port,7337)
(spark.executor.extraClassPath,/opt/mellanox/spark-rdma-3.1.jar)
(spark.hadoop.mapreduce.application.classpath,)
(spark.eventLog.dir,
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.sql.catalogImplementation,hive)
(spark.sql.hive.metastore.jars,${env:HADOOP_COMMON_HOME}/../hive/lib/
:${env:HADOOP_COMMON_HOME}/client/*)
(spark.driver.extraClassPath,/opt/mellanox/spark-rdma-3.1.jar)

19/05/02 13:54:17 WARN spark.SparkContext: Using an existing SparkContext; some configuration may not take effect.
[Stage 0:> (0 + 17) / 45]19/05/02 13:54:18 ERROR rdma.RdmaNode: Error in accept call on a passive RdmaChannel: java.io.IOException: createCQ() failed
java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:345)
at org.apache.spark.shuffle.rdma.RdmaChannel.stop(RdmaChannel.java:894)
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:176)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "RdmaNode connection listening thread" java.lang.RuntimeException: Exception in RdmaNode listening thread java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:210)
at java.lang.Thread.run(Thread.java:748)
19/05/02 13:54:20 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 0.0 (TID 3, , executor 3): java.lang.ArithmeticException: / by zero
at org.apache.spark.shuffle.rdma.RdmaNode.getNextCpuVector(RdmaNode.java:278)
at org.apache.spark.shuffle.rdma.RdmaNode.getRdmaChannel(RdmaNode.java:301)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.org$apache$spark$shuffle$rdma$RdmaShuffleManager$$getRdmaChannel(RdmaShuffleManager.scala:314)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getRdmaChannelToDriver(RdmaShuffleManager.scala:322)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.publishMapTaskOutput(RdmaShuffleManager.scala:410)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.stop(RdmaWrapperShuffleWriter.scala:118)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions