Running HiBench with SparkRDMA

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

Steps to reproduce Terasort result experiment:

Environment: 17 nodes 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch.
Apache hadooop-2.7.4, hdfs (1 namenode, 16 datanodes).
Spark-2.2 standalone 17 nodes
Setup HiBench
Configure Hadoop and Spark settings in HiBench conf directory.
In HiBench/conf/hibench.conf set:

hibench.scale.profile bigdata
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         1000

# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     7000

Set in HiBench/conf/workloads/micro/terasort.conf:

hibench.terasort.bigdata.datasize               1890000000

Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh
Open HiBench/report/hibench.report:

Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415

Add to HiBench/conf/spark.conf:

spark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false

Run HiBench/bin/workloads/micro/terasort/spark/run.sh
Open HiBench/report/hibench.report:

Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415
ScalaSparkTerasort 2018-03-26    19:17:13  189000000000         52.166               3623049495            3623049495

Overall improvement:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running HiBench with SparkRDMA

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally