Skip to content
This repository was archived by the owner on Dec 20, 2022. It is now read-only.

Running HiBench with SparkRDMA

Peter Rudenko edited this page Mar 30, 2018 · 7 revisions

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

Steps to reproduce Terasort result experiment:

  1. Environment: 17 nodes 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch.
  2. Apache hadooop-2.7.4, hdfs (1 namenode, 16 datanodes).
  3. Spark-2.2 standalone 17 nodes
  4. Setup HiBench
  5. Configure Hadoop and Spark settings in HiBench conf directory.
  6. In HiBench/conf/hibench.conf set:
hibench.scale.profile bigdata
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism         1000

# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism     7000
  1. Set in HiBench/conf/workloads/micro/terasort.conf:
hibench.terasort.bigdata.datasize               1890000000
  1. Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh
  2. Open HiBench/report/hibench.report:
Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415
  1. Add to HiBench/conf/spark.conf:
spark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false
  1. Run HiBench/bin/workloads/micro/terasort/spark/run.sh
  2. Open HiBench/report/hibench.report:
Type               Date          Time      Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
ScalaSparkTerasort 2018-03-26    19:13:52  189000000000         79.931               2364539415            2364539415
ScalaSparkTerasort 2018-03-26    19:17:13  189000000000         52.166               3623049495            3623049495
  1. Overall improvement:

Clone this wiki locally