This repository was archived by the owner on Dec 20, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 73
Running HiBench with SparkRDMA
Peter Rudenko edited this page Mar 30, 2018
·
7 revisions
HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.
Steps to reproduce Terasort result experiment:
- Environment: 17 nodes 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch.
- Apache hadooop-2.7.4, hdfs (1 namenode, 16 datanodes).
- Spark-2.2 standalone 17 nodes
- Setup HiBench
-
Configure Hadoop and Spark settings in HiBench
confdirectory. - In
HiBench/conf/hibench.confset:
hibench.scale.profile bigdata
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism 1000
# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism 7000
- Set in
HiBench/conf/workloads/micro/terasort.conf:
hibench.terasort.bigdata.datasize 1890000000
- Run
HiBench/bin/workloads/micro/terasort/prepare/prepare.shandHiBench/bin/workloads/micro/terasort/spark/run.sh - Open
HiBench/report/hibench.report:
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415
- Add to
HiBench/conf/spark.conf:
spark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false
- Run
HiBench/bin/workloads/micro/terasort/spark/run.sh - Open
HiBench/report/hibench.report:
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415
ScalaSparkTerasort 2018-03-26 19:17:13 189000000000 52.166 3623049495 3623049495
- Overall improvement:
