This benchmark emulates a cluster management scenario. It uses a trace of timestamped tuples collected from an 11,000-machine compute cluster at Google. Each tuple is a monitoring event related to the tasks of compute jobs that execute on the cluster, such as the successful completion of a task, the failure of a task, or the submission of a high-priority task for a production job.
It has two queries, Query1 and Query2, that express common cluster monitoring tasks: Query1 combines a projection and an aggregation with a GROUP-BY clause to compute the sum of the requested share of CPU utilisation per job category; and Query2 combines a projection, a selection, and an aggregation with a GROUP-BY to report the average requested CPU utilisation of submitted tasks.
If you need more information, then you have to check this paper.
Benchmark and data-generator are built using Apache Maven.
To build benchmark and data-generator, run:
mvn clean package
To run Query1, run:
./bin/spark-submit --class edu.sogang.benchmark.RunBench ASSEMBLED_JAR_PATH \
--query-name q1 --config-filename config.properties
To run Query2, run:
./bin/spark-submit --class edu.sogang.benchmark.RunBench ASSEMBLED_JAR_PATH \
--query-name q2 --config-filename config.properties
To run data-generator, run:
java -jar ASSEMBLED_JAR_PATH --config-filename config.properties