Skip to content

cloudymoma/dataparorc

Repository files navigation

Data Generation & Processing Suite

High-performance data generation tools and Spark processing pipeline for ORC/Parquet formats.

Components

  • Rust Data Generators: ID, user, and event data generators
  • Java Spark Job: User-event join processor for Google Cloud Dataproc performance comparison
  • Java File Reader: ORC/Parquet file testing and inspection tool
  • Spark File Converter: Distributed converter between ORC and Parquet formats on GCS

Quick Start

# Build all components (Rust generators + Java tools + Spark Jobs)
make build

# Generate data (200M IDs, 1B events)
make id_gen user_gen event_gen

# Upload to GCS
make upload

The make build command compiles:

  • Rust data generators (release mode)
  • Java file reader (with dependencies)
  • Java Spark job (fat JAR)

Data Generators

ID Generator

./target/release/id_gen --out id.txt --number 200000000

Generates 200M unique hash-based string IDs.

User Generator

./target/release/user_gen --in id.txt --parquet user/user.parquet --orc user/user.orc --partition 256MB

Creates user table with ID, age (20-60), gender (male/female), EU country.

Event Generator

./target/release/event_gen --in id.txt --parquet event/event.parquet --orc event/event.orc --partition 512MB --number 1000000000

Generates 1B events with random IDs, timestamps (past 30 days), event names.

Spark Performance Comparison

Compare ORC vs Parquet performance on identical data volumes across different cluster tiers:

cd dataproc
./spark.sh jobserver  # Create cluster
./spark.sh job        # Run user-event join

The Spark job processes the same data in both formats, enabling direct performance comparison between ORC and Parquet with various Dataproc configurations (standard/premium tiers).

Data Testing & Inspection

Test generated ORC/Parquet files using the encapsulated script:

./file-reader.sh ./event/event-000000.parquet
./file-reader.sh ./user/user-000000.orc

Or directly:

cd java-file-reader
java -jar target/file-reader-1.0.0.jar --file ../event/event-000000.parquet

Requirements

  • Rust (latest stable)
  • Java 11+
  • Maven 3.6+
  • Google Cloud SDK (for upload)

Output Structure

user/
├── user-000000.parquet
├── user-000000.orc
└── ...

event/
├── event-000000.parquet
├── event-000000.orc
└── ...

Files partitioned at 256MB (user) / 512MB (event) boundaries.

About

Google Cloud Dataproc Premium engine, parquet & orc file format performance tests

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published