High-performance data generation tools and Spark processing pipeline for ORC/Parquet formats.
- Rust Data Generators: ID, user, and event data generators
- Java Spark Job: User-event join processor for Google Cloud Dataproc performance comparison
- Java File Reader: ORC/Parquet file testing and inspection tool
- Spark File Converter: Distributed converter between ORC and Parquet formats on GCS
# Build all components (Rust generators + Java tools + Spark Jobs)
make build
# Generate data (200M IDs, 1B events)
make id_gen user_gen event_gen
# Upload to GCS
make uploadThe make build command compiles:
- Rust data generators (release mode)
- Java file reader (with dependencies)
- Java Spark job (fat JAR)
./target/release/id_gen --out id.txt --number 200000000Generates 200M unique hash-based string IDs.
./target/release/user_gen --in id.txt --parquet user/user.parquet --orc user/user.orc --partition 256MBCreates user table with ID, age (20-60), gender (male/female), EU country.
./target/release/event_gen --in id.txt --parquet event/event.parquet --orc event/event.orc --partition 512MB --number 1000000000Generates 1B events with random IDs, timestamps (past 30 days), event names.
Compare ORC vs Parquet performance on identical data volumes across different cluster tiers:
cd dataproc
./spark.sh jobserver # Create cluster
./spark.sh job # Run user-event joinThe Spark job processes the same data in both formats, enabling direct performance comparison between ORC and Parquet with various Dataproc configurations (standard/premium tiers).
Test generated ORC/Parquet files using the encapsulated script:
./file-reader.sh ./event/event-000000.parquet
./file-reader.sh ./user/user-000000.orcOr directly:
cd java-file-reader
java -jar target/file-reader-1.0.0.jar --file ../event/event-000000.parquet- Rust (latest stable)
- Java 11+
- Maven 3.6+
- Google Cloud SDK (for upload)
user/
├── user-000000.parquet
├── user-000000.orc
└── ...
event/
├── event-000000.parquet
├── event-000000.orc
└── ...
Files partitioned at 256MB (user) / 512MB (event) boundaries.