First of, for those who dont know what is Apache Parquet, here is the oficial webiste quote:
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.
The idea behind this repository is to compare common scenarios and validate on a simple yet valuable way the speeds to work with common files and Parquet.
Some things that I want to compare here is the write/read speed in every language for study purposes.
The idea here is to make a small comparison between NodeJS, Python, Rust, Go, etc... as I got some time.
Few free to copy this repo for your own tests if you want.
Here few improves are implemented to test parquet reading times, more tweaks may be implemented
schubert:~/Desenvolvimento/github-opensource/parquet-tests/python$ python3 main.py
Starting financial test benchmark
==========================================================================================
Benchmark with 500,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 149780 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 149780 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 149780 transactions with amount > 0
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 777.07 │ 3197.98 │ 276.19 │ 42.31 │ 84.7% │
│ CSV │ 6923.59 │ 1345.28 │ 179.39 │ 38.57 │ 78.5% │
│ Parquet │ 516.52 │ 4.28 │ 4.00 │ 3.83 │ 4.3% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
==========================================================================================
Benchmark with 1,500,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 451355 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 451355 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 451355 transactions with amount > 0
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 1672.54 │ 9012.81 │ 830.18 │ 126.64 │ 84.7% │
│ CSV │ 19432.07 │ 4059.87 │ 539.79 │ 115.52 │ 78.6% │
│ Parquet │ 795.25 │ 10.21 │ 11.99 │ 11.48 │ 4.3% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
==========================================================================================
Benchmark with 3,000,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 899134 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 899134 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 899134 transactions with amount > 0
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 3368.20 │ 19557.14 │ 1660.69 │ 252.85 │ 84.8% │
│ CSV │ 38921.95 │ 8248.35 │ 1079.90 │ 230.63 │ 78.6% │
│ Parquet │ 1593.29 │ 18.99 │ 23.99 │ 22.95 │ 4.3% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
Cleaning files and finishing....
0 benchmark files removedThe same tests running on NodeJS, but here I did not spend much time improving, for sure some improves may be implemented.
schubert:~/Desenvolvimento/github-opensource/parquet-tests/node$ bun start
$ bun run src/index.ts
Starting financial test benchmark
==========================================================================================
Benchmark with 500,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 149963 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 149963 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 149963 transactions with amount > 0
All analyses completed!
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 1024.03 │ 1240.23 │ 274.76 │ 43.14 │ 84.3% │
│ CSV │ 1704.57 │ 905.32 │ 167.71 │ 39.10 │ 76.7% │
│ Parquet │ 168.04 │ 86.54 │ 4.25 │ 4.06 │ 4.7% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
==========================================================================================
Benchmark with 1,500,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 450619 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 450619 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 450619 transactions with amount > 0
All analyses completed!
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 2939.30 │ 3725.82 │ 825.89 │ 129.13 │ 84.4% │
│ CSV │ 5339.41 │ 2609.46 │ 504.72 │ 117.10 │ 76.8% │
│ Parquet │ 371.58 │ 185.20 │ 12.51 │ 11.93 │ 4.6% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
==========================================================================================
Benchmark with 3,000,000 transactions
Executing tests...
Writing JSON...
Reading JSON...
JSON filtered 900065 transactions with amount > 0
Writing CSV...
Reading CSV...
CSV filtered 900065 transactions with amount > 0
Writing Parquet file...
Reading Parquet...
Parquet filtered 900065 transactions with amount > 0
All analyses completed!
DETAILED RESULTS - FINANCIAL LEDGER
┌─────────────────┬──────────────┬──────────────┬──────────────┬─────────────┬────────────┐
│ Format │ Write (ms) │ Read (ms) │ Size (MB) │ Gzip (MB) │ Compression │
├─────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼────────────┤
│ JSON │ 10026.85 │ 7501.09 │ 1652.11 │ 257.84 │ 84.4% │
│ CSV │ 10861.86 │ 5693.82 │ 1009.81 │ 233.74 │ 76.9% │
│ Parquet │ 624.59 │ 369.47 │ 24.76 │ 23.62 │ 4.6% │
└─────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴────────────┘
Preparing next volume
Cleaning files and finishing....
0 benchmark files removedParquet showed to be ridiculous fast compared to JSON and CSV, ofcourse JSON and CSV may be improved, compacted, etc... But still will lose to Parquet on speed and size.
Its a very good approach at least convert long time logs to Parquet to store on datalakes or even in Backups.