A high-performant, streaming-first ETL Engine for Node.js and TypeScript, designed for processing massive datasets with a constant memory footprint.
Visit our full documentation site for in-depth guides, API reference, and real-world recipes:
https://pujansrt.github.io/data-genie/
npm install @pujansrt/data-genieNote:
zod,@aws-sdk/client-s3, andexceljsare optional peer dependencies in case you need them.
import { CSVReader, JsonWriter, Job } from '@pujansrt/data-genie';
const reader = new CSVReader('users.csv');
const writer = new JsonWriter('output.json');
const metrics = await Job.run(reader, writer);
console.log(`Processed ${metrics.recordCount} records!`);Verify your transformations and filters instantly without writing any data.
// Inspect the first 5 records in a beautiful console table
await Job.preview(pipeline); Build your pipelines visually without writing code from scratch:
- Declarative Pipeline Builder: Visually configure your YAML pipelines and copy the generated config.
- TypeScript Code Generator: Generate full TypeScript boilerplate for complex ETL tasks (S3, SQL, Validation, etc.).
In our latest benchmarks (Processing 500k records), Data-Genie used 100x less memory than standard array-based processing.
| Data Size | Naive Approach (Array-based) | Data-Genie (Streaming) |
|---|---|---|
| 100 KB | ~10 MB RAM | ~10 MB RAM |
| 100 MB | ~150 MB RAM | ~12 MB RAM |
| 10 GB | CRASH (OOM) | ~15 MB RAM |
- Streaming-First: Constant memory footprint regardless of file size (O(1) memory complexity).
- Multi-Format: Support for CSV, TSV, JSON, NDJSON, Parquet, Excel, and SQL.
- Transport Agnostic: Read/Write from Local Disk, AWS S3, HTTP APIs, or Memory.
- Fault Tolerant: Retries, Circuit Breakers, and Dead Letter Queues (DLQ).
- Event Emitters Support - Use Job events to build a monitoring UI for your ETL pipelines.
Stream massive datasets directly from the cloud to your local machine.
const source = new S3Source(s3Client, 'mybucket', 'data/users.parquet');
const reader = new ParquetReader(source);
const writer = new CSVWriter('users.csv');
await Job.run(reader, writer);Validate data in real-time and divert "poison" records to a Dead Letter Queue.
const validator = new SchemaValidatingReader(reader, z.object({
email: z.string().email(),
age: z.number().min(18)
})).setDLQ(new JsonWriter('invalid_records.json'));
await Job.run(validator, new SQLWriter(db, 'users'));Read once, transform, and write to multiple destinations in parallel.
const multiWriter = new MultiWriter(
new ConsoleWriter(),
new JsonWriter('processed.json'),
new SQLWriter(db, 'audit_log')
);
await Job.run(pipeline, multiWriter);See 15+ more recipes in our Cookbook
Contributions are welcome! Whether it's adding a new DataReader, fixing a bug, or improving documentation.
- Check out our Contributing Guide.
- Look for Good First Issues.
- Submit a PR!
Want to see the performance difference on your own machine? We provide a built-in benchmark script that compares Data-Genie with a standard fs.readFileSync approach.
# Clone the repo and install dependencies
git clone https://github.com/pujansrt/data-genie.git
npm install
# Run the benchmark
npx tsx benchmarks/run-benchmark.tsMIT © Pujan Srivastava
