Skip to content

pujansrt/data-genie

Repository files navigation

Data-Genie 🧞‍♂️

A high-performant, streaming-first ETL Engine for Node.js and TypeScript, designed for processing massive datasets with a constant memory footprint.

NPM Version NPM Bundle Size TypeScript Node.js Support License Coverage


Documentation & Examples

Visit our full documentation site for in-depth guides, API reference, and real-world recipes:

https://pujansrt.github.io/data-genie/


Installation

npm install @pujansrt/data-genie

Note: zod, @aws-sdk/client-s3, and exceljs are optional peer dependencies in case you need them.

Quick Start

demo

import { CSVReader, JsonWriter, Job } from '@pujansrt/data-genie';

const reader = new CSVReader('users.csv');
const writer = new JsonWriter('output.json');

const metrics = await Job.run(reader, writer);
console.log(`Processed ${metrics.recordCount} records!`);

Preview (Dry Run)

Verify your transformations and filters instantly without writing any data.

// Inspect the first 5 records in a beautiful console table
await Job.preview(pipeline); 

Interactive Tools

Build your pipelines visually without writing code from scratch:


Why Data-Genie? (Performance Benchmark)

In our latest benchmarks (Processing 500k records), Data-Genie used 100x less memory than standard array-based processing.

Data Size Naive Approach (Array-based) Data-Genie (Streaming)
100 KB ~10 MB RAM ~10 MB RAM
100 MB ~150 MB RAM ~12 MB RAM
10 GB CRASH (OOM) ~15 MB RAM

Features

  • Streaming-First: Constant memory footprint regardless of file size (O(1) memory complexity).
  • Multi-Format: Support for CSV, TSV, JSON, NDJSON, Parquet, Excel, and SQL.
  • Transport Agnostic: Read/Write from Local Disk, AWS S3, HTTP APIs, or Memory.
  • Fault Tolerant: Retries, Circuit Breakers, and Dead Letter Queues (DLQ).
  • Event Emitters Support - Use Job events to build a monitoring UI for your ETL pipelines.

Common Recipes

1. S3 Parquet to Local CSV

Stream massive datasets directly from the cloud to your local machine.

const source = new S3Source(s3Client, 'mybucket', 'data/users.parquet');
const reader = new ParquetReader(source);
const writer = new CSVWriter('users.csv');

await Job.run(reader, writer);

2. Schema Validation (Zod) + DLQ

Validate data in real-time and divert "poison" records to a Dead Letter Queue.

const validator = new SchemaValidatingReader(reader, z.object({
    email: z.string().email(),
    age: z.number().min(18)
})).setDLQ(new JsonWriter('invalid_records.json'));

await Job.run(validator, new SQLWriter(db, 'users'));

3. Parallel Fan-out (Multi-Sink)

Read once, transform, and write to multiple destinations in parallel.

const multiWriter = new MultiWriter(
  new ConsoleWriter(),
  new JsonWriter('processed.json'),
  new SQLWriter(db, 'audit_log')
);

await Job.run(pipeline, multiWriter);

See 15+ more recipes in our Cookbook


Contributing

Contributions are welcome! Whether it's adding a new DataReader, fixing a bug, or improving documentation.

  1. Check out our Contributing Guide.
  2. Look for Good First Issues.
  3. Submit a PR!

Running Benchmarks

Want to see the performance difference on your own machine? We provide a built-in benchmark script that compares Data-Genie with a standard fs.readFileSync approach.

# Clone the repo and install dependencies
git clone https://github.com/pujansrt/data-genie.git
npm install

# Run the benchmark
npx tsx benchmarks/run-benchmark.ts

License

MIT © Pujan Srivastava

About

High performant ETL engine written in TypeScript

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors