Data-Genie 🧞‍♂️

A high-performant, streaming-first ETL Engine for Node.js and TypeScript, designed for processing massive datasets with a constant memory footprint.

Documentation & Examples

Visit our full documentation site for in-depth guides, API reference, and real-world recipes:

https://pujansrt.github.io/data-genie/

Installation

npm install @pujansrt/data-genie

Note: zod, @aws-sdk/client-s3, and exceljs are optional peer dependencies in case you need them.

Quick Start

import { CSVReader, JsonWriter, Job } from '@pujansrt/data-genie';

const reader = new CSVReader('users.csv');
const writer = new JsonWriter('output.json');

const metrics = await Job.run(reader, writer);
console.log(`Processed ${metrics.recordCount} records!`);

Preview (Dry Run)

Verify your transformations and filters instantly without writing any data.

// Inspect the first 5 records in a beautiful console table
await Job.preview(pipeline);

Interactive Tools

Build your pipelines visually without writing code from scratch:

Declarative Pipeline Builder: Visually configure your YAML pipelines and copy the generated config.
TypeScript Code Generator: Generate full TypeScript boilerplate for complex ETL tasks (S3, SQL, Validation, etc.).

Why Data-Genie? (Performance Benchmark)

In our latest benchmarks (Processing 500k records), Data-Genie used 100x less memory than standard array-based processing.

Data Size	Naive Approach (Array-based)	Data-Genie (Streaming)
100 KB	~10 MB RAM	~10 MB RAM
100 MB	~150 MB RAM	~12 MB RAM
10 GB	CRASH (OOM)	~15 MB RAM

Features

Streaming-First: Constant memory footprint regardless of file size (O(1) memory complexity).
Multi-Format: Support for CSV, TSV, JSON, NDJSON, Parquet, Excel, and SQL.
Transport Agnostic: Read/Write from Local Disk, AWS S3, HTTP APIs, or Memory.
Fault Tolerant: Retries, Circuit Breakers, and Dead Letter Queues (DLQ).
Event Emitters Support - Use Job events to build a monitoring UI for your ETL pipelines.

Common Recipes

1. S3 Parquet to Local CSV

Stream massive datasets directly from the cloud to your local machine.

const source = new S3Source(s3Client, 'mybucket', 'data/users.parquet');
const reader = new ParquetReader(source);
const writer = new CSVWriter('users.csv');

await Job.run(reader, writer);

2. Schema Validation (Zod) + DLQ

Validate data in real-time and divert "poison" records to a Dead Letter Queue.

const validator = new SchemaValidatingReader(reader, z.object({
    email: z.string().email(),
    age: z.number().min(18)
})).setDLQ(new JsonWriter('invalid_records.json'));

await Job.run(validator, new SQLWriter(db, 'users'));

3. Parallel Fan-out (Multi-Sink)

Read once, transform, and write to multiple destinations in parallel.

const multiWriter = new MultiWriter(
  new ConsoleWriter(),
  new JsonWriter('processed.json'),
  new SQLWriter(db, 'audit_log')
);

await Job.run(pipeline, multiWriter);

See 15+ more recipes in our Cookbook

Contributing

Contributions are welcome! Whether it's adding a new DataReader, fixing a bug, or improving documentation.

Check out our Contributing Guide.
Look for Good First Issues.
Submit a PR!

Running Benchmarks

Want to see the performance difference on your own machine? We provide a built-in benchmark script that compares Data-Genie with a standard fs.readFileSync approach.

# Clone the repo and install dependencies
git clone https://github.com/pujansrt/data-genie.git
npm install

# Run the benchmark
npx tsx benchmarks/run-benchmark.ts

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github		.github
badges		badges
benchmarks		benchmarks
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Genie 🧞‍♂️

Documentation & Examples

Installation

Quick Start

Preview (Dry Run)

Interactive Tools

Why Data-Genie? (Performance Benchmark)

Features

Common Recipes

1. S3 Parquet to Local CSV

2. Schema Validation (Zod) + DLQ

3. Parallel Fan-out (Multi-Sink)

Contributing

Running Benchmarks

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data-Genie 🧞‍♂️

Documentation & Examples

Installation

Quick Start

Preview (Dry Run)

Interactive Tools

Why Data-Genie? (Performance Benchmark)

Features

Common Recipes

1. S3 Parquet to Local CSV

2. Schema Validation (Zod) + DLQ

3. Parallel Fan-out (Multi-Sink)

Contributing

Running Benchmarks

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages