LakeRAG — RAG Lakehouse System (Josys Bootcamp)

A production-style Lakehouse pipeline for document ingestion, transformation, and RAG-ready retrieval.

Overview

LakeRAG is a Scala-first Lakehouse architecture designed to process unstructured documents into clean, queryable datasets used for Retrieval-Augmented Generation (RAG).

System components

Delta Lake–based ETL (Raw → Silver → Gold)
Airflow orchestration
Embeddings + Vector Index
FastAPI retrieval layer
Dockerized deployment

This repository contains all modules in a single monorepo.

Monorepo structure

scala-etl/      # Scala + Spark + Delta ETL jobs
airflow/        # Airflow DAGs (upcoming)
vector-db/      # Embeddings + FAISS/Chroma (upcoming)
fastapi/        # Retrieval API (upcoming)
docker/         # Docker & Compose setup
data/
  raw/          # Input files
  silver/       # Cleaned Delta tables
  gold/         # Curated Delta tables

Current progress

✅ Initial Scala ETL complete
- Raw → Silver (cleaning, normalization)
- Silver → Gold (aggregation)
- Delta Lake enabled
Future components will be added PR-by-PR

Running the ETL

Full details are in scala-etl/README.md. Example:

cd scala-etl
sbt "runMain example.UserETL"

Upcoming work

Data Quality rules
Chunking for document-based RAG
Embeddings + vector index
FastAPI /search & /summarize
Airflow orchestration
Docker Compose end-to-end

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
scala-etl		scala-etl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LakeRAG — RAG Lakehouse System (Josys Bootcamp)

Overview

System components

Monorepo structure

Current progress

Running the ETL

Upcoming work

About

Uh oh!

Releases

Packages

Languages

kalviumcommunity/LakeRAG_Arun-Kumar-S_Josys-Bootcamp

Folders and files

Latest commit

History

Repository files navigation

LakeRAG — RAG Lakehouse System (Josys Bootcamp)

Overview

System components

Monorepo structure

Current progress

Running the ETL

Upcoming work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages