Skip to content

This project implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions about TransFi’s products and solutions. It uses asynchronous scraping, semantic embeddings, and vector-based retrieval to build a local knowledge base of TransFi’s website content.

Notifications You must be signed in to change notification settings

Sanujen/TransFi-RAG-System

Repository files navigation

🧠 TransFi RAG Q&A System

Async-First Retrieval-Augmented Generation (Part 1 Assignment)

This project implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions about TransFi’s products and solutions.
It uses asynchronous scraping, semantic embeddings, and vector-based retrieval to build a local knowledge base of TransFi’s website content.


🚀 Features

Async-first architecture — concurrent scraping & query processing
Website crawler for TransFi’s Products and Solutions pages
Text cleaning & chunking for structured ingestion
FAISS-based semantic search using Sentence Transformers
LLM-powered answer generation (HuggingFace or OpenAI)
Rich metrics logging for both ingestion and query phases
Modular design with utils/ for code reuse (ready for FastAPI Part 2)

⚙️ Installation

1️⃣ Create and activate a virtual environment

python -m venv venv
source venv/bin/activate     # macOS/Linux
venv\Scripts\activate        # Windows

2️⃣ Install dependencies

pip install -r requirements.txt

Notes (Windows): faiss-cpu installs via pip. If PyTorch wheels fail, install a compatible version from https://pytorch.org and re-run the install.

🧩 Part 1 — Async-First RAG Scripts

🧱 Data Flow – ingest.py

Description: Scrapes TransFi’s “Products” and “Solutions” pages asynchronously, cleans the text, chunks it, generates embeddings, builds a FAISS index, and logs ingestion metrics.

Run

python ingest.py --url "https://www.transfi.com"

Configuration options

  • --url: Base site to crawl. Defaults: none (required when run as a script)
  • Data dirs: data/raw, data/clean, data/index (created automatically)
  • Embedding model: all-MiniLM-L6-v2 (in utils/embedding.py)
  • Chunking: max_len=500, overlap=50 (in utils/text_processing.py)

Sample output

=== Ingestion Metrics ===
Total Time: 45.2s
Pages Scraped: 23
Pages Failed: 2
Total Chunks Created: 456
Total Tokens Processed: 125,340
Indexing Time: 2.1s
Average Scraping Time per Page: 1.8s
Errors: None

💬 Query Flow – query.py

Description: Retrieves relevant text chunks from the index, generates answers using an LLM, cites sources, and logs detailed query metrics.

Run (single question)

python query.py --question "What is BizPay and its key features?"

Run (batch questions)

python query.py --questions questions.txt

Run (concurrent batch)

python query.py --questions questions.txt --concurrent

Configuration options

  • --question: Single question string
  • --questions: Path to a text file (one question per line)
  • --concurrent: Run multiple questions concurrently
  • Index dir: data/index (in query.py)
  • LLM: gpt2 text-generation via transformers (in utils/llm_utils.py)

Sample output

Question: What is BizPay and its key features?

Answer:
BizPay enables businesses to process seamless cross-border payments...

Sources:
  1. BizPay - https://www.transfi.com/products/bizpay
     Snippet: "BizPay enables businesses to..."
  2. Solutions Overview - https://www.transfi.com/solutions
     Snippet: "Key features include..."

=== Query Metrics ===
Total Latency: 2.4s
Retrieval Time: 0.3s
LLM Time: 2.0s
Documents Retrieved: 5
Documents Used in Answer: 2

🧩 Part 2 — FastAPI API + Webhook (Critical)

This adds a REST API around the RAG pipeline and a simple webhook receiver to demonstrate async callbacks.

Environment setup

  • Use the same virtual environment and dependencies from Part 1
  • Ensure the FAISS index exists (run ingest.py at least once) before using query endpoints

Exact run instructions

# Terminal 1
python webhook_receiver.py --port 8001

# Terminal 2
uvicorn api:app --port 8000

# Terminal 3 (trigger ingestion with webhook callback)
curl -X POST http://localhost:8000/api/ingest \
     -H "Content-Type: application/json" \
     -d '{"urls": ["https://www.transfi.com"], "callback_url": "http://localhost:8001/webhook"}'

You should see a webhook payload printed in the Terminal 1 window when ingestion completes.

Endpoints

  • POST /api/ingest — body: { "urls": ["https://..."] , "callback_url": "http://.../webhook" }
    • Triggers background ingestion; immediately returns { "message": "Ingestion started" }
    • If callback_url is provided, a completion payload is POSTed to it
  • POST /api/query — body: { "question": "..." }
    • Returns { question, answer, sources[], metrics }
  • POST /api/query/batch — body: { "questions": ["...", "..."], "callback_url": "http://.../webhook" }
    • Executes questions concurrently; returns results and optionally sends them to callback_url

Example commands — ingestion + webhook + query flow

# 1) Start webhook receiver (Terminal 1)
python webhook_receiver.py --port 8001

# 2) Start API (Terminal 2)
uvicorn api:app --port 8000

# 3) Kick off ingestion (Terminal 3)
curl -X POST http://localhost:8000/api/ingest \
     -H "Content-Type: application/json" \
     -d '{"urls": ["https://www.transfi.com"], "callback_url": "http://localhost:8001/webhook"}'

# 4) After ingestion completes, ask a question via API (Terminal 3)
curl -X POST http://localhost:8000/api/query \
     -H "Content-Type: application/json" \
     -d '{"question": "What is BizPay?"}'

# 5) Or batch query with optional webhook (Terminal 3)
curl -X POST http://localhost:8000/api/query/batch \
     -H "Content-Type: application/json" \
     -d '{"questions": ["What is BizPay?", "What are TransFi payouts?"], "callback_url": "http://localhost:8001/webhook"}'

Configuration options (API)

  • Server port: uvicorn api:app --port 8000
  • Index dir: data/index (configured in api.py)
  • Webhook timeout: 15s (in api.py send_webhook)
  • Retrieval top_k: 3 (in api.py process_question)

Sample webhook payload (ingestion completion)

{
  "metrics": {
    "status": "completed",
    "total_time": 42.13,
    "urls": ["https://www.transfi.com"]
  }
}

About

This project implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions about TransFi’s products and solutions. It uses asynchronous scraping, semantic embeddings, and vector-based retrieval to build a local knowledge base of TransFi’s website content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages