A comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems using a custom golden dataset and a FAISS vector database. This project evaluates how well AI can answer different questions using different retrieval strategies.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG EVALUATION SYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Golden Dataset β β Article URLs β β FAISS Vector DB β
β (100 Q&A pairs) ββββββΆβ (Source Links) ββββββΆβ (Scraped Article Text) β
β β β β β β
β β’ Questions β β Scraping via β β β’ Dense Embeddings β
β β’ Ground Truth β β httpx + Claude β β β’ BM25 Index β
β β’ Source IDs β β text extraction β β β’ Full Article Content β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RETRIEVAL LAYER β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Dense Search β β Sparse Search β β Hybrid Search β β
β β (Embeddings) β β (BM25) β β (0.7 Dense + 0.3 Sparse) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GENERATION LAYER β
β β
β Retrieved Context (Top-K=3) ββββΆ Claude API ββββΆ Generated Answer β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVALUATION LAYER β
β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β Hit@K β β Recall@K β β Token Overlapβ βFactual Consistencyβ β
β β (Retrieval)β β (Retrieval)β β (Answer) β β (Custom) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STREAMLIT DASHBOARD β
β β
β π Experiment Results β π Interactive Query β π Detailed Results β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The golden dataset contains 100 question-answer pairs sourced from real news articles and publications. Questions are categorized by type and difficulty level.
| Factual Questions with specific, verifiable answers (names, numbers, dates, facts) | | Conceptual Questions requiring understanding of ideas, theories, or relationships | | Procedural Questions about processes, methods, or how things work |
Direct questions seeking specific information that can be verified against the source.
Example:
"How many countries will have immigrant visa processing suspended according to the State Department announcement?"
Answer: "The State Department announced it will suspend immigrant visa processing for nationals of 75 countries..."
Questions requiring deeper understanding and explanation of ideas or relationships.
Example:
"What is the 'barbell economy' concept and how does it relate to economic recession predictions?"
Answer: "The barbell economy describes an economic structure where wealth is concentrated at the two extremes - the wealthy and the poor - while the middle class shrinks..."
Questions about processes, mechanisms, or step-by-step methods.
Example:
"How does electron beam water treatment (EBWT) work to remediate PFAS contamination?"
Answer: "Electron beam water treatment uses a compact, high-average-power superconducting radio-frequency accelerator to generate electron beams that are directed at contaminated water..."
Each entry in golden_dataset.jsonl contains:
| Field | Type | Description |
|---|---|---|
id |
integer | Unique identifier for the question |
difficulty |
string | Difficulty level: "easy", "medium", or "hard" |
question_type |
string | Category: "factual", "conceptual", or "procedural" |
question |
string | The evaluation question |
answer |
string | Ground truth answer used for evaluation |
source_articles |
array | List of source article references |
source_articles[].article_id |
string | URL of the source article |
source_articles[].title |
string | Title of the source article |
source_articles[].source |
string | Publication name |
The system uses FAISS (Facebook AI Similarity Search) for efficient vector storage and retrieval.
-
URL Extraction: Article URLs are extracted from the golden dataset's
source_articlesfield -
Web Scraping: Articles are fetched using
httpxwith proper headers and rate limiting -
Text Extraction: Claude is used as an intelligent agent to extract clean article text from HTML, removing navigation, ads, and sidebars. Falls back to BeautifulSoup if Claude extraction fails.
-
Embedding Generation: Article text is converted to dense vectors using the
all-MiniLM-L6-v2sentence transformer model (384 dimensions) -
Index Creation:
- FAISS Index: Dense embeddings stored in a flat L2 index
- BM25 Index: Tokenized text stored for sparse keyword search
- Metadata: Full article text, title, source, URL, and summary stored in pickle file
| File | Description |
|---|---|
data/faiss_index |
FAISS vector index (binary) |
data/article_metadata.pkl |
Article metadata including full text |
python scripts/ingest_articles_to_faiss.pyThe system evaluates RAG performance using 4 key metrics: 3 standard metrics and 1 custom metric.
Purpose: Measures if at least one relevant document was retrieved in the top-K results.
Formula:
Hit@K = 1 if |Retrieved_K β© Relevant| > 0 else 0
Interpretation: Binary metric (0 or 1). A score of 1 means the retriever successfully found at least one relevant document.
Purpose: Measures what fraction of all relevant documents were retrieved in the top-K results.
Formula:
Recall@K = |Retrieved_K β© Relevant| / |Relevant|
Interpretation: Ranges from 0 to 1. Higher values indicate better coverage of relevant documents.
Purpose: Measures lexical similarity between generated answer and ground truth using token-level precision, recall, and F1.
Formula:
Precision = |Generated_tokens β© Truth_tokens| / |Generated_tokens|
Recall = |Generated_tokens β© Truth_tokens| / |Truth_tokens|
F1 = 2 Γ (Precision Γ Recall) / (Precision + Recall)
Interpretation: Ranges from 0 to 1. Higher values indicate more word overlap with ground truth.
Purpose: Measures how well the generated answer is grounded in the retrieved context and aligned with ground truth, helping detect hallucinations.
Implementation:
- Tokenize generated answer, retrieved context, and ground truth
- Remove common stopwords to focus on content words
- Calculate what fraction of generated answer tokens appear in either context or ground truth
Formula:
Factual_Consistency = |Generated_content β© (Context_content βͺ Truth_content)| / |Generated_content|
Interpretation: Ranges from 0 to 1. Higher values indicate the answer is well-grounded in sources rather than hallucinated.
Why This Metric:
- Standard metrics like BLEU/ROUGE focus on n-gram overlap with ground truth only
- This metric also considers the retrieved context, rewarding answers that use information from retrieved documents
- Helps identify when the model generates plausible but unsupported information
pip install -r requirements.txtpython scripts/ingest_articles_to_faiss.pycd src
python run_experiments.pystreamlit run ui/app.pyeval_system/
βββ data/
β βββ golden_dataset.jsonl # 100 Q&A pairs with source articles
β βββ faiss_index # FAISS vector index (generated)
β βββ article_metadata.pkl # Article metadata (generated)
β βββ results/ # Experiment results (generated)
βββ scripts/
β βββ ingest_articles_to_faiss.py # Article scraping & indexing
β βββ query_faiss.py # CLI query tool
βββ src/
β βββ config.py # Configuration settings
β βββ faiss_retriever.py # FAISS retrieval module
β βββ metrics.py # Evaluation metrics
β βββ hybrid_retrieval_experiment.py
β βββ context_window_experiment.py
β βββ run_experiments.py # Main experiment runner
βββ ui/
β βββ app.py # Streamlit dashboard
βββ .env # API keys (not in git)
βββ .gitignore
βββ requirements.txt
βββ README.md
Key settings in src/config.py:
| Setting | Default | Description |
|---|---|---|
EMBEDDING_MODEL |
all-MiniLM-L6-v2 | Sentence transformer for embeddings |
LLM_MODEL |
claude-sonnet-4-20250514 | Claude model for generation |
DEFAULT_TOP_K |
3 | Number of documents to retrieve |
SPARSE_WEIGHT |
0.3 | BM25 weight in hybrid search |
DENSE_WEIGHT |
0.7 | Embedding weight in hybrid search |