Advanced Retrieval Notebook Adaptation Guide for ComponentForge

Version: 1.0 Last Updated: 2025-10-07 Purpose: Adapt the "Advanced Retrieval with LangChain" notebook to work with ComponentForge shadcn/ui component patterns

Overview
Prerequisites
File Structure Setup
Task-by-Task Modifications
Activity 1: Evaluation
Golden Dataset
Expected Results
Troubleshooting

Overview

What's Changing?

The original notebook uses CSV data about projects. You'll adapt it to use ComponentForge's shadcn/ui component pattern library (10 JSON files containing Button, Card, Input, etc.).

Why This Approach?

✅ Educational: See all retrieval strategies in one place
✅ Self-contained: No need to modify your existing app
✅ Comparative: Easy to compare strategies side-by-side
✅ Portable: Can share notebook with others

What Stays the Same?

All retrieval strategy logic (Tasks 4-10)
RAG chain construction patterns
LCEL syntax and structure
Evaluation metrics (MRR, Hit@K)

What Changes?

Data source (CSV → JSON patterns)
Document structure and content
Test queries (projects → components)
Golden dataset (project domains → component patterns)

Prerequisites

1. Environment Setup

# Ensure you're in the backend directory
cd backend

# Activate virtual environment
source venv/bin/activate

# Install required packages
pip install langchain langchain-community langchain-openai langchain-cohere
pip install qdrant-client openai cohere
pip install langchain-experimental  # For semantic chunking
pip install pandas numpy  # For evaluation

2. API Keys Required

# Add to your .env or export directly
export OPENAI_API_KEY="your-openai-api-key"
export COHERE_API_KEY="your-cohere-api-key"

3. Data Files

Ensure these exist:

backend/data/patterns/
├── button.json
├── card.json
├── input.json
├── select.json
├── badge.json
├── alert.json
├── checkbox.json
├── radio.json
├── switch.json
└── tabs.json

4. Notebook Location

Create your notebook here:

backend/notebooks/retrieval_evaluation.ipynb

Or work in Google Colab and upload your patterns folder.

File Structure Setup

Directory Structure

component-forge/
├── backend/
│   ├── data/
│   │   ├── patterns/          # Your 10 JSON component files
│   │   └── evaluation/        # Golden dataset (you'll create)
│   │       └── golden_retrieval_tests.json
│   ├── notebooks/             # Create this folder
│   │   └── retrieval_evaluation.ipynb
│   └── docs/
│       └── NOTEBOOK_ADAPTATION_GUIDE.md  # This file

Create Notebook Directory

mkdir -p backend/notebooks
cd backend/notebooks
jupyter notebook
# Create new notebook: retrieval_evaluation.ipynb

Task-by-Task Modifications

Task 1: Dependencies (NO CHANGES)

Original Code: Keep as-is

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

Notes:

If running in Colab, you can hardcode keys (but don't commit!)
If running locally with .env, you can skip this cell

Task 2: Data Loading (MODIFIED)

❌ REMOVE Original Code:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path=f"./data/Projects_with_Domains.csv",
    metadata_columns=[...]
)
synthetic_usecase_data = loader.load()

✅ REPLACE WITH ComponentForge Version:

## Task 2: Data Collection and Preparation (ComponentForge Version)

import json
from pathlib import Path
from langchain.docstore.document import Document

def load_component_patterns(patterns_dir="../data/patterns"):
    """
    Load shadcn/ui component patterns from JSON files.

    Each JSON file represents one component pattern with:
    - id, name, category, description
    - code (React TypeScript implementation)
    - metadata (variants, props, a11y features)

    Returns:
        List[Document]: LangChain documents with pattern content and metadata
    """
    patterns_path = Path(patterns_dir)
    documents = []

    if not patterns_path.exists():
        raise FileNotFoundError(
            f"Patterns directory not found: {patterns_path}\n"
            f"Please ensure you're running from the correct directory."
        )

    json_files = list(patterns_path.glob("*.json"))

    if not json_files:
        raise FileNotFoundError(
            f"No JSON files found in {patterns_path}\n"
            f"Expected files like button.json, card.json, etc."
        )

    for json_file in sorted(json_files):
        with open(json_file, 'r') as f:
            pattern = json.load(f)

        # Extract variant names
        variant_names = [v['name'] for v in pattern['metadata'].get('variants', [])]

        # Extract prop names and types
        prop_info = [
            f"{p['name']} ({p['type']})"
            for p in pattern['metadata'].get('props', [])
        ]

        # Extract a11y features
        a11y_features = pattern['metadata'].get('a11y', {}).get('features', [])

        # Create rich content for retrieval
        # This is what will be embedded and searched
        content = f"""
Component: {pattern['name']}
Category: {pattern['category']}
Description: {pattern['description']}
Framework: {pattern['framework']}
Library: {pattern['library']}

Variants Available: {', '.join(variant_names)}

Props: {', '.join(prop_info)}

Accessibility Features:
{chr(10).join(['- ' + feat for feat in a11y_features])}

Dependencies: {', '.join(pattern['metadata'].get('dependencies', []))}

Code Preview:
{pattern['code'][:500]}...
        """.strip()

        # Create LangChain Document with metadata
        doc = Document(
            page_content=content,
            metadata={
                "id": pattern["id"],
                "name": pattern["name"],
                "category": pattern["category"],
                "framework": pattern["framework"],
                "library": pattern["library"],
                "num_variants": len(variant_names),
                "num_props": len(prop_info),
                "source": str(json_file.name)
            }
        )
        documents.append(doc)

    return documents

# Load component patterns
component_patterns = load_component_patterns()

print(f"✅ Loaded {len(component_patterns)} component patterns")
print(f"📦 Components: {', '.join([d.metadata['name'] for d in component_patterns])}")

# View first pattern to verify structure
print("\n" + "="*60)
print("Sample Document (First Pattern):")
print("="*60)
print(f"Name: {component_patterns[0].metadata['name']}")
print(f"Category: {component_patterns[0].metadata['category']}")
print(f"\nContent Preview:")
print(component_patterns[0].page_content[:300] + "...")

Expected Output:

✅ Loaded 10 component patterns
📦 Components: Alert, Badge, Button, Card, Checkbox, Input, Radio, Select, Switch, Tabs

============================================================
Sample Document (First Pattern):
============================================================
Name: Alert
Category: feedback
...

Notes:

Adjust patterns_dir path based on your notebook location
If in Colab, upload patterns folder first
Content is rich to enable good semantic search

Task 3: Qdrant Setup (MODIFIED)

❌ REMOVE Original Code:

vectorstore = Qdrant.from_documents(
    synthetic_usecase_data,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecases"
)

✅ REPLACE WITH:

## Task 3: Setting up QDrant (ComponentForge Version)

from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings model (same as original)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store with component patterns
vectorstore = Qdrant.from_documents(
    component_patterns,  # ← Changed from synthetic_usecase_data
    embeddings,
    location=":memory:",
    collection_name="ComponentForge_Patterns"  # ← New collection name
)

print(f"✅ Created Qdrant vector store with {len(component_patterns)} patterns")
print(f"📊 Collection: ComponentForge_Patterns")
print(f"🔢 Embedding dimension: 1536 (text-embedding-3-small)")

Expected Output:

✅ Created Qdrant vector store with 10 patterns
📊 Collection: ComponentForge_Patterns
🔢 Embedding dimension: 1536 (text-embedding-3-small)

Task 4: Naive Retrieval (MODIFIED QUERIES)

Keep all code the same EXCEPT the test queries at the end.

❌ REMOVE Original Queries:

naive_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content
naive_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content
naive_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

✅ REPLACE WITH ComponentForge Queries:

# Test queries specific to component patterns

print("🔍 Testing Naive Retrieval with ComponentForge queries...\n")

# Query 1: Direct component name match
print("Q1: What is a button component with multiple visual styles?")
result1 = naive_retrieval_chain.invoke({
    "question": "What is a button component with multiple visual styles?"
})
print(f"A1: {result1['response'].content}\n")
print(f"📄 Retrieved {len(result1['context'])} documents\n")

# Query 2: Category-based search
print("Q2: What form components are available?")
result2 = naive_retrieval_chain.invoke({
    "question": "What form components are available?"
})
print(f"A2: {result2['response'].content}\n")

# Query 3: Feature-based search
print("Q3: Which components have accessibility features for keyboard navigation?")
result3 = naive_retrieval_chain.invoke({
    "question": "Which components have accessibility features for keyboard navigation?"
})
print(f"A3: {result3['response'].content}\n")

# Query 4: Variant-based search
print("Q4: Show me components with primary and secondary variants")
result4 = naive_retrieval_chain.invoke({
    "question": "Show me components with primary and secondary variants"
})
print(f"A4: {result4['response'].content}\n")

Expected Behavior:

Q1 should retrieve Button (exact match)
Q2 should retrieve Input, Select, Checkbox, Radio (category match)
Q3 should retrieve multiple components (semantic match on a11y)
Q4 should retrieve Button, Badge (variant match)

Task 5: BM25 (MODIFIED QUERIES)

Keep code the same, replace queries:

print("🔍 Testing BM25 Retrieval with ComponentForge queries...\n")

# BM25 excels at exact keyword matching
print("Q1: What is a button component?")
result1 = bm25_retrieval_chain.invoke({
    "question": "What is a button component?"
})
print(f"A1: {result1['response'].content}\n")

print("Q2: Find components in the 'form' category")
result2 = bm25_retrieval_chain.invoke({
    "question": "Find components in the 'form' category"
})
print(f"A2: {result2['response'].content}\n")

print("Q3: Which component has a 'destructive' variant?")
result3 = bm25_retrieval_chain.invoke({
    "question": "Which component has a 'destructive' variant?"
})
print(f"A3: {result3['response'].content}\n")

Answer Question #1:

After running, add your answer:

#### ❓ Question #1: BM25 vs Embeddings

**Example Query:** "Which component has a 'destructive' variant?"

**Why BM25 is better than embeddings for this query:**

1. **Exact Keyword Matching**: BM25 excels at finding the exact word "destructive"
   in the component metadata, while embeddings might match semantically similar
   terms like "dangerous", "warning", or "delete".

2. **Sparse Representation Advantage**: BM25 uses bag-of-words which directly
   matches the keyword "destructive" in the Button component's variants list.

3. **No Semantic Confusion**: Embeddings might retrieve Alert or Badge components
   because they have "warning" or "error" variants which are semantically similar
   to "destructive", but BM25 will only match the exact keyword.

4. **Other examples where BM25 outperforms for ComponentForge**:
   - "Find component with 'ghost' variant" (exact variant name)
   - "Which components use '@radix-ui/react-slot'?" (exact dependency)
   - "Show me components in 'shadcn/ui' library" (exact library name)

**When embeddings are better for ComponentForge**:
   - "A clickable element for user actions" (semantic: Button)
   - "Container for related information" (semantic: Card)
   - "Toggle between two states" (semantic: Switch/Checkbox)

Task 6: Contextual Compression (MODIFIED QUERIES)

Keep code the same, replace queries:

print("🔍 Testing Contextual Compression (Reranking) with ComponentForge queries...\n")

print("Q1: I need a button with visual feedback for actions")
result1 = contextual_compression_retrieval_chain.invoke({
    "question": "I need a button with visual feedback for actions"
})
print(f"A1: {result1['response'].content}\n")

print("Q2: Component for displaying status or categories")
result2 = contextual_compression_retrieval_chain.invoke({
    "question": "Component for displaying status or categories"
})
print(f"A2: {result2['response'].content}\n")

print("Q3: Form element with keyboard accessibility")
result3 = contextual_compression_retrieval_chain.invoke({
    "question": "Form element with keyboard accessibility"
})
print(f"A3: {result3['response'].content}\n")

Expected Behavior:

Reranking should improve precision by filtering top-10 to best top-3
Q2 should strongly prefer Badge over other components

Task 7: Multi-Query (MODIFIED QUERIES)

Keep code the same, replace queries and add logging:

# Enable logging to see generated query variations
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s:%(name)s:%(message)s')

print("🔍 Testing Multi-Query Retrieval with ComponentForge queries...\n")

print("Q1: Interactive control for toggling states")
result1 = multi_query_retrieval_chain.invoke({
    "question": "Interactive control for toggling states"
})
print(f"A1: {result1['response'].content}\n")

print("Q2: Element for user input in forms")
result2 = multi_query_retrieval_chain.invoke({
    "question": "Element for user input in forms"
})
print(f"A2: {result2['response'].content}\n")

# Turn off logging for cleaner output
logging.basicConfig(level=logging.WARNING)

Answer Question #2:

#### ❓ Question #2: Multi-Query Reformulations

**How generating multiple reformulations improves recall for ComponentForge:**

1. **Terminology Variation**: Different reformulations capture various ways to
   describe the same component:
   - Original: "Interactive control for toggling states"
   - Variation 1: "Switch component for on/off functionality"
   - Variation 2: "Toggle button for binary choices"
   - Variation 3: "UI element for enabling/disabling features"

2. **Semantic Coverage**: Each reformulation might match different aspects of
   the component's description, metadata, or code, collectively increasing the
   chance of finding the right pattern.

3. **Synonym Expansion**: Reformulations naturally include synonyms:
   - "button" → "clickable element", "action trigger", "interactive control"
   - "input" → "text field", "form element", "data entry component"

4. **Real Example from ComponentForge**:
   - Query: "Element for user input in forms"
   - Reformulation 1: "Text input component for data entry" → matches Input
   - Reformulation 2: "Form field for user information" → matches Input, Select
   - Reformulation 3: "Keyboard accessible input control" → matches Input, Checkbox
   - Combined results have higher recall than single query

5. **Reduces False Negatives**: If one query phrasing misses the correct component,
   alternative reformulations provide backup paths to finding it.

Task 8: Parent-Document (MODIFIED)

Keep most code, but adjust the child splitting strategy:

Original code works, but you can optimize for component patterns:

## Task 8: Parent Document Retriever (ComponentForge Version)

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models
from langchain_qdrant import QdrantVectorStore

# Parent documents are full component patterns
parent_docs = component_patterns

# Child splitter - adjusted for component structure
# Components have natural sections: description, variants, props, code, a11y
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,  # Smaller chunks for component metadata
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Create new Qdrant collection for child chunks
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="component_patterns_children",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="component_patterns_children",
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    client=client
)

# In-memory store for parent documents
store = InMemoryStore()

# Create retriever
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

# Add component patterns
parent_document_retriever.add_documents(parent_docs, ids=None)

print(f"✅ Created Parent-Document Retriever")
print(f"📄 Parent documents: {len(parent_docs)}")
print(f"🔍 Child chunks indexed in vector store")

# Test queries (same pattern as before)
print("\n🔍 Testing Parent-Document Retrieval...\n")

print("Q1: Component with size variants")
result1 = parent_document_retrieval_chain.invoke({
    "question": "Component with size variants"
})
print(f"A1: {result1['response'].content}\n")

Expected Behavior:

Child chunks might match on specific features (e.g., "size: sm")
Returns full parent pattern (complete component JSON content)

Task 9: Ensemble (NO CHANGES)

Keep all code exactly as-is.

The ensemble combines all previous retrievers:

from langchain.retrievers import EnsembleRetriever

retriever_list = [
    bm25_retriever,
    naive_retriever,
    parent_document_retriever,
    compression_retriever,
    multi_query_retriever
]

equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list,
    weights=equal_weighting
)

Replace queries with ComponentForge versions:

print("🔍 Testing Ensemble Retrieval (All Strategies Combined)...\n")

print("Q1: Accessible form component")
result1 = ensemble_retrieval_chain.invoke({
    "question": "Accessible form component"
})
print(f"A1: {result1['response'].content}\n")

print("Q2: Component for user actions with variants")
result2 = ensemble_retrieval_chain.invoke({
    "question": "Component for user actions with variants"
})
print(f"A2: {result2['response'].content}\n")

Task 10: Semantic Chunking (MODIFIED)

Adjust for component patterns:

## Task 10: Semantic Chunking (ComponentForge Version)

from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75  # Adjust threshold for components
)

# For components, semantic chunking might break at:
# - Description → Variants
# - Variants → Props
# - Props → Code
# - Code → Accessibility

# Apply to all component patterns
print("🔧 Applying semantic chunking to component patterns...")
semantic_documents = semantic_chunker.split_documents(component_patterns)

print(f"✅ Original documents: {len(component_patterns)}")
print(f"✅ After semantic chunking: {len(semantic_documents)}")
print(f"📊 Average chunks per component: {len(semantic_documents) / len(component_patterns):.1f}")

# Create new vector store
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="ComponentForge_SemanticChunks"
)

semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 10})

# Create chain
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Test
print("\n🔍 Testing Semantic Chunking Retrieval...\n")

print("Q1: Button variants")
result1 = semantic_retrieval_chain.invoke({
    "question": "Button variants"
})
print(f"A1: {result1['response'].content}\n")

Answer Question #3:

#### ❓ Question #3: Semantic Chunking with Repetitive Content

**If component patterns have short, repetitive sections (e.g., similar variant structures),
semantic chunking might:**

1. **Over-chunk**: Create too many small chunks because similarity scores are all high
   - Example: "default", "primary", "secondary" variants have similar semantic patterns

2. **Under-chunk**: Keep everything together if descriptions are too similar
   - Example: All buttons described as "clickable element" might not create boundaries

**Adjustments for ComponentForge:**

1. **Increase threshold**: Use `breakpoint_threshold_amount=90` (higher percentile)
   to only break at stronger semantic shifts

2. **Use different breakpoint type**: Try `gradient` instead of `percentile` to
   detect rate of change in semantic similarity

3. **Custom splitting logic**: Manually split on structural markers:
   ```python
   separators = ["Variants Available:", "Props:", "Accessibility Features:", "Code Preview:"]

Combine with RecursiveCharacterTextSplitter: Use semantic chunking for overall structure, then character-based for code sections

For ComponentForge specifically:

Components already have natural structure (metadata sections)
Semantic chunking may provide limited benefit over simple structural splitting
Best use case: Very long component descriptions or extensive usage examples


---

## Activity 1: Evaluation (COMPLETE REWRITE)

This is the most important section. Here's the complete evaluation implementation:

### Step 1: Create Golden Dataset

First, create this file manually:

**File:** `backend/data/evaluation/golden_retrieval_tests.json`

```json
[
  {
    "test_id": "button_exact",
    "query": "Button component",
    "expected_component": "Button",
    "expected_id": "shadcn-button",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "button_semantic",
    "query": "Clickable element with multiple visual styles",
    "expected_component": "Button",
    "expected_id": "shadcn-button",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "button_variant",
    "query": "Component with destructive variant",
    "expected_component": "Button",
    "expected_id": "shadcn-button",
    "difficulty": "medium",
    "type": "variant_match"
  },
  {
    "test_id": "card_exact",
    "query": "Card component",
    "expected_component": "Card",
    "expected_id": "shadcn-card",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "card_semantic",
    "query": "Container for grouping related content",
    "expected_component": "Card",
    "expected_id": "shadcn-card",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "card_feature",
    "query": "Component with header and footer sections",
    "expected_component": "Card",
    "expected_id": "shadcn-card",
    "difficulty": "hard",
    "type": "feature_match"
  },
  {
    "test_id": "input_exact",
    "query": "Input component",
    "expected_component": "Input",
    "expected_id": "shadcn-input",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "input_semantic",
    "query": "Text field for user data entry",
    "expected_component": "Input",
    "expected_id": "shadcn-input",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "input_feature",
    "query": "Form element with validation support",
    "expected_component": "Input",
    "expected_id": "shadcn-input",
    "difficulty": "medium",
    "type": "feature_match"
  },
  {
    "test_id": "select_exact",
    "query": "Select component",
    "expected_component": "Select",
    "expected_id": "shadcn-select",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "select_semantic",
    "query": "Dropdown menu for choosing options",
    "expected_component": "Select",
    "expected_id": "shadcn-select",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "select_feature",
    "query": "Component for selecting one option from multiple choices",
    "expected_component": "Select",
    "expected_id": "shadcn-select",
    "difficulty": "hard",
    "type": "semantic_match"
  },
  {
    "test_id": "badge_exact",
    "query": "Badge component",
    "expected_component": "Badge",
    "expected_id": "shadcn-badge",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "badge_semantic",
    "query": "Small label for status indication",
    "expected_component": "Badge",
    "expected_id": "shadcn-badge",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "badge_variant",
    "query": "Component with success, warning, and error variants",
    "expected_component": "Badge",
    "expected_id": "shadcn-badge",
    "difficulty": "hard",
    "type": "variant_match"
  },
  {
    "test_id": "alert_exact",
    "query": "Alert component",
    "expected_component": "Alert",
    "expected_id": "shadcn-alert",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "alert_semantic",
    "query": "Component for displaying important messages",
    "expected_component": "Alert",
    "expected_id": "shadcn-alert",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "alert_feature",
    "query": "Notification banner with icon support",
    "expected_component": "Alert",
    "expected_id": "shadcn-alert",
    "difficulty": "hard",
    "type": "feature_match"
  },
  {
    "test_id": "checkbox_exact",
    "query": "Checkbox component",
    "expected_component": "Checkbox",
    "expected_id": "shadcn-checkbox",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "checkbox_semantic",
    "query": "Binary selection control",
    "expected_component": "Checkbox",
    "expected_id": "shadcn-checkbox",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "checkbox_feature",
    "query": "Form control with checked state",
    "expected_component": "Checkbox",
    "expected_id": "shadcn-checkbox",
    "difficulty": "hard",
    "type": "feature_match"
  },
  {
    "test_id": "radio_exact",
    "query": "Radio component",
    "expected_component": "Radio",
    "expected_id": "shadcn-radio",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "radio_semantic",
    "query": "Single choice selector from group",
    "expected_component": "Radio",
    "expected_id": "shadcn-radio",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "radio_feature",
    "query": "Mutually exclusive option selector",
    "expected_component": "Radio",
    "expected_id": "shadcn-radio",
    "difficulty": "hard",
    "type": "semantic_match"
  },
  {
    "test_id": "switch_exact",
    "query": "Switch component",
    "expected_component": "Switch",
    "expected_id": "shadcn-switch",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "switch_semantic",
    "query": "Toggle control for on/off states",
    "expected_component": "Switch",
    "expected_id": "shadcn-switch",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "switch_feature",
    "query": "Interactive toggle with visual feedback",
    "expected_component": "Switch",
    "expected_id": "shadcn-switch",
    "difficulty": "hard",
    "type": "feature_match"
  },
  {
    "test_id": "tabs_exact",
    "query": "Tabs component",
    "expected_component": "Tabs",
    "expected_id": "shadcn-tabs",
    "difficulty": "easy",
    "type": "exact_match"
  },
  {
    "test_id": "tabs_semantic",
    "query": "Navigation between content panels",
    "expected_component": "Tabs",
    "expected_id": "shadcn-tabs",
    "difficulty": "medium",
    "type": "semantic_match"
  },
  {
    "test_id": "tabs_feature",
    "query": "Component for organizing content in separate views",
    "expected_component": "Tabs",
    "expected_id": "shadcn-tabs",
    "difficulty": "hard",
    "type": "semantic_match"
  }
]

Step 2: Evaluation Code

Add this as a new section in your notebook:

# 🤝 Breakout Room Part #2 - ComponentForge Evaluation
## Activity #1: Evaluate All Retrieval Strategies

import json
import time
import numpy as np
import pandas as pd
from pathlib import Path

# Load golden dataset
golden_dataset_path = Path("../data/evaluation/golden_retrieval_tests.json")

if not golden_dataset_path.exists():
    raise FileNotFoundError(
        f"Golden dataset not found at {golden_dataset_path}\n"
        "Please create this file using the template in NOTEBOOK_ADAPTATION_GUIDE.md"
    )

with open(golden_dataset_path, 'r') as f:
    golden_dataset = json.load(f)

print(f"✅ Loaded {len(golden_dataset)} test cases")
print(f"📊 Breakdown by difficulty:")
for difficulty in ['easy', 'medium', 'hard']:
    count = sum(1 for t in golden_dataset if t['difficulty'] == difficulty)
    print(f"   - {difficulty}: {count} tests")

### Evaluation Functions

def calculate_mrr(results, expected_id):
    """
    Mean Reciprocal Rank: 1/rank of first correct result

    Returns:
        float: 1.0 if expected_id is rank 1, 0.5 if rank 2, 0.33 if rank 3, etc.
               0.0 if not found
    """
    for rank, doc in enumerate(results, 1):
        if doc.metadata.get('id') == expected_id:
            return 1.0 / rank
    return 0.0

def calculate_hit_at_k(results, expected_id, k=3):
    """
    Hit@K: Is correct result in top-K?

    Returns:
        float: 1.0 if found in top-K, 0.0 otherwise
    """
    top_k_ids = [doc.metadata.get('id') for doc in results[:k]]
    return 1.0 if expected_id in top_k_ids else 0.0

def calculate_hit_at_1(results, expected_id):
    """
    Hit@1: Is correct result the top result?

    Returns:
        float: 1.0 if rank 1, 0.0 otherwise
    """
    if results and results[0].metadata.get('id') == expected_id:
        return 1.0
    return 0.0

### Evaluate Single Retriever

def evaluate_retriever(retriever_name, retriever, golden_dataset, verbose=True):
    """
    Evaluate a single retriever on the golden dataset.

    Args:
        retriever_name: Name for display
        retriever: LangChain retriever instance
        golden_dataset: List of test cases
        verbose: Print progress

    Returns:
        dict: Evaluation results with metrics
    """
    if verbose:
        print(f"\n{'='*70}")
        print(f"Evaluating: {retriever_name}")
        print(f"{'='*70}")

    mrr_scores = []
    hit_at_1_scores = []
    hit_at_3_scores = []
    latencies = []
    errors = []

    for i, test in enumerate(golden_dataset):
        if verbose and (i + 1) % 5 == 0:
            print(f"  Progress: {i + 1}/{len(golden_dataset)} tests...")

        try:
            # Measure latency
            start = time.time()
            retrieved_docs = retriever.invoke(test["query"])
            latency = (time.time() - start) * 1000  # Convert to ms

            # Calculate metrics
            mrr = calculate_mrr(retrieved_docs, test["expected_id"])
            hit_at_1 = calculate_hit_at_1(retrieved_docs, test["expected_id"])
            hit_at_3 = calculate_hit_at_k(retrieved_docs, test["expected_id"], k=3)

            mrr_scores.append(mrr)
            hit_at_1_scores.append(hit_at_1)
            hit_at_3_scores.append(hit_at_3)
            latencies.append(latency)

        except Exception as e:
            errors.append({"test_id": test["test_id"], "error": str(e)})
            # Add zero scores for failed tests
            mrr_scores.append(0.0)
            hit_at_1_scores.append(0.0)
            hit_at_3_scores.append(0.0)
            latencies.append(0.0)

    # Calculate aggregate metrics
    results = {
        "Retriever": retriever_name,
        "MRR": round(np.mean(mrr_scores), 3),
        "Hit@1": round(np.mean(hit_at_1_scores), 3),
        "Hit@3": round(np.mean(hit_at_3_scores), 3),
        "Avg Latency (ms)": round(np.mean(latencies), 1),
        "Total Tests": len(golden_dataset),
        "Errors": len(errors)
    }

    if verbose:
        print(f"  ✅ Completed: {results['MRR']:.3f} MRR, {results['Hit@3']:.1%} Hit@3, {results['Avg Latency (ms)']:.0f}ms")
        if errors:
            print(f"  ⚠️  Errors: {len(errors)} test(s) failed")

    return results, errors

### Run Evaluation on All Retrievers

print("\n" + "="*70)
print("Starting Comprehensive Retrieval Evaluation")
print("="*70)

retrievers_to_test = {
    "1. Naive (Semantic)": naive_retriever,
    "2. BM25": bm25_retriever,
    "3. Multi-Query": multi_query_retriever,
    "4. Parent-Document": parent_document_retriever,
    "5. Contextual Compression (Rerank)": compression_retriever,
    "6. Ensemble (RRF)": ensemble_retriever,
    "7. Semantic Chunking": semantic_retriever
}

all_results = []
all_errors = {}

for name, retriever in retrievers_to_test.items():
    result, errors = evaluate_retriever(name, retriever, golden_dataset, verbose=True)
    all_results.append(result)
    if errors:
        all_errors[name] = errors

### Display Results

results_df = pd.DataFrame(all_results)
results_df = results_df.sort_values("MRR", ascending=False)

print("\n" + "="*80)
print("ComponentForge Retrieval Evaluation Results")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

### Detailed Analysis by Difficulty

print("\n" + "="*80)
print("Performance Breakdown by Query Difficulty")
print("="*80)

for difficulty in ['easy', 'medium', 'hard']:
    difficulty_tests = [t for t in golden_dataset if t['difficulty'] == difficulty]
    print(f"\n{difficulty.upper()} queries ({len(difficulty_tests)} tests):")

    for name, retriever in retrievers_to_test.items():
        result, _ = evaluate_retriever(
            name,
            retriever,
            difficulty_tests,
            verbose=False
        )
        print(f"  {name}: MRR={result['MRR']:.3f}, Hit@3={result['Hit@3']:.1%}")

### Cost Analysis

print("\n" + "="*80)
print("Cost Analysis (per 100 queries)")
print("="*80)

cost_estimates = {
    "1. Naive (Semantic)": 0.01,  # Embedding cost only
    "2. BM25": 0.00,  # No API calls
    "3. Multi-Query": 0.05,  # 3-5 LLM calls per query
    "4. Parent-Document": 0.01,  # Embedding cost only
    "5. Contextual Compression (Rerank)": 0.20,  # Cohere reranking
    "6. Ensemble (RRF)": 0.25,  # All above combined
    "7. Semantic Chunking": 0.01  # Embedding cost only
}

cost_df = results_df.copy()
cost_df["Cost per 100 Queries ($)"] = [cost_estimates[r] for r in cost_df["Retriever"]]
cost_df["Cost per Query ($)"] = cost_df["Cost per 100 Queries ($)"] / 100

print(cost_df[["Retriever", "MRR", "Hit@3", "Avg Latency (ms)", "Cost per 100 Queries ($)"]].to_string(index=False))

### Final Analysis and Recommendation

print("\n" + "="*80)
print("ANALYSIS & RECOMMENDATION")
print("="*80)

best_by_mrr = results_df.iloc[0]
best_by_hit3 = results_df.sort_values("Hit@3", ascending=False).iloc[0]
fastest = results_df.sort_values("Avg Latency (ms)", ascending=True).iloc[0]

analysis = f"""
📊 EVALUATION SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Dataset: {len(golden_dataset)} test queries across {len(component_patterns)} component patterns
Breakdown: {sum(1 for t in golden_dataset if t['difficulty']=='easy')} easy,
           {sum(1 for t in golden_dataset if t['difficulty']=='medium')} medium,
           {sum(1 for t in golden_dataset if t['difficulty']=='hard')} hard queries

🏆 BEST PERFORMERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Highest Accuracy (MRR): {best_by_mrr['Retriever']}
  ├─ MRR: {best_by_mrr['MRR']:.3f}
  ├─ Hit@3: {best_by_mrr['Hit@3']:.1%}
  └─ Latency: {best_by_mrr['Avg Latency (ms)']:.0f}ms

Best Top-3 Accuracy: {best_by_hit3['Retriever']}
  ├─ Hit@3: {best_by_hit3['Hit@3']:.1%}
  ├─ MRR: {best_by_hit3['MRR']:.3f}
  └─ Latency: {best_by_hit3['Avg Latency (ms)']:.0f}ms

Fastest Retrieval: {fastest['Retriever']}
  ├─ Latency: {fastest['Avg Latency (ms)']:.0f}ms
  ├─ MRR: {fastest['MRR']:.3f}
  └─ Cost: ${cost_estimates[fastest['Retriever']]:.4f}/query

💰 COST-PERFORMANCE TRADE-OFFS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Most Cost-Effective:
  - BM25: FREE, {results_df[results_df['Retriever']=='2. BM25']['MRR'].values[0]:.3f} MRR, {results_df[results_df['Retriever']=='2. BM25']['Avg Latency (ms)'].values[0]:.0f}ms
  - Good for exact keyword matches (component names, variant names)

Best Value (Performance/Cost):
  - Naive Semantic: $0.0001/query, {results_df[results_df['Retriever']=='1. Naive (Semantic)']['MRR'].values[0]:.3f} MRR
  - Good for semantic understanding, low cost

Premium Option (Highest Accuracy):
  - {best_by_mrr['Retriever']}: ${cost_estimates[best_by_mrr['Retriever']]:.4f}/query, {best_by_mrr['MRR']:.3f} MRR
  - Worth it if accuracy is critical

🎯 RECOMMENDATION FOR COMPONENTFORGE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Current State: {len(component_patterns)} component patterns (small corpus)

Best Strategy: [DETERMINE BASED ON YOUR RESULTS]

Reasoning:
1. Performance: [Which strategy had best MRR/Hit@3?]
2. Cost: [Is the performance gain worth the cost?]
3. Latency: [Is latency acceptable for your use case?]
4. Corpus Size: With only {len(component_patterns)} patterns, simpler strategies may suffice

Alternative Consideration:
- If expanding to 50+ patterns, re-evaluate reranking and ensemble strategies
- Current corpus is small enough that BM25 + Semantic fusion likely optimal

Production Deployment:
- Use: [YOUR CHOSEN STRATEGY]
- Fallback: BM25 for offline/low-cost scenarios
- Monitor: Hit@3 rate in production, adjust if < 90%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"""

print(analysis)

### Save Results

results_output_path = Path("../data/evaluation/retrieval_evaluation_results.json")
results_output_path.parent.mkdir(parents=True, exist_ok=True)

output_data = {
    "evaluation_date": pd.Timestamp.now().isoformat(),
    "corpus_size": len(component_patterns),
    "test_cases": len(golden_dataset),
    "results": all_results,
    "errors": all_errors,
    "cost_estimates": cost_estimates
}

with open(results_output_path, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"\n✅ Results saved to: {results_output_path}")

Expected Results

What Good Performance Looks Like

For a 10-pattern corpus, expect:

Retriever	MRR	Hit@3	Latency	Cost/100
BM25	0.65-0.75	85-90%	15-30ms	$0
Naive Semantic	0.70-0.85	90-95%	150-200ms	$0.01
Multi-Query	0.75-0.90	92-97%	400-600ms	$0.05
Compression/Rerank	0.80-0.95	95-98%	200-300ms	$0.20
Ensemble	0.75-0.90	93-97%	500-700ms	$0.25

Key Insights:

Easy queries (exact name): All retrievers should hit ~100%
Medium queries (semantic): Naive/Compression excel
Hard queries (ambiguous): Multi-Query and Ensemble help

Interpreting MRR

MRR = 1.0: Perfect (correct answer always rank 1)
MRR = 0.7: Good (average rank ~1.4)
MRR = 0.5: Acceptable (average rank 2)
MRR < 0.3: Poor (average rank > 3)

Interpreting Hit@3

Hit@3 > 95%: Excellent retrieval
Hit@3 85-95%: Good retrieval
Hit@3 < 85%: Needs improvement

Troubleshooting

Issue 1: "Patterns directory not found"

Solution:

# Adjust path based on notebook location
component_patterns = load_component_patterns("../../backend/data/patterns")

# Or use absolute path
import os
patterns_dir = os.path.join(os.getcwd(), "backend", "data", "patterns")
component_patterns = load_component_patterns(patterns_dir)

Issue 2: "API rate limit exceeded"

Solution:

import time

# Add delays between API calls
for test in golden_dataset:
    result = retriever.invoke(test["query"])
    time.sleep(0.5)  # 500ms delay

Issue 3: "Multi-Query takes too long"

Solution:

# Reduce number of query variations
# In MultiQueryRetriever, LLM generates ~3-5 queries by default
# You can't easily control this, but you can reduce test dataset size for quick tests

# Quick test with subset
quick_golden_dataset = golden_dataset[:10]  # Test with 10 instead of 30

Issue 4: "Cohere API errors"

Solution:

# Verify API key
import os
print(f"Cohere API Key set: {bool(os.getenv('COHERE_API_KEY'))}")

# Test with small batch first
compression_retriever.invoke("Button component")  # Should work if key is valid

Issue 5: "Low MRR scores across all retrievers"

Possible causes:

Incorrect expected_id in golden dataset: Verify IDs match pattern JSONs exactly
Poor document content: Ensure patterns loaded correctly (check Task 2 output)
Embedding issues: Verify OpenAI API key and embeddings working

Debug:

# Check what's actually being retrieved
test = golden_dataset[0]
results = naive_retriever.invoke(test["query"])

print(f"Query: {test['query']}")
print(f"Expected: {test['expected_id']}")
print(f"Retrieved IDs: {[d.metadata['id'] for d in results[:3]]}")

# If expected_id not in top-3, something is wrong with:
# - Document content (Task 2)
# - Expected IDs in golden dataset
# - Embeddings quality

Summary Checklist

Before running evaluation, ensure:

All 10 component pattern JSONs loaded successfully
Qdrant vectorstores created (3 total: main, parent-doc children, semantic chunks)
All 7 retrievers initialized without errors
Golden dataset JSON created with 30 test cases
API keys set (OPENAI_API_KEY, COHERE_API_KEY)
Dependencies installed (langchain, cohere, pandas, numpy)
Evaluation output directory exists: backend/data/evaluation/

Run the notebook top-to-bottom and you should get complete evaluation results!

Next Steps After Evaluation

Analyze Results: Which retriever performed best for your patterns?
Optimize Chosen Strategy: Fine-tune parameters (k, weights, thresholds)
Integrate into App: Use winning strategy in RetrievalService
Monitor Production: Track Hit@3 with real user queries
Re-evaluate at Scale: When corpus grows to 50+ patterns, re-run evaluation

Questions or Issues? Refer to ComponentForge docs or LangChain documentation for advanced customization.

FilesExpand file tree

notebook-guide.md

Latest commit

History

notebook-guide.md

File metadata and controls

Advanced Retrieval Notebook Adaptation Guide for ComponentForge

Table of Contents

Overview

What's Changing?

Why This Approach?

What Stays the Same?

What Changes?

Prerequisites

1. Environment Setup

2. API Keys Required

3. Data Files

4. Notebook Location

File Structure Setup

Directory Structure

Create Notebook Directory

Task-by-Task Modifications

Task 1: Dependencies (NO CHANGES)

Task 2: Data Loading (MODIFIED)

Task 3: Qdrant Setup (MODIFIED)

Task 4: Naive Retrieval (MODIFIED QUERIES)

Task 5: BM25 (MODIFIED QUERIES)

Task 6: Contextual Compression (MODIFIED QUERIES)

Task 7: Multi-Query (MODIFIED QUERIES)

Task 8: Parent-Document (MODIFIED)

Task 9: Ensemble (NO CHANGES)

Task 10: Semantic Chunking (MODIFIED)

Step 2: Evaluation Code

Expected Results

What Good Performance Looks Like

Interpreting MRR

Interpreting Hit@3

Troubleshooting

Issue 1: "Patterns directory not found"

Issue 2: "API rate limit exceeded"

Issue 3: "Multi-Query takes too long"

Issue 4: "Cohere API errors"

Issue 5: "Low MRR scores across all retrievers"

Summary Checklist

Next Steps After Evaluation