Lilac + Eval Protocol: Diverse Dataset Curation

Use Lilac to automatically cluster and sample diverse examples from your LLM traces for evaluation.

This is an example showing how to use Eval Protocol's utilities with Lilac for intelligent data curation.

What This Does

When evaluating LLMs, running on all your production traces is expensive and often redundant—many queries are semantically similar. This integration:

Pulls traces from Langfuse (or any supported observability platform)
Clusters them semantically using embeddings + HDBSCAN
Samples diverse examples from each cluster
Evaluates the representative subset

Result: Instead of evaluating 1000 similar traces, you evaluate 30 diverse ones that cover all query types.

100 traces → Lilac clustering → 6 semantic groups → 12 diverse samples

Quick Start

1. Setup

# Clone this repo
git clone <repo-url>
cd lilac-eval-example

# Run setup script (creates venv and installs everything)
./setup.sh

2. Configure API Keys

cp env.template .env
# Edit .env with your keys

3. Run

source .venv/bin/activate
pytest test_lilac_preprocessing.py -v -s

How Lilac Integration Works

The Integration Point

The key is the preprocess_fn parameter in DynamicDataLoader. This function receives ALL loaded rows and returns a filtered/transformed subset:

@evaluation_test(
    data_loaders=DynamicDataLoader(
        generators=[langfuse_traces_generator],
        preprocess_fn=lilac_cluster_and_sample,  # ← Your Lilac logic here!
    ),
    ...
)
def test_my_evaluation(row: EvaluationRow) -> EvaluationRow:
    return evaluate(row)

The Preprocessing Pipeline

def lilac_cluster_and_sample(rows: List[EvaluationRow]) -> List[EvaluationRow]:
    """
    1. Convert to DataFrame (for Lilac compatibility)
    2. Create Lilac dataset
    3. Cluster on user queries
    4. Sample from each cluster
    5. Convert back to EvaluationRows
    """
    import lilac as ll
    
    # Step 1: Convert to DataFrame using eval-protocol utility
    df = evaluation_rows_to_dataframe(rows)
    df["user_query"] = df["messages_json"].apply(extract_first_user_message)
    
    # Step 2: Create Lilac dataset
    config = ll.DatasetConfig(
        namespace="local",
        name="my_dataset",
        source=ll.PandasSource(df),
    )
    dataset = ll.create_dataset(config)
    
    # Step 3: Cluster (Lilac handles embedding + UMAP + HDBSCAN)
    dataset.cluster("user_query")
    
    # Step 4: Sample diverse examples from each cluster
    df = dataset.to_pandas(include_signals=True)
    # ... sampling logic per cluster ...
    
    # Step 5: Convert back using eval-protocol utility
    return dataframe_to_evaluation_rows(df)

Clustering Pipeline (What Lilac Does)

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│ User Queries │ ──▶ │  Embed with  │ ──▶ │    UMAP     │ ──▶ │   HDBSCAN    │
│  (text)      │     │ Transformers │     │ (dim reduce)│     │ (clustering) │
└─────────────┘     └──────────────┘     └─────────────┘     └──────────────┘
                                                                    │
                                                                    ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────────────────────────┐
│   Output:   │ ◀── │ Sample N per │ ◀── │ Clusters with auto-generated   │
│ Diverse Set │     │   cluster    │     │ titles (via LLM, optional)     │
└─────────────┘     └──────────────┘     └─────────────────────────────────┘

Embeds each user query using sentence transformers (jina-embeddings-v2-small-en)
Reduces dimensions with UMAP (512 → 5 dimensions)
Clusters with HDBSCAN (automatically determines cluster count)
Names clusters using an LLM (optional, requires API_MODEL env var)
Samples N examples from each cluster for diversity

Example Output

============================================================
🌸 LILAC PREPROCESSING
============================================================
📥 Input: 100 rows

🧮 Clustering user queries...
   Method: Embed → UMAP → HDBSCAN
   Cluster naming: LLM (gpt-4o-mini)

📊 Found 6 clusters:
--------------------------------------------------
   Cluster 0 "Account Management Requests": 14 items
      e.g., "Update phone number on account"
   Cluster 1 "Order Returns and Refunds": 26 items
      e.g., "ORD-54656 shipping status?"
   Cluster 2 "Customer Service Inquiries": 17 items
      e.g., "Recovery options change"

✅ Output: 12 diverse samples
   Strategy: 2 per cluster, max 30 total
============================================================

Eval Protocol Utilities Used

This example uses several Eval Protocol utilities that enable the Lilac integration:

1. DataFrame Conversion - Bridge Between EvaluationRows and Pandas

from eval_protocol.adapters.lilac import (
    evaluation_rows_to_dataframe,
    dataframe_to_evaluation_rows,
)

# Convert EvaluationRows → DataFrame (for Lilac/pandas processing)
df = evaluation_rows_to_dataframe(rows)

# ... do clustering, filtering, transformations with pandas/Lilac ...

# Convert DataFrame → EvaluationRows (back to eval-protocol format)
filtered_rows = dataframe_to_evaluation_rows(df)

2. Trace Adapters - Pull Data from Observability Platforms

from eval_protocol import create_langfuse_adapter

# Create adapter for your platform
adapter = create_langfuse_adapter()

# Pull traces and convert to EvaluationRows
rows = adapter.get_evaluation_rows(
    limit=100,              # How many traces
    hours_back=168,         # Time window (7 days)
    include_tool_calls=True # Include function calls
)

Supported platforms:

create_langfuse_adapter() - Langfuse
create_langsmith_adapter() - LangSmith
create_braintrust_adapter() - Braintrust
create_fireworks_tracing_adapter() - Fireworks Tracing

3. DynamicDataLoader - Flexible Data Loading with Preprocessing

from eval_protocol import DynamicDataLoader

data_loader = DynamicDataLoader(
    generators=[my_data_generator],     # Functions that return EvaluationRows
    preprocess_fn=my_preprocess_fn,     # Transform rows before evaluation
)

Configuration

Edit these constants in test_lilac_preprocessing.py:

SAMPLES_PER_CLUSTER = 2   # How many samples from each cluster
MAX_TOTAL_SAMPLES = 30    # Cap on total output rows
LANGFUSE_LIMIT = 100      # How many traces to pull from Langfuse

Bring Your Own Data

From Different Trace Sources

# Langfuse
adapter = create_langfuse_adapter()
rows = adapter.get_evaluation_rows(limit=100)

# LangSmith
adapter = create_langsmith_adapter()
rows = adapter.get_evaluation_rows(limit=100)

# From a JSONL file
def load_from_file():
    with open("traces.jsonl") as f:
        return [EvaluationRow.from_dict(json.loads(line)) for line in f]

Custom Preprocessing (Without Lilac)

You can write any preprocessing logic:

def my_custom_preprocess(rows: List[EvaluationRow]) -> List[EvaluationRow]:
    # Filter by length
    rows = [r for r in rows if len(r.last_user_message().content) > 10]
    
    # Deduplicate
    seen = set()
    unique = []
    for r in rows:
        key = r.last_user_message().content[:100]
        if key not in seen:
            seen.add(key)
            unique.append(r)
    
    # Random sample
    return random.sample(unique, min(50, len(unique)))

Advanced Lilac Features

Beyond clustering, Lilac offers:

Semantic Search

dataset.search("user_query", "password reset", limit=10)

Concept Detection

from lilac.signals import PIISignal
dataset.compute_signal(PIISignal(), "user_query")

Lilac Web UI

import lilac as ll
ll.start_server()  # Interactive UI at localhost:5432

📚 Full Lilac Documentation: https://docs.lilacml.com/

API Keys Reference

Required:

LANGFUSE_PUBLIC_KEY=pk-lf-...      # Langfuse public key
LANGFUSE_SECRET_KEY=sk-lf-...      # Langfuse secret key
LANGFUSE_HOST=https://cloud.langfuse.com
FIREWORKS_API_KEY=fw_...           # For model evaluation

Optional (for LLM cluster naming):

OPENAI_API_KEY=sk-...              # OpenAI API key
API_MODEL=gpt-4o-mini              # Model for naming clusters

Troubleshooting

Test skipped with "No Langfuse credentials"

Ensure .env has valid LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.

"you must provide a model parameter"

Set API_MODEL for LLM cluster naming, or it falls back to generic names.

"HDBSCAN: X noise points"

Normal! Uncertain points are assigned to nearest cluster automatically.

Slow first run

First run downloads embedding model (~400MB). Subsequent runs use cache.

Requirements

Python 3.10+
~2GB disk space for embedding model cache
API keys for trace source + evaluation model

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
env.template		env.template
requirements.txt		requirements.txt
setup.sh		setup.sh
test_lilac_preprocessing.py		test_lilac_preprocessing.py

eval-protocol/eval-protocol-lilac

Folders and files

Latest commit

History

Repository files navigation