Skip to content

Commit aadaf82

Browse files
authored
Update readme (#214)
* add deepwiki badge * update README
1 parent ca4dfa1 commit aadaf82

File tree

1 file changed

+69
-88
lines changed

1 file changed

+69
-88
lines changed

README.md

Lines changed: 69 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -1,133 +1,114 @@
11
# Eval Protocol (EP)
22

33
[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
4+
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/eval-protocol/python-sdk)
45

5-
**The open-source toolkit for building your internal model leaderboard.**
6+
**Stop guessing which AI model to use. Build a data-driven model leaderboard.**
67

7-
When you have multiple AI models to choose from—different versions, providers,
8-
or configurations—how do you know which one is best for your use case?
8+
With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.
99

1010
## 🚀 Features
1111

12-
- **Custom Evaluations**: Write evaluations tailored to your specific business needs
13-
- **Auto-Evaluation**: Stack-rank models using LLMs as judges with just model traces using out-of-the-box evaluators
14-
- **RL Environments via MCP**: Build reinforcement learning environments using the Model Control Protocol (MCP) to simulate user interactions and advanced evaluation scenarios
15-
- **Consistent Testing**: Test across various models and configurations with a unified framework
16-
- **Resilient Runtime**: Automatic retries for unstable LLM APIs and concurrent execution for long-running evaluations
17-
- **Rich Visualizations**: Built-in pivot tables and visualizations for result analysis
18-
- **Data-Driven Decisions**: Make informed model deployment decisions based on comprehensive evaluation results
12+
- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations
13+
- **Robust rollouts**: Handles flaky LLM APIs and parallel execution
14+
- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
15+
- **Agent support**: LangGraph and Pydantic AI
16+
- **MCP RL envs**: Build reinforcement learning environments with MCP
17+
- **Built-in benchmarks**: AIME, tau-bench
18+
- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
19+
- **Local UI**: Pivot/table views for real-time analysis
1920

20-
## Quick Examples
21+
## ⚡ Quickstart (no labels needed)
2122

22-
### Basic Model Comparison
23+
Install with your tracing platform extras and set API keys:
2324

24-
Compare models on a simple formatting task:
25-
26-
```python test_bold_format.py
27-
from eval_protocol.models import EvaluateResult, EvaluationRow, Message
28-
from eval_protocol.pytest import default_single_turn_rollout_processor, evaluation_test
25+
```bash
26+
pip install 'eval-protocol[langfuse]'
2927

30-
@evaluation_test(
31-
input_messages=[
32-
[
33-
Message(role="system", content="Use bold text to highlight important information."),
34-
Message(role="user", content="Explain why evaluations matter for AI agents. Make it dramatic!"),
35-
],
36-
],
37-
completion_params=[
38-
{"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"},
39-
{"model": "openai/gpt-4"},
40-
{"model": "anthropic/claude-3-sonnet"}
41-
],
42-
rollout_processor=default_single_turn_rollout_processor,
43-
mode="pointwise",
44-
)
45-
def test_bold_format(row: EvaluationRow) -> EvaluationRow:
46-
"""Check if the model's response contains bold text."""
47-
assistant_response = row.messages[-1].content
28+
# Model API keys (set what you need)
29+
export OPENAI_API_KEY=...
30+
export FIREWORKS_API_KEY=...
31+
export GEMINI_API_KEY=...
4832

49-
if assistant_response is None:
50-
row.evaluation_result = EvaluateResult(score=0.0, reason="No response")
51-
return row
33+
# Platform keys
34+
export LANGFUSE_PUBLIC_KEY=...
35+
export LANGFUSE_SECRET_KEY=...
36+
export LANGFUSE_HOST=https://your-deployment.com # optional
37+
```
5238

53-
has_bold = "**" in str(assistant_response)
54-
score = 1.0 if has_bold else 0.0
55-
reason = "Contains bold text" if has_bold else "No bold text found"
39+
Minimal evaluation using the built-in AHA judge:
5640

57-
row.evaluation_result = EvaluateResult(score=score, reason=reason)
58-
return row
59-
```
41+
```python
42+
from datetime import datetime
43+
import pytest
44+
45+
from eval_protocol import (
46+
evaluation_test,
47+
aha_judge,
48+
EvaluationRow,
49+
SingleTurnRolloutProcessor,
50+
DynamicDataLoader,
51+
create_langfuse_adapter,
52+
)
6053

61-
### Using Datasets
6254

63-
Evaluate models on existing datasets:
55+
def langfuse_data_generator() -> list[EvaluationRow]:
56+
adapter = create_langfuse_adapter()
57+
return adapter.get_evaluation_rows(
58+
to_timestamp=datetime.utcnow(),
59+
limit=20,
60+
sample_size=5,
61+
)
6462

65-
```python
66-
from eval_protocol.pytest import evaluation_test
67-
from eval_protocol.adapters.huggingface import create_gsm8k_adapter
6863

69-
@evaluation_test(
70-
input_dataset=["development/gsm8k_sample.jsonl"], # Local JSONL file
71-
dataset_adapter=create_gsm8k_adapter(), # Adapter to convert data
72-
completion_params=[
73-
{"model": "openai/gpt-4"},
74-
{"model": "anthropic/claude-3-sonnet"}
64+
@pytest.mark.parametrize(
65+
"completion_params",
66+
[
67+
{"model": "openai/gpt-4.1"},
68+
{"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
7569
],
76-
mode="pointwise"
7770
)
78-
def test_math_reasoning(row: EvaluationRow) -> EvaluationRow:
79-
# Your evaluation logic here
80-
return row
71+
@evaluation_test(
72+
data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
73+
rollout_processor=SingleTurnRolloutProcessor(),
74+
)
75+
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
76+
return await aha_judge(row)
8177
```
8278

79+
Run it:
8380

84-
## 📚 Resources
81+
```bash
82+
pytest -q -s
83+
```
8584

86-
- **[Documentation](https://evalprotocol.io)** - Complete guides and API reference
87-
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** - Community discussions
88-
- **[GitHub](https://github.com/eval-protocol/python-sdk)** - Source code and examples
85+
The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`.
8986

9087
## Installation
9188

92-
**This library requires Python >= 3.10.**
89+
This library requires Python >= 3.10.
9390

94-
### Basic Installation
95-
96-
Install with pip:
91+
### pip
9792

9893
```bash
9994
pip install eval-protocol
10095
```
10196

102-
### Recommended Installation with uv
103-
104-
For better dependency management and faster installs, we recommend using [uv](https://docs.astral.sh/uv/):
97+
### uv (recommended)
10598

10699
```bash
107-
# Install uv if you haven't already
100+
# Install uv (if needed)
108101
curl -LsSf https://astral.sh/uv/install.sh | sh
109102

110-
# Install eval-protocol
103+
# Add to your project
111104
uv add eval-protocol
112105
```
113106

114-
### Optional Dependencies
115-
116-
Install with additional features:
117-
118-
```bash
119-
# For Langfuse integration
120-
pip install 'eval-protocol[langfuse]'
121-
122-
# For HuggingFace datasets
123-
pip install 'eval-protocol[huggingface]'
124-
125-
# For all adapters
126-
pip install 'eval-protocol[adapters]'
107+
## 📚 Resources
127108

128-
# For development
129-
pip install 'eval-protocol[dev]'
130-
```
109+
- **[Documentation](https://evalprotocol.io)** – Guides and API reference
110+
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community
111+
- **[GitHub](https://github.com/eval-protocol/python-sdk)** – Source and examples
131112

132113
## License
133114

0 commit comments

Comments
 (0)