This repository implements a framework for evaluating Language Models (LLMs) using judge models. The system consists of two main components:
- Test LLMs - These models generate responses based on user queries.
- Judge LLMs - These models evaluate the accuracy and quality of test model responses.
This implementation is based on the paper "Facts-Grounding: A New Benchmark for Evaluating the Factuality of Large Language Models". You can also refer to the related DeepMind blog post for a more detailed discussion on the methodology and findings.
- Utilizes LangChain for making API calls to various LLM providers.
- Supports local model inference through Ollama (Llama 3.2 used in this implementation).
- Calls test models to generate responses for user queries.
- Uses judge models to assess the test model responses against provided context documents.
- Supports multiple judge models for a more reliable evaluation.
- Leverages LangSmith for logging results, tracing, and performance monitoring.
- Saves results in JSONL format for easy analysis.
Ensure you have Python installed. Then, install required dependencies:
pip install -r requirements.txt
Make sure to define environment variables or configurations required for model initialization. This can be done by modifying utils.py
and models.yaml
.
The application requires:
- LangChain environment variables for API connections
- LangSmith API key for result logging and tracing
- Google Cloud credentials for Vertex AI (if using Google models)
You can set these up in a .env
file at the root of the project.
This implementation uses Llama 3.2 through Ollama for local inference. To use this or any other model locally:
- Install Ollama from ollama.ai
- Pull the desired model:
ollama pull llama3.2
- Configure the model in your
models.yaml
file
You can replace llama3.2 with any other model supported by Ollama according to your requirements.
To execute the evaluation pipeline:
python main.py
- LLM Class: Uses LangChain to call a specified LLM model and retrieves responses.
- JudgeLLM Class: Uses a list of judge models to evaluate test model outputs.
- Loads test and judge models from
models.yaml
. - Iterates through sample prompts in
examples.csv
. - Calls test models to generate responses via LangChain.
- Uses judge models to evaluate the responses.
- Logs interactions and results using LangSmith.
- Stores the results in JSONL format.
test_responses.jsonl
: Stores responses from test models.judge_responses.jsonl
: Stores evaluations from judge models.- LangSmith dashboard: Contains detailed traces and analytics of model calls.
- Modify
models.yaml
to change test and judge models. - Update
examples.csv
with new sample queries. - Adjust
evaluation_prompts.csv
for different evaluation criteria. - Configure LangSmith settings for custom tracing and evaluation.
- Add different Ollama models by pulling them and updating the configuration.
This project is licensed under the MIT License. See the LICENSE file for more details.