Skip to content

Commit becf508

Browse files
error9098xluarss
andauthored
feat: add script-based auto-evaluation with Streamlit analysis (#114)
* feat: add script-based auto-evaluation with Streamlit analysis * fix mypy/ruff checks * fix core functionality, update readme and shift script_based requirements to top-level --------- Signed-off-by: error9098x <[email protected]> Co-authored-by: Jack Luar <[email protected]>
1 parent e882c5c commit becf508

File tree

18 files changed

+1083
-5
lines changed

18 files changed

+1083
-5
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ venv/
1919
# docs
2020
documents.txt
2121
credentials.json
22+
creds.json
2223

2324
# virtualenv
2425
.venv
@@ -29,3 +30,4 @@ credentials.json
2930
**/.deepeval-cache.json
3031
temp_test_run_data.json
3132
**/llm_tests_output.txt
33+
**/error_log.txt

evaluation/auto_evaluation/eval_main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
make_hallucination_metric,
2020
)
2121
from auto_evaluation.dataset import hf_pull, preprocess
22-
from tqdm import tqdm # type: ignore
22+
from tqdm import tqdm
2323

2424
eval_root_path = os.path.join(os.path.dirname(__file__), "..")
2525
load_dotenv(dotenv_path=os.path.join(eval_root_path, ".env"))

evaluation/human_evaluation/main.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,13 @@
22
from dotenv import load_dotenv
33
import os
44

5-
from utils.sheets import read_questions_and_answers, write_responses, find_new_questions
6-
from utils.api import fetch_endpoints, get_responses
7-
from utils.utils import (
5+
from human_evaluation.utils.sheets import (
6+
read_questions_and_answers,
7+
write_responses,
8+
find_new_questions,
9+
)
10+
from human_evaluation.utils.api import fetch_endpoints, get_responses
11+
from human_evaluation.utils.utils import (
812
parse_custom_input,
913
selected_questions,
1014
update_gform,

evaluation/pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ dependencies = { file = ["requirements.txt"] }
2121
optional-dependencies = { test = { file = ["requirements-test.txt"] } }
2222

2323
[tool.setuptools.packages.find]
24-
include = ["auto_evaluation", "human_evaluation"]
24+
include = ["auto_evaluation", "human_evaluation", "script_based_evaluation"]
2525

2626
[tool.mypy]
2727
python_version = "3.12"
@@ -30,6 +30,7 @@ warn_return_any = true
3030
warn_unused_ignores = true
3131
strict_optional = true
3232
disable_error_code = ["call-arg"]
33+
explicit_package_bases = true
3334
exclude = "src/post_install.py"
3435

3536
[[tool.mypy.overrides]]

evaluation/requirements-test.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ mypy==1.10.1
22
ruff==0.5.1
33
types-requests==2.32.0.20240622
44
google-api-python-client-stubs==1.28.0
5+
types-tqdm==4.67.0.20241221

evaluation/requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,8 @@ langchain-google-vertexai==2.0.6
1313
asyncio==3.4.3
1414
huggingface-hub==0.26.2
1515
instructor[vertexai]==1.5.2
16+
openai==1.58.1
17+
pydantic==2.10.4
18+
tqdm==4.67.1
19+
vertexai==1.71.1
20+
plotly==5.24.1
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
GOOGLE_APPLICATION_CREDENTIALS={{GOOGLE_APPLICATION_CREDENTIALS}}
2+
OPENAI_API_KEY={{OPENAI_API_KEY}}
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# ORAssistant Automated Evaluation
2+
3+
This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant.
4+
5+
6+
## Features
7+
8+
1. **Classification-based Metrics**:
9+
- Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
10+
- Computes metrics such as Accuracy, Precision, Recall, and F1 Score.
11+
12+
2. **LLMScore**:
13+
- Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy.
14+
15+
## Setup
16+
17+
### Environment Variables
18+
19+
Create a `.env` file in the root directory with the following variables:
20+
```plaintext
21+
GOOGLE_APPLICATION_CREDENTIALS=path/to/secret.json
22+
OPENAI_API_KEY=your_openai_api_key # Required if testing against OpenAI models
23+
```
24+
### Required Files
25+
26+
- `secret.json`: Ensure you have a Google Vertex AI subscription and the necessary credentials file.
27+
28+
### Data Files
29+
30+
- **Input File**: `data/data.csv`
31+
- This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns: `Question`, `Answer`.
32+
33+
- **Output File**: `data/data_result.csv`
34+
- This file will be generated after running the script. It contains the results of the evaluation.
35+
36+
## How to Run
37+
38+
1. **Activate virtual environment**
39+
40+
From the previous directory (`evaluation`), make sure you have run the command
41+
`make init` before activating virtual environment. It is needed to recognise
42+
this folder as a submodule.
43+
44+
2. **Run the Script**
45+
46+
Use the following command to execute the script with customizable options:
47+
48+
```bash
49+
python script.py --env-path /path/to/.env --creds-path /path/to/secret.json --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
50+
```
51+
52+
- `--env-path`: Path to the `.env` file.
53+
- `--creds-path`: Path to the `secret.json` file.
54+
- `--iterations`: Number of iterations per question.
55+
- `--llms`: Comma-separated list of LLMs to test.
56+
- `--agent-retrievers`: Comma-separated list of agent-retriever names and URLs.
57+
58+
3. **View Results**
59+
60+
Results will be saved in a CSV file named after the input data file with `_result` appended.
61+
62+
## Basic Usage
63+
64+
### a. Default Usage
65+
66+
```bash
67+
python main.py
68+
```
69+
70+
- Uses the default `.env` file in the project root.
71+
- Default `data/data.csv` as input.
72+
- 5 iterations per question.
73+
- Tests all available LLMs.
74+
- No additional agent-retrievers.
75+
76+
### b. Specify .env and secret.json Paths
77+
78+
```bash
79+
python main.py --env-path /path/to/.env --creds-path /path/to/secret.json
80+
```
81+
82+
### c. Customize Iterations and Select Specific LLMs
83+
84+
```bash
85+
python main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash"
86+
```
87+
88+
### d. Add Agent-Retrievers with Custom Names
89+
90+
```bash
91+
python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
92+
```
93+
94+
### e. Full Example with All Options
95+
96+
```bash
97+
python main.py \
98+
--env-path /path/to/.env \
99+
--creds-path /path/to/secret.json \
100+
--iterations 10 \
101+
--llms "base-gemini-1.5-flash,base-gpt-4o" \
102+
--agent-retrievers "v1=http://url1.com,v2=http://url2.com"
103+
```
104+
105+
### f. Display Help Message
106+
107+
To view all available command-line options:
108+
109+
```bash
110+
python main.py --help
111+
```
112+
113+
### Run Analysis
114+
115+
After generating results, you can perform analysis using the provided `analysis.py` script. To run the analysis, execute the following command:
116+
117+
```bash
118+
streamlit run analysis.py
119+
```
120+
121+
122+
### Sample Comparison Commands
123+
124+
1. To compare three versions of ORAssistant, use:
125+
```bash
126+
python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com"
127+
```
128+
*Note: Each URL is the endpoint of the ORAssistant backend.*
129+
130+
2. To compare ORAssistant with base-gpt-4o, use:
131+
```bash
132+
python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com"
133+
```
134+
*Note: The URL is the endpoint of the ORAssistant backend.*

evaluation/script_based_evaluation/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)