|
| 1 | +# ORAssistant Automated Evaluation |
| 2 | + |
| 3 | +This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant. |
| 4 | + |
| 5 | + |
| 6 | +## Features |
| 7 | + |
| 8 | +1. **Classification-based Metrics**: |
| 9 | + - Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). |
| 10 | + - Computes metrics such as Accuracy, Precision, Recall, and F1 Score. |
| 11 | + |
| 12 | +2. **LLMScore**: |
| 13 | + - Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy. |
| 14 | + |
| 15 | +## Setup |
| 16 | + |
| 17 | +### Environment Variables |
| 18 | + |
| 19 | +Create a `.env` file in the root directory with the following variables: |
| 20 | +```plaintext |
| 21 | +GOOGLE_APPLICATION_CREDENTIALS=path/to/secret.json |
| 22 | +OPENAI_API_KEY=your_openai_api_key # Required if testing against OpenAI models |
| 23 | +``` |
| 24 | +### Required Files |
| 25 | + |
| 26 | +- `secret.json`: Ensure you have a Google Vertex AI subscription and the necessary credentials file. |
| 27 | + |
| 28 | +### Data Files |
| 29 | + |
| 30 | +- **Input File**: `data/data.csv` |
| 31 | + - This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns: `Question`, `Answer`. |
| 32 | + |
| 33 | +- **Output File**: `data/data_result.csv` |
| 34 | + - This file will be generated after running the script. It contains the results of the evaluation. |
| 35 | + |
| 36 | +## How to Run |
| 37 | + |
| 38 | +1. **Activate virtual environment** |
| 39 | + |
| 40 | + From the previous directory (`evaluation`), make sure you have run the command |
| 41 | + `make init` before activating virtual environment. It is needed to recognise |
| 42 | + this folder as a submodule. |
| 43 | + |
| 44 | +2. **Run the Script** |
| 45 | + |
| 46 | + Use the following command to execute the script with customizable options: |
| 47 | + |
| 48 | + ```bash |
| 49 | + python script.py --env-path /path/to/.env --creds-path /path/to/secret.json --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com" |
| 50 | + ``` |
| 51 | + |
| 52 | + - `--env-path`: Path to the `.env` file. |
| 53 | + - `--creds-path`: Path to the `secret.json` file. |
| 54 | + - `--iterations`: Number of iterations per question. |
| 55 | + - `--llms`: Comma-separated list of LLMs to test. |
| 56 | + - `--agent-retrievers`: Comma-separated list of agent-retriever names and URLs. |
| 57 | + |
| 58 | +3. **View Results** |
| 59 | + |
| 60 | + Results will be saved in a CSV file named after the input data file with `_result` appended. |
| 61 | + |
| 62 | +## Basic Usage |
| 63 | + |
| 64 | +### a. Default Usage |
| 65 | + |
| 66 | +```bash |
| 67 | +python main.py |
| 68 | +``` |
| 69 | + |
| 70 | +- Uses the default `.env` file in the project root. |
| 71 | +- Default `data/data.csv` as input. |
| 72 | +- 5 iterations per question. |
| 73 | +- Tests all available LLMs. |
| 74 | +- No additional agent-retrievers. |
| 75 | + |
| 76 | +### b. Specify .env and secret.json Paths |
| 77 | + |
| 78 | +```bash |
| 79 | +python main.py --env-path /path/to/.env --creds-path /path/to/secret.json |
| 80 | +``` |
| 81 | + |
| 82 | +### c. Customize Iterations and Select Specific LLMs |
| 83 | + |
| 84 | +```bash |
| 85 | +python main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash" |
| 86 | +``` |
| 87 | + |
| 88 | +### d. Add Agent-Retrievers with Custom Names |
| 89 | + |
| 90 | +```bash |
| 91 | +python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com" |
| 92 | +``` |
| 93 | + |
| 94 | +### e. Full Example with All Options |
| 95 | + |
| 96 | +```bash |
| 97 | +python main.py \ |
| 98 | + --env-path /path/to/.env \ |
| 99 | + --creds-path /path/to/secret.json \ |
| 100 | + --iterations 10 \ |
| 101 | + --llms "base-gemini-1.5-flash,base-gpt-4o" \ |
| 102 | + --agent-retrievers "v1=http://url1.com,v2=http://url2.com" |
| 103 | +``` |
| 104 | + |
| 105 | +### f. Display Help Message |
| 106 | + |
| 107 | +To view all available command-line options: |
| 108 | + |
| 109 | +```bash |
| 110 | +python main.py --help |
| 111 | +``` |
| 112 | + |
| 113 | +### Run Analysis |
| 114 | + |
| 115 | +After generating results, you can perform analysis using the provided `analysis.py` script. To run the analysis, execute the following command: |
| 116 | + |
| 117 | +```bash |
| 118 | +streamlit run analysis.py |
| 119 | +``` |
| 120 | + |
| 121 | + |
| 122 | +### Sample Comparison Commands |
| 123 | + |
| 124 | +1. To compare three versions of ORAssistant, use: |
| 125 | + ```bash |
| 126 | + python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com" |
| 127 | + ``` |
| 128 | + *Note: Each URL is the endpoint of the ORAssistant backend.* |
| 129 | + |
| 130 | +2. To compare ORAssistant with base-gpt-4o, use: |
| 131 | + ```bash |
| 132 | + python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com" |
| 133 | + ``` |
| 134 | + *Note: The URL is the endpoint of the ORAssistant backend.* |
0 commit comments