LLM-Eval

Ref: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.3.0 The repository contains set of evaluation tasks with which we can evaluate large language models to assess different capabilities of a model. All the tasks are present under lm_eval --> tasks directory.

Setting up the environment

Create a conda env and use requirements.txt to install the dependencies

Tasks to be run for evaluations

English tasks: ["mmlu", "race", "hellaswag", "piqa", "boolq", "siqa", "arc_challenge", "openbookqa", "winogrande", "truthfulqa", "crowspairs"]
Arabic tasks: ["exams_ar", "mmlu_hu_ar", "mmlu_ar", "digitised_ar", "hellaswag_ar", "piqa_ar", "boolq_ar", "siqa_ar", "arc_challenge_ar", "openbookqa_ar", "truthfulqa_mc_ar", "crowspairs_ar"]
Hindi tasks: ["mmlu_hi","hellaswag_hi","arc_hi","truthfulqa_hi"]
All the datasets are either directly downloaded from Huggingface or available under datasets directory.

All tasks are 0 shot.

Run a single evaluation task

cd to the current directory.
Execute the following command

python main.py \
    --model hf-causal-experimental  \
    --model_args use_accelerate=True,pretrained=<model_path> \
    --tasks <task_name> \
    --num_fewshot 0 \
    --output_path  output<task>.json \
    --device cuda

Some details about the parameters passed to execute the code

model_args multiple parameters can be passed to the model which can be used to load the model, more details can be found here.
tasks we have to pass name of the task which needs to be run, we can pass multiple comma-separated tasks at the same time to run all of them at the single run.
num_fewshot sets the number of examples used as an example in n-shot setting, the default value is 0. In our evaluations we have been using 0-shot evaluations so this can be ignored.
output_path the result file where the final evalutions are written.
More details about the other available parameters can be found here.

Run GPT4-as-Judge evals

These evals are to compare generations of two models on Vicuna 80 questions in Arabic as well as English. GPT4 is used as Judge , where it compares and scores the output of both the models.
In order to run (requires openai API key)

cd gpt4_eval/

# generate responses for two models on Vicuna 80 questions, please change output paths in utils.py
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang ar
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang en

python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang ar
python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang en

# once above generations are complete, you may run GPT4 comparisons
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang en &
p1=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang en &
p2=$!
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang ar &
p3=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang ar &
p4=$!

wait $p1 $p2 $p3 $p4

# Save a win rate figure
python gpt4_eval_scores.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna

------------------------------------------

Run batch evaluations

You may refer to this section to know how we run our evaluations on slurm clusters.

cd to scripts directory.
Add model info in model_name_to_path.jsonl
- model_name is any suitable you want to give so that the result files are generated by using this name.
In the batch_run.py, following changes should be done:
1. Add this model in the list model_to_be_evaluated
2. In case of GPT4 evaluations (tasks such 'vicuna'), we want a second model with which the current model is compared. The name of the second model should be put in model_to_compare_name.
3. Add tasks to be run in tasks_to_be_run ( basics LM harness tasks are base_tasks_en, base_tasks_ar)
Execute python batch_run.py
It will trigger jobs on the cloud while reserving the required resources. You can check if everything is working fine by searching if the jobs are triggered with squeue command.
In between execution of these evaluations we can run find_missing.py by providing the model_name: python find_missing.py --file_starts <model_name>

Check results

All the lm-harness results are stored in a single jsonl file for each task under output_0shot directory with <model_name> initials and task names.
Run python print_results.py. This will consolidate results of different tasks and generates lm_harness_results.xlsx. This file can be used to update the llm_eval.xlsx.
Results from GPT4 evaluations are stored in gpt4_eval --> fig directory where final comparison results of both the models is present.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
HuggingFaceDatasets		HuggingFaceDatasets
Indic-Codes		Indic-Codes
datasets/multilingual_datasets		datasets/multilingual_datasets
gpt4_eval		gpt4_eval
lm_eval		lm_eval
output		output
scripts		scripts
.gitattributes		.gitattributes
README.md		README.md
Test.yml		Test.yml
Test_pip.txt		Test_pip.txt
__init__.py		__init__.py
main.py		main.py
temp_45.txt		temp_45.txt
temp_46.txt		temp_46.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Eval

Setting up the environment

Tasks to be run for evaluations

Run a single evaluation task

Run GPT4-as-Judge evals

------------------------------------------

Run batch evaluations

Check results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Eval

Setting up the environment

Tasks to be run for evaluations

Run a single evaluation task

Run GPT4-as-Judge evals

------------------------------------------

Run batch evaluations

Check results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages