Ref: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.3.0 The repository contains set of evaluation tasks with which we can evaluate large language models to assess different capabilities of a model. All the tasks are present under lm_eval --> tasks directory.
- Create a conda env and use requirements.txt to install the dependencies
- English tasks: ["mmlu", "race", "hellaswag", "piqa", "boolq", "siqa", "arc_challenge", "openbookqa", "winogrande", "truthfulqa", "crowspairs"]
- Arabic tasks: ["exams_ar", "mmlu_hu_ar", "mmlu_ar", "digitised_ar", "hellaswag_ar", "piqa_ar", "boolq_ar", "siqa_ar", "arc_challenge_ar", "openbookqa_ar", "truthfulqa_mc_ar", "crowspairs_ar"]
- Hindi tasks: ["mmlu_hi","hellaswag_hi","arc_hi","truthfulqa_hi"]
- All the datasets are either directly downloaded from Huggingface or available under datasets directory.
All tasks are 0 shot.
- cd to the current directory.
- Execute the following command
python main.py \
--model hf-causal-experimental \
--model_args use_accelerate=True,pretrained=<model_path> \
--tasks <task_name> \
--num_fewshot 0 \
--output_path output<task>.json \
--device cudaSome details about the parameters passed to execute the code
- model_args multiple parameters can be passed to the model which can be used to load the model, more details can be found here.
- tasks we have to pass name of the task which needs to be run, we can pass multiple comma-separated tasks at the same time to run all of them at the single run.
- num_fewshot sets the number of examples used as an example in n-shot setting, the default value is 0. In our evaluations we have been using 0-shot evaluations so this can be ignored.
- output_path the result file where the final evalutions are written.
- More details about the other available parameters can be found here.
- These evals are to compare generations of two models on Vicuna 80 questions in Arabic as well as English. GPT4 is used as Judge , where it compares and scores the output of both the models.
- In order to run (requires openai API key)
cd gpt4_eval/
# generate responses for two models on Vicuna 80 questions, please change output paths in utils.py
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang ar
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang en
python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang ar
python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang en
# once above generations are complete, you may run GPT4 comparisons
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang en &
p1=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang en &
p2=$!
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang ar &
p3=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang ar &
p4=$!
wait $p1 $p2 $p3 $p4
# Save a win rate figure
python gpt4_eval_scores.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicunaYou may refer to this section to know how we run our evaluations on slurm clusters.
- cd to
scriptsdirectory. - Add model info in
model_name_to_path.jsonl- model_name is any suitable you want to give so that the result files are generated by using this name.
- In the
batch_run.py, following changes should be done:- Add this model in the list
model_to_be_evaluated - In case of GPT4 evaluations (tasks such
'vicuna'), we want a second model with which the current model is compared. The name of the second model should be put inmodel_to_compare_name. - Add tasks to be run in
tasks_to_be_run( basics LM harness tasks arebase_tasks_en, base_tasks_ar)
- Add this model in the list
- Execute
python batch_run.py - It will trigger jobs on the cloud while reserving the required resources. You can check if everything is working fine by searching if the jobs are triggered with
squeuecommand. - In between execution of these evaluations we can run
find_missing.pyby providing the model_name:python find_missing.py --file_starts <model_name>
- All the lm-harness results are stored in a single jsonl file for each task under output_0shot directory with <model_name> initials and task names.
- Run
python print_results.py. This will consolidate results of different tasks and generates lm_harness_results.xlsx. This file can be used to update the llm_eval.xlsx. - Results from GPT4 evaluations are stored in gpt4_eval --> fig directory where final comparison results of both the models is present.