Skip to content

shivam-MBZUAI/LM-Harness-Codebase

Repository files navigation

LLM-Eval

Ref: https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.3.0 The repository contains set of evaluation tasks with which we can evaluate large language models to assess different capabilities of a model. All the tasks are present under lm_eval --> tasks directory.

Setting up the environment

  1. Create a conda env and use requirements.txt to install the dependencies

Tasks to be run for evaluations

  1. English tasks: ["mmlu", "race", "hellaswag", "piqa", "boolq", "siqa", "arc_challenge", "openbookqa", "winogrande", "truthfulqa", "crowspairs"]
  2. Arabic tasks: ["exams_ar", "mmlu_hu_ar", "mmlu_ar", "digitised_ar", "hellaswag_ar", "piqa_ar", "boolq_ar", "siqa_ar", "arc_challenge_ar", "openbookqa_ar", "truthfulqa_mc_ar", "crowspairs_ar"]
  3. Hindi tasks: ["mmlu_hi","hellaswag_hi","arc_hi","truthfulqa_hi"]
  4. All the datasets are either directly downloaded from Huggingface or available under datasets directory.

All tasks are 0 shot.

Run a single evaluation task

  1. cd to the current directory.
  2. Execute the following command
python main.py \
    --model hf-causal-experimental  \
    --model_args use_accelerate=True,pretrained=<model_path> \
    --tasks <task_name> \
    --num_fewshot 0 \
    --output_path  output<task>.json \
    --device cuda

Some details about the parameters passed to execute the code

  • model_args multiple parameters can be passed to the model which can be used to load the model, more details can be found here.
  • tasks we have to pass name of the task which needs to be run, we can pass multiple comma-separated tasks at the same time to run all of them at the single run.
  • num_fewshot sets the number of examples used as an example in n-shot setting, the default value is 0. In our evaluations we have been using 0-shot evaluations so this can be ignored.
  • output_path the result file where the final evalutions are written.
  • More details about the other available parameters can be found here.

Run GPT4-as-Judge evals

  • These evals are to compare generations of two models on Vicuna 80 questions in Arabic as well as English. GPT4 is used as Judge , where it compares and scores the output of both the models.
  • In order to run (requires openai API key)
cd gpt4_eval/

# generate responses for two models on Vicuna 80 questions, please change output paths in utils.py
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang ar
python model_text_gen.py --model_name $model1_name --model_path $model1_path --task vicuna --lang en

python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang ar
python model_text_gen.py --model_name $model2_name --model_path $model2_path --task vicuna --lang en

# once above generations are complete, you may run GPT4 comparisons
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang en &
p1=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang en &
p2=$!
python3 compare_models.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna --lang ar &
p3=$!
python3 compare_models.py --model1 $model2_name --model2 $model1_name --gpt4_prompt_type generic --task vicuna --lang ar &
p4=$!

wait $p1 $p2 $p3 $p4

# Save a win rate figure
python gpt4_eval_scores.py --model1 $model1_name --model2 $model2_name --gpt4_prompt_type generic --task vicuna

------------------------------------------

Run batch evaluations

You may refer to this section to know how we run our evaluations on slurm clusters.

  1. cd to scripts directory.
  2. Add model info in model_name_to_path.jsonl
    • model_name is any suitable you want to give so that the result files are generated by using this name.
  3. In the batch_run.py, following changes should be done:
    1. Add this model in the list model_to_be_evaluated
    2. In case of GPT4 evaluations (tasks such 'vicuna'), we want a second model with which the current model is compared. The name of the second model should be put in model_to_compare_name.
    3. Add tasks to be run in tasks_to_be_run ( basics LM harness tasks are base_tasks_en, base_tasks_ar)
  4. Execute python batch_run.py
  5. It will trigger jobs on the cloud while reserving the required resources. You can check if everything is working fine by searching if the jobs are triggered with squeue command.
  6. In between execution of these evaluations we can run find_missing.py by providing the model_name: python find_missing.py --file_starts <model_name>

Check results

  1. All the lm-harness results are stored in a single jsonl file for each task under output_0shot directory with <model_name> initials and task names.
  2. Run python print_results.py. This will consolidate results of different tasks and generates lm_harness_results.xlsx. This file can be used to update the llm_eval.xlsx.
  3. Results from GPT4 evaluations are stored in gpt4_eval --> fig directory where final comparison results of both the models is present.

About

Copy of LM-harness Codebase

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors