This repo contains the code for the ACL 2026 paper CUB: Benchmarking Context Utilisation Techniques for Language Models.
As described in the paper, the work consists of three main steps:
- Data preparation for CUB
- Implementation of context usage manipulation techniques (CMTs)
- Evaluation on CUB and results analysis
We use uv to set up the environment for the repo. Follow the instructions on the webpage to install uv if you do not already have it installed. Then, run the following:
uv syncThis command reads the package dependencies from the pyproject.toml and uv.lock files and sets up the virtual environment. Then, to run a Python script with the configured environment, use uv run python <your-python-script>. uv also configures a virtual environment under .venv in the repo root. This can also be used to run a script with the configured Python environment, using e.g. source .venv/bin/activate before running any Python scripts. This latter approach is used in many of the bash scripts of this repo.
For adding packages to the environment, simply use uv add <your-package>, or edit the pyproject.toml file and then run uv sync. uv add automaticallly updates the pyproject.toml.
For some specific CMT implementations, you may need to further configure the virtual environment. At those instances you will be instructed on how to do this.
CUB is built upon three datasets: CounterFact, NQ and DRUID. The setup of each dataset is described below. For the setup of CUB we also collect 'regular mode predictions' (model predictions when no CMT is applied) and split the benchmarking datasets into validation and test sets, these steps are also described below.
The results of the steps described in this section are three datasets ready for context utilisation benchmarking. They can be found on Hugging Face Datasets: CounterFact, NQ and DRUID. Under these pages, you can also find more details about the datasets and what they contain.
The dataset is based on the exact fact recall set from the PRISM approach by Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion for Pythia 6.9B. The predictions on these samples are more likely to correspond to memory recall and no prompt completion is necessary.
To get the evaluation dataset from the PRISM dataset, use notebooks/data_preparation/get_counterfact_samples.ipynb.
Use notebooks/data_preparation/get_nq_samples.ipynb to obtain the NQ dataset.
Use notebooks/get_druid_samples.ipynb to obtain the DRUID dataset.
All models under consideration are evaluated in "regular mode" on the same data samples, produced above.
Run the scripts under scripts/eval to collect these regular mode predictions for all evaluated LMs (both open-sourced and API based).
The standard mode model predictions are split into a validation and test set using notebooks/data_preparation/get_dev_test_splits.ipynb. The resulting files (one dev and test split per model evaluated and evaluation dataset) are then uploaded to Hugging Face Datasets.
We implement different CMTs and evaluate them on CUB. This section describes the implementation and results collection for each of:
The results are then plotted and analysed in the next main section.
We leverage the PH3 to perform mechanistic interventions to improve/suppress context utilisation of the LMs evaluated on CUB.
General approach of PH3 per LM:
- Identify salient heads of the LM via path patching on CounterFact
- Tune the head configuration for each of the two PH3 settings (+context or +memory) on the CUB validation dataset
- Evaluate the model on the CUB test dataset with the tuned head configuration
We use CounterFact samples corresponding to exact fact recall in the model (exact fact recall partition from PRISM) as the diagnostic set for attention head detection. Motivation: A preliminary study indicated issues with CounterFact and World Capital datasets as used by Ortu et al. and PH3. We also recovered more stable results on the CounterFact PRISM data.
Follow the steps described in the PRISM repo to generate the exact fact recall partitions of CounterFact for each LM PH3 should be applied to. The datasets from PRISM for each model are then reformatted as described below:
GPT-2 XL 732 samples (only keeping samples for which the top-1 prediction is correct)
# <load your Python environment here>
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/gpt2_xl.jsonl"Pythia 6.9B 899 samples (only keeping samples for which the top-1 prediction is correct)
# <load your Python environment here>
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/pythia_6_9.jsonl"Qwens
# <load your Python environment here>
# Qwen 1.5B
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/qwen_1_5.jsonl"
# Qwen 1.5B Instruct
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/qwen_1_5_instruct.jsonl"
# Qwen 7B
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/qwen_7.jsonl"
# Qwen 7B Instruct
python -m src.ph3.reformat_prism_dataset_to_ph3 \
--input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
--lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
--output_file "data/prism_counterfact/qwen_7_instruct.jsonl"Use the scripts under scripts/path_patching to identify attention heads for each of the LMs PH3 is applied to. The output gives a ranked list of the most important heads for context or memory usage. Use notebooks/ph3/get_attention_head_plots.ipynb to plot the heads.
The original PH3 work tuned the number of heads to intervene on for each evaluation dataset using development sets of each evaluation dataset. Here we also evaluate models on the validation splits of the evaluation datasets to identify the best head configuration.
Use scripts in e.g. scripts/ph3/gpt2_xl/top_n_heads (as array in SLURM) for getting the results for each n-head setting of PH3. I.e. prune the 1, 3, 5, 7, 10, 12, 15, 17, 20, 23 or 25 most important heads.
Note: The scripts need to be tuned to the model. Modify the script for each model such that it contains the 1, 3, 5, ..., and 25 most important heads. This information was received in the previous step in notebooks/ph3/get_attention_head_plots.ipynb.
sbatch --array=0-10 scripts/ph3/gpt2_xl/top_n_heads/nq.shThen, to identify the best n-head setting for each of the CUB datasets use the notebooks notebooks/ph3/tune_ph3_counterfact.ipynb, notebooks/ph3/tune_ph3_nq.ipynb and notebooks/ph3/tune_ph3_druid.ipynb. These will indicate the best PH3 hyperparameter setting for each model and dataset in CUB.
Note: The notebooks need to be tuned to the model. Modify the notebook for each model such that it contains the 1, 3, 5, ..., and 25 most important heads. This information was received in the previous step in notebooks/ph3/get_attention_head_plots.ipynb.
Now we know what PH3 head setting works best for each model and dataset. Update the scripts in e.g. scripts/ph3/gpt2_xl/top_config and scripts/ph3/pythia_6_9/top_config with the tuned head configurations and run them to collect the PH3 results for CUB.
The prompting CMT is implemented as follows for CUB:
- Curate the prompts: Add the prompts to be experimented with for each dataset under
src/prompt_tuning/<CUB-dataset>_prompts.py. As described in the paper, we use different prompts depending on dataset and model type. The method for selecting the prompts is also described in the paper. - Evaluate prompts: Use scripts under
scripts/prompt_tuningto measure the performance of each prompt on the validation splits of CUB. Record the most performant prompt for each setting. - Benchmark on CUB: Run the most optimal prompts on the test splits of CUB using
scripts/prompt_tuning/<CUB-dataset>/<model>_tuned.sh. Note: You need to specify what the optimal prompt was before running these scripts. This has already been filled in for the experiments in the paper.
We follow the method by Li et al. (2023) to perform knowledge aware fine-tuning (KAFT). KAFT involves fine-tuning LMs to generate answers that align with provided context, even when the context is conflicting, and to be robust when presented with irrelevant context. To this end, we need to sample a custom fine-tuning dataset that reflects information already encoded in the model, together with gold, conflicting and irrelevant context.
The fine-tuning CMT thus involves four main steps: 1) prepare dataset sources for fine-tuning, 2) sample fine-tuning dataset from the sources, 3) fine-tune the selected models on the dataset and 4) evaluate the fine-tuned models on CUB.
To ensure that the fine-tuning CMT works well across different domains and task setups we curate a fine-tuning dataset based on four different data sources: SQuAD 2.0, TriviaQA, AVeriTeC and DYNAMICQA. More details on the data sources can be found in the appendix of the paper (Implementation Details of Fine-tuning).
python -m src.finetuning.data_prep.triviaqa_label_relevanceThe original TriviaQA samples do not include relevance labels; hence, we label each question and context by the exact match of the answer within the context. The script creates two files, triviaqa_irrelevance.json and triviaqa_relevance.json. Those files will be used during the fine-tuning.
python -m src.finetuning.data_prep.query_parametric \
--model_name <model-name-on> \
--dataset <dataset>List of possible arguments for dataset: ['squad', 'triviaqa', 'averitec', 'dynamicqa'].
The model names are the same as those on Huggingface. The script creates the json file for each dataset, which contains the list of indices where the model answered the question correctly without the context.
To run this script, you need to have access to a GPU (e.g. Nvidia CUDA) and run uv add vllm.
To ensure that all models are trained with the same data points in the same order, we sample from the entire datasets following the mixing rate. The details of the mixing rate can be found in the paper (Table 17). The script creates sampled_train_data.json file.
python -m src.finetuning.data_prep.create_train_sequenceAs the final step, we perform the fine-tuning on the curated dataset. The setup is adapted to the type of LM being fine-tuned (instruct or not instruct). After this step, the saved fine-tuned model checkpoints are ready for benchmarking.
python -m src.finetuning.fine_tune
python -m src.finetuning.finetune_instruct
The fine-tuned models are then evaluated on CUB. See the evaluation scripts for Qwen 32B scripts/finetuning/qwen_32 for an example of what the scripts look like for evaluating the fine-tuned models. For evaluating other LMs, only the --model_name and --saved_model arguments need to be changed.
The multi-agent method requires the full generation output of the Regular open-book baseline, not just the first token prediction.
- Full generation outputs for closed-source LMs are already available in the CUB Hugging Face datasets.
- For open-source LMs, run the script below to generate full open-book (
naive_w_context) outputs.
CUDA_VISIBLE_DEVICES=$1 python -m src.multi_agent.prepare_open_book \
--model_name $MODEL_NAME \
--model_path $MODEL_PATH \
--data_name $DATASET \
--data_split "test" \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--method_name "naive_w_context" Run the following two expert modules. Set MODEL_TYPE to "open" for open-source LMs and "closed" for closed-source LMs that require API-based inference (OpenAI).
CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.multi_agent \
--model_name $MODEL_NAME \
--data_name $DATASET \
--model_path $MODEL_PATH \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--model_type $MODEL_TYPE \
--expert "context_faithfulness"CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.multi_agent \
--model_name $MODEL_NAME \
--data_name $DATASET \
--model_path $MODEL_PATH \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--model_type $MODEL_TYPE \
--expert "relevance_assessment" The judge module makes the final decision based on the expert modules and refines the final answer.
Set MODEL_TYPE to "open" for open-source LMs and "closed" for closed-source LMs that require API-based inference (OpenAI).
CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.judge \
--model_name $MODEL_NAME \
--data_name $DATASET \
--model_path $MODEL_PATH \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--model_type $MODEL_TYPE \
--method_name "multi_agent" \
--expert_list "context_faithfulness"The raw output files from the previous steps need to be reformatted to work with the CUB analysis. Use the notebook below to standardize the output format and naming convention. Run notebooks/multi_agent/reformat_multi_agent_results.ipynb to reformat the output fields and save the final result files.
To evaluate the model on CUB with ACD, run the script below:
CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
--model_name $MODEL_NAME \
--model_path $MODEL_PATH \
--data_name $DATASET \
--data_split "test" \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--method_name "acd" COIECD requires a hyperparameter search phase to find the optimal settings for each dataset, followed by evaluation with those settings.
Following the COIECD paper, optimal hyperparameters (lambda and alpha) are selected by evaluating on the validation set of the NQ dataset using gold context.
for lambda in 0.1 0.25 0.5 1.0
do
for alpha in 0.0 0.5 1.0 1.5 2.0
do
CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
--model_name $MODEL_NAME \
--model_path $MODEL_PATH \
--data_name "copenlu/cmt-benchmark-nq" \
--data_split "validation" \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--method_name "coiecd" \
--threshold $lambda \
--alpha $alpha
done
doneRun notebooks/coiecd/coiecd_hp_search.ipynb to identify the best hyperparameter combination.
Once the optimal lambda and alpha values have been identified from the hyperparameter search, use them to evaluate the model on the test split of CUB.
CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
--model_name $MODEL_NAME \
--model_path $MODEL_PATH \
--data_name $DATASET \
--data_split "test" \
--data_path $DATA_PATH \
--result_path $RESULT_PATH \
--method_name "coiecd" \
--threshold $lambda \
--alpha $alphaOnce all results have been collected on CUB, we analyse them and create plots for the paper using the following notebooks:
- notebooks/get_eval_plots.ipynb
- General results analysis.
- Also creates tables with detailed results for the appendix.
- notebooks/get_eval_plots_model_reg_diff.ipynb
- Create a plot showing model-averaged relative performance of each CMT compared to Regular across datasets and context types.
- notebooks/explain_results.ipynb
- Performs a correlation analysis for the results on CUB.
- notebooks/analysis_pareto.ipynb
- Find pareto optimal CMTs considering both faithfulness (Gold and conflicting contexts) and robustness (Irrelevant contexts) of methods & models.
- notebooks/multi_agent/get_eval_relevance_expert.ipynb
- Evaluate accuracy of relevance expert of multi agent method.
We also analyse the CUB datasets using the notebook below:
- notebooks/check_data_stats.ipynb
- Measures context lengths.
Kindly cite our paper as follows:
TBD
