Context Usage Manipulation Technique benchmark

This repo contains the code for the ACL 2026 paper CUB: Benchmarking Context Utilisation Techniques for Language Models.

As described in the paper, the work consists of three main steps:

Data preparation for CUB
Implementation of context usage manipulation techniques (CMTs)
Evaluation on CUB and results analysis

Setup

We use uv to set up the environment for the repo. Follow the instructions on the webpage to install uv if you do not already have it installed. Then, run the following:

uv sync

This command reads the package dependencies from the pyproject.toml and uv.lock files and sets up the virtual environment. Then, to run a Python script with the configured environment, use uv run python <your-python-script>. uv also configures a virtual environment under .venv in the repo root. This can also be used to run a script with the configured Python environment, using e.g. source .venv/bin/activate before running any Python scripts. This latter approach is used in many of the bash scripts of this repo.

For adding packages to the environment, simply use uv add <your-package>, or edit the pyproject.toml file and then run uv sync. uv add automaticallly updates the pyproject.toml.

For some specific CMT implementations, you may need to further configure the virtual environment. At those instances you will be instructed on how to do this.

Data preparation for CUB

CUB is built upon three datasets: CounterFact, NQ and DRUID. The setup of each dataset is described below. For the setup of CUB we also collect 'regular mode predictions' (model predictions when no CMT is applied) and split the benchmarking datasets into validation and test sets, these steps are also described below.

The results of the steps described in this section are three datasets ready for context utilisation benchmarking. They can be found on Hugging Face Datasets: CounterFact, NQ and DRUID. Under these pages, you can also find more details about the datasets and what they contain.

1. CounterFact

The dataset is based on the exact fact recall set from the PRISM approach by Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion for Pythia 6.9B. The predictions on these samples are more likely to correspond to memory recall and no prompt completion is necessary.

To get the evaluation dataset from the PRISM dataset, use notebooks/data_preparation/get_counterfact_samples.ipynb.

2. NQ

Use notebooks/data_preparation/get_nq_samples.ipynb to obtain the NQ dataset.

3. DRUID

Use notebooks/get_druid_samples.ipynb to obtain the DRUID dataset.

Collect regular mode model predictions

All models under consideration are evaluated in "regular mode" on the same data samples, produced above.

Run the scripts under scripts/eval to collect these regular mode predictions for all evaluated LMs (both open-sourced and API based).

Get data splits and upload to Hugging Face datasets

The standard mode model predictions are split into a validation and test set using notebooks/data_preparation/get_dev_test_splits.ipynb. The resulting files (one dev and test split per model evaluated and evaluation dataset) are then uploaded to Hugging Face Datasets.

Implementation of CMTs

We implement different CMTs and evaluate them on CUB. This section describes the implementation and results collection for each of:

PH3
Prompting
Fine-tuning
Multi-agent
ACD
COIECD

The results are then plotted and analysed in the next main section.

PH3 (Pruning Head via Path Patching)

We leverage the PH3 to perform mechanistic interventions to improve/suppress context utilisation of the LMs evaluated on CUB.

General approach of PH3 per LM:

Identify salient heads of the LM via path patching on CounterFact
Tune the head configuration for each of the two PH3 settings (+context or +memory) on the CUB validation dataset
Evaluate the model on the CUB test dataset with the tuned head configuration

1. Identify attention heads

Configure diagnostic dataset for attention head identification

We use CounterFact samples corresponding to exact fact recall in the model (exact fact recall partition from PRISM) as the diagnostic set for attention head detection. Motivation: A preliminary study indicated issues with CounterFact and World Capital datasets as used by Ortu et al. and PH3. We also recovered more stable results on the CounterFact PRISM data.

Follow the steps described in the PRISM repo to generate the exact fact recall partitions of CounterFact for each LM PH3 should be applied to. The datasets from PRISM for each model are then reformatted as described below:

GPT-2 XL 732 samples (only keeping samples for which the top-1 prediction is correct)

# <load your Python environment here>
python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/gpt2_xl.jsonl"

Pythia 6.9B 899 samples (only keeping samples for which the top-1 prediction is correct)

# <load your Python environment here>

python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/pythia_6_9.jsonl"

Qwens

# <load your Python environment here>

# Qwen 1.5B
python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/qwen_1_5.jsonl"

# Qwen 1.5B Instruct
python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/qwen_1_5_instruct.jsonl"

# Qwen 7B
python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/qwen_7.jsonl"

# Qwen 7B Instruct
python -m src.ph3.reformat_prism_dataset_to_ph3 \
    --input_file <path-to-prism-exact-fact-recall-data-for-the-model> \
    --lama_folder <path-to-srcdir_trex-described-in-prism-repo> \
    --output_file "data/prism_counterfact/qwen_7_instruct.jsonl"

Apply path patching to identify heads

Use the scripts under scripts/path_patching to identify attention heads for each of the LMs PH3 is applied to. The output gives a ranked list of the most important heads for context or memory usage. Use notebooks/ph3/get_attention_head_plots.ipynb to plot the heads.

2. Tune head configuration

The original PH3 work tuned the number of heads to intervene on for each evaluation dataset using development sets of each evaluation dataset. Here we also evaluate models on the validation splits of the evaluation datasets to identify the best head configuration.

Use scripts in e.g. scripts/ph3/gpt2_xl/top_n_heads (as array in SLURM) for getting the results for each n-head setting of PH3. I.e. prune the 1, 3, 5, 7, 10, 12, 15, 17, 20, 23 or 25 most important heads.

Note: The scripts need to be tuned to the model. Modify the script for each model such that it contains the 1, 3, 5, ..., and 25 most important heads. This information was received in the previous step in notebooks/ph3/get_attention_head_plots.ipynb.

sbatch --array=0-10 scripts/ph3/gpt2_xl/top_n_heads/nq.sh

Then, to identify the best n-head setting for each of the CUB datasets use the notebooks notebooks/ph3/tune_ph3_counterfact.ipynb, notebooks/ph3/tune_ph3_nq.ipynb and notebooks/ph3/tune_ph3_druid.ipynb. These will indicate the best PH3 hyperparameter setting for each model and dataset in CUB.

Note: The notebooks need to be tuned to the model. Modify the notebook for each model such that it contains the 1, 3, 5, ..., and 25 most important heads. This information was received in the previous step in notebooks/ph3/get_attention_head_plots.ipynb.

3. Evaluate model with tuned attention head configuration

Now we know what PH3 head setting works best for each model and dataset. Update the scripts in e.g. scripts/ph3/gpt2_xl/top_config and scripts/ph3/pythia_6_9/top_config with the tuned head configurations and run them to collect the PH3 results for CUB.

Prompting

The prompting CMT is implemented as follows for CUB:

Curate the prompts: Add the prompts to be experimented with for each dataset under src/prompt_tuning/<CUB-dataset>_prompts.py. As described in the paper, we use different prompts depending on dataset and model type. The method for selecting the prompts is also described in the paper.
Evaluate prompts: Use scripts under scripts/prompt_tuning to measure the performance of each prompt on the validation splits of CUB. Record the most performant prompt for each setting.
Benchmark on CUB: Run the most optimal prompts on the test splits of CUB using scripts/prompt_tuning/<CUB-dataset>/<model>_tuned.sh. Note: You need to specify what the optimal prompt was before running these scripts. This has already been filled in for the experiments in the paper.

Fine-tuning

We follow the method by Li et al. (2023) to perform knowledge aware fine-tuning (KAFT). KAFT involves fine-tuning LMs to generate answers that align with provided context, even when the context is conflicting, and to be robust when presented with irrelevant context. To this end, we need to sample a custom fine-tuning dataset that reflects information already encoded in the model, together with gold, conflicting and irrelevant context.

The fine-tuning CMT thus involves four main steps: 1) prepare dataset sources for fine-tuning, 2) sample fine-tuning dataset from the sources, 3) fine-tune the selected models on the dataset and 4) evaluate the fine-tuned models on CUB.

1. Prepare dataset sources

To ensure that the fine-tuning CMT works well across different domains and task setups we curate a fine-tuning dataset based on four different data sources: SQuAD 2.0, TriviaQA, AVeriTeC and DYNAMICQA. More details on the data sources can be found in the appendix of the paper (Implementation Details of Fine-tuning).

Label the relevance of contexts in TriviaQA

python -m src.finetuning.data_prep.triviaqa_label_relevance

The original TriviaQA samples do not include relevance labels; hence, we label each question and context by the exact match of the answer within the context. The script creates two files, triviaqa_irrelevance.json and triviaqa_relevance.json. Those files will be used during the fine-tuning.

Querying parametric knowledge of models

python -m src.finetuning.data_prep.query_parametric \
      --model_name <model-name-on> \
      --dataset <dataset>

List of possible arguments for dataset: ['squad', 'triviaqa', 'averitec', 'dynamicqa']. The model names are the same as those on Huggingface. The script creates the json file for each dataset, which contains the list of indices where the model answered the question correctly without the context.

To run this script, you need to have access to a GPU (e.g. Nvidia CUDA) and run uv add vllm.

2. Sample fine-tuning dataset

To ensure that all models are trained with the same data points in the same order, we sample from the entire datasets following the mixing rate. The details of the mixing rate can be found in the paper (Table 17). The script creates sampled_train_data.json file.

python -m src.finetuning.data_prep.create_train_sequence

3. Fine-tune

As the final step, we perform the fine-tuning on the curated dataset. The setup is adapted to the type of LM being fine-tuned (instruct or not instruct). After this step, the saved fine-tuned model checkpoints are ready for benchmarking.

Fine-tuning causal language models

python -m src.finetuning.fine_tune

Fine-tuning instruction fine-tuned models

python -m src.finetuning.finetune_instruct

4. Evaluate on CUB

The fine-tuned models are then evaluated on CUB. See the evaluation scripts for Qwen 32B scripts/finetuning/qwen_32 for an example of what the scripts look like for evaluating the fine-tuned models. For evaluating other LMs, only the --model_name and --saved_model arguments need to be changed.

Multi-agent

1. Prepare Regular (open-book) generation results.

The multi-agent method requires the full generation output of the Regular open-book baseline, not just the first token prediction.

Full generation outputs for closed-source LMs are already available in the CUB Hugging Face datasets.
For open-source LMs, run the script below to generate full open-book (naive_w_context) outputs.

CUDA_VISIBLE_DEVICES=$1 python -m src.multi_agent.prepare_open_book \
    --model_name $MODEL_NAME \
    --model_path $MODEL_PATH \
    --data_name $DATASET \
    --data_split "test" \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --method_name "naive_w_context"

2. Run expert modules: Relevance expert and Faithfulness expert

Run the following two expert modules. Set MODEL_TYPE to "open" for open-source LMs and "closed" for closed-source LMs that require API-based inference (OpenAI).

CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.multi_agent \
    --model_name $MODEL_NAME \
    --data_name $DATASET \
    --model_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --model_type $MODEL_TYPE \
    --expert "context_faithfulness"

CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.multi_agent \
    --model_name $MODEL_NAME \
    --data_name $DATASET \
    --model_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --model_type $MODEL_TYPE \
    --expert "relevance_assessment"

3. Run self-refinement module (Judge)

The judge module makes the final decision based on the expert modules and refines the final answer. Set MODEL_TYPE to "open" for open-source LMs and "closed" for closed-source LMs that require API-based inference (OpenAI).

CUDA_VISIBLE_DEVICES=$CUDA python -m src.multi_agent.judge \
    --model_name $MODEL_NAME \
    --data_name $DATASET \
    --model_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --model_type $MODEL_TYPE \
    --method_name "multi_agent" \
    --expert_list "context_faithfulness"

4. Reformat and rename output files

The raw output files from the previous steps need to be reformatted to work with the CUB analysis. Use the notebook below to standardize the output format and naming convention. Run notebooks/multi_agent/reformat_multi_agent_results.ipynb to reformat the output fields and save the final result files.

ACD

To evaluate the model on CUB with ACD, run the script below:

CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
    --model_name $MODEL_NAME \
    --model_path $MODEL_PATH \
    --data_name $DATASET \
    --data_split "test" \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --method_name "acd"

COIECD

COIECD requires a hyperparameter search phase to find the optimal settings for each dataset, followed by evaluation with those settings.

1. Hyperparameter search: Run COIECD on validation set of NQ dataset

Following the COIECD paper, optimal hyperparameters (lambda and alpha) are selected by evaluating on the validation set of the NQ dataset using gold context.

for lambda in 0.1 0.25 0.5 1.0
do
    for alpha in 0.0 0.5 1.0 1.5 2.0
    do
        CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
            --model_name $MODEL_NAME \
            --model_path $MODEL_PATH \
            --data_name "copenlu/cmt-benchmark-nq" \
            --data_split "validation" \
            --data_path $DATA_PATH \
            --result_path $RESULT_PATH \
            --method_name "coiecd" \
            --threshold $lambda \
            --alpha $alpha
    done
done

Run notebooks/coiecd/coiecd_hp_search.ipynb to identify the best hyperparameter combination.

2. Evaluate on CUB with optimized hyperparameters

Once the optimal lambda and alpha values have been identified from the hyperparameter search, use them to evaluate the model on the test split of CUB.

CUDA_VISIBLE_DEVICES=$1 python -m src.eval_cd \
    --model_name $MODEL_NAME \
    --model_path $MODEL_PATH \
    --data_name $DATASET \
    --data_split "test" \
    --data_path $DATA_PATH \
    --result_path $RESULT_PATH \
    --method_name "coiecd" \
    --threshold $lambda \
    --alpha $alpha

Evaluation on CUB and results analysis

Once all results have been collected on CUB, we analyse them and create plots for the paper using the following notebooks:

notebooks/get_eval_plots.ipynb
- General results analysis.
- Also creates tables with detailed results for the appendix.
notebooks/get_eval_plots_model_reg_diff.ipynb
- Create a plot showing model-averaged relative performance of each CMT compared to Regular across datasets and context types.
notebooks/explain_results.ipynb
- Performs a correlation analysis for the results on CUB.
notebooks/analysis_pareto.ipynb
- Find pareto optimal CMTs considering both faithfulness (Gold and conflicting contexts) and robustness (Irrelevant contexts) of methods & models.
notebooks/multi_agent/get_eval_relevance_expert.ipynb
- Evaluate accuracy of relevance expert of multi agent method.

We also analyse the CUB datasets using the notebook below:

notebooks/check_data_stats.ipynb
- Measures context lengths.

Citation

Kindly cite our paper as follows:

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
images		images
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Context Usage Manipulation Technique benchmark

Setup

Data preparation for CUB

1. CounterFact

2. NQ

3. DRUID

Collect regular mode model predictions

Get data splits and upload to Hugging Face datasets

Implementation of CMTs

PH3 (Pruning Head via Path Patching)

1. Identify attention heads

Configure diagnostic dataset for attention head identification

Apply path patching to identify heads

2. Tune head configuration

3. Evaluate model with tuned attention head configuration

Prompting

Fine-tuning

1. Prepare dataset sources

Label the relevance of contexts in TriviaQA

Querying parametric knowledge of models

2. Sample fine-tuning dataset

3. Fine-tune

Fine-tuning causal language models

Fine-tuning instruction fine-tuned models

4. Evaluate on CUB

Multi-agent

1. Prepare Regular (open-book) generation results.

2. Run expert modules: Relevance expert and Faithfulness expert

3. Run self-refinement module (Judge)

4. Reformat and rename output files

ACD

COIECD

1. Hyperparameter search: Run COIECD on validation set of NQ dataset

2. Evaluate on CUB with optimized hyperparameters

Evaluation on CUB and results analysis

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages