The repository of the research project ``Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models''.
SemanticQA supports 19 tasks across four phrase types, including 6 sequential (multi-step) tasks.
| Task | Abbr. | Eval Metrics | Phrase Type |
|---|---|---|---|
| Idiomatic Expression Detection | IED | MCQ Accuracy | Idiom |
| Idiomatic Expression Extraction | IEE | Exact Match | Idiom |
| Idiomatic Expression Interpretation | IEI | ROUGE-L, BERTScore-F1, METEOR, BLEU | Idiom |
| Noun Compound Compositionality | NCC | MCQ Accuracy | Noun Compound |
| Noun Compound Extraction | NCE | Exact Match | Noun Compound |
| Noun Compound Interpretation | NCI | ROUGE-L, BERTScore-F1, METEOR, BLEU | Noun Compound |
| Lexical Collocation Categorization | LCC | Accuracy, Macro/Micro/Weighted F1 | Collocation |
| Lexical Collocation Extraction | LCE | Exact Match | Collocation |
| Lexical Collocation Interpretation | LCI | ROUGE-L, BERTScore-F1, METEOR, BLEU | Collocation |
| Collocate Retrieval | CR | Exact Match | Collocation |
| Collocation Identification | CI | Accuracy | Collocation |
| Verbal Multiword Expression Extraction | VMWE | Exact Match | Verbal MWE |
Sequential (multi-step) tasks combine extraction with judgment or interpretation:
- Idiom / Collocation / Noun Compound Extraction + Judgment
- Idiom / Collocation / Noun Compound Extraction + Interpretation
| Provider | Models |
|---|---|
| OpenAI | GPT-3.5-Turbo (0301/0613/1106), GPT-4 (0314/0613/Turbo/4o), GPT-5, O3, text-davinci-003 |
| Anthropic | Claude Instant, Claude 2 (2.0/2.1), Claude 3 (Sonnet/Opus), Claude Sonnet 4.5 |
| Gemini Pro (1.5/latest), Gemini 2.5 Pro, Gemma 3 27B IT | |
| DeepSeek | DeepSeek-Chat, DeepSeek-R1 |
| Zhipu AI | GLM-4.6 |
| Alibaba | Qwen3 (8B/14B/32B/235B-A22B) |
| Moonshot | Kimi K2 Instruct |
| Open | Llama 2 (7B/13B/70B), Vicuna (7B/13B), Mistral 7B, Mixtral 8x7B, ChatGLM (2/3-6B), Yi-6B |
SemanticQA/
├── resources/ # External datasets and source repositories
│ ├── dataset.zip # Prepared benchmark data (unzip to use)
│ ├── AStitchInLanguageModels/ # Idiom & noun compound datasets
│ ├── ID10M/ # Idiom extraction dataset
│ ├── CollFrEn/ # French-English collocation data
│ ├── lexcomp/ # Lexical compositionality classifiers
│ ├── lexfunc/ # Lexical function data
│ ├── lexicalcollocations/# Collocation datasets
│ ├── LexNET/ # Lexical network corpus
│ ├── noun-compound-interpretation/ # NC interpretation data
│ ├── pronci/ # NC interpretation with transformers
│ └── graph-aware-collocation-recognition/
├── scripts/ # Data preparation & utility scripts
│ ├── data.py # Data preparation for all tasks
│ ├── download_dataset.sh # Download raw datasets from sources
│ ├── calc_mean_sd.py # Compute mean & std for BERT results
│ └── tsv2xlsx.py # Convert TSV results to XLSX
├── semantic_qa/ # Main source code
│ ├── main.py # Entry point for running evaluations
│ ├── args.py # CLI argument definitions
│ ├── eval.py # Evaluation metrics implementation
│ ├── utils.py # I/O and prompt utilities
│ ├── data_utils.py # Data loading & preprocessing
│ ├── model/ # Model query interfaces (OpenAI, Claude, Gemini, local)
│ ├── prompts/ # Zero-shot & few-shot prompt templates
│ ├── taxonomy/ # Semantic relation taxonomies (8/16 categories)
│ ├── training/ # Fine-tuning scripts (encoder LCC, T5 paraphrasing)
│ ├── type/ # Lexical function category mappings
│ ├── tests/ # Test scripts
│ └── results/ # Output directory
└── environment.yml # Conda environment config
Download raw datasets:
# Available: asilm, id10m, pie, ncc, nci, nce, vmwe
./scripts/download_dataset.sh asilm
./scripts/download_dataset.sh pie
./scripts/download_dataset.sh vmweOr unzip the prepared benchmark data:
unzip resources/dataset.zip -d SemanticQA/conda env create -f environment.yml
conda activate lexbench
cd semantic_qa
pip install -r requirements.txtAll evaluations are launched from semantic_qa/ via main.py.
Example — idiom interpretation with gpt-5 (zero-shot):
python main.py \
--task idiom-paraphrase \
--api_key <YOUR_API_KEY> \
--model gpt-5 \
--prompt_path prompts/idiom_paraphrase_zeroshot.txt \
--example_path dataset/idiom_paraphrase/prepared/examples.tsv \
--input_path dataset/idiom_paraphrase/prepared/idiom_paraphrase_prepared.tsv \
--output_path results/idiom-paraphrase_0-shot_gpt-5.json \
--evaluate \
--shot_num 0 \
--max_query 1000 \
--max_tokens 128 \
--temperature 0 \
--presence_penalty 0 \
--frequency_penalty 0Key arguments:
| Argument | Description |
|---|---|
--task |
Task name (see supported tasks above) |
--model |
Model identifier |
--api_key |
API key for the model provider |
--base_url |
Custom API base URL (optional) |
--prompt_path |
Path to prompt template file |
--example_path |
Path to few-shot examples (optional) |
--taxonomy_path |
Path to taxonomy file (for LCC tasks) |
--shot_num |
Number of few-shot examples |
--evaluate |
Run evaluation after generation |
--oracle_prompt |
Use oracle prompt with ground truth hints |
--max_query |
Maximum number of API requests |
--max_tokens |
Max generated tokens per request |
--temperature |
Sampling temperature (0–2) |
Run collocation categorization across different taxonomy sizes (1/2/4/8/16 categories):
cd semantic_qa
./run_lcc_scaling.shEvaluate existing result files directly:
python eval.py --task lcc --result_file_path results/lcc_result.json
python eval.py --task vmwe --result_file_path results/vmwe_result.json
python eval.py --task iep --result_file_path results/idiom_paraphrase_result.jsonFine-tuning scripts are available under semantic_qa/training/:
collocation-categorization/— Fine-tune encoder models for LCCsemantic-phrases-interpretation/— Fine-tune T5 for paraphrasing tasks
@article{liu2024revisiting,
title={Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models},
author={Liu, Yang and Qin, Melissa Xiaohui and Li, Hongming and Huang, Chao},
journal={arXiv preprint arXiv:2405.02861},
year={2024}
}MIT License - see LICENSE for details.