DIWALI

This repository contains the evaluation code for the DIWALI (Diversity and Inclusivity aWare cuLture specific Items for India) project, which assesses Large Language Models (LLMs) for cultural text adaptation in the Indian context.

Overview

This codebase evaluates LLMs on their ability to perform culturally-aware text adaptation for the Indian context. It includes:

Culturally adapted versions of GSM-8K math word problems
Concept extraction and matching for cultural items
Evaluation across multiple state-of-the-art LLMs
Support for both English and Bengali language adaptations
Integration with the CANDLE dataset for cultural concept analysis

Repository Structure

culture-evaluation/
├── adaptations/          # Culturally adapted test sets
│   └── gsm-8k/          # GSM-8K adaptations for various models
│       ├── bengali_prompt/
│       └── *_gsm_8k_test.json
├── CANDLE/              # CANDLE dataset processing
│   ├── concept_extraction.py
│   ├── concepts.jsonl
│   └── concepts_bengali.jsonl
├── concept_matching/    # Concept matching evaluation scripts
│   └── concept_matching_*.py  # Scripts for different models
├── concept_matching_bengali/ # Bengali concept matching
├── data/                # Data loading utilities
│   └── data.py
├── dataset/             # Base datasets
│   └── gsm-8k/
│       ├── train.jsonl
│       └── test.jsonl
├── MGSM/                # Multilingual GSM-8K (Bengali)
│   ├── mgsm_bn.tsv
│   ├── concept_matching_bn_prompt/
│   └── output_bn_prompt/
├── model/               # Model inference scripts
│   ├── offline_copy_replaced_words.py
│   ├── mgsm_eng_propmpt.py
│   ├── mgsm_ben_propmpt.py
│   └── scoring-*.py     # Scoring scripts for different models
├── our_csis/            # Custom culture-specific items for India
│   ├── concept_matching/
│   ├── csis/
│   └── india_map/
├── output/              # Evaluation results
├── run.sh               # Main evaluation script (English)
└── run_bengali.sh       # Bengali evaluation script

Prerequisites

Python 3.8+
CUDA-capable GPU (recommended)
Required Python packages (see installation section)

Installation

Clone the repository:

git clone https://github.com/pramitsahoo/culture-evaluation.git
cd culture-evaluation

Install dependencies:

pip install torch transformers pandas datasets

Download the DIWALI dataset from HuggingFace:

# The dataset will be automatically downloaded when running evaluation scripts
# Or manually download from: https://huggingface.co/datasets/nlip/DIWALI

(Optional) Download the CANDLE dataset:

# Download from: https://www.mpi-inf.mpg.de/fileadmin/inf/d5/research/candle/candle_dataset_v1.jsonl.zip
# Extract to CANDLE/ directory

Usage

Running Evaluations

English Prompt Evaluation

Evaluate a model on culturally adapted GSM-8K problems with English prompts:

./run.sh <DEVICE_ID> <MODEL_NAME>

Example:

./run.sh 0 llama-3

Bengali Prompt Evaluation

Evaluate a model with Bengali prompts:

./run_bengali.sh <DEVICE_ID> <MODEL_NAME>

Example:

./run_bengali.sh 0 llama-3

Supported Models

The framework supports evaluation of the following models:

llama-2 (LLaMA-2 7B)
llama-3 (LLaMA-3 8B)
llama-3.2-1b (LLaMA-3.2 1B)
llama-3.2-3b (LLaMA-3.2 3B)
mistral (Mistral 7B)
gemma-2-2b (Gemma-2 2B)
gemma-2-9b (Gemma-2 9B)

Concept Matching

To evaluate concept matching performance:

cd concept_matching
python concept_matching_<model_name>.py

Example:

python concept_matching_llama2.py

Custom Evaluations

To run custom evaluations, modify the parameters in the run scripts or directly call the Python scripts:

python model/offline_copy_replaced_words.py \
    --dataset gsm_8k \
    --split test \
    --model llama-3 \
    --output_dir ./output/custom_evaluation/

Output Format

Evaluation results are saved in the output/ directory with the following structure:

output/
├── new_prompt_english/     # English prompt results
│   └── <model_name>/
│       ├── predictions.json
│       └── scores.txt
├── bengali/                # Bengali prompt results
└── LLM_Scores_*/          # Detailed scoring results

Each output file contains:

Model predictions
Cultural adaptation scores
Concept matching accuracy
Detailed explanations (if enabled)

Dataset Information

DIWALI Dataset

The DIWALI dataset is available on HuggingFace: nlip/DIWALI

It contains:

Culturally adapted math word problems
Culture-specific items (CSIs) for India
Mappings between source and target cultural concepts
Human annotations for quality assessment

GSM8K Integration

This evaluation uses culturally adapted versions of the GSM-8K dataset, where original cultural references are replaced with Indian cultural equivalents while maintaining mathematical complexity and reasoning requirements.

Key Features

Cultural Adaptation: Systematic replacement of culture-specific items with Indian equivalents
Multilingual Support: Evaluation in both English and Bengali
Multiple LLMs: Support for various state-of-the-art language models
Concept Matching: Evaluation of models' understanding of cultural concepts
CANDLE Integration: Leverages the CANDLE dataset for cultural concept extraction
Comprehensive Metrics: Detailed scoring and explanation generation

Results

Detailed results and analysis can be found in the paper. The evaluation shows varying performance across different LLMs in handling culturally adapted content, with implications for developing more inclusive AI systems.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is licensed under the terms specified in the LICENSE file.

Contact

For questions or issues, please:

Open an issue on GitHub
Contact the authors through the project page

For more details, please refer to the paper and project page.

Citation

@inproceedings{sahoo-etal-2025-diwali,
    title = "{DIWALI} - Diversity and Inclusivity a{W}are cu{L}ture specific Items for {I}ndia: Dataset and Assessment of {LLM}s for Cultural Text Adaptation in {I}ndian Context",
    author = "Sahoo, Pramit  and
      Brahma, Maharaj  and
      Desarkar, Maunendra Sankar",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1706/",
    pages = "33587--33614",
    ISBN = "979-8-89176-332-6",
    abstract = "Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment (CITATION) and produce biased generations (CITATION) due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises {\textasciitilde}8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI, project webpage, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation."
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DIWALI

Overview

Repository Structure

Prerequisites

Installation

Usage

Running Evaluations

English Prompt Evaluation

Bengali Prompt Evaluation

Supported Models

Concept Matching

Custom Evaluations

Output Format

Dataset Information

DIWALI Dataset

GSM8K Integration

Key Features

Results

Contributing

License

Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
CANDLE		CANDLE
MGSM		MGSM
adaptations		adaptations
concept_matching		concept_matching
concept_matching_bengali		concept_matching_bengali
data		data
dataset/gsm-8k		dataset/gsm-8k
model		model
our_csis		our_csis
output		output
LICENSE		LICENSE
README.md		README.md
run.sh		run.sh
run_bengali.sh		run_bengali.sh

License

pramitsahoo/culture-evaluation

Folders and files

Latest commit

History

Repository files navigation

DIWALI

Overview

Repository Structure

Prerequisites

Installation

Usage

Running Evaluations

English Prompt Evaluation

Bengali Prompt Evaluation

Supported Models

Concept Matching

Custom Evaluations

Output Format

Dataset Information

DIWALI Dataset

GSM8K Integration

Key Features

Results

Contributing

License

Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages