This repository contains implementations for analyzing toxicity detection models using concept-based interpretability techniques. The code is organized into two separate experiments for the Civil Comments and HateXplain datasets.
.
βββ Artifacts/ # Output directory for both experiments
βββ backup-patches/ # Backup files
βββ CivilComments-EXP/ # Civil Comments experiment
βββ HateXplain-EXP/ # HateXplain experiment
βββ .gitignore
βββ README.md
Navigate to CivilComments-EXP directory:
cd CivilComments-EXP- Train models:
python target_model.py # Train toxicity detection model
python concept_model.py # Train concept prediction model- Run analysis:
python cg_apply_automated.py # Run concept gradient analysis
python tcav_apply.py # Run TCAV analysis
python get_stats.py # Generate statisticsConcepts analyzed:
- Obscene
- Threat
- Sexual Explicit
- Insult
- Identity Attack
Navigate to HateXplain-EXP directory:
cd HateXplain-EXP- Train models:
python target_model.py # Train hate speech detection model
python concept_model.py # Train concept prediction model- Run analysis:
python cg_apply_automated.py # Run concept gradient analysis
python tcav_apply.py # Run TCAV analysis
python get_stats.py # Generate statisticsConcepts analyzed:
- Race
- Religion
- Gender
Install dependencies:
pip install torch transformers datasets pandas numpy matplotlib seaborn wordcloud tqdm captumFor each experiment, prepare your data in the following structure:
dataset/
βββ train.csv
βββ dev.csv
βββ test.csv
Both experiments use:
- Base Architecture: RoBERTa
- Target Model: Binary classification (toxic/non-toxic or hate/normal)
- Concept Model: Multi-label concept classification
Results are saved in the Artifacts directory:
Artifacts/
βββ CivilComments/
β βββ word_clouds/ # Word cloud visualizations
β βββ plots/ # Concept attribution plots
β βββ csv_dumps/ # Analysis results
β
βββ HateXplain/
βββ word_clouds/ # Word cloud visualizations
ββ plots/ # Concept attribution plots