Skip to content

SamarthGarg09/CBITD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Concept-Based Interpretability for Toxicity Detection

This repository contains implementations for analyzing toxicity detection models using concept-based interpretability techniques. The code is organized into two separate experiments for the Civil Comments and HateXplain datasets.

Repository Structure

.
β”œβ”€β”€ Artifacts/                    # Output directory for both experiments
β”œβ”€β”€ backup-patches/               # Backup files
β”œβ”€β”€ CivilComments-EXP/           # Civil Comments experiment
β”œβ”€β”€ HateXplain-EXP/              # HateXplain experiment
β”œβ”€β”€ .gitignore
└── README.md

Dataset Specific Instructions

Civil Comments Experiment

Navigate to CivilComments-EXP directory:

cd CivilComments-EXP
  1. Train models:
python target_model.py    # Train toxicity detection model
python concept_model.py   # Train concept prediction model
  1. Run analysis:
python cg_apply_automated.py   # Run concept gradient analysis
python tcav_apply.py          # Run TCAV analysis
python get_stats.py           # Generate statistics

Concepts analyzed:

  • Obscene
  • Threat
  • Sexual Explicit
  • Insult
  • Identity Attack

HateXplain Experiment

Navigate to HateXplain-EXP directory:

cd HateXplain-EXP
  1. Train models:
python target_model.py    # Train hate speech detection model
python concept_model.py   # Train concept prediction model
  1. Run analysis:
python cg_apply_automated.py   # Run concept gradient analysis
python tcav_apply.py          # Run TCAV analysis
python get_stats.py           # Generate statistics

Concepts analyzed:

  • Race
  • Religion
  • Gender

Setup Requirements

Install dependencies:

pip install torch transformers datasets pandas numpy matplotlib seaborn wordcloud tqdm captum

Data Organization

For each experiment, prepare your data in the following structure:

dataset/
    β”œβ”€β”€ train.csv
    β”œβ”€β”€ dev.csv
    └── test.csv

Model Architecture

Both experiments use:

  • Base Architecture: RoBERTa
  • Target Model: Binary classification (toxic/non-toxic or hate/normal)
  • Concept Model: Multi-label concept classification

Output Structure

Results are saved in the Artifacts directory:

Artifacts/
    β”œβ”€β”€ CivilComments/
    β”‚   β”œβ”€β”€ word_clouds/        # Word cloud visualizations
    β”‚   β”œβ”€β”€ plots/              # Concept attribution plots
    β”‚   └── csv_dumps/          # Analysis results
    β”‚
    └── HateXplain/
        β”œβ”€β”€ word_clouds/        # Word cloud visualizations
        └─ plots/              # Concept attribution plots

About

Concept-Based Interpretability for Toxicity Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published