Abliteration - All-in-one Script

A comprehensive tool for removing refusal directions from Large Language Models (LLMs), making them more helpful and less likely to refuse valid requests.

Overview

This script combines all the functionality from the Abliteration codebase into a single, easy-to-use file. It allows you to:

Calculate refusal directions - Identify the most significant refusal directions in a model using harmful and harmless prompts
Remove refusal directions - Abliterate (remove) these directions from the model weights
Chat with the model - Test the original or abliterated model through an interactive chat interface
Compare models - Analyze differences between original and abliterated models

How It Works

Abliteration works by:

Identifying the "refusal direction" in the model's weight space by comparing model activations on harmful vs. harmless prompts
Removing this direction from key weight matrices in the model's transformer layers
Preserving the model's general capabilities while reducing its tendency to refuse valid requests

Installation

Requirements

Python 3.9+
PyTorch
Transformers
Datasets
tqdm
pandas
psutil

Install dependencies:

pip install transformers datasets torch tqdm pandas psutil

Usage

Abliterate a Model

python abliterate_all_in_one.py abliterate -m <model_path> -o <output_dir>

Example:

python abliterate_all_in_one.py abliterate -m meta-llama/Llama-3.2-3B-Instruct -o llama3.2-3b-abliterated

Chat with a Model

python abliterate_all_in_one.py chat -m <model_path>

Example:

python abliterate_all_in_one.py chat -m llama3.2-3b-abliterated

Compare Original and Abliterated Models

python abliterate_all_in_one.py compare --original <original_model_path> --abliterated <abliterated_model_path>

Example:

python abliterate_all_in_one.py compare --original meta-llama/Llama-3.2-3B-Instruct --abliterated llama3.2-3b-abliterated --output-file comparison_report.md

Advanced Options

Abliteration Options

--skip-begin: Number of layers to skip at the beginning (default: 1)
--skip-end: Number of layers to skip at the end (default: 0)
--scale-factor: Scale factor for ablation (default: 1.0)
--top-refusal-layers: Only abliterate the N layers with highest refusal factors
--specific-layers: Comma-separated list of specific layer indices to abliterate
--proportional-scaling: Scale abliteration proportionally to each layer's refusal factor
--force-abliteration: Force abliteration even when refusal factors are negligible

Device Options

--device: Target device (auto, cuda, cpu, last-gpu)
--gpu-id: Specific GPU ID to use
--multi-gpu: Distribute model across multiple GPUs
--max-memory: Maximum memory to use per GPU

Precision Options

--precision: Precision to use (fp16, bf16, fp32)
--load-in-4bit: Load model in 4-bit precision
--load-in-8bit: Load model in 8-bit precision
--flash-attn: Use flash attention 2

Data Options

--data-harmful: Custom harmful prompts file
--data-harmless: Custom harmless prompts file
--num-harmful: Number of harmful calibrations to randomly select
--num-harmless: Number of harmless calibrations to randomly select

Technical Details

The script performs the following key operations:

Refusal Direction Calculation:
- Computes representations for harmful and harmless prompts
- Calculates the difference vector (refusal direction)
- Normalizes this direction for consistent application
Layer Analysis:
- Analyzes each layer to determine its contribution to refusal behavior
- Ranks layers by their refusal factors
- Allows targeting specific layers for abliteration
Tensor Modification:
- Applies a projection-based modification to weight matrices
- Removes components aligned with the refusal direction
- Preserves other capabilities of the model
Model Comparison:
- Provides detailed analysis of changes between original and abliterated models
- Reports on parameter changes at layer and component levels
- Generates comprehensive comparison reports

Acknowledgments

This script is based on the Abliteration project, which pioneered the technique of removing refusal directions from LLMs. The all-in-one script consolidates the functionality from the original repository into a single, easy-to-use file.

License

This project is released under the same license as the original Abliteration repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
README.md		README.md
abliterate_all_in_one.py		abliterate_all_in_one.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Abliteration - All-in-one Script

Overview

How It Works

Installation

Requirements

Usage

Abliterate a Model

Chat with a Model

Compare Original and Abliterated Models

Advanced Options

Abliteration Options

Device Options

Precision Options

Data Options

Technical Details

Acknowledgments

License

About

Uh oh!

Releases

Packages

Languages

jyb2025/Abliteration-by-Transformers

Folders and files

Latest commit

History

Repository files navigation

Abliteration - All-in-one Script

Overview

How It Works

Installation

Requirements

Usage

Abliterate a Model

Chat with a Model

Compare Original and Abliterated Models

Advanced Options

Abliteration Options

Device Options

Precision Options

Data Options

Technical Details

Acknowledgments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages