Skip to content

Add Claude MD file #2311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

torchao is PyTorch's official Architecture Optimization library that accelerates PyTorch models through advanced quantization and sparsification techniques. It provides optimization for weights, gradients, activations, and more for both inference and training with minimal code changes.

## Development Commands

### Installation & Build
```bash
# Development install (Python-only mode, fastest for development)
USE_CPP=0 python setup.py develop

# Full build with C++/CUDA extensions
python setup.py develop

# Install specific version of ruff for linting
pip install ruff==0.11.6
```

### Testing
```bash
# Run specific test files
pytest test/float8/test_base.py
pytest test/quantization/test_quant_api.py
pytest test/dtypes/test_affine_quantized.py

# Run comprehensive float8 tests
./test/float8/test_everything.sh

# Run all tutorials
./tutorials/run_all.sh
```

### Linting & Formatting
```bash
# Install pre-commit hooks (one-time setup)
pre-commit install

# Run all pre-commit checks
pre-commit run --all-files

# Run pre-commit on staged files only
pre-commit run
```

## Architecture Overview

### Core Components

**torchao/quantization/** - Primary quantization APIs
- `quant_api.py` - Main `quantize_()` function for one-line model quantization
- `autoquant.py` - Automatic quantization selection
- Weight-only quantization (INT4/INT8), dynamic quantization, QAT support

**torchao/dtypes/** - Custom tensor subclasses with layout and dispatch registration
- `AffineQuantizedTensor` - Base quantized tensor class
- `nf4tensor.py` - NF4 (4-bit normal float) implementation for QLoRA
- `uintx/floatx/` - Unsigned integer and floating-point quantized tensors

**torchao/float8/** - High-performance float8 training
- Delivers up to 1.5x speedup on 512 GPU clusters
- `convert_to_float8_training()` - Main entry point
- Full `torch.compile` and FSDP2 compatibility

**torchao/sparsity/** - Structured and unstructured sparsity
- 2:4 semi-structured sparsity with up to 2.4x throughput improvements
- `sparse_api.py` - Main sparsity functions
- Wanda pruning, block-sparse operations

**torchao/optim/** - Memory-efficient optimizers
- `AdamW8bit`, `AdamW4bit`, `AdamWFp8` - Quantized optimizers (2-4x memory reduction)
- `CPUOffloadOptimizer` - 60% VRAM reduction via CPU offloading

**torchao/csrc/** - Custom CUDA/CPU kernels
- CUTLASS-based implementations for maximum performance
- ROCm support for AMD GPUs
- CPU kernels with AVX512 optimizations

### Key Design Principles

**Composability**: All custom dtypes work with `torch.compile`, FSDP2, and tensor parallel out-of-the-box

**Subclass Architecture**: Tensor subclasses handle layout, dispatch, and kernel registration automatically

**Hardware Optimization**: Architecture-specific optimizations (CUDA, ROCm, CPU, MPS) with automatic detection

## Build Configuration

The build system uses environment variables for configuration:

**Core Controls:**
- `USE_CPP=0|1` - Skip C++/CUDA extensions (default: 1, set to 0 for fastest dev setup)
- `USE_CPU_KERNELS=0|1` - Enable optimized CPU kernels (Linux only, default: 0)
- `DEBUG=0|1` - Debug build mode

**Experimental Features:**
- `BUILD_TORCHAO_EXPERIMENTAL=1` - Enable experimental cmake builds
- `TORCHAO_BUILD_CPU_AARCH64=1` - ARM64 CPU kernels (auto-enabled on Apple Silicon)
- `TORCHAO_BUILD_KLEIDIAI=1` - Kleidi AI library integration
- `TORCHAO_BUILD_EXPERIMENTAL_MPS=1` - MPS acceleration (macOS only)

## Integration Points

- **HuggingFace Transformers**: Built-in backend via `TorchAoConfig`
- **vLLM/SGLang**: LLM serving integration
- **TorchTune**: QLoRA and QAT recipes
- **torch.compile**: Full compiler compatibility
- **FSDP2**: Distributed training support

## Common Development Tasks

**Adding a new quantization technique:** Implement as tensor subclass in `torchao/dtypes/`, register dispatch kernels, add to `quant_api.py`

**Performance optimization:** Custom kernels go in `torchao/csrc/`, with separate extensions for different GPU architectures (SM90a, SM100a)

**Testing:** Follow existing patterns in `test/` directory, use `pytest` for individual tests

## Important Notes

- Always run `pre-commit run --all-files` before committing
- Use `USE_CPP=0` for faster iteration during Python-only development
- CUTLASS kernels have architecture-specific builds (SM90a, SM100a) based on CUDA version
- Git submodules (CUTLASS) are automatically initialized during build
Loading