CGGR - Confidence-Gated Gradient Routing

Warning

CGGR is still experimental. The core selective-loss path is usable for research and prototyping, but large parts of the repo remain exploratory.

Selective loss computation for Transformer-style training. The core idea is to score examples by difficulty, then spend backward-pass work on the hardest part of the batch.

Status

Stable Core

The most credible, currently maintained surface is:

CGGRLoss
CGGRModel
CGGRScorer
create_truncated_router
PyTorch fallback kernels in triton_kernels.py

Experimental / Best-Effort

The following modules exist, but should be treated as research code rather than stable product surface:

cggr_async.py
cggr_checkpointing.py
cggr_dataloader.py
cggr_flash.py
persistent_cggr_kernels.py
benchmark and race scripts under benchmarks/

Current Behavior Notes

CGGRLoss selects tokens for loss computation.
CGGRModel currently scores token difficulty but performs its second training pass on the hardest sequences, not arbitrary token slices.
Architecture auto-detection works best on common Hugging Face causal LM layouts. Non-causal and less common architectures should be treated as best-effort until they are covered by dedicated tests.

Installation

pip install cggr

For optional CUDA acceleration with Triton kernels:

pip install cggr[cuda]

Platform Compatibility

Platform	Triton Kernels	PyTorch Fallback	Status
CUDA (Linux)	✓	✓	Best-supported path
CUDA (Windows)	✓	✓	Best-effort
ROCm (AMD)	✗	✓	Fallback-only
MPS (Apple Silicon)	✗	✓	Fallback-only
CPU	✗	✓	Fallback-only

Model Architecture Support

Architecture	Auto-Detect	Notes
Llama/Mistral/Qwen/Gemma/Phi-3	✓	Primary target, `model.layers` style
GPT-2/GPT-J/Falcon/GPT-NeoX	✓	Implemented, `transformer.h` style
BERT/RoBERTa	Best-effort	Not yet covered by dedicated tests
Mamba/SSM	Best-effort	Not yet covered by dedicated tests
Other	Passthrough	Uses full model as router, usually slower

Flash Attention Utilities

The repo includes experimental helpers for enabling Flash Attention / SDPA on Hugging Face models:

pip install cggr[flash]

from cggr_flash import load_model_with_flash_attention, enable_flash_attention

# Option 1: Load model with Flash Attention
model = load_model_with_flash_attention("microsoft/phi-2")

# Option 2: Enable on existing model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...")
model = enable_flash_attention(model)  # Auto-selects best backend

Backend	Requirements	Notes
`flash_attention_2`	flash-attn library	Experimental helper path
`sdpa`	PyTorch 2.0+	Experimental helper path
`eager`	None	Baseline

What CGGR Tries To Improve

Metric	Standard Training	CGGR Goal	Caveat
Backward compute	Full batch/tokens	Selective loss on hardest part of batch	Savings depend on wrapper and routing ratio
Forward overhead	Single pass	Router pass + selective main pass	Can erase gains on some models/hardware
Memory use	Full graph	Smaller active backward graph	Most visible with `CGGRModel`
Throughput	Baseline	Potential improvement	Must be benchmarked per setup
Quality	Baseline	Similar or better with good routing	Not yet established as a universal result

Benchmark Status

Checked-in benchmark artifacts are exploratory, not canonical proof of the method. They are useful for local investigation, but they should not be read as a production benchmark suite yet.

The current repository contains:

one canonical synthetic throughput/memory entry point: benchmarks/canonical_benchmark.py
one canonical dataset-backed quality entry point: benchmarks/canonical_quality_benchmark.py
one reusable single-GPU CUDA wrapper: scripts/run_canonical_l40s.sh
additional exploratory benchmark scripts under benchmarks/
one checked-in benchmark report under benchmark_results/
early experiments across throughput, memory, and convergence behavior

Phase 0 hardening rule: treat benchmark claims as setup-specific unless they are backed by a reproducible command, fixed config, and matching artifact in the repo.

Current hardening benchmark convention:

local MPS/CPU outputs are pipeline-validation artifacts only
CUDA outputs with explicit artifact tags are the canonical evidence path for this repo
dataset-backed canonical quality runs need meaningful host RAM in Slurm (--mem=16G minimum, --mem=32G recommended)

See benchmarks/README.md for the canonical vs experimental split. The latest checked-in single-L40S canonical interpretation is in benchmark_results/cuda_single_l40s_smollm135_summary.md. Release gating checklist is in RELEASE_CHECKLIST.md. Compatibility policy is in COMPATIBILITY.md.

Quick Start

1. Batch Splitting (Recommended Core Path)

The main end-to-end wrapper is CGGRModel. It uses a lightweight router to score difficulty, then trains on the hardest sequences in the batch.

from cggr import CGGRModel, create_truncated_router
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("...").cuda()

# Create lightweight router (shares weights, 0 extra memory)
router = create_truncated_router(model, num_layers=4)

# Wrap model
cggr_model = CGGRModel(
    model, 
    router=router, 
    min_tokens_ratio=0.25
)

# Train
loss = cggr_model(input_ids, labels=labels)
loss.backward()

2. Manual Integration (CGGRLoss)

If you cannot use CGGRModel (e.g. specialized architectures), you can use CGGRLoss manually.

from cggr import CGGRLoss

criterion = CGGRLoss(
    scoring='combined',      # 'entropy', 'margin', 'loss', 'combined'
    selection='stratified',  # 'topk', 'stratified', 'sequence_aware'
    min_tokens_ratio=0.25,
    warmup_steps=1000,
)

for batch in dataloader:
    logits = model(input_ids)
    loss = criterion(logits, targets)  # Only hard tokens
    loss.backward()
    optimizer.step()
    criterion.step()

3. Native Integration (e.g. MoE/SRDE)

For architectures like SRDE that require static tensor shapes, you can use CGGRScorer to generate a routing mask instead of splitting the batch.

from cggr import CGGRScorer

# 1. Initialize Scorer
self.scorer = CGGRScorer(router, min_tokens_ratio=0.5)

# 2. Get Mask
difficulty, mask, info = self.scorer(input_ids)

# 3. Apply Mask (Null Routing)
# mask is Boolean: True=Hard (Route to Expert), False=Easy (Skip/Null)
expert_output = expert_layer(x) * mask.unsqueeze(-1)

Scoring Strategies

Strategy	Description	Best For
`entropy`	High entropy = hard	General training
`margin`	Small top-2 margin = hard	Classification
`loss`	Intended to use per-token loss	Experimental semantics
`combined`	Mix of uncertainty signals	Default research setting

Selection Strategies

Strategy	Description	Benefit
`topk`	Top-k hardest tokens	Simple, fast
`stratified`	Sample from difficulty buckets	Prevents forgetting
`sequence_aware`	Ensure coverage per sequence	Preserves structure

Dynamic Thresholding

Automatically adjusts token ratio based on batch confidence:

Low confidence → more tokens (model is learning)
High confidence → fewer tokens (model has converged)

CGGRLoss(dynamic_threshold=True, threshold_sensitivity=0.5)

Full API

CGGRLoss(
    # Scoring
    scoring='combined',
    
    # Selection
    selection='topk',
    num_strata=4,                  # For stratified
    min_tokens_per_sequence=1,     # For sequence_aware
    
    # Thresholding
    dynamic_threshold=True,
    threshold_sensitivity=0.5,
    
    # Curriculum
    min_tokens_ratio=0.25,
    warmup_steps=1000,
)

Performance Notes

For CGGRLoss, selective loss reduces how many tokens contribute to the final loss tensor, but it does not automatically guarantee end-to-end wall-clock speedup.

For CGGRModel, the intended win comes from training the main model on a smaller subset of the batch after a router pass. Whether that produces a net gain depends on:

model architecture
batch shape
routing ratio
GPU characteristics
router cost vs main-model savings

Persistent CGGR Kernels (Advanced, Experimental)

For Token-Routed MLP and Mixture-of-Experts architectures, the repo includes experimental persistent-kernel work:

Persistent Kernels - Keep SM threads active across expert batches
Cooperative Thread Groups - Better SM utilization through cooperative scheduling
Warp-Specialized Streaming - Overlap memory loads with computation
Software Pipelining - Multi-stage async prefetching for latency hiding

from cggr import PersistentTRMLP, create_persistent_tr_mlp

# Create MoE layer with persistent optimizations
moe_layer = create_persistent_tr_mlp(
    hidden_dim=1024,
    intermediate_dim=4096,
    num_experts=8,
    top_k=2,
)

# Forward pass (routes tokens to experts)
output, aux_loss = moe_layer(hidden_states)
total_loss = language_modeling_loss + 0.01 * aux_loss

This code path is source-tree experimental work. It is not part of the stable Phase 0 surface and should be validated independently before use.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github		.github
benchmark_results		benchmark_results
benchmarks		benchmarks
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
CGGR_WHITEPAPER.tex		CGGR_WHITEPAPER.tex
COMPATIBILITY.md		COMPATIBILITY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
bench_speed.py		bench_speed.py
cggr.py		cggr.py
cggr_async.py		cggr_async.py
cggr_checkpointing.py		cggr_checkpointing.py
cggr_dataloader.py		cggr_dataloader.py
cggr_flash.py		cggr_flash.py
docs.md		docs.md
persistent_cggr_kernels.py		persistent_cggr_kernels.py
quick_benchmark.py		quick_benchmark.py
setup.py		setup.py
train_experts.py		train_experts.py
triton_kernels.py		triton_kernels.py
verify_persistent.py		verify_persistent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CGGR - Confidence-Gated Gradient Routing

Status

Stable Core

Experimental / Best-Effort

Current Behavior Notes

Installation

Platform Compatibility

Model Architecture Support

Flash Attention Utilities

What CGGR Tries To Improve

Benchmark Status

Quick Start

1. Batch Splitting (Recommended Core Path)

2. Manual Integration (CGGRLoss)

3. Native Integration (e.g. MoE/SRDE)

Scoring Strategies

Selection Strategies

Dynamic Thresholding

Full API

Performance Notes

Persistent CGGR Kernels (Advanced, Experimental)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CGGR - Confidence-Gated Gradient Routing

Status

Stable Core

Experimental / Best-Effort

Current Behavior Notes

Installation

Platform Compatibility

Model Architecture Support

Flash Attention Utilities

What CGGR Tries To Improve

Benchmark Status

Quick Start

1. Batch Splitting (Recommended Core Path)

2. Manual Integration (CGGRLoss)

3. Native Integration (e.g. MoE/SRDE)

Scoring Strategies

Selection Strategies

Dynamic Thresholding

Full API

Performance Notes

Persistent CGGR Kernels (Advanced, Experimental)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages