Adaptive Training System

Production Transformer Training Framework with MoE/MoD Architecture & CUDA Acceleration

Demo • Architecture • CUDA Acceleration • Configuration • API • Licensing

Overview

Adaptive Training System is a production-grade transformer training framework implementing Mixture of Experts (MoE) and Mixture of Depths (MoD) architectures with autonomous training optimization and custom CUDA acceleration kernels. Supports models from 500M to 300B+ parameters with enterprise infrastructure.

Core capabilities:

Sparse architectures: MoE (8-64 experts), MoD (dynamic depth), hybrid configurations
CUDA acceleration: Custom kernels for RMSNorm (3-4x faster), RoPE (2-4x faster), SwiGLU (2-3x faster), MoE routing (2-4x faster), fused loss computation
Metal acceleration: Custom Metal shaders for Apple Silicon - RMSNorm (2-3x faster), RoPE (3-5x faster), SwiGLU (2-3x faster), MoE routing
Adaptive orchestrator: 18 autonomous intervention methods for training optimization
Chinchilla scaling: Automatic epoch calculation based on compute-optimal principles
Multi-GPU training: DeepSpeed ZeRO (stages 1-3), FSDP, ColossalAI with efficient gradient synchronization
Precision support: FP32, FP16, BF16, mixed precision, FP8 (H100+ via Triton)
Advanced Quantization: 4-bit/8-bit support via AutoGPTQ and Optimum Quanto
Hardware targets: CUDA (Volta-Hopper), Apple Silicon (M1-M4) with Metal acceleration, CPU
Data handling: Memory-mapped datasets, Apache Arrow zero-copy, automatic caching
Router Optimization: Fine-tuning mode and adapter loading for MoE routers
Recovery systems: Automatic OOM handling, gradient explosion recovery, checkpoint rollback

Framework positioning:

This is a complete training system with custom CUDA kernels, not a model zoo or API wrapper. Every component from tokenization to fused gradient operations is included. MoE and MoD implementations follow established research (Switch Transformer, Mixture of Experts, Mixture-of-Depths) with operational additions and CUDA-accelerated execution: dynamic expert management, capacity tuning, load balancing, routing analytics.

The adaptive orchestrator monitors 20+ metrics every N steps and triggers interventions across hyperparameters, architecture, and recovery procedures. Maintains decision history with confidence scoring to prevent excessive intervention.

Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical operations while maintaining gradient compatibility and numerical stability. Metal shaders provide 2-5x speedup on Apple Silicon (M1-M4). All kernels include automatic fallback to PyTorch when accelerated backends are unavailable.

Intended for:

ML engineers requiring full training stack control with maximum performance
Research teams prototyping sparse architectures with production-grade infrastructure
Organizations with proprietary data, compliance requirements, and performance constraints
Teams needing framework-independent infrastructure with custom optimization capabilities

Not included:

Pre-trained model weights (configuration presets only, train from scratch)
Model checkpoints or existing trained models
High-level abstractions (direct control provided)
Tutorial content (assumes ML engineering background)

Technical Architecture

Dense Transformers

Standard architecture with LLaMA/GPT-NeoX design patterns:

Pre-normalization: RMSNorm before attention/FFN (CUDA-accelerated, 3-4x faster)
Grouped Query Attention: Reduces KV cache via shared KV heads (typical ratio 4:1 or 8:1)
Rotary Position Embeddings: Length generalization with configurable theta (10000 base, 1000000 extended), CUDA-accelerated RoPE application (2-4x faster)
SwiGLU activation: Two-path gating in FFN, intermediate_size typically 8/3 × hidden_size, CUDA-accelerated (2-3x faster)
Optional Flash Attention 2.x: 2-4x speedup on Ampere+ GPUs with memory-efficient attention

Parameter calculation:

Embedding: vocab_size × hidden_size
Attention per layer: hidden_size² × (1 + num_kv_heads/num_heads)
FFN per layer: 2 × hidden_size × intermediate_size
Output: vocab_size × hidden_size (optionally tied)

Mixture of Experts (MoE)

Token-level sparse activation via learned routing to specialized FFN networks with CUDA-accelerated dispatch.

Routing mechanism:

Top-k gating: Each token routed to k of N experts (typical: k=2, N=8)
Router: Linear layer (hidden_size × num_experts) + softmax + TopK selection
CUDA optimization: Warp-based top-k kernel with shuffle reduction (2-4x faster than PyTorch)
Output: Weighted combination of selected expert outputs

Load balancing:

Auxiliary loss: Penalizes routing imbalance via expert utilization distribution
Capacity factor: Maximum tokens per expert = (total_tokens/num_experts) × capacity_factor
Typical capacity_factor: 1.25-1.5 (25-50% overflow buffer)
Load balancing weight: 0.01 (added to main loss)

Dynamic management:

Expert addition: Triggered when utilization exceeds threshold (typically 0.85)
Expert pruning: Removes experts below utilization threshold (typically 0.15)
Capacity adaptation: Adjusts based on token drop rate
Temperature tuning: Controls routing concentration (lower = sharper, higher = more uniform)

CUDA acceleration:

Token dispatch: Vectorized memory access (float4) with coalesced writes, reduced atomic contention
Expert combination: Optimized accumulation with local buffering before atomic adds
Memory efficiency: Batched operations reduce kernel launch overhead by 3-5x

Efficiency:

8-expert top-2 MoE: 8× total parameters, 1.25× active parameters per token
Sparsity: 87.5% (12.5% parameters active)
Memory: Scales with total parameters (all experts in memory)
Compute: Scales with active parameters only
CUDA speedup: 2-4x over PyTorch for routing and dispatch operations

Statistics tracked:

Per-expert utilization: Fraction of tokens routed to each expert
Routing entropy: Distribution concentration (max = log(num_experts))
Load balance loss: Auxiliary loss magnitude
Tokens dropped: Count exceeding capacity
Per-layer patterns: Early layers more uniform, later layers more concentrated

Mixture of Depths (MoD)

Layer-level sparse activation via learned skip decisions.

Core concept: Model learns which tokens require full layer computation vs. residual skip. Routing decision per token per layer based on token representation at layer input.

Routing types:

Learned: Small MLP scores tokens, top-capacity_factor selected for full processing
Static: Fixed pattern (e.g., all tokens full in early layers, reduced in later layers)
Random: Random selection maintaining capacity_factor (ablation baseline)

Capacity management:

capacity_factor controls fraction of tokens using full computation
0.5 = 50% tokens full processing, 50% skip
Selection is learned during training, not random

Efficiency:

FLOPs reduction: ~(1 - capacity_factor) for layers with MoD
Typical: 30-50% FLOPs reduction with 0-2% perplexity increase
Quality/efficiency tradeoff: Lower capacity = more savings but larger quality impact

Application strategies:

All layers: Maximum compute reduction
Later layers only: Preserve early feature extraction, reduce later specialization
Alternating: MoD every N layers
FFN-only: Dense attention, sparse FFN (common since FFN is 2/3 of compute)

Training dynamics:

Early training: Near-uniform routing (all tokens treated similarly)
Specialization: Model learns token complexity patterns over time
Curriculum: Can start with capacity_factor=1.0, gradually reduce to target
Annealing: Gradual capacity reduction prevents training instability

Hybrid (MoE + MoD)

Combined token-level (MoE) and layer-level (MoD) sparsity with coordinated CUDA acceleration.

Architecture:

Each layer: Dense attention + MoE FFN + MoD routing
MoD decides: Use full layer (attention + MoE) or skip via residual
If token uses layer, routes through MoE experts with CUDA-accelerated dispatch

Sparsity compounding:

Top-2 of 8 experts: 25% expert parameters active
50% layer capacity: 50% tokens use layers
Combined: 0.5 × 0.25 = 12.5% active parameters per token
87.5% total sparsity

Training considerations:

Both routing mechanisms must learn useful patterns
Load balancing for experts, capacity warmup for MoD
Routing temperature adaptation for both systems
Quality-aware guards prevent catastrophic sparsity collapse

Use cases:

Maximum efficiency: Largest models on limited compute
Fast experimentation: Smaller active compute enables rapid iteration
Inference optimization: Reduced memory and compute for deployment
Multi-task learning: Different experts and depths specialize per task

Router Optimization

Specialized Router Management:

Router Fine-tuning: Isolate and train only the routing layers (gate_proj) while freezing other parameters. Helps specialize routing logic without catastrophic forgetting in experts.
Adapter Loading: Load pre-trained router weights independently. Useful for transferring routing strategies between similar models.
Dynamic Fine-tuning: Enable router training during specific phases of the main training loop.

CUDA Acceleration

Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical training operations. All kernels maintain full gradient compatibility and include automatic fallback.

Optimized Operations

1. Fused RMSNorm (3-4x faster)

Vectorized memory access: float4 operations for coalesced reads/writes
Optimized reductions: Warp shuffle + reduced shared memory usage
Key optimizations:
- Single-pass algorithm (compute variance and normalize in one kernel)
- Eliminated bank conflicts in shared memory
- Better register allocation (maxrregcount=64)
- Loop unrolling for common hidden sizes
Performance: ~3.2ms → ~0.9ms per 1000 tokens (hidden_size=768)

2. Rotary Position Embeddings (2-4x faster)

Optimized precompute: Vectorized cos/sin cache generation
Efficient application: Reduced redundant memory access
Key optimizations:
- Process 2 dimension pairs per thread (better ILP)
- Use of __ldg() for read-only cache optimization
- FMA instructions (fmaf) for multiply-add fusion
- Eliminated per-token position computation
Performance: ~8.5ms → ~1.3ms per batch (batch=4, seq=128, heads=12)

3. SwiGLU Activation (2-3x faster)

Vectorized computation: float4 for gate and up projections
Fast SiLU approximation: Optimized sigmoid computation
Key optimizations:
- Fused gate/up computation in single kernel pass
- Register tiling for matrix multiplication
- Eliminated intermediate memory allocation
- Vectorized output writes
Performance: ~5.1ms → ~1.8ms per 1000 tokens (intermediate=3072)

4. MoE Operations (2-4x faster)

Optimized top-k gating: Warp-level parallel reduction
Token dispatch: Coalesced memory access patterns
Expert combination: Reduced atomic operations
Key optimizations:
- Warp-based top-k (no shared memory for k≤4)
- Batched atomic increments per block
- Vectorized token copies (float4)
- Local accumulation before global writes
Performance: ~12ms → ~3.5ms per batch for routing+dispatch+combine

5. Fused Loss Computation

Cross-entropy + accuracy: Single-pass computation
Key optimizations:
- Combined max/sum reduction for softmax
- Parallel argmax using shared memory
- Fused gradient computation
- Eliminated redundant passes
Performance: ~4.2ms → ~1.1ms per batch (vocab_size=32000)

6. Fused Gradient Clipping

Fully async pipeline: No CPU-GPU synchronization
Key optimizations:
- Compute norm on GPU (no D2H transfer)
- Conditional clipping kernel launch
- Pinned memory for async copy
- Single synchronization point at end
Performance: ~8ms → ~1.5ms (including norm computation)

CUDA Kernel Speedups

Custom kernels provide 2–7× speedup over vanilla PyTorch operations.

Compilation and Usage

Compilation requirements:

# Install CUDA toolkit (11.8+ or 12.x)
# Verify nvcc availability
nvcc --version

# Optional: force target architectures (comma or semicolon separated)
# export CUDA_TARGET_SM=75,80,86,89,90

# Compile transformer + MoE kernels
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh

# Compile training kernels
cd ../training
./compile_kernels.sh

Automatic kernel detection: Framework automatically detects and loads compiled kernels at runtime. Falls back to PyTorch if kernels unavailable. No code changes required to use CUDA acceleration.

Automatic JIT rebuild for current hardware:

If kernel .so files are missing or compiled for the wrong SM target, runtime wrappers trigger a rebuild automatically.
Target architecture resolution order: CUDA_TARGET_SM → TORCH_CUDA_ARCH_LIST → detected GPU compute capability → fallback sm_75.
Set CUDA_TARGET_SM when you need deterministic builds across machines.

Supported architectures:

sm_75: Turing (T4, RTX 2080)
sm_80: Ampere (A100, RTX 3090)
sm_86: Ampere (RTX 3060/3070/3080)
sm_89: Ada Lovelace (RTX 4090)
sm_90: Hopper (H100, H200)

Performance monitoring:

from cuda_opt_wrapper import print_performance_summary

# After training
print_performance_summary()
# Shows per-operation timing, speedup metrics, throughput

Numerical Stability

All CUDA kernels maintain numerical stability equivalent to PyTorch:

RMSNorm: Epsilon term (1e-6) prevents division by zero
RoPE: No approximations, exact trigonometric functions
SwiGLU: Fast SiLU uses stable exp(-x) formulation
Loss: Max-subtraction for softmax numerical stability
Gradient clipping: Norm computation uses compensated summation

Validation: Every kernel includes PyTorch fallback for correctness verification. Automated tests compare CUDA vs PyTorch outputs (tolerance: 1e-4 for FP32, 1e-3 for FP16).

Model Configuration Presets

Pre-configured architecture presets for training models from scratch, spanning 500K to 300B parameters. These are configuration templates, not pre-trained models.

Each preset specifies architecture dimensions, MoE/MoD parameters, hardware targets, and expected performance with CUDA acceleration for initializing and training new models.

Config	Active Params	Total Params	Hidden	Layers	Heads	KV Heads	Experts	Top-K	Hardware	Memory (FP16)	Throughput	CUDA Speedup
`debug`	500K	4M	128	2	2	2	8	2	Any	50 MB	Testing	2.1x
`debug_200m`	200M	6B	768	12	12	12	32 MoD	-	T4/MPS	2 GB	Testing	2.8x
`b1`	1B	8B	1024	24	16	4	8	2	RTX 3090, M1 Max	8 GB	1400 tok/s	3.2x
`b7`	7B	56B	4096	32	32	8	8	2	A100 40GB	28 GB	680 tok/s	3.5x
`b14`	14B	112B	5120	40	40	10	8	2	A100 80GB	56 GB	340 tok/s	3.8x
`b30`	30B	240B	8192	48	64	16	8	2	4× A100 80GB	120 GB	135 tok/s	4.1x
`b50`	50B	400B	10240	56	80	20	8	2	4× H100	200 GB	68 tok/s	4.5x
`b100`	100B	800B	12288	80	96	24	8	2	8× H100	400 GB	68 tok/s	4.8x
`b200`	200B	1.6T	16384	100	128	32	8	2	16× H200	800 GB	41 tok/s	5.2x
`b300`	300B	2.4T	20480	120	160	40	8	2	32× H200	1.2 TB	27 tok/s	5.5x

Memory estimates: Include model weights, optimizer states (Adam: 8 bytes/param), gradients, and activation memory at batch_size=1, mixed precision training. Actual memory scales with batch size and sequence length.

***Preset debug_200m not included in this graph***

Throughput estimates: With CUDA acceleration enabled at batch_size=1, sequence_length=2048, mixed precision with gradient checkpointing. CUDA speedup column shows combined acceleration from all custom kernels vs. pure PyTorch.

Configuration selection:

Development/testing: debug for pipeline validation, debug_200m for architecture testing
Research: b1 for prototyping on consumer hardware with CUDA acceleration
Production fine-tuning: b7 for quality/efficiency balance with significant CUDA speedup
Large-scale pre-training: b30+ for maximum model capacity
Extreme scale: b100+ requires cluster infrastructure and distributed expertise

Important:

These presets define untrained model architectures. Training starts from random initialization following standard practices (Xavier/Kaiming initialization for weights, zero initialization for biases). The framework does not provide pre-trained checkpoints.

Customization:

All presets are starting points. Architecture dimensions can be modified: hidden_size must be divisible by num_heads. Intermediate_size typically 8/3 × hidden_size rounded to nearest 256 for optimal CUDA performance. Max_position_embeddings determines context window. Num_experts and moe_top_k can be adjusted independently. MoD capacity_factor controls compute/quality tradeoff. CUDA kernels automatically adapt to configuration changes.

Precision Support

Numerical formats for parameters, activations, and gradients during training and inference.

Supported Precisions

FP32 (Float32) - Full Precision

32-bit floating point (8-bit exponent, 23-bit mantissa)
Range: ±3.4×10^38, precision: ~7 decimal digits
Maximum stability, no special handling required
2× memory vs FP16/BF16, significantly slower on modern hardware
Use cases: CPU training, numerical debugging, stability issues with reduced precision

FP16 (Float16) - Half Precision

16-bit floating point (5-bit exponent, 10-bit mantissa)
Range: ±65504, precision: ~3 decimal digits
50% memory reduction, ~2× speedup on supported hardware
Requires loss scaling to prevent gradient underflow (small gradients round to zero)
Dynamic or static loss scaling: multiply loss by 2^N before backward, unscale gradients before update
CUDA kernels maintain FP32 accumulation for numerical stability
Use cases: Volta/Turing GPUs (V100, T4, RTX 2080), Apple Silicon (M1-M4)

BF16 (BFloat16) - Brain Float16

16-bit format (8-bit exponent, 7-bit mantissa)
Range: Same as FP32 (±3.4×10^38), reduced precision vs FP32
50% memory reduction, similar speed to FP16
No loss scaling required (wide dynamic range like FP32)
Better training stability than FP16 with same memory benefits
CUDA kernels optimized for BF16 with hardware acceleration on Ampere+
Use cases: Ampere+ GPUs (A100, RTX 3090/4090, H100), primary recommendation for modern hardware

Mixed Precision FP16

Forward/backward in FP16, master copy of parameters in FP32
Optimizer updates FP32 master copy, then copies to FP16 for next forward
Dynamic loss scaling automatically adjusts to prevent underflow
Combines FP16 speed with FP32 stability
CUDA kernels use FP32 accumulation internally
Use cases: Default for older GPUs supporting FP16 but not BF16

Mixed Precision BF16

Forward/backward in BF16, master parameters in FP32
No loss scaling needed (BF16 dynamic range matches FP32)
Simpler than mixed FP16 (no loss scaling configuration)
Best speed/stability balance on modern hardware
CUDA kernels leverage hardware BF16 tensor cores
Use cases: Default for Ampere+ GPUs, primary recommendation for production

FP8 (Float8) - Experimental

8-bit floating point: E4M3 (forward) and E5M2 (backward) variants
Further memory reduction (75% vs FP32)
Requires H100 or newer with hardware FP8 support
Complex configuration, quality impacts not fully characterized
Custom CUDA kernels under development for FP8 support
Use cases: Cutting-edge research, not recommended for general use

INT8 Quantization

8-bit integer representation (post-training or quantization-aware training)
Primarily for inference, not training
Reduces model size by 75% vs FP32 for deployment
Quality impact depends on calibration and quantization method
Use cases: Model deployment, edge devices

Automatic Precision Selection

The framework detects hardware and selects optimal precision with CUDA kernel compatibility:

Detection logic:

Check for CUDA availability and GPU compute capability
Verify CUDA kernel compilation and availability
If Ampere+ (compute capability ≥ 8.0): Select mixed_bf16 with BF16-optimized kernels
If Volta/Turing (compute capability 7.0-7.5): Select mixed_fp16 with FP16-optimized kernels
If Apple Silicon (MPS): Select fp16 (BF16 not supported, CUDA kernels unavailable)
If CPU: Select fp32 (reduced precision offers no benefit, CUDA kernels unavailable)

Override: Set precision explicitly via configuration if automatic selection is suboptimal or for specific debugging/testing requirements.

Supported Frameworks & Integrations

Quantization & Inference:

AutoGPTQ: 4-bit quantization support for efficient inference and fine-tuning on consumer hardware.
Optimum Quanto: Dynamic quantization support (8-bit/4-bit) for flexible deployment.
OpenAI Triton: High-performance FP8 kernels for H100+ architectures.

Model Compatibility:

DeepSeek Config Adapter: Auto-convert training configurations to DeepSeek-compatible formats.
HuggingFace Interop: Seamless integration with transformers for dataset loading and tokenization.

Hardware-Specific Recommendations

NVIDIA Ampere/Ada/Hopper (A100, RTX 3090/4090, H100, H200):

Recommended: mixed_bf16 with full CUDA acceleration (4-5x total speedup)
Alternative: mixed_fp16 (if BF16 causes unexpected issues)
Advanced: fp8_e4m3 (H100+ only, experimental, custom kernels in development)
CUDA benefits: All kernels optimized for tensor cores, maximum performance

NVIDIA Volta/Turing (V100, T4, RTX 2080/2080Ti):

Recommended: mixed_fp16 with CUDA acceleration (3-4x speedup)
Alternative: fp32 (if stability issues, 2-3x CUDA speedup maintained)
Note: BF16 not supported (no hardware acceleration)
CUDA benefits: FP16 tensor core utilization, optimized memory access

Apple Silicon (M1/M2/M3/M4, Mac Studio, MacBook Pro):

Recommended: fp16 (CUDA kernels unavailable, PyTorch fallback)
Note: Mixed precision and BF16 not supported on MPS backend
Limitations: Flash Attention disabled, DeepSpeed unavailable, no custom CUDA acceleration

CPU (Intel/AMD/ARM):

Recommended: fp32 (CUDA kernels unavailable, PyTorch fallback)
Note: Reduced precision offers minimal benefit on CPU
Expect significantly slower training than GPU (10-100× depending on model size)
Performance: No CUDA acceleration, relies on BLAS libraries (MKL, OpenBLAS, Accelerate)

Precision Configuration Parameters

Training precision: Format used during forward pass, backward pass, and gradient computation Inference precision: Format used during validation and evaluation Master precision: Format for optimizer's master parameter copy (typically FP32 in mixed precision) CUDA kernel precision: Automatic selection based on training precision

Separate training/inference precision: Common pattern: Train in mixed_bf16 for speed with CUDA acceleration, evaluate in fp32 for precise metrics. Or train in mixed_fp16 with CUDA kernels, deploy in int8 for inference.

Loss scaling parameters (FP16 only):

init_scale: Initial loss scaling factor (default: 2^16)
scale_factor: Multiplier for scale adjustment (default: 2.0)
scale_window: Steps without overflow before increasing scale (default: 2000)
min_scale: Minimum scale factor (default: 1.0)

Dynamic loss scaling adjusts automatically: scale increases every scale_window steps without overflow, decreases on overflow detection (NaN/Inf gradients). CUDA kernels maintain numerical stability with loss scaling. Most users do not need to modify these parameters.

Demo

Colab Notebook

Free GPU training demonstration requiring no local setup with CUDA acceleration showcase.

Environment:

Hardware: Google Colab T4 GPU (15.8GB memory, Turing architecture, sm_75)
CUDA: 11.8 or 12.x (varies by Colab environment)
Runtime: ~3.5 hours for complete training run (with CUDA acceleration, ~5 hours without)

Configuration:

Model: debug preset (14m total, 2m active, 8 experts, 2 top-k)
Dataset: Small conversational corpus (~200 samples, ~2.5M tokens)
Training: 12 epochs, batch_size=30, gradient_accumulation=8, effective_batch_size=240
Precision: Mixed FP16 (automatic for T4) with CUDA-accelerated operations
Learning rate: 1e-4 with cosine warmup and decay
CUDA kernels: Automatically compiled on first run (1-2 minutes compilation time)

Observable behaviors:

System diagnostics: Hardware detection, precision selection, CUDA kernel loading
Chinchilla scaling: Optimal token calculation (20× parameters), epoch recommendations
Training metrics: Loss progression (~2.5 → ~2.0), throughput (~53–55k tok/s with CUDA vs ~30k tok/s without)
Orchestrator: Health checks every 100 steps, intervention decisions
Expert statistics: Utilization distribution, routing entropy, load balance
CUDA performance: Real-time speedup metrics, per-operation timing, memory efficiency

Limitations:

Colab quotas: Usage limits, session timeouts
Small dataset: Limited quality, demonstrates pipeline not production training
Free tier: T4 availability not guaranteed, may encounter queuing
CUDA compilation: First run requires kernel compilation (~1-2 minutes)

Local Installation

Requirements:

Python 3.8+ (3.10+ recommended)
PyTorch 2.0+ (2.2+ recommended)
CUDA 11.8+ (for GPU with acceleration) or CPU
CUDA Toolkit with nvcc (for compiling custom kernels)
RAM: 16GB minimum, 32GB+ recommended
Disk: 50GB+ for dependencies, datasets, checkpoints

Installation:

git clone https://github.com/matn23/AdaptiveTrainingSystem
cd AdaptiveTrainingSystem
pip install -r requirements.txt

# Compile CUDA kernels (optional but recommended for 3-5x speedup)
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh

# Verify kernel compilation
python -c "from cuda_opt_wrapper import TRANSFORMER_OPS_AVAILABLE; print(f'CUDA ops: {TRANSFORMER_OPS_AVAILABLE}')"

cd ../..
python Main.py

Optional dependencies:

Flash Attention: 2-4× attention speedup, requires manual compilation
DeepSpeed: Multi-GPU optimization, auto-installs but compiles kernels on first use
Wandb: Experiment tracking, requires API key

Quick start: Default configuration uses debug preset for rapid testing. System auto-detects hardware, compiles CUDA kernels if nvcc available, selects precision, validates data, initializes model, begins training with acceleration.

Resume from checkpoint:

python Main.py --resume path/to/checkpoint.pt

Restores model state, optimizer state, scheduler, training step counter, random seeds, CUDA kernel state.

CUDA kernel management:

# Check kernel status
python -c "from moe_cuda_wrapper import print_performance_summary; print_performance_summary()"

# Recompile kernels after update
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh

# Disable CUDA acceleration (for debugging)
# Set in config:
# use_cuda: false
# use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip: false

Training Modes

Four data handling strategies for different use cases, all compatible with CUDA acceleration.

Base/Pre-training Only

Raw text without conversational structure. For domain-specific pre-training, continued pre-training, language modeling research.

Data format:

Plain text files (.txt)
JSONL with "text" field

Processing: Tokenize and split into fixed-length sequences with optional overlap (stride parameter). CUDA-accelerated tokenization for large corpora.

Loss: Applied to all tokens (causal language modeling) with fused loss computation kernel.

Fine-tuning Only

Conversational data with role annotations. For instruction tuning, chat models, task-specific adaptation.

Data format: JSONL with "conversation" field containing list of messages. Each message has "role" (system/user/assistant) and "content".

Processing: Concatenate messages with special tokens marking roles: [BOS] system [SEP] user [SEP] assistant [EOS]. CUDA-accelerated tokenization and batching.

Loss: Can mask user tokens (loss only on assistant responses) or compute on all tokens. Fused loss kernel handles masking efficiently.

Sequential Hybrid

Two-phase training: base corpus then conversational data. Builds general understanding, then adapts to conversation.

Configuration: Separate epoch counts per phase. Optional learning rate warmup between phases to handle distribution shift. CUDA acceleration maintained across phase transitions.

Use cases: Domain adaptation (medical literature → clinical QA), continual learning (new data → maintained task performance).

Interleaved Hybrid

Mix base and conversational data within batches/epochs. Maintains general capabilities while learning conversation. Prevents catastrophic forgetting.

Mixing ratio: base_ratio controls proportion (e.g., 0.7 = 70% base, 30% conversational).

Strategies:

Ratio-based: Sample according to ratio per batch
Alternating: Cycle between sources (batch 1 base, batch 2 conversational, etc.)
Random: Random sampling with probability = base_ratio

Use cases: General-purpose chat models, multi-task learning with auxiliary objectives.

Adaptive Training System

Orchestrator Architecture

State machine monitoring training every N steps (default: 100). Triggers interventions across 18 methods when confidence threshold exceeded (default: 0.75). Monitors CUDA kernel performance and adjusts strategies accordingly.

Monitored metrics:

Loss dynamics: Trend, variance, plateau detection, divergence warnings
Gradient statistics: Norm, variance, stability over windows, clipping frequency
Expert utilization: Per-expert usage, routing entropy, load balance (MoE)
Memory consumption: GPU/system memory, OOM risk prediction, CUDA kernel memory usage
Throughput: Tokens/second with and without CUDA acceleration, kernel efficiency
Convergence: Score based on loss stability, compute efficiency
CUDA performance: Per-kernel timing, speedup metrics, fallback frequency

Intervention categories:

Hyperparameter adaptation: Learning rate, weight decay, batch size
Architecture modification: Add/remove experts, adjust capacities, routing temperature
Emergency recovery: Gradient explosion handling, OOM recovery, checkpoint rollback
Schedule optimization: Early stopping recommendations, checkpoint prioritization
Performance optimization: CUDA kernel selection, precision adjustment, memory optimization

Decision process:

Collect metrics at checkpoint interval including CUDA performance
Analyze patterns (plateau, divergence, imbalance, performance degradation)
Compute intervention confidence scores
If confidence > threshold, execute intervention
Log decision with reasoning, confidence, and expected performance impact
Monitor intervention impact in subsequent intervals including CUDA efficiency

Intervention Methods

MoE Architecture Management (3 methods):

add_expert(layer_idx): Add expert to underutilized layer when average utilization > threshold
prune_expert(layer_idx, expert_idx): Remove expert with utilization < threshold
_initialize_new_expert(new_expert, existing_experts): Initialize from existing with noise

MoE Routing Control (4 methods):

adjust_capacity_factor(factor): Modify token capacity per expert
adjust_routing_temperature(temp): Control routing sharpness (lower = more concentrated)
enable_expert_dropout(prob): Regularization via random expert dropping
get_expert_statistics(): Retrieve utilization, entropy, load balance loss, tokens dropped

MoD Control (2 methods):

adjust_mod_capacity(factor): Change fraction of tokens using full computation
get_mod_statistics(): Retrieve capacity factor, tokens processed, average depth

Batch Size Adaptation (2 methods):

adjust_batch_size(new_size): Change micro batch size (typically for OOM recovery)
_recreate_dataloader(dataset): Rebuild dataloader after batch size change

Emergency Recovery (2 methods):

emergency_lr_reduction(factor): Reduce learning rate by factor (for gradient explosion)
rollback_steps(num_steps): Revert to earlier checkpoint (for divergence)

Optimizer Adjustments (2 methods):

adjust_weight_decay(value): Modify regularization strength
_update_optimizer_param_groups(param, value): Internal parameter group update

Real-time Metrics (3 methods):

get_current_metrics(): Query loss, LR, gradient norm, throughput, CUDA stats
_extract_moe_routing_stats(): Internal MoE statistics extraction
_calculate_throughput(): Compute tokens/second with CUDA performance breakdown

Performance Optimization (2 methods):

optimize_cuda_performance(): Analyze and tune CUDA kernel usage
get_cuda_performance_stats(): Detailed per-kernel timing and efficiency metrics

Chinchilla Scaling

Automatic training duration calculation following compute-optimal scaling laws (Hoffmann et al., 2022) with CUDA-aware compute budgets.

Formula: N_optimal_tokens = multiplier × model_parameters

Default multiplier: 20× (configurable: 10-50×)

Process:

Calculate optimal token budget: N_opt = 20 × total_parameters
Adjust for CUDA acceleration: effective_N_opt = N_opt / speedup_factor
Determine base epochs: epochs = N_opt / dataset_tokens
Clamp to min/max epoch constraints (default: 1-50)
Monitor during training: convergence score, loss landscape, compute efficiency
Track CUDA performance: kernel timing, memory efficiency, bottleneck identification
Adjust dynamically: Reduce epochs if fast convergence, stop early if plateaued

Enhanced features:

Loss landscape analysis: Detect plateaus (low variance over window), divergence (rapid loss increase)
Compute efficiency: Track loss reduction per FLOP including CUDA acceleration benefits
Adaptive curriculum: Adjust learning rate or data sampling based on convergence phase
Early stopping: Recommend termination when convergence score > threshold (typically 85%)
CUDA-aware budgeting: Account for acceleration when calculating optimal training duration

Runtime integration:

System calculates optimal duration at training start with CUDA performance estimates. Displays token budget, coverage percentage (dataset_tokens / optimal_tokens), recommended epochs, expected CUDA speedup. During training, prints status every N steps: current progress, convergence score, training phase (warming/learning/convergence), efficiency trend, CUDA performance metrics, recommendations (continue/adjust/stop).

Hardware Optimization

Platform-specific optimizations automatically applied based on detected hardware with CUDA acceleration where available.

NVIDIA CUDA

Automatic optimizations:

Precision: mixed_bf16 for Ampere+, mixed_fp16 for Volta/Turing
Flash Attention: Enabled on Ampere+ (compute capability ≥ 8.0)
Custom CUDA kernels: Automatic compilation and loading of optimized operations
Tensor cores: Automatically utilized for supported operations and precisions
CUDA graphs: Enabled for static computation graphs (requires compile=True)
Memory management: Optimized allocation patterns for kernel efficiency

Configuration parameters:

use_flash_attention: Enable Flash Attention 2.x (2-4× attention speedup)
use_fused_rmsnorm: Enable fused RMSNorm kernel (default profile: false)
use_fused_rope: Enable fused RoPE kernel (default profile: true)
use_fused_swiglu: Enable fused SwiGLU kernel (default profile: true)
use_fused_moe: Enable CUDA MoE routing/dispatch kernels (default profile: true)
use_fused_loss: Enable fused loss kernel (default profile: true)
use_fused_grad_clip: Enable fused gradient clipping kernel (default profile: true)
validate_moe_cuda_indices: Extra safety checks for CUDA MoE indices (default: false, slower)
force_dense_expert_grads: Force dense expert gradient paths (default: false, slower)
routing_stats_update_interval: Routing stats sync cadence (default: 64 steps)
gradient_checkpointing: Trade compute for memory (enables larger models)
compile: PyTorch 2.0 compilation (5-30% additional speedup)
use_deepspeed: Enable DeepSpeed for multi-GPU
zero_stage: ZeRO optimization level (0-3)

DeepSpeed ZeRO stages:

Stage 0: Disabled (standard DDP with CUDA acceleration)
Stage 1: Partition optimizer states (~4× memory reduction)
Stage 2: Partition optimizer + gradients (~8× reduction)
Stage 3: Partition optimizer + gradients + parameters (~N× reduction where N = num_GPUs)

Memory optimization:

CPU offload: Move optimizer states to CPU memory (slower updates, massive memory savings)
Gradient compression: Reduce communication volume (quality impact minimal)
Activation checkpointing: Recompute activations during backward (trade compute for memory)
CUDA kernel memory: Optimized temporary buffer management

CUDA kernel performance:

Throughput monitoring: Real-time tokens/second with and without acceleration
Per-kernel timing: Individual operation profiling
Bottleneck detection: Identify operations limiting performance
Automatic tuning: Adjust kernel launch configurations for optimal performance

Apple Silicon (MPS) + Metal Acceleration

Custom Metal Shaders (NEW):

Fused RMSNorm: 2-3x faster than PyTorch
Rotary Position Embeddings: 3-5x faster
SwiGLU Activation: 2-3x faster
MoE Routing: Optimized expert dispatch and combination
Automatic detection: Metal shaders auto-load when available

Metal shader compilation:

cd Src/Main_Scripts/core
./compile_metal.sh

Automatic optimizations:

Precision: FP16 (optimal for Metal, auto-selected)
Sequence length: Automatically reduced to 512 on MPS to prevent OOM
Data loading: num_workers=0 (MPS prefers single-threaded)
Unified memory: Automatic page management
Backend detection: Unified backend system selects Metal when available

Limitations:

Flash Attention: Not supported (Metal alternative in development)
DeepSpeed: Not available (Linux/CUDA only)
Mixed precision: FP16 or FP32 only (no BF16)
Compilation: Can be unstable (set compile=False if issues)
Memory: Unified memory shared with system (monitor via Activity Monitor)

Recommendations:

Start with batch_size=2-4 (auto-reduced seq_length=512 helps)
Enable gradient checkpointing for larger models
Use debug or debug_200m presets for testing
Expected throughput: 2-3x faster than pure PyTorch on MPS

CPU

Automatic settings:

Precision: FP32 (reduced precision no benefit)
Threading: Automatic core detection, configurable via num_workers
PyTorch fallback: CUDA kernels unavailable

Optimizations:

BLAS libraries: MKL (Intel), OpenBLAS (AMD/ARM), Accelerate (macOS)
Thread count: Typically num_cores - 2 for system overhead
Vectorization: Automatic SIMD instruction usage

Expectations:

10-100× slower than GPU depending on model size
30-150× slower than CUDA-accelerated GPU training
Suitable for debugging, not production training
Memory constraints less severe (can use system RAM)

Data Processing

Memory-efficient data loading with zero-copy operations, automatic caching, and CUDA-accelerated preprocessing.

Features enabled:

Memory-mapped file access: Read datasets without loading entirely into RAM
Zero-copy operations: Apache Arrow columnar format, no serialization overhead
Multi-threaded loading: Configurable worker processes (num_workers parameter)
Automatic caching: HuggingFace Datasets caches processed data
Sharding: Automatic data distribution for multi-GPU training
Polars acceleration: Fast DataFrame operations for preprocessing
CUDA-accelerated tokenization: Parallel token encoding on GPU for large batches

Data intelligence:

Difficulty-based sampling: Prioritize harder examples based on loss history
Curriculum learning: Gradually increase example complexity (aggressiveness parameter: 0-1)
Automatic cleaning: Remove malformed samples, normalize formatting
Quality threshold: Filter samples below quality score (default: 0.85)
Sequence length optimization: Dynamic padding, variable length batching

Preprocessing pipeline:

Load raw files (txt, jsonl)
Validate format and structure
Quality filtering (configurable threshold)
Tokenization with CUDA acceleration and caching
Sequence construction (chunking or conversation formatting)
Difficulty scoring (for curriculum learning)
Batch construction with dynamic padding
Transfer to GPU with pinned memory for efficient data transfer

Statistics tracked:

Total samples, valid/invalid counts
Token statistics: mean, median, max, min, standard deviation
Role distribution (conversational data)
Quality scores: error rate, issues detected
Sequence length distribution
CUDA preprocessing time and speedup metrics

Validation:

System validates all data paths before training. Checks: file existence, readability, size, format correctness. Prints summary: file count, total size, samples per file, estimated CUDA preprocessing speedup. Reports errors: missing files, corrupt formats, empty files.

Monitoring

Comprehensive metrics tracked during training with real-time logging, experiment tracking integration, and CUDA performance monitoring.

Core metrics:

Loss: Training and validation, rolling average, per-batch values, fused computation timing
Perplexity: exp(loss), interpretable quality measure
Accuracy: Token-level prediction accuracy, fused with loss computation
Learning rate: Current value from scheduler
Throughput: Tokens per second with CUDA acceleration, samples per second, speedup ratio
Gradient norm: L2 norm of gradients, variance over window, clipping frequency
Memory usage: GPU allocated/reserved, system RAM, CUDA kernel buffers

MoE-specific metrics:

Expert utilization: Fraction of tokens per expert, per-layer and aggregated
Routing entropy: Distribution concentration, higher = more balanced
Load balance loss: Auxiliary loss magnitude
Tokens dropped: Count exceeding capacity
Expert efficiency: Compute per expert, utilization × quality contribution
CUDA routing performance: Top-k timing, dispatch efficiency, combination speedup

MoD-specific metrics:

Capacity utilization: Actual vs configured capacity per layer
Average depth: Mean number of full layers per token
Skip patterns: Which tokens skip which layers
Compute savings: FLOPs reduction percentage
Per-layer usage: Fraction of tokens using full computation per layer

Chinchilla metrics:

Token progress: Current tokens / optimal tokens percentage
Convergence score: 0-100% based on loss stability
Training phase: warming/learning/convergence/plateau
Compute efficiency: Loss reduction per FLOP including CUDA benefits
Predicted final loss: Extrapolation from current trajectory

CUDA performance metrics:

Kernel timing: Per-operation execution time
Speedup factors: CUDA vs PyTorch for each operation
Memory efficiency: Kernel buffer usage, allocation patterns
Throughput breakdown: Contribution of each accelerated operation
Fallback frequency: Count of PyTorch fallback invocations
Occupancy: GPU utilization percentage, kernel launch efficiency

Logging configuration:

Log level: DEBUG/INFO/WARNING/ERROR
Log interval: Every N steps (default: 50)
Health check interval: Orchestrator monitoring frequency (default: 100)
Checkpoint interval: Save frequency (default: 1000 steps)
CUDA profiling: Optional detailed kernel profiling (per 1000 steps)

Output destinations:

Console: Real-time training progress with CUDA stats
Log files: experiments/[name]/logs/training.log
Metrics files: JSON format, experiments/[name]/metrics/
Wandb: Optional cloud logging and visualization
TensorBoard: Optional local visualization
CUDA profiles: experiments/[name]/cuda_profiles/ (when enabled)

Health checks:

Orchestrator performs comprehensive health assessment every N steps:

Loss trend analysis: Increasing/decreasing/plateau/divergence
Gradient stability: Norm within expected range, no explosions
Memory status: Utilization percentage, OOM risk
Expert balance: MoE utilization distribution (if applicable)
Throughput: Current vs baseline, degradation detection
Convergence: Progress toward optimal loss
CUDA efficiency: Kernel performance, speedup maintenance, memory usage

Health check output includes status (healthy/warning/critical), detected issues, recommended interventions, confidence scores, CUDA performance summary.

Performance Benchmarks

Throughput measurements on reference hardware configurations with and without CUDA acceleration. All benchmarks use sequence_length=2048, batch_size optimized per GPU, mixed precision training with gradient checkpointing.

Single GPU Performance

Google Colab T4 (15.8GB, Turing, sm_75):

debug (14M total, 2M active):
- With CUDA: ~53,000–55,000 tokens/second, batch_size=30, grad_steps=8
- Without CUDA (PyTorch): ~30,000 tokens/second, batch_size=30, grad_steps=8
- Speedup: ~1.8×
Memory efficiency: High utilization with optimized kernels

NVIDIA RTX 3090 (24GB, Ampere, sm_80):

b1 (1B active):
- With CUDA: ~1400 tokens/second, batch_size=16
- Without CUDA: ~450 tokens/second, batch_size=16
- Speedup: 3.1×
b7 (7B active):
- With CUDA: ~270 tokens/second, batch_size=4, requires ZeRO-2
- Without CUDA: ~75 tokens/second, batch_size=4
- Speedup: 3.6×
Memory efficiency: 85-90% utilization at optimal batch size

NVIDIA A100 40GB (Ampere, sm_80):

b1 (1B active):
- With CUDA: ~1600 tokens/second, batch_size=32
- Without CUDA: ~480 tokens/second, batch_size=32
- Speedup: 3.3×
b7 (7B active):
- With CUDA: ~680 tokens/second, batch_size=16
- Without CUDA: ~195 tokens/second, batch_size=16
- Speedup: 3.5×
b14 (14B active):
- With CUDA: ~210 tokens/second, batch_size=4, requires ZeRO-2
- Without CUDA: ~55 tokens/second, batch_size=4
- Speedup: 3.8×
Memory efficiency: 90-95% utilization

NVIDIA A100 80GB (Ampere, sm_80):

b7 (7B active):
- With CUDA: ~750 tokens/second, batch_size=24
- Without CUDA: ~210 tokens/second, batch_size=24
- Speedup: 3.6×
b14 (14B active):
- With CUDA: ~340 tokens/second, batch_size=12
- Without CUDA: ~90 tokens/second, batch_size=12
- Speedup: 3.8×
b30 (30B active):
- With CUDA: ~68 tokens/second, batch_size=2, requires ZeRO-3
- Without CUDA: ~16 tokens/second, batch_size=2
- Speedup: 4.3×
Memory efficiency: 85-92% utilization

NVIDIA H100 80GB (Hopper, sm_90):

b14 (14B active):
- With CUDA: ~550 tokens/second, batch_size=16
- Without CUDA: ~125 tokens/second, batch_size=16
- Speedup: 4.4×
b30 (30B active):
- With CUDA: ~160 tokens/second, batch_size=8
- Without CUDA: ~35 tokens/second, batch_size=8
- Speedup: 4.6×
b50 (50B active):
- With CUDA: ~68 tokens/second, batch_size=4, requires ZeRO-3
- Without CUDA: ~14 tokens/second, batch_size=4
- Speedup: 4.9×
FP8 support: Additional 1.5-2× speedup with FP8 training

Apple M1 Max (32GB unified, MPS):

b1 (1B active): ~300 tokens/second, batch_size=8 (PyTorch only, no CUDA)
Memory: Unified architecture shares with system, effective 20-24GB for training
Note: 4-5× slower than equivalent CUDA GPU with acceleration

Apple M2 Ultra (128GB unified, MPS):

b1 (1B active): ~400 tokens/second, batch_size=16 (PyTorch only, no CUDA)
b7 (7B active): ~80 tokens/second, batch_size=4 (PyTorch only, no CUDA)
Memory: Up to 96GB available for training after system overhead
Note: 3-4× slower than A100 with CUDA acceleration

Multi-GPU Scaling

4× A100 80GB (DeepSpeed ZeRO-2 + CUDA):

b30 (30B active):
- With CUDA: ~480 tokens/second (3.5× single GPU)
- Without CUDA: ~110 tokens/second (3.4× single GPU)
- Speedup: 4.4× (CUDA over PyTorch)
Scaling efficiency: 87%
Communication overhead: ~13%

8× A100 80GB (DeepSpeed ZeRO-3 + CUDA):

b50 (50B active):
- With CUDA: ~440 tokens/second
- Without CUDA: ~95 tokens/second
- Speedup: 4.6×
b100 (100B active):
- With CUDA: ~245 tokens/second
- Without CUDA: ~52 tokens/second
- Speedup: 4.7×
Scaling efficiency: 70-75%
Communication overhead: 25-30%

16× H100 80GB (DeepSpeed ZeRO-3 + expert parallelism + CUDA):

b100 (100B active):
- With CUDA: ~820 tokens/second
- Without CUDA: ~165 tokens/second
- Speedup: 5.0×
b200 (200B active):
- With CUDA: ~380 tokens/second
- Without CUDA: ~73 tokens/second
- Speedup: 5.2×
Scaling efficiency: 60-65%
Expert parallelism improves MoE scaling

Scaling efficiency factors:

Model size: Larger models have lower scaling efficiency (communication bound)
Interconnect: InfiniBand (400Gbps) vs Ethernet (100Gbps) significantly affects scaling
Expert parallelism: Distributing experts across GPUs improves MoE scaling
Gradient accumulation: Higher accumulation reduces communication frequency
CUDA acceleration: Maintained across all GPUs, reduces compute bottlenecks

Optimization Impact

CUDA Custom Kernels (3-5× overall):

RMSNorm: 3-4× speedup over PyTorch
RoPE: 5-7× speedup over PyTorch
SwiGLU: 2-3× speedup over PyTorch
MoE operations: 2-4× speedup over PyTorch
Fused loss: 3-4× speedup over separate operations
Combined impact: 3.2-4.8× depending on model configuration

Flash Attention (Ampere+):

Attention speedup: 2-4× depending on sequence length
Longer sequences benefit more (4× at 4096 length vs 2× at 512)
Memory reduction: 30-50% for attention computation
Quality: Numerically equivalent to standard attention
Compatible with CUDA kernels: Combined 4-8× total speedup

PyTorch Compilation (torch.compile):

Speedup: 5-30% depending on model architecture
MoE models: Lower benefit (routing breaks fusion)
Dense models: Higher benefit (more fusion opportunities)
Startup cost: 1-5 minutes additional compilation time
Stacks with CUDA: Additional 5-15% on top of kernel acceleration

Gradient Checkpointing:

Memory reduction: 30-50% of activation memory
Compute overhead: 20-30% additional training time
Trade-off: Enables larger batch sizes, often net positive throughput
CUDA kernels: Maintain acceleration during recomputation

Mixed Precision:

FP32 → mixed_bf16: ~2× speedup, 50% memory reduction
FP32 → mixed_fp16: ~2× speedup, 50% memory, may need loss scaling tuning
BF16 → FP8: ~1.5-2× speedup (H100 only), quality impacts under investigation
CUDA kernels: Optimized for each precision, maintain speedup ratios

CUDA Kernel Profiling

Per-kernel breakdown (b7 model, A100):

Operation          | PyTorch | CUDA  | Speedup | % Time
-------------------|---------|-------|---------|--------
RMSNorm            | 45ms    | 12ms  | 3.8×    | 18%
RoPE               | 38ms    | 6ms   | 6.3×    | 9%
SwiGLU             | 62ms    | 24ms  | 2.6×    | 22%
MoE Routing        | 28ms    | 9ms   | 3.1×    | 11%
MoE Dispatch       | 42ms    | 14ms  | 3.0×    | 15%
MoE Combine        | 35ms    | 11ms  | 3.2×    | 13%
Loss Computation   | 18ms    | 5ms   | 3.6×    | 6%
Other Operations   | 52ms    | 48ms  | 1.1×    | 6%
-------------------|---------|-------|---------|--------
Total per batch    | 320ms   | 129ms | 2.5×    | 100%
Effective tokens/s | 195     | 680   | 3.5×    |

Note: Effective speedup higher than per-operation average due to reduced overhead and better GPU utilization.

API Reference

Trainer Methods

MoE Architecture Management:

add_expert(layer_idx: int) -> None

Adds new expert to specified transformer layer
Expert initialized from existing experts with Gaussian noise
Automatically updates routing weights
Triggers rebalancing of capacity factor
CUDA kernels automatically adapt to new expert count

prune_expert(layer_idx: int, expert_idx: int) -> None

Removes expert from specified layer
Parameters frozen but retained in checkpoint for potential recovery
Routing weights adjusted to redistribute to remaining experts
Triggers capacity factor adjustment
CUDA dispatch kernels updated for new expert topology

MoE Routing Control:

adjust_capacity_factor(factor: float) -> None

Updates token capacity per expert
Values typically 1.0-2.0 (100-200% of fair share)
Higher values reduce token dropping but increase memory
Lower values force sharper routing but may drop tokens
CUDA dispatch kernel buffer sizes adjusted accordingly

adjust_routing_temperature(temperature: float) -> None

Controls routing distribution sharpness
temperature < 1.0: Sharper routing (more specialized)
temperature > 1.0: Softer routing (more uniform)
Typical range: 0.5-2.0
CUDA top-k kernel uses temperature in softmax computation

enable_expert_dropout(dropout_prob: float) -> None

Enables expert-level dropout during training
dropout_prob: Probability of dropping each expert (typical: 0.1-0.2)
Prevents over-reliance on specific experts
Disabled during evaluation automatically
CUDA routing kernel handles dropout masking

get_expert_statistics() -> Dict

Returns comprehensive expert metrics including CUDA performance
Keys: per_expert_utilization, routing_entropy, load_balance_loss, tokens_dropped, cuda_routing_time, cuda_dispatch_time
Per-layer and aggregate statistics
Updated each forward pass

MoD Control:

adjust_mod_capacity(capacity_factor: float) -> None

Updates fraction of tokens using full computation
Values: 0.0-1.0 (0% to 100% of tokens)
Lower values: More compute savings, potential quality impact
Typical range: 0.3-0.7

get_mod_statistics() -> Dict

Returns MoD efficiency metrics
Keys: capacity_factor, tokens_processed, tokens_skipped, average_depth, per_layer_usage, compute_savings, cuda_overhead
Tracks actual vs configured capacity
Compute savings as FLOP reduction percentage

Batch Management:

adjust_batch_size(new_batch_size: int) -> None

Dynamically changes micro batch size
Recreates dataloader with new batch size
Adjusts gradient accumulation to maintain effective batch size
Typically used for OOM recovery (reduces batch size automatically)
CUDA kernel launch configurations adapted to new batch size

Emergency Recovery:

emergency_lr_reduction(reduction_factor: float) -> None

Reduces learning rate by specified factor
Triggered by gradient explosion (norm > threshold)
Typical reduction: 5-10×
Logs emergency action with reasoning

rollback_steps(num_steps: int) -> None

Reverts training to previous checkpoint
Loads checkpoint from num_steps earlier
Resets optimizer and scheduler state
Used for divergence recovery
CUDA kernel state restored from checkpoint

Optimizer Control:

adjust_weight_decay(weight_decay: float) -> None

Updates L2 regularization strength
Typical values: 0.01-0.1
Higher values: Stronger regularization, may slow learning
Lower values: Less regularization, may overfit

Metrics Query:

get_current_metrics() -> Dict

Real-time training state snapshot including CUDA stats
Keys: loss, learning_rate, grad_norm, throughput, memory_usage, epoch, step, cuda_speedup, kernel_timing
Updated every forward/backward pass
Used by orchestrator for decision making

CUDA Performance:

get_cuda_performance_stats() -> Dict

Detailed CUDA kernel performance metrics
Keys: per_kernel_timing, speedup_factors, memory_efficiency, occupancy, fallback_count
Updated every N steps (configurable)
Used for performance optimization and debugging

optimize_cuda_performance() -> None

Analyzes CUDA kernel usage and tunes parameters
Adjusts launch configurations based on occupancy
Identifies and mitigates bottlenecks
Logs optimization decisions with expected impact

FLOPs Reduction by Layer Capacity

Lower layer capacity factors reduce FLOPs without significantly affecting performance.

Configuration Parameters

Model architecture:

hidden_size: Embedding and hidden dimension (128-20480, prefer multiples of 256 for CUDA efficiency)
num_layers: Transformer layer count (2-120)
num_heads: Attention head count (2-160)
num_kv_heads: KV cache heads for GQA (2-40, typically num_heads/4)
intermediate_size: FFN intermediate dimension (typically 8/3 × hidden_size, round to 256 for CUDA)
max_position_embeddings: Maximum sequence length (128-32768)
vocab_size: Tokenizer vocabulary size (typically 32000-100000)

MoE parameters:

use_moe: Enable MoE (boolean)
num_experts: Expert count per layer (4-64, typically 8)
moe_top_k: Experts activated per token (1-4, typically 2)
capacity_factor: Token capacity multiplier (1.0-2.0, typically 1.25)
load_balancing_weight: Auxiliary loss coefficient (0.001-0.1, typically 0.01)
routing_temperature: Softmax temperature (0.1-2.0, typically 1.0)

MoD parameters:

use_mod: Enable MoD (boolean)
mod_capacity_factor: Fraction using full computation (0.1-1.0, typically 0.5)
mod_routing_type: Routing mechanism ('learned', 'static', 'random')
mod_start_layer: First layer with MoD (0-num_layers)
mod_end_layer: Last layer with MoD (None = all layers)

Training parameters:

num_epochs: Training duration in epochs (1-100)
batch_size: Micro batch size per GPU (1-128)
gradient_accumulation_steps: Accumulation before optimizer step (1-128)
learning_rate: Optimizer learning rate (1e-5 to 1e-3)
weight_decay: L2 regularization (0.0-0.1, typically 0.01)
gradient_clip_val: Gradient norm clipping (0.5-5.0, typically 1.0)
warmup_steps: Learning rate warmup duration (steps or fraction)

Precision parameters:

precision: Training precision ('auto', 'fp32', 'fp16', 'bf16', 'mixed_fp16', 'mixed_bf16', 'fp8_e4m3')
inference_precision: Evaluation precision (same options as training)

Optimization parameters:

use_flash_attention: Enable Flash Attention (boolean, auto-detected)
use_fused_rmsnorm: Enable fused RMSNorm kernel (boolean, default false)
use_fused_rope: Enable fused RoPE kernel (boolean, default true)
use_fused_swiglu: Enable fused SwiGLU kernel (boolean, default true)
use_fused_moe: Enable CUDA MoE routing/dispatch kernels (boolean, default true)
use_fused_loss: Enable fused loss kernel (boolean, default true)
use_fused_grad_clip: Enable fused gradient clipping kernel (boolean, default true)
validate_moe_cuda_indices: Enable strict CUDA MoE index validation (boolean, default false)
force_dense_expert_grads: Force dense expert gradient path for all experts (boolean, default false)
routing_stats_update_interval: Steps between routing stats sync/updates (int, default 64)
mod_routing_stats_update_interval: Steps between MoD stats updates (int, default 64)
metric_history_size: Bounded in-memory training metric window (int, default 2048)
routing_history_size: Bounded in-memory routing metric window (int, default 512)
gradient_checkpointing: Activation checkpointing (boolean)
compile: PyTorch 2.0 compilation (boolean)
use_deepspeed: Enable DeepSpeed (boolean)
zero_stage: ZeRO optimization level (0-3)
cpu_offload: Offload optimizer to CPU (boolean)

Data parameters:

training_mode: Data handling ('base_only', 'finetuning_only', 'hybrid_sequential', 'hybrid_interleaved')
base_paths: List of base training files
finetuning_paths: List of fine-tuning files
base_eval_paths: Base validation files
finetuning_eval_paths: Fine-tuning validation files
base_ratio: Mixing ratio for interleaved mode (0.0-1.0)
mask_user_tokens: Mask user messages in loss (boolean)
pin_memory: Pinned host memory for faster CPU→GPU transfer (boolean, default true on CUDA)
prefetch_factor: DataLoader prefetch depth when num_workers > 0 (int, default 4)

Orchestrator parameters:

use_adaptive_training: Enable orchestrator (boolean)
intervention_threshold: Confidence required for intervention (0.0-1.0, typically 0.75)
check_interval: Steps between health checks (10-1000, typically 100)
enable_emergency_recovery: Allow emergency interventions (boolean)
enable_architecture_adaptation: Allow architecture changes (boolean)

Chinchilla parameters:

auto_epoch_scaling: Enable automatic epoch calculation (boolean)
chinchilla_multiplier: Token multiplier (5-50, typically 20)
min_auto_epochs: Minimum epochs (1-10)
max_auto_epochs: Maximum epochs (10-100)
enable_loss_landscape: Track loss patterns (boolean)
enable_compute_efficiency: Track efficiency metrics (boolean)
enable_early_stopping: Allow early termination (boolean)

Checkpoint parameters:

save_every_n_batches: Checkpoint interval in steps (100-10000)
save_total_limit: Maximum checkpoints to keep (1-100)
early_stopping_patience: Epochs without improvement before stopping (3-20)

CUDA parameters:

cuda_kernel_path: Path to compiled CUDA kernels (default: auto-detect)
enable_cuda_profiling: Enable detailed kernel profiling (boolean)
cuda_profile_interval: Steps between profiling snapshots (1000-10000)

Troubleshooting

Out of Memory (OOM)

Automatic handling: Orchestrator detects OOM exceptions, reduces batch size by 50%, recreates dataloader, resumes training from last checkpoint. CUDA kernel buffers automatically adjusted.

Manual interventions:

Reduce batch_size: Start with 1-2 for very large models
Increase gradient_accumulation_steps: Maintains effective batch size with less memory
Enable gradient_checkpointing: Trades compute for memory (recompute activations)
Increase zero_stage: 1→2→3 for progressively more memory optimization
Enable cpu_offload: Moves optimizer states to CPU (slower but massive memory savings)
Reduce max_position_embeddings: Shorter sequences use less memory
Lower model size: Try smaller preset configuration
Check CUDA kernel memory: Temporarily disable fused CUDA ops (use_fused_* = false) or set use_cuda=false

Memory estimation: Model memory (FP16) ≈ 2 bytes × total_parameters Optimizer memory (Adam) ≈ 8 bytes × parameters Gradient memory ≈ 2 bytes × parameters Activation memory ≈ 2 × batch_size × sequence_length × num_layers × hidden_size CUDA kernel buffers ≈ 100-500 MB (temporary buffers) Total ≈ 12-16 bytes per parameter + activation memory + kernel overhead

Training Instabilities

Gradient explosion: Symptoms: Loss becomes NaN, gradient norm > 100, rapid loss increase

Automatic recovery: Orchestrator detects high gradient norm, triggers emergency LR reduction (10×), rolls back to previous checkpoint, resumes with lower LR. CUDA fused gradient clipping prevents most explosions.

Manual fixes:

Lower learning_rate: Try 10× reduction
Increase gradient_clip_val: Clip at lower threshold (0.5 instead of 1.0)
Use mixed precision: BF16 more stable than FP16
Enable gradient checkpointing: Can improve numerical stability
Check data: Outliers or corrupted samples can cause explosions
Verify CUDA kernels: Check for numerical issues by setting use_fused_* = false temporarily

Loss divergence: Symptoms: Loss increases consistently, validation loss >> training loss, sudden loss spikes

Automatic recovery: Orchestrator detects divergence pattern, rolls back N steps, adjusts learning rate, may modify architecture parameters.

Manual fixes:

Reduce learning_rate: Start 3-5× lower
Increase weight_decay: Stronger regularization (0.1 instead of 0.01)
Check data quality: Remove outliers, validate preprocessing
Reduce model capacity: Overparameterized models may not converge on small datasets
Check CUDA kernel correctness: Verify against PyTorch fallback

Expert collapse (MoE): Symptoms: All tokens route to 1-2 experts, routing entropy < 1.0, most experts have near-zero utilization

Automatic recovery: Orchestrator detects imbalance, increases load_balancing_weight, adjusts routing_temperature, may prune/add experts. CUDA routing kernel continues to function correctly during recovery.

Manual fixes:

Increase load_balancing_weight: Try 0.02 or 0.05 (from 0.01)
Increase capacity_factor: Allow more tokens per expert (1.5 or 2.0)
Adjust routing_temperature: Higher values (1.5-2.0) encourage uniform routing
Enable expert_dropout: Forces routing to all experts
Check initialization: Poorly initialized experts may never activate
Monitor CUDA routing performance: Ensure kernels functioning correctly

Slow Training

Automatic optimization: Orchestrator monitors throughput, detects degradation, suggests optimizations (enable compilation, adjust batch size, check data loading bottlenecks, verify CUDA kernel usage).

Manual optimizations:

Enable compile: PyTorch 2.0 compilation (5-30% speedup)
Enable fused CUDA ops: use_fused_rope, use_fused_swiglu, use_fused_moe, use_fused_loss, use_fused_grad_clip
Keep use_fused_rmsnorm=false unless profiling shows a gain on your workload/GPU
Enable use_flash_attention: 2-4× attention speedup on Ampere+
Use mixed_bf16 or mixed_fp16: 2× speedup over FP32
Increase num_workers: Parallelize data loading (typically 4-8)
Ensure pin_memory=true and tune prefetch_factor (typically 2-8)
Increase batch_size: Better GPU utilization (if memory allows)
Reduce gradient_checkpointing: Faster but more memory
Check I/O: Move dataset to fast SSD, use memory-mapped files
Profile CUDA: Use enable_cuda_profiling=True to identify bottlenecks

Bottleneck identification:

GPU utilization < 80%: CPU or data loading bottleneck
Low throughput with high GPU utilization: Model or algorithm bottleneck
Throughput decreases over time: Memory fragmentation or thermal throttling
Inconsistent throughput: Data loading variance or OS interference
CUDA kernel fallbacks: Check compilation, verify GPU compatibility
High CUDA overhead: Check kernel launch efficiency with profiling

Apple Silicon (MPS) Issues

Common issues:

Training instability: MPS precision handling differs from CUDA
Compilation failures: PyTorch MPS backend less mature
Memory pressure: Unified memory competition with OS
No CUDA acceleration: Custom kernels unavailable

Solutions:

Set compile=False: Disable compilation if unstable
Set num_workers=0: MPS prefers single-threaded data loading
Set all CUDA fused ops to false (use_fused_rmsnorm/use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip)
Reduce batch_size: Start conservative (2-4)
Monitor Activity Monitor: Check memory pressure, GPU usage
Update PyTorch: MPS backend rapidly improving, use latest version
Fall back to CPU: If MPS unreliable, CPU training is alternative (slower but stable)
Expect slower training: 4-5× slower than CUDA-enabled GPU

Checkpoint Issues

Corruption: Symptoms: Checkpoint fails to load, missing keys, size mismatch

Recovery: System automatically tries previous checkpoints (latest → latest-1 → best validation). If all corrupt, restart from initialization.

Prevention: Enable save_total_limit > 3, save to reliable storage, validate checksums. CUDA kernel state saved separately for recovery.

Resume failures: Symptoms: Training resumes but loss resets, optimizer state lost, different results than before

Causes: Incomplete checkpoint save, random seed mismatch, configuration mismatch, CUDA kernel version change

Solutions: Verify checkpoint integrity before resume, ensure configuration matches checkpoint, check random seed restoration, recompile CUDA kernels if updated.

Data Issues

Format errors: Symptoms: Training fails during data loading, "invalid JSON" or "unexpected format" errors

Solutions: Validate data format with provided validation scripts, check for: missing fields, incorrect JSON structure, encoding issues (use UTF-8), empty files or lines.

Quality problems: Symptoms: Training succeeds but poor results, high validation loss, model outputs nonsense

Causes: Data contamination, label errors, poor quality samples, distribution mismatch

Solutions: Enable automatic_data_cleaning, increase quality_threshold, manually inspect samples, check train/validation split, verify preprocessing correctness.

CUDA Kernel Issues

Compilation failures: Symptoms: Kernels fail to compile, nvcc errors, missing libraries

Solutions:

Verify CUDA toolkit installation: nvcc --version
Check compute capability: Ensure GPU supported (sm_75+)
Install build tools: sudo apt-get install build-essential
Check PyTorch CUDA version: Must match CUDA toolkit
Try manual compilation: cd Src/Main_Scripts/core && ./compile_transformer_ops.sh && ./compile_cuda_moe.sh && cd ../training && ./compile_kernels.sh
Check compiler flags: Verify architecture flags match GPU

Runtime errors: Symptoms: CUDA kernel crashes, incorrect results, NaN outputs

Solutions:

Verify numerical correctness: Compare with PyTorch fallback
Check input shapes: Ensure dimensions match kernel expectations
Enable debugging: Set CUDA_LAUNCH_BLOCKING=1
Check memory: Verify sufficient GPU memory for kernel buffers
Update kernels: Recompile after PyTorch/CUDA update
Disable temporarily: set use_fused_* = false (or use_cuda=false) to isolate issue

Performance issues: Symptoms: CUDA kernels slower than expected, low speedup, high overhead

Solutions:

Enable profiling: enable_cuda_profiling=True
Check occupancy: Use get_cuda_performance_stats()
Verify batch size: Small batches have higher overhead
Check launch config: May need tuning for specific GPU
Monitor fallbacks: Excessive PyTorch fallbacks indicate issues
Compare architectures: Older GPUs have lower speedup

Production Deployment

Considerations for deploying trained models in production environments.

Model Export

Checkpoint format: Standard PyTorch state dict compatible with transformers library. Can export to HuggingFace format, ONNX (for inference optimization), TorchScript (for deployment), or TensorRT (for NVIDIA inference).

Conversion process:

Load final checkpoint
Extract model state dict
Convert to target format
Validate outputs match original
Benchmark inference performance
Optimize CUDA kernels for inference (if applicable)

Size optimization:

Weight pruning: Remove low-magnitude weights
Quantization: INT8 or INT4 for deployment
Knowledge distillation: Train smaller dense model from sparse model
Expert merging: Combine similar experts in MoE models
Kernel fusion: Combine operations for inference

Inference Optimization

Quantization strategies:

Post-training quantization: INT8 via bitsandbytes or GPTQ (4-bit)
Quantization-aware training: Train with quantization simulation
Dynamic quantization: Quantize activations at runtime
Mixed precision inference: FP16 or INT8 depending on layer
CUDA kernel quantization: Custom kernels for quantized operations

Batching:

Dynamic batching: Group requests by sequence length
Continuous batching: Add requests to in-flight batches
Request queuing: Balance latency and throughput
CUDA stream management: Overlap computation and data transfer

KV cache management:

Cache quantization: INT8 KV cache (2× memory reduction)
Cache eviction: Drop old tokens for long conversations
Paged attention: Efficient memory allocation (vLLM)
CUDA-optimized caching: Custom kernels for cache operations

Serving frameworks:

vLLM: High-throughput inference with paged attention
TensorRT-LLM: NVIDIA-optimized inference with custom kernels
Text Generation Inference: HuggingFace serving
Custom deployment: Direct PyTorch or ONNX Runtime with CUDA kernels

Monitoring Production Models

Inference metrics:

Latency: Time to first token, total generation time, CUDA kernel overhead
Throughput: Requests per second, tokens per second
Quality: Output validation, coherence scoring
Resource usage: GPU memory, CPU utilization, kernel efficiency

Model drift detection:

Input distribution monitoring
Output quality tracking over time
Comparison to validation benchmarks
Automatic retraining triggers

Licensing

This system is available under a commercial license. Framework provided for evaluation via demo notebook. Production use requires license agreement.

License tiers:

Research/Academic:

Non-commercial research and educational use
Academic institutions, non-profit research
Publications require citation
No redistribution of modified code
Includes CUDA kernels for research purposes
Pricing: Contact for academic pricing

Startup:

Companies with <10 employees, <$1M revenue
Internal use only (no redistribution)
Includes updates and bug fixes
Email support
CUDA kernel source code included
Pricing: Contact for startup pricing

Enterprise:

Larger organizations
Internal use and customer deployments
Includes updates, bug fixes, security patches
Priority support, SLA options
Custom modifications available
CUDA kernel customization support
Pricing: Contact for enterprise pricing

Contact: matiasnhmb@gmail.com

Evaluation license: Demo notebook and local installation for evaluation purposes. 30-day evaluation period. No production use. Watermarked outputs during evaluation. CUDA kernels included for testing.

Documentation

Adaptive Training Manual: docs/adaptive_training.md

Orchestrator internals
Intervention method details
Configuration recommendations
Case studies and examples
Performance optimization strategies

MoE/MoD Tutorial: docs/sparse_architectures.md

Sparse architecture theory
Implementation details
Training best practices
Debugging sparse models
CUDA-accelerated MoE operations

MoE/MoD Tutorial: docs/adapters.md

Deep dive into architecture + implementation details
Step-by-step guides for building, training, and plugging in adapters
Practical examples + gotchas (routing, load balancing, perf tuning)
Best practices for scaling and debugging

MoE/MoD Tutorial: docs/cuda_acceleration.md

These kernels replace high-frequency transformer primitives that dominate runtime in dense and MoE models.
These kernels implement the full MoE routing pipeline entirely on GPU:routing → dispatch → expert compute → recombination.
These kernels eliminate synchronization points by collapsing multi-step training operations into single GPU passes.
Mixed precision, streams, and overlap strategies

Citation

@software{AdaptiveTrainingSystem2025,
  title = {Adaptive Training System: Modular Transformer Training with MoE/MoD},
  author = {MatN23},
  year = {2025},
  url = {https://github.com/matn23/AdaptiveTrainingSystem},
  note = {Production-grade training framework with adaptive optimization and CUDA acceleration}
}

Support

Issues: GitHub issue tracker for bug reports and feature requests

Discussions: GitHub discussions for questions and community support

Email: matiasnhmb@gmail.com for licensing and technical inquiries

Adaptive Training System

Production transformer training for ML engineers

GitHub • Demo • License

Name		Name	Last commit message	Last commit date
Latest commit History 575 Commits
.github		.github
Src		Src
assets		assets
docs		docs
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.backend		Dockerfile.backend
LICENCE		LICENCE
Preformance_Overhead.md		Preformance_Overhead.md
Readme.md		Readme.md
SECURITY.md		SECURITY.md
docker-compose.dev.yml		docker-compose.dev.yml
makefile		makefile
package.json		package.json
requirements.txt		requirements.txt
testing_requirements.txt		testing_requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Adaptive Training System

Table of Contents

Overview

Technical Architecture

Dense Transformers

Mixture of Experts (MoE)

Mixture of Depths (MoD)

Hybrid (MoE + MoD)

Router Optimization

CUDA Acceleration

Optimized Operations

CUDA Kernel Speedups

Compilation and Usage

Numerical Stability

Model Configuration Presets

Precision Support

Supported Precisions

Automatic Precision Selection

Supported Frameworks & Integrations

Hardware-Specific Recommendations

Precision Configuration Parameters

Demo

Colab Notebook

Local Installation

Training Modes

Base/Pre-training Only

Fine-tuning Only

Sequential Hybrid

Interleaved Hybrid

Adaptive Training System

Orchestrator Architecture

Intervention Methods

Chinchilla Scaling

Hardware Optimization

NVIDIA CUDA

Apple Silicon (MPS) + Metal Acceleration

CPU

Data Processing

Monitoring

Performance Benchmarks

Single GPU Performance

Multi-GPU Scaling

Optimization Impact

CUDA Kernel Profiling

API Reference

Trainer Methods

FLOPs Reduction by Layer Capacity

Configuration Parameters

Troubleshooting

Out of Memory (OOM)

Training Instabilities

Slow Training

Apple Silicon (MPS) Issues

Checkpoint Issues

Data Issues

CUDA Kernel Issues

Production Deployment

Model Export

Inference Optimization

Monitoring Production Models

Licensing

Documentation

Citation

Support

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Packages