Production Transformer Training Framework with MoE/MoD Architecture & CUDA Acceleration
Demo • Architecture • CUDA Acceleration • Configuration • API • Licensing
- Overview
- Technical Architecture
- CUDA Acceleration
- Model Configurations
- Precision Support
- Demo
- Training Modes
- Adaptive Training System
- Hardware Optimization
- Data Processing
- Monitoring
- Performance Benchmarks
- API Reference
- Troubleshooting
- Production Deployment
- Licensing
Adaptive Training System is a production-grade transformer training framework implementing Mixture of Experts (MoE) and Mixture of Depths (MoD) architectures with autonomous training optimization and custom CUDA acceleration kernels. Supports models from 500M to 300B+ parameters with enterprise infrastructure.
Core capabilities:
- Sparse architectures: MoE (8-64 experts), MoD (dynamic depth), hybrid configurations
- CUDA acceleration: Custom kernels for RMSNorm (3-4x faster), RoPE (2-4x faster), SwiGLU (2-3x faster), MoE routing (2-4x faster), fused loss computation
- Metal acceleration: Custom Metal shaders for Apple Silicon - RMSNorm (2-3x faster), RoPE (3-5x faster), SwiGLU (2-3x faster), MoE routing
- Adaptive orchestrator: 18 autonomous intervention methods for training optimization
- Chinchilla scaling: Automatic epoch calculation based on compute-optimal principles
- Multi-GPU training: DeepSpeed ZeRO (stages 1-3), FSDP, ColossalAI with efficient gradient synchronization
- Precision support: FP32, FP16, BF16, mixed precision, FP8 (H100+ via Triton)
- Advanced Quantization: 4-bit/8-bit support via AutoGPTQ and Optimum Quanto
- Hardware targets: CUDA (Volta-Hopper), Apple Silicon (M1-M4) with Metal acceleration, CPU
- Data handling: Memory-mapped datasets, Apache Arrow zero-copy, automatic caching
- Router Optimization: Fine-tuning mode and adapter loading for MoE routers
- Recovery systems: Automatic OOM handling, gradient explosion recovery, checkpoint rollback
Framework positioning:
This is a complete training system with custom CUDA kernels, not a model zoo or API wrapper. Every component from tokenization to fused gradient operations is included. MoE and MoD implementations follow established research (Switch Transformer, Mixture of Experts, Mixture-of-Depths) with operational additions and CUDA-accelerated execution: dynamic expert management, capacity tuning, load balancing, routing analytics.
The adaptive orchestrator monitors 20+ metrics every N steps and triggers interventions across hyperparameters, architecture, and recovery procedures. Maintains decision history with confidence scoring to prevent excessive intervention.
Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical operations while maintaining gradient compatibility and numerical stability. Metal shaders provide 2-5x speedup on Apple Silicon (M1-M4). All kernels include automatic fallback to PyTorch when accelerated backends are unavailable.
Intended for:
- ML engineers requiring full training stack control with maximum performance
- Research teams prototyping sparse architectures with production-grade infrastructure
- Organizations with proprietary data, compliance requirements, and performance constraints
- Teams needing framework-independent infrastructure with custom optimization capabilities
Not included:
- Pre-trained model weights (configuration presets only, train from scratch)
- Model checkpoints or existing trained models
- High-level abstractions (direct control provided)
- Tutorial content (assumes ML engineering background)
Standard architecture with LLaMA/GPT-NeoX design patterns:
- Pre-normalization: RMSNorm before attention/FFN (CUDA-accelerated, 3-4x faster)
- Grouped Query Attention: Reduces KV cache via shared KV heads (typical ratio 4:1 or 8:1)
- Rotary Position Embeddings: Length generalization with configurable theta (10000 base, 1000000 extended), CUDA-accelerated RoPE application (2-4x faster)
- SwiGLU activation: Two-path gating in FFN, intermediate_size typically 8/3 × hidden_size, CUDA-accelerated (2-3x faster)
- Optional Flash Attention 2.x: 2-4x speedup on Ampere+ GPUs with memory-efficient attention
Parameter calculation:
- Embedding: vocab_size × hidden_size
- Attention per layer: hidden_size² × (1 + num_kv_heads/num_heads)
- FFN per layer: 2 × hidden_size × intermediate_size
- Output: vocab_size × hidden_size (optionally tied)
Token-level sparse activation via learned routing to specialized FFN networks with CUDA-accelerated dispatch.
Routing mechanism:
- Top-k gating: Each token routed to k of N experts (typical: k=2, N=8)
- Router: Linear layer (hidden_size × num_experts) + softmax + TopK selection
- CUDA optimization: Warp-based top-k kernel with shuffle reduction (2-4x faster than PyTorch)
- Output: Weighted combination of selected expert outputs
Load balancing:
- Auxiliary loss: Penalizes routing imbalance via expert utilization distribution
- Capacity factor: Maximum tokens per expert = (total_tokens/num_experts) × capacity_factor
- Typical capacity_factor: 1.25-1.5 (25-50% overflow buffer)
- Load balancing weight: 0.01 (added to main loss)
Dynamic management:
- Expert addition: Triggered when utilization exceeds threshold (typically 0.85)
- Expert pruning: Removes experts below utilization threshold (typically 0.15)
- Capacity adaptation: Adjusts based on token drop rate
- Temperature tuning: Controls routing concentration (lower = sharper, higher = more uniform)
CUDA acceleration:
- Token dispatch: Vectorized memory access (float4) with coalesced writes, reduced atomic contention
- Expert combination: Optimized accumulation with local buffering before atomic adds
- Memory efficiency: Batched operations reduce kernel launch overhead by 3-5x
Efficiency:
- 8-expert top-2 MoE: 8× total parameters, 1.25× active parameters per token
- Sparsity: 87.5% (12.5% parameters active)
- Memory: Scales with total parameters (all experts in memory)
- Compute: Scales with active parameters only
- CUDA speedup: 2-4x over PyTorch for routing and dispatch operations
Statistics tracked:
- Per-expert utilization: Fraction of tokens routed to each expert
- Routing entropy: Distribution concentration (max = log(num_experts))
- Load balance loss: Auxiliary loss magnitude
- Tokens dropped: Count exceeding capacity
- Per-layer patterns: Early layers more uniform, later layers more concentrated
Layer-level sparse activation via learned skip decisions.
Core concept: Model learns which tokens require full layer computation vs. residual skip. Routing decision per token per layer based on token representation at layer input.
Routing types:
- Learned: Small MLP scores tokens, top-capacity_factor selected for full processing
- Static: Fixed pattern (e.g., all tokens full in early layers, reduced in later layers)
- Random: Random selection maintaining capacity_factor (ablation baseline)
Capacity management:
- capacity_factor controls fraction of tokens using full computation
- 0.5 = 50% tokens full processing, 50% skip
- Selection is learned during training, not random
Efficiency:
- FLOPs reduction: ~(1 - capacity_factor) for layers with MoD
- Typical: 30-50% FLOPs reduction with 0-2% perplexity increase
- Quality/efficiency tradeoff: Lower capacity = more savings but larger quality impact
Application strategies:
- All layers: Maximum compute reduction
- Later layers only: Preserve early feature extraction, reduce later specialization
- Alternating: MoD every N layers
- FFN-only: Dense attention, sparse FFN (common since FFN is 2/3 of compute)
Training dynamics:
- Early training: Near-uniform routing (all tokens treated similarly)
- Specialization: Model learns token complexity patterns over time
- Curriculum: Can start with capacity_factor=1.0, gradually reduce to target
- Annealing: Gradual capacity reduction prevents training instability
Combined token-level (MoE) and layer-level (MoD) sparsity with coordinated CUDA acceleration.
Architecture:
- Each layer: Dense attention + MoE FFN + MoD routing
- MoD decides: Use full layer (attention + MoE) or skip via residual
- If token uses layer, routes through MoE experts with CUDA-accelerated dispatch
Sparsity compounding:
- Top-2 of 8 experts: 25% expert parameters active
- 50% layer capacity: 50% tokens use layers
- Combined: 0.5 × 0.25 = 12.5% active parameters per token
- 87.5% total sparsity
Training considerations:
- Both routing mechanisms must learn useful patterns
- Load balancing for experts, capacity warmup for MoD
- Routing temperature adaptation for both systems
- Quality-aware guards prevent catastrophic sparsity collapse
Use cases:
- Maximum efficiency: Largest models on limited compute
- Fast experimentation: Smaller active compute enables rapid iteration
- Inference optimization: Reduced memory and compute for deployment
- Multi-task learning: Different experts and depths specialize per task
Specialized Router Management:
- Router Fine-tuning: Isolate and train only the routing layers (
gate_proj) while freezing other parameters. Helps specialize routing logic without catastrophic forgetting in experts. - Adapter Loading: Load pre-trained router weights independently. Useful for transferring routing strategies between similar models.
- Dynamic Fine-tuning: Enable router training during specific phases of the main training loop.
Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical training operations. All kernels maintain full gradient compatibility and include automatic fallback.
1. Fused RMSNorm (3-4x faster)
- Vectorized memory access: float4 operations for coalesced reads/writes
- Optimized reductions: Warp shuffle + reduced shared memory usage
- Key optimizations:
- Single-pass algorithm (compute variance and normalize in one kernel)
- Eliminated bank conflicts in shared memory
- Better register allocation (maxrregcount=64)
- Loop unrolling for common hidden sizes
- Performance: ~3.2ms → ~0.9ms per 1000 tokens (hidden_size=768)
2. Rotary Position Embeddings (2-4x faster)
- Optimized precompute: Vectorized cos/sin cache generation
- Efficient application: Reduced redundant memory access
- Key optimizations:
- Process 2 dimension pairs per thread (better ILP)
- Use of __ldg() for read-only cache optimization
- FMA instructions (fmaf) for multiply-add fusion
- Eliminated per-token position computation
- Performance: ~8.5ms → ~1.3ms per batch (batch=4, seq=128, heads=12)
3. SwiGLU Activation (2-3x faster)
- Vectorized computation: float4 for gate and up projections
- Fast SiLU approximation: Optimized sigmoid computation
- Key optimizations:
- Fused gate/up computation in single kernel pass
- Register tiling for matrix multiplication
- Eliminated intermediate memory allocation
- Vectorized output writes
- Performance: ~5.1ms → ~1.8ms per 1000 tokens (intermediate=3072)
4. MoE Operations (2-4x faster)
- Optimized top-k gating: Warp-level parallel reduction
- Token dispatch: Coalesced memory access patterns
- Expert combination: Reduced atomic operations
- Key optimizations:
- Warp-based top-k (no shared memory for k≤4)
- Batched atomic increments per block
- Vectorized token copies (float4)
- Local accumulation before global writes
- Performance: ~12ms → ~3.5ms per batch for routing+dispatch+combine
5. Fused Loss Computation
- Cross-entropy + accuracy: Single-pass computation
- Key optimizations:
- Combined max/sum reduction for softmax
- Parallel argmax using shared memory
- Fused gradient computation
- Eliminated redundant passes
- Performance: ~4.2ms → ~1.1ms per batch (vocab_size=32000)
6. Fused Gradient Clipping
- Fully async pipeline: No CPU-GPU synchronization
- Key optimizations:
- Compute norm on GPU (no D2H transfer)
- Conditional clipping kernel launch
- Pinned memory for async copy
- Single synchronization point at end
- Performance: ~8ms → ~1.5ms (including norm computation)
Custom kernels provide 2–7× speedup over vanilla PyTorch operations.
Compilation requirements:
# Install CUDA toolkit (11.8+ or 12.x)
# Verify nvcc availability
nvcc --version
# Optional: force target architectures (comma or semicolon separated)
# export CUDA_TARGET_SM=75,80,86,89,90
# Compile transformer + MoE kernels
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
# Compile training kernels
cd ../training
./compile_kernels.shAutomatic kernel detection: Framework automatically detects and loads compiled kernels at runtime. Falls back to PyTorch if kernels unavailable. No code changes required to use CUDA acceleration.
Automatic JIT rebuild for current hardware:
- If kernel
.sofiles are missing or compiled for the wrong SM target, runtime wrappers trigger a rebuild automatically. - Target architecture resolution order:
CUDA_TARGET_SM→TORCH_CUDA_ARCH_LIST→ detected GPU compute capability → fallbacksm_75. - Set
CUDA_TARGET_SMwhen you need deterministic builds across machines.
Supported architectures:
- sm_75: Turing (T4, RTX 2080)
- sm_80: Ampere (A100, RTX 3090)
- sm_86: Ampere (RTX 3060/3070/3080)
- sm_89: Ada Lovelace (RTX 4090)
- sm_90: Hopper (H100, H200)
Performance monitoring:
from cuda_opt_wrapper import print_performance_summary
# After training
print_performance_summary()
# Shows per-operation timing, speedup metrics, throughputAll CUDA kernels maintain numerical stability equivalent to PyTorch:
- RMSNorm: Epsilon term (1e-6) prevents division by zero
- RoPE: No approximations, exact trigonometric functions
- SwiGLU: Fast SiLU uses stable exp(-x) formulation
- Loss: Max-subtraction for softmax numerical stability
- Gradient clipping: Norm computation uses compensated summation
Validation: Every kernel includes PyTorch fallback for correctness verification. Automated tests compare CUDA vs PyTorch outputs (tolerance: 1e-4 for FP32, 1e-3 for FP16).
Pre-configured architecture presets for training models from scratch, spanning 500K to 300B parameters. These are configuration templates, not pre-trained models.
Each preset specifies architecture dimensions, MoE/MoD parameters, hardware targets, and expected performance with CUDA acceleration for initializing and training new models.
| Config | Active Params | Total Params | Hidden | Layers | Heads | KV Heads | Experts | Top-K | Hardware | Memory (FP16) | Throughput | CUDA Speedup |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
debug |
500K | 4M | 128 | 2 | 2 | 2 | 8 | 2 | Any | 50 MB | Testing | 2.1x |
debug_200m |
200M | 6B | 768 | 12 | 12 | 12 | 32 MoD | - | T4/MPS | 2 GB | Testing | 2.8x |
b1 |
1B | 8B | 1024 | 24 | 16 | 4 | 8 | 2 | RTX 3090, M1 Max | 8 GB | 1400 tok/s | 3.2x |
b7 |
7B | 56B | 4096 | 32 | 32 | 8 | 8 | 2 | A100 40GB | 28 GB | 680 tok/s | 3.5x |
b14 |
14B | 112B | 5120 | 40 | 40 | 10 | 8 | 2 | A100 80GB | 56 GB | 340 tok/s | 3.8x |
b30 |
30B | 240B | 8192 | 48 | 64 | 16 | 8 | 2 | 4× A100 80GB | 120 GB | 135 tok/s | 4.1x |
b50 |
50B | 400B | 10240 | 56 | 80 | 20 | 8 | 2 | 4× H100 | 200 GB | 68 tok/s | 4.5x |
b100 |
100B | 800B | 12288 | 80 | 96 | 24 | 8 | 2 | 8× H100 | 400 GB | 68 tok/s | 4.8x |
b200 |
200B | 1.6T | 16384 | 100 | 128 | 32 | 8 | 2 | 16× H200 | 800 GB | 41 tok/s | 5.2x |
b300 |
300B | 2.4T | 20480 | 120 | 160 | 40 | 8 | 2 | 32× H200 | 1.2 TB | 27 tok/s | 5.5x |
Memory estimates: Include model weights, optimizer states (Adam: 8 bytes/param), gradients, and activation memory at batch_size=1, mixed precision training. Actual memory scales with batch size and sequence length.
***Preset debug_200m not included in this graph***Throughput estimates: With CUDA acceleration enabled at batch_size=1, sequence_length=2048, mixed precision with gradient checkpointing. CUDA speedup column shows combined acceleration from all custom kernels vs. pure PyTorch.
Configuration selection:
- Development/testing:
debugfor pipeline validation,debug_200mfor architecture testing - Research:
b1for prototyping on consumer hardware with CUDA acceleration - Production fine-tuning:
b7for quality/efficiency balance with significant CUDA speedup - Large-scale pre-training:
b30+ for maximum model capacity - Extreme scale:
b100+ requires cluster infrastructure and distributed expertise
Important:
These presets define untrained model architectures. Training starts from random initialization following standard practices (Xavier/Kaiming initialization for weights, zero initialization for biases). The framework does not provide pre-trained checkpoints.
Customization:
All presets are starting points. Architecture dimensions can be modified: hidden_size must be divisible by num_heads. Intermediate_size typically 8/3 × hidden_size rounded to nearest 256 for optimal CUDA performance. Max_position_embeddings determines context window. Num_experts and moe_top_k can be adjusted independently. MoD capacity_factor controls compute/quality tradeoff. CUDA kernels automatically adapt to configuration changes.
Numerical formats for parameters, activations, and gradients during training and inference.
FP32 (Float32) - Full Precision
- 32-bit floating point (8-bit exponent, 23-bit mantissa)
- Range: ±3.4×10^38, precision: ~7 decimal digits
- Maximum stability, no special handling required
- 2× memory vs FP16/BF16, significantly slower on modern hardware
- Use cases: CPU training, numerical debugging, stability issues with reduced precision
FP16 (Float16) - Half Precision
- 16-bit floating point (5-bit exponent, 10-bit mantissa)
- Range: ±65504, precision: ~3 decimal digits
- 50% memory reduction, ~2× speedup on supported hardware
- Requires loss scaling to prevent gradient underflow (small gradients round to zero)
- Dynamic or static loss scaling: multiply loss by 2^N before backward, unscale gradients before update
- CUDA kernels maintain FP32 accumulation for numerical stability
- Use cases: Volta/Turing GPUs (V100, T4, RTX 2080), Apple Silicon (M1-M4)
BF16 (BFloat16) - Brain Float16
- 16-bit format (8-bit exponent, 7-bit mantissa)
- Range: Same as FP32 (±3.4×10^38), reduced precision vs FP32
- 50% memory reduction, similar speed to FP16
- No loss scaling required (wide dynamic range like FP32)
- Better training stability than FP16 with same memory benefits
- CUDA kernels optimized for BF16 with hardware acceleration on Ampere+
- Use cases: Ampere+ GPUs (A100, RTX 3090/4090, H100), primary recommendation for modern hardware
Mixed Precision FP16
- Forward/backward in FP16, master copy of parameters in FP32
- Optimizer updates FP32 master copy, then copies to FP16 for next forward
- Dynamic loss scaling automatically adjusts to prevent underflow
- Combines FP16 speed with FP32 stability
- CUDA kernels use FP32 accumulation internally
- Use cases: Default for older GPUs supporting FP16 but not BF16
Mixed Precision BF16
- Forward/backward in BF16, master parameters in FP32
- No loss scaling needed (BF16 dynamic range matches FP32)
- Simpler than mixed FP16 (no loss scaling configuration)
- Best speed/stability balance on modern hardware
- CUDA kernels leverage hardware BF16 tensor cores
- Use cases: Default for Ampere+ GPUs, primary recommendation for production
FP8 (Float8) - Experimental
- 8-bit floating point: E4M3 (forward) and E5M2 (backward) variants
- Further memory reduction (75% vs FP32)
- Requires H100 or newer with hardware FP8 support
- Complex configuration, quality impacts not fully characterized
- Custom CUDA kernels under development for FP8 support
- Use cases: Cutting-edge research, not recommended for general use
INT8 Quantization
- 8-bit integer representation (post-training or quantization-aware training)
- Primarily for inference, not training
- Reduces model size by 75% vs FP32 for deployment
- Quality impact depends on calibration and quantization method
- Use cases: Model deployment, edge devices
The framework detects hardware and selects optimal precision with CUDA kernel compatibility:
Detection logic:
- Check for CUDA availability and GPU compute capability
- Verify CUDA kernel compilation and availability
- If Ampere+ (compute capability ≥ 8.0): Select
mixed_bf16with BF16-optimized kernels - If Volta/Turing (compute capability 7.0-7.5): Select
mixed_fp16with FP16-optimized kernels - If Apple Silicon (MPS): Select
fp16(BF16 not supported, CUDA kernels unavailable) - If CPU: Select
fp32(reduced precision offers no benefit, CUDA kernels unavailable)
Override: Set precision explicitly via configuration if automatic selection is suboptimal or for specific debugging/testing requirements.
Quantization & Inference:
- AutoGPTQ: 4-bit quantization support for efficient inference and fine-tuning on consumer hardware.
- Optimum Quanto: Dynamic quantization support (8-bit/4-bit) for flexible deployment.
- OpenAI Triton: High-performance FP8 kernels for H100+ architectures.
Model Compatibility:
- DeepSeek Config Adapter: Auto-convert training configurations to DeepSeek-compatible formats.
- HuggingFace Interop: Seamless integration with
transformersfor dataset loading and tokenization.
NVIDIA Ampere/Ada/Hopper (A100, RTX 3090/4090, H100, H200):
- Recommended:
mixed_bf16with full CUDA acceleration (4-5x total speedup) - Alternative:
mixed_fp16(if BF16 causes unexpected issues) - Advanced:
fp8_e4m3(H100+ only, experimental, custom kernels in development) - CUDA benefits: All kernels optimized for tensor cores, maximum performance
NVIDIA Volta/Turing (V100, T4, RTX 2080/2080Ti):
- Recommended:
mixed_fp16with CUDA acceleration (3-4x speedup) - Alternative:
fp32(if stability issues, 2-3x CUDA speedup maintained) - Note: BF16 not supported (no hardware acceleration)
- CUDA benefits: FP16 tensor core utilization, optimized memory access
Apple Silicon (M1/M2/M3/M4, Mac Studio, MacBook Pro):
- Recommended:
fp16(CUDA kernels unavailable, PyTorch fallback) - Note: Mixed precision and BF16 not supported on MPS backend
- Limitations: Flash Attention disabled, DeepSpeed unavailable, no custom CUDA acceleration
CPU (Intel/AMD/ARM):
- Recommended:
fp32(CUDA kernels unavailable, PyTorch fallback) - Note: Reduced precision offers minimal benefit on CPU
- Expect significantly slower training than GPU (10-100× depending on model size)
- Performance: No CUDA acceleration, relies on BLAS libraries (MKL, OpenBLAS, Accelerate)
Training precision: Format used during forward pass, backward pass, and gradient computation Inference precision: Format used during validation and evaluation Master precision: Format for optimizer's master parameter copy (typically FP32 in mixed precision) CUDA kernel precision: Automatic selection based on training precision
Separate training/inference precision:
Common pattern: Train in mixed_bf16 for speed with CUDA acceleration, evaluate in fp32 for precise metrics. Or train in mixed_fp16 with CUDA kernels, deploy in int8 for inference.
Loss scaling parameters (FP16 only):
init_scale: Initial loss scaling factor (default: 2^16)scale_factor: Multiplier for scale adjustment (default: 2.0)scale_window: Steps without overflow before increasing scale (default: 2000)min_scale: Minimum scale factor (default: 1.0)
Dynamic loss scaling adjusts automatically: scale increases every scale_window steps without overflow, decreases on overflow detection (NaN/Inf gradients). CUDA kernels maintain numerical stability with loss scaling. Most users do not need to modify these parameters.
Free GPU training demonstration requiring no local setup with CUDA acceleration showcase.
Environment:
- Hardware: Google Colab T4 GPU (15.8GB memory, Turing architecture, sm_75)
- CUDA: 11.8 or 12.x (varies by Colab environment)
- Runtime: ~3.5 hours for complete training run (with CUDA acceleration, ~5 hours without)
Configuration:
- Model:
debugpreset (14m total, 2m active, 8 experts, 2 top-k) - Dataset: Small conversational corpus (~200 samples, ~2.5M tokens)
- Training: 12 epochs, batch_size=30, gradient_accumulation=8, effective_batch_size=240
- Precision: Mixed FP16 (automatic for T4) with CUDA-accelerated operations
- Learning rate: 1e-4 with cosine warmup and decay
- CUDA kernels: Automatically compiled on first run (1-2 minutes compilation time)
Observable behaviors:
- System diagnostics: Hardware detection, precision selection, CUDA kernel loading
- Chinchilla scaling: Optimal token calculation (20× parameters), epoch recommendations
- Training metrics: Loss progression (~2.5 → ~2.0), throughput (~53–55k tok/s with CUDA vs ~30k tok/s without)
- Orchestrator: Health checks every 100 steps, intervention decisions
- Expert statistics: Utilization distribution, routing entropy, load balance
- CUDA performance: Real-time speedup metrics, per-operation timing, memory efficiency
Limitations:
- Colab quotas: Usage limits, session timeouts
- Small dataset: Limited quality, demonstrates pipeline not production training
- Free tier: T4 availability not guaranteed, may encounter queuing
- CUDA compilation: First run requires kernel compilation (~1-2 minutes)
Requirements:
- Python 3.8+ (3.10+ recommended)
- PyTorch 2.0+ (2.2+ recommended)
- CUDA 11.8+ (for GPU with acceleration) or CPU
- CUDA Toolkit with nvcc (for compiling custom kernels)
- RAM: 16GB minimum, 32GB+ recommended
- Disk: 50GB+ for dependencies, datasets, checkpoints
Installation:
git clone https://github.com/matn23/AdaptiveTrainingSystem
cd AdaptiveTrainingSystem
pip install -r requirements.txt
# Compile CUDA kernels (optional but recommended for 3-5x speedup)
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh
# Verify kernel compilation
python -c "from cuda_opt_wrapper import TRANSFORMER_OPS_AVAILABLE; print(f'CUDA ops: {TRANSFORMER_OPS_AVAILABLE}')"
cd ../..
python Main.pyOptional dependencies:
- Flash Attention: 2-4× attention speedup, requires manual compilation
- DeepSpeed: Multi-GPU optimization, auto-installs but compiles kernels on first use
- Wandb: Experiment tracking, requires API key
Quick start:
Default configuration uses debug preset for rapid testing. System auto-detects hardware, compiles CUDA kernels if nvcc available, selects precision, validates data, initializes model, begins training with acceleration.
Resume from checkpoint:
python Main.py --resume path/to/checkpoint.ptRestores model state, optimizer state, scheduler, training step counter, random seeds, CUDA kernel state.
CUDA kernel management:
# Check kernel status
python -c "from moe_cuda_wrapper import print_performance_summary; print_performance_summary()"
# Recompile kernels after update
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh
# Disable CUDA acceleration (for debugging)
# Set in config:
# use_cuda: false
# use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip: falseFour data handling strategies for different use cases, all compatible with CUDA acceleration.
Raw text without conversational structure. For domain-specific pre-training, continued pre-training, language modeling research.
Data format:
- Plain text files (.txt)
- JSONL with "text" field
Processing: Tokenize and split into fixed-length sequences with optional overlap (stride parameter). CUDA-accelerated tokenization for large corpora.
Loss: Applied to all tokens (causal language modeling) with fused loss computation kernel.
Conversational data with role annotations. For instruction tuning, chat models, task-specific adaptation.
Data format: JSONL with "conversation" field containing list of messages. Each message has "role" (system/user/assistant) and "content".
Processing: Concatenate messages with special tokens marking roles: [BOS] system [SEP] user [SEP] assistant [EOS]. CUDA-accelerated tokenization and batching.
Loss: Can mask user tokens (loss only on assistant responses) or compute on all tokens. Fused loss kernel handles masking efficiently.
Two-phase training: base corpus then conversational data. Builds general understanding, then adapts to conversation.
Configuration: Separate epoch counts per phase. Optional learning rate warmup between phases to handle distribution shift. CUDA acceleration maintained across phase transitions.
Use cases: Domain adaptation (medical literature → clinical QA), continual learning (new data → maintained task performance).
Mix base and conversational data within batches/epochs. Maintains general capabilities while learning conversation. Prevents catastrophic forgetting.
Mixing ratio: base_ratio controls proportion (e.g., 0.7 = 70% base, 30% conversational).
Strategies:
- Ratio-based: Sample according to ratio per batch
- Alternating: Cycle between sources (batch 1 base, batch 2 conversational, etc.)
- Random: Random sampling with probability = base_ratio
Use cases: General-purpose chat models, multi-task learning with auxiliary objectives.
State machine monitoring training every N steps (default: 100). Triggers interventions across 18 methods when confidence threshold exceeded (default: 0.75). Monitors CUDA kernel performance and adjusts strategies accordingly.
Monitored metrics:
- Loss dynamics: Trend, variance, plateau detection, divergence warnings
- Gradient statistics: Norm, variance, stability over windows, clipping frequency
- Expert utilization: Per-expert usage, routing entropy, load balance (MoE)
- Memory consumption: GPU/system memory, OOM risk prediction, CUDA kernel memory usage
- Throughput: Tokens/second with and without CUDA acceleration, kernel efficiency
- Convergence: Score based on loss stability, compute efficiency
- CUDA performance: Per-kernel timing, speedup metrics, fallback frequency
Intervention categories:
- Hyperparameter adaptation: Learning rate, weight decay, batch size
- Architecture modification: Add/remove experts, adjust capacities, routing temperature
- Emergency recovery: Gradient explosion handling, OOM recovery, checkpoint rollback
- Schedule optimization: Early stopping recommendations, checkpoint prioritization
- Performance optimization: CUDA kernel selection, precision adjustment, memory optimization
Decision process:
- Collect metrics at checkpoint interval including CUDA performance
- Analyze patterns (plateau, divergence, imbalance, performance degradation)
- Compute intervention confidence scores
- If confidence > threshold, execute intervention
- Log decision with reasoning, confidence, and expected performance impact
- Monitor intervention impact in subsequent intervals including CUDA efficiency
MoE Architecture Management (3 methods):
add_expert(layer_idx): Add expert to underutilized layer when average utilization > thresholdprune_expert(layer_idx, expert_idx): Remove expert with utilization < threshold_initialize_new_expert(new_expert, existing_experts): Initialize from existing with noise
MoE Routing Control (4 methods):
adjust_capacity_factor(factor): Modify token capacity per expertadjust_routing_temperature(temp): Control routing sharpness (lower = more concentrated)enable_expert_dropout(prob): Regularization via random expert droppingget_expert_statistics(): Retrieve utilization, entropy, load balance loss, tokens dropped
MoD Control (2 methods):
adjust_mod_capacity(factor): Change fraction of tokens using full computationget_mod_statistics(): Retrieve capacity factor, tokens processed, average depth
Batch Size Adaptation (2 methods):
adjust_batch_size(new_size): Change micro batch size (typically for OOM recovery)_recreate_dataloader(dataset): Rebuild dataloader after batch size change
Emergency Recovery (2 methods):
emergency_lr_reduction(factor): Reduce learning rate by factor (for gradient explosion)rollback_steps(num_steps): Revert to earlier checkpoint (for divergence)
Optimizer Adjustments (2 methods):
adjust_weight_decay(value): Modify regularization strength_update_optimizer_param_groups(param, value): Internal parameter group update
Real-time Metrics (3 methods):
get_current_metrics(): Query loss, LR, gradient norm, throughput, CUDA stats_extract_moe_routing_stats(): Internal MoE statistics extraction_calculate_throughput(): Compute tokens/second with CUDA performance breakdown
Performance Optimization (2 methods):
optimize_cuda_performance(): Analyze and tune CUDA kernel usageget_cuda_performance_stats(): Detailed per-kernel timing and efficiency metrics
Automatic training duration calculation following compute-optimal scaling laws (Hoffmann et al., 2022) with CUDA-aware compute budgets.
Formula: N_optimal_tokens = multiplier × model_parameters
Default multiplier: 20× (configurable: 10-50×)
Process:
- Calculate optimal token budget:
N_opt = 20 × total_parameters - Adjust for CUDA acceleration:
effective_N_opt = N_opt / speedup_factor - Determine base epochs:
epochs = N_opt / dataset_tokens - Clamp to min/max epoch constraints (default: 1-50)
- Monitor during training: convergence score, loss landscape, compute efficiency
- Track CUDA performance: kernel timing, memory efficiency, bottleneck identification
- Adjust dynamically: Reduce epochs if fast convergence, stop early if plateaued
Enhanced features:
- Loss landscape analysis: Detect plateaus (low variance over window), divergence (rapid loss increase)
- Compute efficiency: Track loss reduction per FLOP including CUDA acceleration benefits
- Adaptive curriculum: Adjust learning rate or data sampling based on convergence phase
- Early stopping: Recommend termination when convergence score > threshold (typically 85%)
- CUDA-aware budgeting: Account for acceleration when calculating optimal training duration
Runtime integration:
System calculates optimal duration at training start with CUDA performance estimates. Displays token budget, coverage percentage (dataset_tokens / optimal_tokens), recommended epochs, expected CUDA speedup. During training, prints status every N steps: current progress, convergence score, training phase (warming/learning/convergence), efficiency trend, CUDA performance metrics, recommendations (continue/adjust/stop).
Platform-specific optimizations automatically applied based on detected hardware with CUDA acceleration where available.
Automatic optimizations:
- Precision:
mixed_bf16for Ampere+,mixed_fp16for Volta/Turing - Flash Attention: Enabled on Ampere+ (compute capability ≥ 8.0)
- Custom CUDA kernels: Automatic compilation and loading of optimized operations
- Tensor cores: Automatically utilized for supported operations and precisions
- CUDA graphs: Enabled for static computation graphs (requires compile=True)
- Memory management: Optimized allocation patterns for kernel efficiency
Configuration parameters:
use_flash_attention: Enable Flash Attention 2.x (2-4× attention speedup)use_fused_rmsnorm: Enable fused RMSNorm kernel (default profile:false)use_fused_rope: Enable fused RoPE kernel (default profile:true)use_fused_swiglu: Enable fused SwiGLU kernel (default profile:true)use_fused_moe: Enable CUDA MoE routing/dispatch kernels (default profile:true)use_fused_loss: Enable fused loss kernel (default profile:true)use_fused_grad_clip: Enable fused gradient clipping kernel (default profile:true)validate_moe_cuda_indices: Extra safety checks for CUDA MoE indices (default:false, slower)force_dense_expert_grads: Force dense expert gradient paths (default:false, slower)routing_stats_update_interval: Routing stats sync cadence (default:64steps)gradient_checkpointing: Trade compute for memory (enables larger models)compile: PyTorch 2.0 compilation (5-30% additional speedup)use_deepspeed: Enable DeepSpeed for multi-GPUzero_stage: ZeRO optimization level (0-3)
DeepSpeed ZeRO stages:
- Stage 0: Disabled (standard DDP with CUDA acceleration)
- Stage 1: Partition optimizer states (~4× memory reduction)
- Stage 2: Partition optimizer + gradients (~8× reduction)
- Stage 3: Partition optimizer + gradients + parameters (~N× reduction where N = num_GPUs)
Memory optimization:
- CPU offload: Move optimizer states to CPU memory (slower updates, massive memory savings)
- Gradient compression: Reduce communication volume (quality impact minimal)
- Activation checkpointing: Recompute activations during backward (trade compute for memory)
- CUDA kernel memory: Optimized temporary buffer management
CUDA kernel performance:
- Throughput monitoring: Real-time tokens/second with and without acceleration
- Per-kernel timing: Individual operation profiling
- Bottleneck detection: Identify operations limiting performance
- Automatic tuning: Adjust kernel launch configurations for optimal performance
Custom Metal Shaders (NEW):
- Fused RMSNorm: 2-3x faster than PyTorch
- Rotary Position Embeddings: 3-5x faster
- SwiGLU Activation: 2-3x faster
- MoE Routing: Optimized expert dispatch and combination
- Automatic detection: Metal shaders auto-load when available
Metal shader compilation:
cd Src/Main_Scripts/core
./compile_metal.shAutomatic optimizations:
- Precision: FP16 (optimal for Metal, auto-selected)
- Sequence length: Automatically reduced to 512 on MPS to prevent OOM
- Data loading: num_workers=0 (MPS prefers single-threaded)
- Unified memory: Automatic page management
- Backend detection: Unified backend system selects Metal when available
Limitations:
- Flash Attention: Not supported (Metal alternative in development)
- DeepSpeed: Not available (Linux/CUDA only)
- Mixed precision: FP16 or FP32 only (no BF16)
- Compilation: Can be unstable (set compile=False if issues)
- Memory: Unified memory shared with system (monitor via Activity Monitor)
Recommendations:
- Start with batch_size=2-4 (auto-reduced seq_length=512 helps)
- Enable gradient checkpointing for larger models
- Use
debugordebug_200mpresets for testing - Expected throughput: 2-3x faster than pure PyTorch on MPS
Automatic settings:
- Precision: FP32 (reduced precision no benefit)
- Threading: Automatic core detection, configurable via num_workers
- PyTorch fallback: CUDA kernels unavailable
Optimizations:
- BLAS libraries: MKL (Intel), OpenBLAS (AMD/ARM), Accelerate (macOS)
- Thread count: Typically num_cores - 2 for system overhead
- Vectorization: Automatic SIMD instruction usage
Expectations:
- 10-100× slower than GPU depending on model size
- 30-150× slower than CUDA-accelerated GPU training
- Suitable for debugging, not production training
- Memory constraints less severe (can use system RAM)
Memory-efficient data loading with zero-copy operations, automatic caching, and CUDA-accelerated preprocessing.
Features enabled:
- Memory-mapped file access: Read datasets without loading entirely into RAM
- Zero-copy operations: Apache Arrow columnar format, no serialization overhead
- Multi-threaded loading: Configurable worker processes (num_workers parameter)
- Automatic caching: HuggingFace Datasets caches processed data
- Sharding: Automatic data distribution for multi-GPU training
- Polars acceleration: Fast DataFrame operations for preprocessing
- CUDA-accelerated tokenization: Parallel token encoding on GPU for large batches
Data intelligence:
- Difficulty-based sampling: Prioritize harder examples based on loss history
- Curriculum learning: Gradually increase example complexity (aggressiveness parameter: 0-1)
- Automatic cleaning: Remove malformed samples, normalize formatting
- Quality threshold: Filter samples below quality score (default: 0.85)
- Sequence length optimization: Dynamic padding, variable length batching
Preprocessing pipeline:
- Load raw files (txt, jsonl)
- Validate format and structure
- Quality filtering (configurable threshold)
- Tokenization with CUDA acceleration and caching
- Sequence construction (chunking or conversation formatting)
- Difficulty scoring (for curriculum learning)
- Batch construction with dynamic padding
- Transfer to GPU with pinned memory for efficient data transfer
Statistics tracked:
- Total samples, valid/invalid counts
- Token statistics: mean, median, max, min, standard deviation
- Role distribution (conversational data)
- Quality scores: error rate, issues detected
- Sequence length distribution
- CUDA preprocessing time and speedup metrics
Validation:
System validates all data paths before training. Checks: file existence, readability, size, format correctness. Prints summary: file count, total size, samples per file, estimated CUDA preprocessing speedup. Reports errors: missing files, corrupt formats, empty files.
Comprehensive metrics tracked during training with real-time logging, experiment tracking integration, and CUDA performance monitoring.
Core metrics:
- Loss: Training and validation, rolling average, per-batch values, fused computation timing
- Perplexity: exp(loss), interpretable quality measure
- Accuracy: Token-level prediction accuracy, fused with loss computation
- Learning rate: Current value from scheduler
- Throughput: Tokens per second with CUDA acceleration, samples per second, speedup ratio
- Gradient norm: L2 norm of gradients, variance over window, clipping frequency
- Memory usage: GPU allocated/reserved, system RAM, CUDA kernel buffers
MoE-specific metrics:
- Expert utilization: Fraction of tokens per expert, per-layer and aggregated
- Routing entropy: Distribution concentration, higher = more balanced
- Load balance loss: Auxiliary loss magnitude
- Tokens dropped: Count exceeding capacity
- Expert efficiency: Compute per expert, utilization × quality contribution
- CUDA routing performance: Top-k timing, dispatch efficiency, combination speedup
MoD-specific metrics:
- Capacity utilization: Actual vs configured capacity per layer
- Average depth: Mean number of full layers per token
- Skip patterns: Which tokens skip which layers
- Compute savings: FLOPs reduction percentage
- Per-layer usage: Fraction of tokens using full computation per layer
Chinchilla metrics:
- Token progress: Current tokens / optimal tokens percentage
- Convergence score: 0-100% based on loss stability
- Training phase: warming/learning/convergence/plateau
- Compute efficiency: Loss reduction per FLOP including CUDA benefits
- Predicted final loss: Extrapolation from current trajectory
CUDA performance metrics:
- Kernel timing: Per-operation execution time
- Speedup factors: CUDA vs PyTorch for each operation
- Memory efficiency: Kernel buffer usage, allocation patterns
- Throughput breakdown: Contribution of each accelerated operation
- Fallback frequency: Count of PyTorch fallback invocations
- Occupancy: GPU utilization percentage, kernel launch efficiency
Logging configuration:
- Log level: DEBUG/INFO/WARNING/ERROR
- Log interval: Every N steps (default: 50)
- Health check interval: Orchestrator monitoring frequency (default: 100)
- Checkpoint interval: Save frequency (default: 1000 steps)
- CUDA profiling: Optional detailed kernel profiling (per 1000 steps)
Output destinations:
- Console: Real-time training progress with CUDA stats
- Log files: experiments/[name]/logs/training.log
- Metrics files: JSON format, experiments/[name]/metrics/
- Wandb: Optional cloud logging and visualization
- TensorBoard: Optional local visualization
- CUDA profiles: experiments/[name]/cuda_profiles/ (when enabled)
Health checks:
Orchestrator performs comprehensive health assessment every N steps:
- Loss trend analysis: Increasing/decreasing/plateau/divergence
- Gradient stability: Norm within expected range, no explosions
- Memory status: Utilization percentage, OOM risk
- Expert balance: MoE utilization distribution (if applicable)
- Throughput: Current vs baseline, degradation detection
- Convergence: Progress toward optimal loss
- CUDA efficiency: Kernel performance, speedup maintenance, memory usage
Health check output includes status (healthy/warning/critical), detected issues, recommended interventions, confidence scores, CUDA performance summary.
Throughput measurements on reference hardware configurations with and without CUDA acceleration. All benchmarks use sequence_length=2048, batch_size optimized per GPU, mixed precision training with gradient checkpointing.
Google Colab T4 (15.8GB, Turing, sm_75):
- debug (14M total, 2M active):
- With CUDA: ~53,000–55,000 tokens/second, batch_size=30, grad_steps=8
- Without CUDA (PyTorch): ~30,000 tokens/second, batch_size=30, grad_steps=8
- Speedup: ~1.8×
- Memory efficiency: High utilization with optimized kernels
NVIDIA RTX 3090 (24GB, Ampere, sm_80):
- b1 (1B active):
- With CUDA: ~1400 tokens/second, batch_size=16
- Without CUDA: ~450 tokens/second, batch_size=16
- Speedup: 3.1×
- b7 (7B active):
- With CUDA: ~270 tokens/second, batch_size=4, requires ZeRO-2
- Without CUDA: ~75 tokens/second, batch_size=4
- Speedup: 3.6×
- Memory efficiency: 85-90% utilization at optimal batch size
NVIDIA A100 40GB (Ampere, sm_80):
- b1 (1B active):
- With CUDA: ~1600 tokens/second, batch_size=32
- Without CUDA: ~480 tokens/second, batch_size=32
- Speedup: 3.3×
- b7 (7B active):
- With CUDA: ~680 tokens/second, batch_size=16
- Without CUDA: ~195 tokens/second, batch_size=16
- Speedup: 3.5×
- b14 (14B active):
- With CUDA: ~210 tokens/second, batch_size=4, requires ZeRO-2
- Without CUDA: ~55 tokens/second, batch_size=4
- Speedup: 3.8×
- Memory efficiency: 90-95% utilization
NVIDIA A100 80GB (Ampere, sm_80):
- b7 (7B active):
- With CUDA: ~750 tokens/second, batch_size=24
- Without CUDA: ~210 tokens/second, batch_size=24
- Speedup: 3.6×
- b14 (14B active):
- With CUDA: ~340 tokens/second, batch_size=12
- Without CUDA: ~90 tokens/second, batch_size=12
- Speedup: 3.8×
- b30 (30B active):
- With CUDA: ~68 tokens/second, batch_size=2, requires ZeRO-3
- Without CUDA: ~16 tokens/second, batch_size=2
- Speedup: 4.3×
- Memory efficiency: 85-92% utilization
NVIDIA H100 80GB (Hopper, sm_90):
- b14 (14B active):
- With CUDA: ~550 tokens/second, batch_size=16
- Without CUDA: ~125 tokens/second, batch_size=16
- Speedup: 4.4×
- b30 (30B active):
- With CUDA: ~160 tokens/second, batch_size=8
- Without CUDA: ~35 tokens/second, batch_size=8
- Speedup: 4.6×
- b50 (50B active):
- With CUDA: ~68 tokens/second, batch_size=4, requires ZeRO-3
- Without CUDA: ~14 tokens/second, batch_size=4
- Speedup: 4.9×
- FP8 support: Additional 1.5-2× speedup with FP8 training
Apple M1 Max (32GB unified, MPS):
- b1 (1B active): ~300 tokens/second, batch_size=8 (PyTorch only, no CUDA)
- Memory: Unified architecture shares with system, effective 20-24GB for training
- Note: 4-5× slower than equivalent CUDA GPU with acceleration
Apple M2 Ultra (128GB unified, MPS):
- b1 (1B active): ~400 tokens/second, batch_size=16 (PyTorch only, no CUDA)
- b7 (7B active): ~80 tokens/second, batch_size=4 (PyTorch only, no CUDA)
- Memory: Up to 96GB available for training after system overhead
- Note: 3-4× slower than A100 with CUDA acceleration
4× A100 80GB (DeepSpeed ZeRO-2 + CUDA):
- b30 (30B active):
- With CUDA: ~480 tokens/second (3.5× single GPU)
- Without CUDA: ~110 tokens/second (3.4× single GPU)
- Speedup: 4.4× (CUDA over PyTorch)
- Scaling efficiency: 87%
- Communication overhead: ~13%
8× A100 80GB (DeepSpeed ZeRO-3 + CUDA):
- b50 (50B active):
- With CUDA: ~440 tokens/second
- Without CUDA: ~95 tokens/second
- Speedup: 4.6×
- b100 (100B active):
- With CUDA: ~245 tokens/second
- Without CUDA: ~52 tokens/second
- Speedup: 4.7×
- Scaling efficiency: 70-75%
- Communication overhead: 25-30%
16× H100 80GB (DeepSpeed ZeRO-3 + expert parallelism + CUDA):
- b100 (100B active):
- With CUDA: ~820 tokens/second
- Without CUDA: ~165 tokens/second
- Speedup: 5.0×
- b200 (200B active):
- With CUDA: ~380 tokens/second
- Without CUDA: ~73 tokens/second
- Speedup: 5.2×
- Scaling efficiency: 60-65%
- Expert parallelism improves MoE scaling
Scaling efficiency factors:
- Model size: Larger models have lower scaling efficiency (communication bound)
- Interconnect: InfiniBand (400Gbps) vs Ethernet (100Gbps) significantly affects scaling
- Expert parallelism: Distributing experts across GPUs improves MoE scaling
- Gradient accumulation: Higher accumulation reduces communication frequency
- CUDA acceleration: Maintained across all GPUs, reduces compute bottlenecks
CUDA Custom Kernels (3-5× overall):
- RMSNorm: 3-4× speedup over PyTorch
- RoPE: 5-7× speedup over PyTorch
- SwiGLU: 2-3× speedup over PyTorch
- MoE operations: 2-4× speedup over PyTorch
- Fused loss: 3-4× speedup over separate operations
- Combined impact: 3.2-4.8× depending on model configuration
Flash Attention (Ampere+):
- Attention speedup: 2-4× depending on sequence length
- Longer sequences benefit more (4× at 4096 length vs 2× at 512)
- Memory reduction: 30-50% for attention computation
- Quality: Numerically equivalent to standard attention
- Compatible with CUDA kernels: Combined 4-8× total speedup
PyTorch Compilation (torch.compile):
- Speedup: 5-30% depending on model architecture
- MoE models: Lower benefit (routing breaks fusion)
- Dense models: Higher benefit (more fusion opportunities)
- Startup cost: 1-5 minutes additional compilation time
- Stacks with CUDA: Additional 5-15% on top of kernel acceleration
Gradient Checkpointing:
- Memory reduction: 30-50% of activation memory
- Compute overhead: 20-30% additional training time
- Trade-off: Enables larger batch sizes, often net positive throughput
- CUDA kernels: Maintain acceleration during recomputation
Mixed Precision:
- FP32 → mixed_bf16: ~2× speedup, 50% memory reduction
- FP32 → mixed_fp16: ~2× speedup, 50% memory, may need loss scaling tuning
- BF16 → FP8: ~1.5-2× speedup (H100 only), quality impacts under investigation
- CUDA kernels: Optimized for each precision, maintain speedup ratios
Per-kernel breakdown (b7 model, A100):
Operation | PyTorch | CUDA | Speedup | % Time
-------------------|---------|-------|---------|--------
RMSNorm | 45ms | 12ms | 3.8× | 18%
RoPE | 38ms | 6ms | 6.3× | 9%
SwiGLU | 62ms | 24ms | 2.6× | 22%
MoE Routing | 28ms | 9ms | 3.1× | 11%
MoE Dispatch | 42ms | 14ms | 3.0× | 15%
MoE Combine | 35ms | 11ms | 3.2× | 13%
Loss Computation | 18ms | 5ms | 3.6× | 6%
Other Operations | 52ms | 48ms | 1.1× | 6%
-------------------|---------|-------|---------|--------
Total per batch | 320ms | 129ms | 2.5× | 100%
Effective tokens/s | 195 | 680 | 3.5× |
Note: Effective speedup higher than per-operation average due to reduced overhead and better GPU utilization.
MoE Architecture Management:
add_expert(layer_idx: int) -> None
- Adds new expert to specified transformer layer
- Expert initialized from existing experts with Gaussian noise
- Automatically updates routing weights
- Triggers rebalancing of capacity factor
- CUDA kernels automatically adapt to new expert count
prune_expert(layer_idx: int, expert_idx: int) -> None
- Removes expert from specified layer
- Parameters frozen but retained in checkpoint for potential recovery
- Routing weights adjusted to redistribute to remaining experts
- Triggers capacity factor adjustment
- CUDA dispatch kernels updated for new expert topology
MoE Routing Control:
adjust_capacity_factor(factor: float) -> None
- Updates token capacity per expert
- Values typically 1.0-2.0 (100-200% of fair share)
- Higher values reduce token dropping but increase memory
- Lower values force sharper routing but may drop tokens
- CUDA dispatch kernel buffer sizes adjusted accordingly
adjust_routing_temperature(temperature: float) -> None
- Controls routing distribution sharpness
- temperature < 1.0: Sharper routing (more specialized)
- temperature > 1.0: Softer routing (more uniform)
- Typical range: 0.5-2.0
- CUDA top-k kernel uses temperature in softmax computation
enable_expert_dropout(dropout_prob: float) -> None
- Enables expert-level dropout during training
- dropout_prob: Probability of dropping each expert (typical: 0.1-0.2)
- Prevents over-reliance on specific experts
- Disabled during evaluation automatically
- CUDA routing kernel handles dropout masking
get_expert_statistics() -> Dict
- Returns comprehensive expert metrics including CUDA performance
- Keys: per_expert_utilization, routing_entropy, load_balance_loss, tokens_dropped, cuda_routing_time, cuda_dispatch_time
- Per-layer and aggregate statistics
- Updated each forward pass
MoD Control:
adjust_mod_capacity(capacity_factor: float) -> None
- Updates fraction of tokens using full computation
- Values: 0.0-1.0 (0% to 100% of tokens)
- Lower values: More compute savings, potential quality impact
- Typical range: 0.3-0.7
get_mod_statistics() -> Dict
- Returns MoD efficiency metrics
- Keys: capacity_factor, tokens_processed, tokens_skipped, average_depth, per_layer_usage, compute_savings, cuda_overhead
- Tracks actual vs configured capacity
- Compute savings as FLOP reduction percentage
Batch Management:
adjust_batch_size(new_batch_size: int) -> None
- Dynamically changes micro batch size
- Recreates dataloader with new batch size
- Adjusts gradient accumulation to maintain effective batch size
- Typically used for OOM recovery (reduces batch size automatically)
- CUDA kernel launch configurations adapted to new batch size
Emergency Recovery:
emergency_lr_reduction(reduction_factor: float) -> None
- Reduces learning rate by specified factor
- Triggered by gradient explosion (norm > threshold)
- Typical reduction: 5-10×
- Logs emergency action with reasoning
rollback_steps(num_steps: int) -> None
- Reverts training to previous checkpoint
- Loads checkpoint from num_steps earlier
- Resets optimizer and scheduler state
- Used for divergence recovery
- CUDA kernel state restored from checkpoint
Optimizer Control:
adjust_weight_decay(weight_decay: float) -> None
- Updates L2 regularization strength
- Typical values: 0.01-0.1
- Higher values: Stronger regularization, may slow learning
- Lower values: Less regularization, may overfit
Metrics Query:
get_current_metrics() -> Dict
- Real-time training state snapshot including CUDA stats
- Keys: loss, learning_rate, grad_norm, throughput, memory_usage, epoch, step, cuda_speedup, kernel_timing
- Updated every forward/backward pass
- Used by orchestrator for decision making
CUDA Performance:
get_cuda_performance_stats() -> Dict
- Detailed CUDA kernel performance metrics
- Keys: per_kernel_timing, speedup_factors, memory_efficiency, occupancy, fallback_count
- Updated every N steps (configurable)
- Used for performance optimization and debugging
optimize_cuda_performance() -> None
- Analyzes CUDA kernel usage and tunes parameters
- Adjusts launch configurations based on occupancy
- Identifies and mitigates bottlenecks
- Logs optimization decisions with expected impact
Lower layer capacity factors reduce FLOPs without significantly affecting performance.
Model architecture:
hidden_size: Embedding and hidden dimension (128-20480, prefer multiples of 256 for CUDA efficiency)num_layers: Transformer layer count (2-120)num_heads: Attention head count (2-160)num_kv_heads: KV cache heads for GQA (2-40, typically num_heads/4)intermediate_size: FFN intermediate dimension (typically 8/3 × hidden_size, round to 256 for CUDA)max_position_embeddings: Maximum sequence length (128-32768)vocab_size: Tokenizer vocabulary size (typically 32000-100000)
MoE parameters:
use_moe: Enable MoE (boolean)num_experts: Expert count per layer (4-64, typically 8)moe_top_k: Experts activated per token (1-4, typically 2)capacity_factor: Token capacity multiplier (1.0-2.0, typically 1.25)load_balancing_weight: Auxiliary loss coefficient (0.001-0.1, typically 0.01)routing_temperature: Softmax temperature (0.1-2.0, typically 1.0)
MoD parameters:
use_mod: Enable MoD (boolean)mod_capacity_factor: Fraction using full computation (0.1-1.0, typically 0.5)mod_routing_type: Routing mechanism ('learned', 'static', 'random')mod_start_layer: First layer with MoD (0-num_layers)mod_end_layer: Last layer with MoD (None = all layers)
Training parameters:
num_epochs: Training duration in epochs (1-100)batch_size: Micro batch size per GPU (1-128)gradient_accumulation_steps: Accumulation before optimizer step (1-128)learning_rate: Optimizer learning rate (1e-5 to 1e-3)weight_decay: L2 regularization (0.0-0.1, typically 0.01)gradient_clip_val: Gradient norm clipping (0.5-5.0, typically 1.0)warmup_steps: Learning rate warmup duration (steps or fraction)
Precision parameters:
precision: Training precision ('auto', 'fp32', 'fp16', 'bf16', 'mixed_fp16', 'mixed_bf16', 'fp8_e4m3')inference_precision: Evaluation precision (same options as training)
Optimization parameters:
use_flash_attention: Enable Flash Attention (boolean, auto-detected)use_fused_rmsnorm: Enable fused RMSNorm kernel (boolean, defaultfalse)use_fused_rope: Enable fused RoPE kernel (boolean, defaulttrue)use_fused_swiglu: Enable fused SwiGLU kernel (boolean, defaulttrue)use_fused_moe: Enable CUDA MoE routing/dispatch kernels (boolean, defaulttrue)use_fused_loss: Enable fused loss kernel (boolean, defaulttrue)use_fused_grad_clip: Enable fused gradient clipping kernel (boolean, defaulttrue)validate_moe_cuda_indices: Enable strict CUDA MoE index validation (boolean, defaultfalse)force_dense_expert_grads: Force dense expert gradient path for all experts (boolean, defaultfalse)routing_stats_update_interval: Steps between routing stats sync/updates (int, default64)mod_routing_stats_update_interval: Steps between MoD stats updates (int, default64)metric_history_size: Bounded in-memory training metric window (int, default2048)routing_history_size: Bounded in-memory routing metric window (int, default512)gradient_checkpointing: Activation checkpointing (boolean)compile: PyTorch 2.0 compilation (boolean)use_deepspeed: Enable DeepSpeed (boolean)zero_stage: ZeRO optimization level (0-3)cpu_offload: Offload optimizer to CPU (boolean)
Data parameters:
training_mode: Data handling ('base_only', 'finetuning_only', 'hybrid_sequential', 'hybrid_interleaved')base_paths: List of base training filesfinetuning_paths: List of fine-tuning filesbase_eval_paths: Base validation filesfinetuning_eval_paths: Fine-tuning validation filesbase_ratio: Mixing ratio for interleaved mode (0.0-1.0)mask_user_tokens: Mask user messages in loss (boolean)pin_memory: Pinned host memory for faster CPU→GPU transfer (boolean, defaulttrueon CUDA)prefetch_factor: DataLoader prefetch depth whennum_workers > 0(int, default4)
Orchestrator parameters:
use_adaptive_training: Enable orchestrator (boolean)intervention_threshold: Confidence required for intervention (0.0-1.0, typically 0.75)check_interval: Steps between health checks (10-1000, typically 100)enable_emergency_recovery: Allow emergency interventions (boolean)enable_architecture_adaptation: Allow architecture changes (boolean)
Chinchilla parameters:
auto_epoch_scaling: Enable automatic epoch calculation (boolean)chinchilla_multiplier: Token multiplier (5-50, typically 20)min_auto_epochs: Minimum epochs (1-10)max_auto_epochs: Maximum epochs (10-100)enable_loss_landscape: Track loss patterns (boolean)enable_compute_efficiency: Track efficiency metrics (boolean)enable_early_stopping: Allow early termination (boolean)
Checkpoint parameters:
save_every_n_batches: Checkpoint interval in steps (100-10000)save_total_limit: Maximum checkpoints to keep (1-100)early_stopping_patience: Epochs without improvement before stopping (3-20)
CUDA parameters:
cuda_kernel_path: Path to compiled CUDA kernels (default: auto-detect)enable_cuda_profiling: Enable detailed kernel profiling (boolean)cuda_profile_interval: Steps between profiling snapshots (1000-10000)
Automatic handling: Orchestrator detects OOM exceptions, reduces batch size by 50%, recreates dataloader, resumes training from last checkpoint. CUDA kernel buffers automatically adjusted.
Manual interventions:
- Reduce
batch_size: Start with 1-2 for very large models - Increase
gradient_accumulation_steps: Maintains effective batch size with less memory - Enable
gradient_checkpointing: Trades compute for memory (recompute activations) - Increase
zero_stage: 1→2→3 for progressively more memory optimization - Enable
cpu_offload: Moves optimizer states to CPU (slower but massive memory savings) - Reduce
max_position_embeddings: Shorter sequences use less memory - Lower model size: Try smaller preset configuration
- Check CUDA kernel memory: Temporarily disable fused CUDA ops (
use_fused_* = false) or setuse_cuda=false
Memory estimation: Model memory (FP16) ≈ 2 bytes × total_parameters Optimizer memory (Adam) ≈ 8 bytes × parameters Gradient memory ≈ 2 bytes × parameters Activation memory ≈ 2 × batch_size × sequence_length × num_layers × hidden_size CUDA kernel buffers ≈ 100-500 MB (temporary buffers) Total ≈ 12-16 bytes per parameter + activation memory + kernel overhead
Gradient explosion: Symptoms: Loss becomes NaN, gradient norm > 100, rapid loss increase
Automatic recovery: Orchestrator detects high gradient norm, triggers emergency LR reduction (10×), rolls back to previous checkpoint, resumes with lower LR. CUDA fused gradient clipping prevents most explosions.
Manual fixes:
- Lower
learning_rate: Try 10× reduction - Increase
gradient_clip_val: Clip at lower threshold (0.5 instead of 1.0) - Use mixed precision: BF16 more stable than FP16
- Enable gradient checkpointing: Can improve numerical stability
- Check data: Outliers or corrupted samples can cause explosions
- Verify CUDA kernels: Check for numerical issues by setting
use_fused_* = falsetemporarily
Loss divergence: Symptoms: Loss increases consistently, validation loss >> training loss, sudden loss spikes
Automatic recovery: Orchestrator detects divergence pattern, rolls back N steps, adjusts learning rate, may modify architecture parameters.
Manual fixes:
- Reduce
learning_rate: Start 3-5× lower - Increase
weight_decay: Stronger regularization (0.1 instead of 0.01) - Check data quality: Remove outliers, validate preprocessing
- Reduce model capacity: Overparameterized models may not converge on small datasets
- Check CUDA kernel correctness: Verify against PyTorch fallback
Expert collapse (MoE): Symptoms: All tokens route to 1-2 experts, routing entropy < 1.0, most experts have near-zero utilization
Automatic recovery: Orchestrator detects imbalance, increases load_balancing_weight, adjusts routing_temperature, may prune/add experts. CUDA routing kernel continues to function correctly during recovery.
Manual fixes:
- Increase
load_balancing_weight: Try 0.02 or 0.05 (from 0.01) - Increase
capacity_factor: Allow more tokens per expert (1.5 or 2.0) - Adjust
routing_temperature: Higher values (1.5-2.0) encourage uniform routing - Enable
expert_dropout: Forces routing to all experts - Check initialization: Poorly initialized experts may never activate
- Monitor CUDA routing performance: Ensure kernels functioning correctly
Automatic optimization: Orchestrator monitors throughput, detects degradation, suggests optimizations (enable compilation, adjust batch size, check data loading bottlenecks, verify CUDA kernel usage).
Manual optimizations:
- Enable
compile: PyTorch 2.0 compilation (5-30% speedup) - Enable fused CUDA ops:
use_fused_rope,use_fused_swiglu,use_fused_moe,use_fused_loss,use_fused_grad_clip - Keep
use_fused_rmsnorm=falseunless profiling shows a gain on your workload/GPU - Enable
use_flash_attention: 2-4× attention speedup on Ampere+ - Use
mixed_bf16ormixed_fp16: 2× speedup over FP32 - Increase
num_workers: Parallelize data loading (typically 4-8) - Ensure
pin_memory=trueand tuneprefetch_factor(typically 2-8) - Increase
batch_size: Better GPU utilization (if memory allows) - Reduce
gradient_checkpointing: Faster but more memory - Check I/O: Move dataset to fast SSD, use memory-mapped files
- Profile CUDA: Use
enable_cuda_profiling=Trueto identify bottlenecks
Bottleneck identification:
- GPU utilization < 80%: CPU or data loading bottleneck
- Low throughput with high GPU utilization: Model or algorithm bottleneck
- Throughput decreases over time: Memory fragmentation or thermal throttling
- Inconsistent throughput: Data loading variance or OS interference
- CUDA kernel fallbacks: Check compilation, verify GPU compatibility
- High CUDA overhead: Check kernel launch efficiency with profiling
Common issues:
- Training instability: MPS precision handling differs from CUDA
- Compilation failures: PyTorch MPS backend less mature
- Memory pressure: Unified memory competition with OS
- No CUDA acceleration: Custom kernels unavailable
Solutions:
- Set
compile=False: Disable compilation if unstable - Set
num_workers=0: MPS prefers single-threaded data loading - Set all CUDA fused ops to false (
use_fused_rmsnorm/use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip) - Reduce
batch_size: Start conservative (2-4) - Monitor Activity Monitor: Check memory pressure, GPU usage
- Update PyTorch: MPS backend rapidly improving, use latest version
- Fall back to CPU: If MPS unreliable, CPU training is alternative (slower but stable)
- Expect slower training: 4-5× slower than CUDA-enabled GPU
Corruption: Symptoms: Checkpoint fails to load, missing keys, size mismatch
Recovery: System automatically tries previous checkpoints (latest → latest-1 → best validation). If all corrupt, restart from initialization.
Prevention: Enable save_total_limit > 3, save to reliable storage, validate checksums. CUDA kernel state saved separately for recovery.
Resume failures: Symptoms: Training resumes but loss resets, optimizer state lost, different results than before
Causes: Incomplete checkpoint save, random seed mismatch, configuration mismatch, CUDA kernel version change
Solutions: Verify checkpoint integrity before resume, ensure configuration matches checkpoint, check random seed restoration, recompile CUDA kernels if updated.
Format errors: Symptoms: Training fails during data loading, "invalid JSON" or "unexpected format" errors
Solutions: Validate data format with provided validation scripts, check for: missing fields, incorrect JSON structure, encoding issues (use UTF-8), empty files or lines.
Quality problems: Symptoms: Training succeeds but poor results, high validation loss, model outputs nonsense
Causes: Data contamination, label errors, poor quality samples, distribution mismatch
Solutions: Enable automatic_data_cleaning, increase quality_threshold, manually inspect samples, check train/validation split, verify preprocessing correctness.
Compilation failures: Symptoms: Kernels fail to compile, nvcc errors, missing libraries
Solutions:
- Verify CUDA toolkit installation:
nvcc --version - Check compute capability: Ensure GPU supported (sm_75+)
- Install build tools:
sudo apt-get install build-essential - Check PyTorch CUDA version: Must match CUDA toolkit
- Try manual compilation:
cd Src/Main_Scripts/core && ./compile_transformer_ops.sh && ./compile_cuda_moe.sh && cd ../training && ./compile_kernels.sh - Check compiler flags: Verify architecture flags match GPU
Runtime errors: Symptoms: CUDA kernel crashes, incorrect results, NaN outputs
Solutions:
- Verify numerical correctness: Compare with PyTorch fallback
- Check input shapes: Ensure dimensions match kernel expectations
- Enable debugging: Set
CUDA_LAUNCH_BLOCKING=1 - Check memory: Verify sufficient GPU memory for kernel buffers
- Update kernels: Recompile after PyTorch/CUDA update
- Disable temporarily: set
use_fused_* = false(oruse_cuda=false) to isolate issue
Performance issues: Symptoms: CUDA kernels slower than expected, low speedup, high overhead
Solutions:
- Enable profiling:
enable_cuda_profiling=True - Check occupancy: Use
get_cuda_performance_stats() - Verify batch size: Small batches have higher overhead
- Check launch config: May need tuning for specific GPU
- Monitor fallbacks: Excessive PyTorch fallbacks indicate issues
- Compare architectures: Older GPUs have lower speedup
Considerations for deploying trained models in production environments.
Checkpoint format: Standard PyTorch state dict compatible with transformers library. Can export to HuggingFace format, ONNX (for inference optimization), TorchScript (for deployment), or TensorRT (for NVIDIA inference).
Conversion process:
- Load final checkpoint
- Extract model state dict
- Convert to target format
- Validate outputs match original
- Benchmark inference performance
- Optimize CUDA kernels for inference (if applicable)
Size optimization:
- Weight pruning: Remove low-magnitude weights
- Quantization: INT8 or INT4 for deployment
- Knowledge distillation: Train smaller dense model from sparse model
- Expert merging: Combine similar experts in MoE models
- Kernel fusion: Combine operations for inference
Quantization strategies:
- Post-training quantization: INT8 via bitsandbytes or GPTQ (4-bit)
- Quantization-aware training: Train with quantization simulation
- Dynamic quantization: Quantize activations at runtime
- Mixed precision inference: FP16 or INT8 depending on layer
- CUDA kernel quantization: Custom kernels for quantized operations
Batching:
- Dynamic batching: Group requests by sequence length
- Continuous batching: Add requests to in-flight batches
- Request queuing: Balance latency and throughput
- CUDA stream management: Overlap computation and data transfer
KV cache management:
- Cache quantization: INT8 KV cache (2× memory reduction)
- Cache eviction: Drop old tokens for long conversations
- Paged attention: Efficient memory allocation (vLLM)
- CUDA-optimized caching: Custom kernels for cache operations
Serving frameworks:
- vLLM: High-throughput inference with paged attention
- TensorRT-LLM: NVIDIA-optimized inference with custom kernels
- Text Generation Inference: HuggingFace serving
- Custom deployment: Direct PyTorch or ONNX Runtime with CUDA kernels
Inference metrics:
- Latency: Time to first token, total generation time, CUDA kernel overhead
- Throughput: Requests per second, tokens per second
- Quality: Output validation, coherence scoring
- Resource usage: GPU memory, CPU utilization, kernel efficiency
Model drift detection:
- Input distribution monitoring
- Output quality tracking over time
- Comparison to validation benchmarks
- Automatic retraining triggers
This system is available under a commercial license. Framework provided for evaluation via demo notebook. Production use requires license agreement.
License tiers:
Research/Academic:
- Non-commercial research and educational use
- Academic institutions, non-profit research
- Publications require citation
- No redistribution of modified code
- Includes CUDA kernels for research purposes
- Pricing: Contact for academic pricing
Startup:
- Companies with <10 employees, <$1M revenue
- Internal use only (no redistribution)
- Includes updates and bug fixes
- Email support
- CUDA kernel source code included
- Pricing: Contact for startup pricing
Enterprise:
- Larger organizations
- Internal use and customer deployments
- Includes updates, bug fixes, security patches
- Priority support, SLA options
- Custom modifications available
- CUDA kernel customization support
- Pricing: Contact for enterprise pricing
Contact: matiasnhmb@gmail.com
Evaluation license: Demo notebook and local installation for evaluation purposes. 30-day evaluation period. No production use. Watermarked outputs during evaluation. CUDA kernels included for testing.
Adaptive Training Manual: docs/adaptive_training.md
- Orchestrator internals
- Intervention method details
- Configuration recommendations
- Case studies and examples
- Performance optimization strategies
MoE/MoD Tutorial: docs/sparse_architectures.md
- Sparse architecture theory
- Implementation details
- Training best practices
- Debugging sparse models
- CUDA-accelerated MoE operations
MoE/MoD Tutorial: docs/adapters.md
- Deep dive into architecture + implementation details
- Step-by-step guides for building, training, and plugging in adapters
- Practical examples + gotchas (routing, load balancing, perf tuning)
- Best practices for scaling and debugging
MoE/MoD Tutorial: docs/cuda_acceleration.md
- These kernels replace high-frequency transformer primitives that dominate runtime in dense and MoE models.
- These kernels implement the full MoE routing pipeline entirely on GPU:routing → dispatch → expert compute → recombination.
- These kernels eliminate synchronization points by collapsing multi-step training operations into single GPU passes.
- Mixed precision, streams, and overlap strategies
@software{AdaptiveTrainingSystem2025,
title = {Adaptive Training System: Modular Transformer Training with MoE/MoD},
author = {MatN23},
year = {2025},
url = {https://github.com/matn23/AdaptiveTrainingSystem},
note = {Production-grade training framework with adaptive optimization and CUDA acceleration}
}Issues: GitHub issue tracker for bug reports and feature requests
Discussions: GitHub discussions for questions and community support
Email: matiasnhmb@gmail.com for licensing and technical inquiries

