Skip to content

MatN23/AdaptiveTrainingSystem

Adaptive Training System

Production Transformer Training Framework with MoE/MoD Architecture & CUDA Acceleration

Python PyTorch CUDA License Colab

DemoArchitectureCUDA AccelerationConfigurationAPILicensing


Table of Contents


Overview

Adaptive Training System is a production-grade transformer training framework implementing Mixture of Experts (MoE) and Mixture of Depths (MoD) architectures with autonomous training optimization and custom CUDA acceleration kernels. Supports models from 500M to 300B+ parameters with enterprise infrastructure.

Core capabilities:

  • Sparse architectures: MoE (8-64 experts), MoD (dynamic depth), hybrid configurations
  • CUDA acceleration: Custom kernels for RMSNorm (3-4x faster), RoPE (2-4x faster), SwiGLU (2-3x faster), MoE routing (2-4x faster), fused loss computation
  • Metal acceleration: Custom Metal shaders for Apple Silicon - RMSNorm (2-3x faster), RoPE (3-5x faster), SwiGLU (2-3x faster), MoE routing
  • Adaptive orchestrator: 18 autonomous intervention methods for training optimization
  • Chinchilla scaling: Automatic epoch calculation based on compute-optimal principles
  • Multi-GPU training: DeepSpeed ZeRO (stages 1-3), FSDP, ColossalAI with efficient gradient synchronization
  • Precision support: FP32, FP16, BF16, mixed precision, FP8 (H100+ via Triton)
  • Advanced Quantization: 4-bit/8-bit support via AutoGPTQ and Optimum Quanto
  • Hardware targets: CUDA (Volta-Hopper), Apple Silicon (M1-M4) with Metal acceleration, CPU
  • Data handling: Memory-mapped datasets, Apache Arrow zero-copy, automatic caching
  • Router Optimization: Fine-tuning mode and adapter loading for MoE routers
  • Recovery systems: Automatic OOM handling, gradient explosion recovery, checkpoint rollback

Framework positioning:

This is a complete training system with custom CUDA kernels, not a model zoo or API wrapper. Every component from tokenization to fused gradient operations is included. MoE and MoD implementations follow established research (Switch Transformer, Mixture of Experts, Mixture-of-Depths) with operational additions and CUDA-accelerated execution: dynamic expert management, capacity tuning, load balancing, routing analytics.

The adaptive orchestrator monitors 20+ metrics every N steps and triggers interventions across hyperparameters, architecture, and recovery procedures. Maintains decision history with confidence scoring to prevent excessive intervention.

Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical operations while maintaining gradient compatibility and numerical stability. Metal shaders provide 2-5x speedup on Apple Silicon (M1-M4). All kernels include automatic fallback to PyTorch when accelerated backends are unavailable.

Intended for:

  • ML engineers requiring full training stack control with maximum performance
  • Research teams prototyping sparse architectures with production-grade infrastructure
  • Organizations with proprietary data, compliance requirements, and performance constraints
  • Teams needing framework-independent infrastructure with custom optimization capabilities

Not included:

  • Pre-trained model weights (configuration presets only, train from scratch)
  • Model checkpoints or existing trained models
  • High-level abstractions (direct control provided)
  • Tutorial content (assumes ML engineering background)

Technical Architecture

Dense Transformers

Standard architecture with LLaMA/GPT-NeoX design patterns:

  • Pre-normalization: RMSNorm before attention/FFN (CUDA-accelerated, 3-4x faster)
  • Grouped Query Attention: Reduces KV cache via shared KV heads (typical ratio 4:1 or 8:1)
  • Rotary Position Embeddings: Length generalization with configurable theta (10000 base, 1000000 extended), CUDA-accelerated RoPE application (2-4x faster)
  • SwiGLU activation: Two-path gating in FFN, intermediate_size typically 8/3 × hidden_size, CUDA-accelerated (2-3x faster)
  • Optional Flash Attention 2.x: 2-4x speedup on Ampere+ GPUs with memory-efficient attention

Parameter calculation:

  • Embedding: vocab_size × hidden_size
  • Attention per layer: hidden_size² × (1 + num_kv_heads/num_heads)
  • FFN per layer: 2 × hidden_size × intermediate_size
  • Output: vocab_size × hidden_size (optionally tied)

Mixture of Experts (MoE)

Token-level sparse activation via learned routing to specialized FFN networks with CUDA-accelerated dispatch.

Routing mechanism:

  • Top-k gating: Each token routed to k of N experts (typical: k=2, N=8)
  • Router: Linear layer (hidden_size × num_experts) + softmax + TopK selection
  • CUDA optimization: Warp-based top-k kernel with shuffle reduction (2-4x faster than PyTorch)
  • Output: Weighted combination of selected expert outputs

Load balancing:

  • Auxiliary loss: Penalizes routing imbalance via expert utilization distribution
  • Capacity factor: Maximum tokens per expert = (total_tokens/num_experts) × capacity_factor
  • Typical capacity_factor: 1.25-1.5 (25-50% overflow buffer)
  • Load balancing weight: 0.01 (added to main loss)

Dynamic management:

  • Expert addition: Triggered when utilization exceeds threshold (typically 0.85)
  • Expert pruning: Removes experts below utilization threshold (typically 0.15)
  • Capacity adaptation: Adjusts based on token drop rate
  • Temperature tuning: Controls routing concentration (lower = sharper, higher = more uniform)

CUDA acceleration:

  • Token dispatch: Vectorized memory access (float4) with coalesced writes, reduced atomic contention
  • Expert combination: Optimized accumulation with local buffering before atomic adds
  • Memory efficiency: Batched operations reduce kernel launch overhead by 3-5x

Efficiency:

  • 8-expert top-2 MoE: 8× total parameters, 1.25× active parameters per token
  • Sparsity: 87.5% (12.5% parameters active)
  • Memory: Scales with total parameters (all experts in memory)
  • Compute: Scales with active parameters only
  • CUDA speedup: 2-4x over PyTorch for routing and dispatch operations

Statistics tracked:

  • Per-expert utilization: Fraction of tokens routed to each expert
  • Routing entropy: Distribution concentration (max = log(num_experts))
  • Load balance loss: Auxiliary loss magnitude
  • Tokens dropped: Count exceeding capacity
  • Per-layer patterns: Early layers more uniform, later layers more concentrated

Mixture of Depths (MoD)

Layer-level sparse activation via learned skip decisions.

Core concept: Model learns which tokens require full layer computation vs. residual skip. Routing decision per token per layer based on token representation at layer input.

Routing types:

  • Learned: Small MLP scores tokens, top-capacity_factor selected for full processing
  • Static: Fixed pattern (e.g., all tokens full in early layers, reduced in later layers)
  • Random: Random selection maintaining capacity_factor (ablation baseline)

Capacity management:

  • capacity_factor controls fraction of tokens using full computation
  • 0.5 = 50% tokens full processing, 50% skip
  • Selection is learned during training, not random

Efficiency:

  • FLOPs reduction: ~(1 - capacity_factor) for layers with MoD
  • Typical: 30-50% FLOPs reduction with 0-2% perplexity increase
  • Quality/efficiency tradeoff: Lower capacity = more savings but larger quality impact

Application strategies:

  • All layers: Maximum compute reduction
  • Later layers only: Preserve early feature extraction, reduce later specialization
  • Alternating: MoD every N layers
  • FFN-only: Dense attention, sparse FFN (common since FFN is 2/3 of compute)

Training dynamics:

  • Early training: Near-uniform routing (all tokens treated similarly)
  • Specialization: Model learns token complexity patterns over time
  • Curriculum: Can start with capacity_factor=1.0, gradually reduce to target
  • Annealing: Gradual capacity reduction prevents training instability

Hybrid (MoE + MoD)

Combined token-level (MoE) and layer-level (MoD) sparsity with coordinated CUDA acceleration.

Architecture:

  • Each layer: Dense attention + MoE FFN + MoD routing
  • MoD decides: Use full layer (attention + MoE) or skip via residual
  • If token uses layer, routes through MoE experts with CUDA-accelerated dispatch

Sparsity compounding:

  • Top-2 of 8 experts: 25% expert parameters active
  • 50% layer capacity: 50% tokens use layers
  • Combined: 0.5 × 0.25 = 12.5% active parameters per token
  • 87.5% total sparsity

Training considerations:

  • Both routing mechanisms must learn useful patterns
  • Load balancing for experts, capacity warmup for MoD
  • Routing temperature adaptation for both systems
  • Quality-aware guards prevent catastrophic sparsity collapse

Use cases:

  • Maximum efficiency: Largest models on limited compute
  • Fast experimentation: Smaller active compute enables rapid iteration
  • Inference optimization: Reduced memory and compute for deployment
  • Multi-task learning: Different experts and depths specialize per task

Router Optimization

Specialized Router Management:

  • Router Fine-tuning: Isolate and train only the routing layers (gate_proj) while freezing other parameters. Helps specialize routing logic without catastrophic forgetting in experts.
  • Adapter Loading: Load pre-trained router weights independently. Useful for transferring routing strategies between similar models.
  • Dynamic Fine-tuning: Enable router training during specific phases of the main training loop.

CUDA Acceleration

Custom CUDA kernels provide 2-7x speedup over PyTorch implementations for critical training operations. All kernels maintain full gradient compatibility and include automatic fallback.

Optimized Operations

1. Fused RMSNorm (3-4x faster)

  • Vectorized memory access: float4 operations for coalesced reads/writes
  • Optimized reductions: Warp shuffle + reduced shared memory usage
  • Key optimizations:
    • Single-pass algorithm (compute variance and normalize in one kernel)
    • Eliminated bank conflicts in shared memory
    • Better register allocation (maxrregcount=64)
    • Loop unrolling for common hidden sizes
  • Performance: ~3.2ms → ~0.9ms per 1000 tokens (hidden_size=768)

2. Rotary Position Embeddings (2-4x faster)

  • Optimized precompute: Vectorized cos/sin cache generation
  • Efficient application: Reduced redundant memory access
  • Key optimizations:
    • Process 2 dimension pairs per thread (better ILP)
    • Use of __ldg() for read-only cache optimization
    • FMA instructions (fmaf) for multiply-add fusion
    • Eliminated per-token position computation
  • Performance: ~8.5ms → ~1.3ms per batch (batch=4, seq=128, heads=12)

3. SwiGLU Activation (2-3x faster)

  • Vectorized computation: float4 for gate and up projections
  • Fast SiLU approximation: Optimized sigmoid computation
  • Key optimizations:
    • Fused gate/up computation in single kernel pass
    • Register tiling for matrix multiplication
    • Eliminated intermediate memory allocation
    • Vectorized output writes
  • Performance: ~5.1ms → ~1.8ms per 1000 tokens (intermediate=3072)

4. MoE Operations (2-4x faster)

  • Optimized top-k gating: Warp-level parallel reduction
  • Token dispatch: Coalesced memory access patterns
  • Expert combination: Reduced atomic operations
  • Key optimizations:
    • Warp-based top-k (no shared memory for k≤4)
    • Batched atomic increments per block
    • Vectorized token copies (float4)
    • Local accumulation before global writes
  • Performance: ~12ms → ~3.5ms per batch for routing+dispatch+combine

5. Fused Loss Computation

  • Cross-entropy + accuracy: Single-pass computation
  • Key optimizations:
    • Combined max/sum reduction for softmax
    • Parallel argmax using shared memory
    • Fused gradient computation
    • Eliminated redundant passes
  • Performance: ~4.2ms → ~1.1ms per batch (vocab_size=32000)

6. Fused Gradient Clipping

  • Fully async pipeline: No CPU-GPU synchronization
  • Key optimizations:
    • Compute norm on GPU (no D2H transfer)
    • Conditional clipping kernel launch
    • Pinned memory for async copy
    • Single synchronization point at end
  • Performance: ~8ms → ~1.5ms (including norm computation)

CUDA Kernel Speedups

CUDA Kernel Speedup Overview Custom kernels provide 2–7× speedup over vanilla PyTorch operations.

Compilation and Usage

Compilation requirements:

# Install CUDA toolkit (11.8+ or 12.x)
# Verify nvcc availability
nvcc --version

# Optional: force target architectures (comma or semicolon separated)
# export CUDA_TARGET_SM=75,80,86,89,90

# Compile transformer + MoE kernels
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh

# Compile training kernels
cd ../training
./compile_kernels.sh

Automatic kernel detection: Framework automatically detects and loads compiled kernels at runtime. Falls back to PyTorch if kernels unavailable. No code changes required to use CUDA acceleration.

Automatic JIT rebuild for current hardware:

  • If kernel .so files are missing or compiled for the wrong SM target, runtime wrappers trigger a rebuild automatically.
  • Target architecture resolution order: CUDA_TARGET_SMTORCH_CUDA_ARCH_LIST → detected GPU compute capability → fallback sm_75.
  • Set CUDA_TARGET_SM when you need deterministic builds across machines.

Supported architectures:

  • sm_75: Turing (T4, RTX 2080)
  • sm_80: Ampere (A100, RTX 3090)
  • sm_86: Ampere (RTX 3060/3070/3080)
  • sm_89: Ada Lovelace (RTX 4090)
  • sm_90: Hopper (H100, H200)

Performance monitoring:

from cuda_opt_wrapper import print_performance_summary

# After training
print_performance_summary()
# Shows per-operation timing, speedup metrics, throughput

Numerical Stability

All CUDA kernels maintain numerical stability equivalent to PyTorch:

  • RMSNorm: Epsilon term (1e-6) prevents division by zero
  • RoPE: No approximations, exact trigonometric functions
  • SwiGLU: Fast SiLU uses stable exp(-x) formulation
  • Loss: Max-subtraction for softmax numerical stability
  • Gradient clipping: Norm computation uses compensated summation

Validation: Every kernel includes PyTorch fallback for correctness verification. Automated tests compare CUDA vs PyTorch outputs (tolerance: 1e-4 for FP32, 1e-3 for FP16).


Model Configuration Presets

Pre-configured architecture presets for training models from scratch, spanning 500K to 300B parameters. These are configuration templates, not pre-trained models.

Each preset specifies architecture dimensions, MoE/MoD parameters, hardware targets, and expected performance with CUDA acceleration for initializing and training new models.

Config Active Params Total Params Hidden Layers Heads KV Heads Experts Top-K Hardware Memory (FP16) Throughput CUDA Speedup
debug 500K 4M 128 2 2 2 8 2 Any 50 MB Testing 2.1x
debug_200m 200M 6B 768 12 12 12 32 MoD - T4/MPS 2 GB Testing 2.8x
b1 1B 8B 1024 24 16 4 8 2 RTX 3090, M1 Max 8 GB 1400 tok/s 3.2x
b7 7B 56B 4096 32 32 8 8 2 A100 40GB 28 GB 680 tok/s 3.5x
b14 14B 112B 5120 40 40 10 8 2 A100 80GB 56 GB 340 tok/s 3.8x
b30 30B 240B 8192 48 64 16 8 2 4× A100 80GB 120 GB 135 tok/s 4.1x
b50 50B 400B 10240 56 80 20 8 2 4× H100 200 GB 68 tok/s 4.5x
b100 100B 800B 12288 80 96 24 8 2 8× H100 400 GB 68 tok/s 4.8x
b200 200B 1.6T 16384 100 128 32 8 2 16× H200 800 GB 41 tok/s 5.2x
b300 300B 2.4T 20480 120 160 40 8 2 32× H200 1.2 TB 27 tok/s 5.5x

Memory estimates: Include model weights, optimizer states (Adam: 8 bytes/param), gradients, and activation memory at batch_size=1, mixed precision training. Actual memory scales with batch size and sequence length.

***Preset debug_200m not included in this graph***

Throughput estimates: With CUDA acceleration enabled at batch_size=1, sequence_length=2048, mixed precision with gradient checkpointing. CUDA speedup column shows combined acceleration from all custom kernels vs. pure PyTorch.

Configuration selection:

  • Development/testing: debug for pipeline validation, debug_200m for architecture testing
  • Research: b1 for prototyping on consumer hardware with CUDA acceleration
  • Production fine-tuning: b7 for quality/efficiency balance with significant CUDA speedup
  • Large-scale pre-training: b30+ for maximum model capacity
  • Extreme scale: b100+ requires cluster infrastructure and distributed expertise

Important:

These presets define untrained model architectures. Training starts from random initialization following standard practices (Xavier/Kaiming initialization for weights, zero initialization for biases). The framework does not provide pre-trained checkpoints.

Customization:

All presets are starting points. Architecture dimensions can be modified: hidden_size must be divisible by num_heads. Intermediate_size typically 8/3 × hidden_size rounded to nearest 256 for optimal CUDA performance. Max_position_embeddings determines context window. Num_experts and moe_top_k can be adjusted independently. MoD capacity_factor controls compute/quality tradeoff. CUDA kernels automatically adapt to configuration changes.


Precision Support

Numerical formats for parameters, activations, and gradients during training and inference.

Supported Precisions

FP32 (Float32) - Full Precision

  • 32-bit floating point (8-bit exponent, 23-bit mantissa)
  • Range: ±3.4×10^38, precision: ~7 decimal digits
  • Maximum stability, no special handling required
  • 2× memory vs FP16/BF16, significantly slower on modern hardware
  • Use cases: CPU training, numerical debugging, stability issues with reduced precision

FP16 (Float16) - Half Precision

  • 16-bit floating point (5-bit exponent, 10-bit mantissa)
  • Range: ±65504, precision: ~3 decimal digits
  • 50% memory reduction, ~2× speedup on supported hardware
  • Requires loss scaling to prevent gradient underflow (small gradients round to zero)
  • Dynamic or static loss scaling: multiply loss by 2^N before backward, unscale gradients before update
  • CUDA kernels maintain FP32 accumulation for numerical stability
  • Use cases: Volta/Turing GPUs (V100, T4, RTX 2080), Apple Silicon (M1-M4)

BF16 (BFloat16) - Brain Float16

  • 16-bit format (8-bit exponent, 7-bit mantissa)
  • Range: Same as FP32 (±3.4×10^38), reduced precision vs FP32
  • 50% memory reduction, similar speed to FP16
  • No loss scaling required (wide dynamic range like FP32)
  • Better training stability than FP16 with same memory benefits
  • CUDA kernels optimized for BF16 with hardware acceleration on Ampere+
  • Use cases: Ampere+ GPUs (A100, RTX 3090/4090, H100), primary recommendation for modern hardware

Mixed Precision FP16

  • Forward/backward in FP16, master copy of parameters in FP32
  • Optimizer updates FP32 master copy, then copies to FP16 for next forward
  • Dynamic loss scaling automatically adjusts to prevent underflow
  • Combines FP16 speed with FP32 stability
  • CUDA kernels use FP32 accumulation internally
  • Use cases: Default for older GPUs supporting FP16 but not BF16

Mixed Precision BF16

  • Forward/backward in BF16, master parameters in FP32
  • No loss scaling needed (BF16 dynamic range matches FP32)
  • Simpler than mixed FP16 (no loss scaling configuration)
  • Best speed/stability balance on modern hardware
  • CUDA kernels leverage hardware BF16 tensor cores
  • Use cases: Default for Ampere+ GPUs, primary recommendation for production

FP8 (Float8) - Experimental

  • 8-bit floating point: E4M3 (forward) and E5M2 (backward) variants
  • Further memory reduction (75% vs FP32)
  • Requires H100 or newer with hardware FP8 support
  • Complex configuration, quality impacts not fully characterized
  • Custom CUDA kernels under development for FP8 support
  • Use cases: Cutting-edge research, not recommended for general use

INT8 Quantization

  • 8-bit integer representation (post-training or quantization-aware training)
  • Primarily for inference, not training
  • Reduces model size by 75% vs FP32 for deployment
  • Quality impact depends on calibration and quantization method
  • Use cases: Model deployment, edge devices

Automatic Precision Selection

The framework detects hardware and selects optimal precision with CUDA kernel compatibility:

Detection logic:

  1. Check for CUDA availability and GPU compute capability
  2. Verify CUDA kernel compilation and availability
  3. If Ampere+ (compute capability ≥ 8.0): Select mixed_bf16 with BF16-optimized kernels
  4. If Volta/Turing (compute capability 7.0-7.5): Select mixed_fp16 with FP16-optimized kernels
  5. If Apple Silicon (MPS): Select fp16 (BF16 not supported, CUDA kernels unavailable)
  6. If CPU: Select fp32 (reduced precision offers no benefit, CUDA kernels unavailable)

Override: Set precision explicitly via configuration if automatic selection is suboptimal or for specific debugging/testing requirements.

Supported Frameworks & Integrations

Quantization & Inference:

  • AutoGPTQ: 4-bit quantization support for efficient inference and fine-tuning on consumer hardware.
  • Optimum Quanto: Dynamic quantization support (8-bit/4-bit) for flexible deployment.
  • OpenAI Triton: High-performance FP8 kernels for H100+ architectures.

Model Compatibility:

  • DeepSeek Config Adapter: Auto-convert training configurations to DeepSeek-compatible formats.
  • HuggingFace Interop: Seamless integration with transformers for dataset loading and tokenization.

Hardware-Specific Recommendations

NVIDIA Ampere/Ada/Hopper (A100, RTX 3090/4090, H100, H200):

  • Recommended: mixed_bf16 with full CUDA acceleration (4-5x total speedup)
  • Alternative: mixed_fp16 (if BF16 causes unexpected issues)
  • Advanced: fp8_e4m3 (H100+ only, experimental, custom kernels in development)
  • CUDA benefits: All kernels optimized for tensor cores, maximum performance

NVIDIA Volta/Turing (V100, T4, RTX 2080/2080Ti):

  • Recommended: mixed_fp16 with CUDA acceleration (3-4x speedup)
  • Alternative: fp32 (if stability issues, 2-3x CUDA speedup maintained)
  • Note: BF16 not supported (no hardware acceleration)
  • CUDA benefits: FP16 tensor core utilization, optimized memory access

Apple Silicon (M1/M2/M3/M4, Mac Studio, MacBook Pro):

  • Recommended: fp16 (CUDA kernels unavailable, PyTorch fallback)
  • Note: Mixed precision and BF16 not supported on MPS backend
  • Limitations: Flash Attention disabled, DeepSpeed unavailable, no custom CUDA acceleration

CPU (Intel/AMD/ARM):

  • Recommended: fp32 (CUDA kernels unavailable, PyTorch fallback)
  • Note: Reduced precision offers minimal benefit on CPU
  • Expect significantly slower training than GPU (10-100× depending on model size)
  • Performance: No CUDA acceleration, relies on BLAS libraries (MKL, OpenBLAS, Accelerate)

Precision Configuration Parameters

Training precision: Format used during forward pass, backward pass, and gradient computation Inference precision: Format used during validation and evaluation Master precision: Format for optimizer's master parameter copy (typically FP32 in mixed precision) CUDA kernel precision: Automatic selection based on training precision

Separate training/inference precision: Common pattern: Train in mixed_bf16 for speed with CUDA acceleration, evaluate in fp32 for precise metrics. Or train in mixed_fp16 with CUDA kernels, deploy in int8 for inference.

Loss scaling parameters (FP16 only):

  • init_scale: Initial loss scaling factor (default: 2^16)
  • scale_factor: Multiplier for scale adjustment (default: 2.0)
  • scale_window: Steps without overflow before increasing scale (default: 2000)
  • min_scale: Minimum scale factor (default: 1.0)

Dynamic loss scaling adjusts automatically: scale increases every scale_window steps without overflow, decreases on overflow detection (NaN/Inf gradients). CUDA kernels maintain numerical stability with loss scaling. Most users do not need to modify these parameters.


Demo

Colab Notebook

Free GPU training demonstration requiring no local setup with CUDA acceleration showcase.

Open In Colab

Environment:

  • Hardware: Google Colab T4 GPU (15.8GB memory, Turing architecture, sm_75)
  • CUDA: 11.8 or 12.x (varies by Colab environment)
  • Runtime: ~3.5 hours for complete training run (with CUDA acceleration, ~5 hours without)

Configuration:

  • Model: debug preset (14m total, 2m active, 8 experts, 2 top-k)
  • Dataset: Small conversational corpus (~200 samples, ~2.5M tokens)
  • Training: 12 epochs, batch_size=30, gradient_accumulation=8, effective_batch_size=240
  • Precision: Mixed FP16 (automatic for T4) with CUDA-accelerated operations
  • Learning rate: 1e-4 with cosine warmup and decay
  • CUDA kernels: Automatically compiled on first run (1-2 minutes compilation time)

Observable behaviors:

  • System diagnostics: Hardware detection, precision selection, CUDA kernel loading
  • Chinchilla scaling: Optimal token calculation (20× parameters), epoch recommendations
  • Training metrics: Loss progression (~2.5 → ~2.0), throughput (~53–55k tok/s with CUDA vs ~30k tok/s without)
  • Orchestrator: Health checks every 100 steps, intervention decisions
  • Expert statistics: Utilization distribution, routing entropy, load balance
  • CUDA performance: Real-time speedup metrics, per-operation timing, memory efficiency

Limitations:

  • Colab quotas: Usage limits, session timeouts
  • Small dataset: Limited quality, demonstrates pipeline not production training
  • Free tier: T4 availability not guaranteed, may encounter queuing
  • CUDA compilation: First run requires kernel compilation (~1-2 minutes)

Local Installation

Requirements:

  • Python 3.8+ (3.10+ recommended)
  • PyTorch 2.0+ (2.2+ recommended)
  • CUDA 11.8+ (for GPU with acceleration) or CPU
  • CUDA Toolkit with nvcc (for compiling custom kernels)
  • RAM: 16GB minimum, 32GB+ recommended
  • Disk: 50GB+ for dependencies, datasets, checkpoints

Installation:

git clone https://github.com/matn23/AdaptiveTrainingSystem
cd AdaptiveTrainingSystem
pip install -r requirements.txt

# Compile CUDA kernels (optional but recommended for 3-5x speedup)
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh

# Verify kernel compilation
python -c "from cuda_opt_wrapper import TRANSFORMER_OPS_AVAILABLE; print(f'CUDA ops: {TRANSFORMER_OPS_AVAILABLE}')"

cd ../..
python Main.py

Optional dependencies:

  • Flash Attention: 2-4× attention speedup, requires manual compilation
  • DeepSpeed: Multi-GPU optimization, auto-installs but compiles kernels on first use
  • Wandb: Experiment tracking, requires API key

Quick start: Default configuration uses debug preset for rapid testing. System auto-detects hardware, compiles CUDA kernels if nvcc available, selects precision, validates data, initializes model, begins training with acceleration.

Resume from checkpoint:

python Main.py --resume path/to/checkpoint.pt

Restores model state, optimizer state, scheduler, training step counter, random seeds, CUDA kernel state.

CUDA kernel management:

# Check kernel status
python -c "from moe_cuda_wrapper import print_performance_summary; print_performance_summary()"

# Recompile kernels after update
cd Src/Main_Scripts/core
./compile_transformer_ops.sh
./compile_cuda_moe.sh
cd ../training
./compile_kernels.sh

# Disable CUDA acceleration (for debugging)
# Set in config:
# use_cuda: false
# use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip: false

Training Modes

Four data handling strategies for different use cases, all compatible with CUDA acceleration.

Base/Pre-training Only

Raw text without conversational structure. For domain-specific pre-training, continued pre-training, language modeling research.

Data format:

  • Plain text files (.txt)
  • JSONL with "text" field

Processing: Tokenize and split into fixed-length sequences with optional overlap (stride parameter). CUDA-accelerated tokenization for large corpora.

Loss: Applied to all tokens (causal language modeling) with fused loss computation kernel.

Fine-tuning Only

Conversational data with role annotations. For instruction tuning, chat models, task-specific adaptation.

Data format: JSONL with "conversation" field containing list of messages. Each message has "role" (system/user/assistant) and "content".

Processing: Concatenate messages with special tokens marking roles: [BOS] system [SEP] user [SEP] assistant [EOS]. CUDA-accelerated tokenization and batching.

Loss: Can mask user tokens (loss only on assistant responses) or compute on all tokens. Fused loss kernel handles masking efficiently.

Sequential Hybrid

Two-phase training: base corpus then conversational data. Builds general understanding, then adapts to conversation.

Configuration: Separate epoch counts per phase. Optional learning rate warmup between phases to handle distribution shift. CUDA acceleration maintained across phase transitions.

Use cases: Domain adaptation (medical literature → clinical QA), continual learning (new data → maintained task performance).

Interleaved Hybrid

Mix base and conversational data within batches/epochs. Maintains general capabilities while learning conversation. Prevents catastrophic forgetting.

Mixing ratio: base_ratio controls proportion (e.g., 0.7 = 70% base, 30% conversational).

Strategies:

  • Ratio-based: Sample according to ratio per batch
  • Alternating: Cycle between sources (batch 1 base, batch 2 conversational, etc.)
  • Random: Random sampling with probability = base_ratio

Use cases: General-purpose chat models, multi-task learning with auxiliary objectives.


Adaptive Training System

Orchestrator Architecture

State machine monitoring training every N steps (default: 100). Triggers interventions across 18 methods when confidence threshold exceeded (default: 0.75). Monitors CUDA kernel performance and adjusts strategies accordingly.

Monitored metrics:

  • Loss dynamics: Trend, variance, plateau detection, divergence warnings
  • Gradient statistics: Norm, variance, stability over windows, clipping frequency
  • Expert utilization: Per-expert usage, routing entropy, load balance (MoE)
  • Memory consumption: GPU/system memory, OOM risk prediction, CUDA kernel memory usage
  • Throughput: Tokens/second with and without CUDA acceleration, kernel efficiency
  • Convergence: Score based on loss stability, compute efficiency
  • CUDA performance: Per-kernel timing, speedup metrics, fallback frequency

Intervention categories:

  1. Hyperparameter adaptation: Learning rate, weight decay, batch size
  2. Architecture modification: Add/remove experts, adjust capacities, routing temperature
  3. Emergency recovery: Gradient explosion handling, OOM recovery, checkpoint rollback
  4. Schedule optimization: Early stopping recommendations, checkpoint prioritization
  5. Performance optimization: CUDA kernel selection, precision adjustment, memory optimization

Decision process:

  1. Collect metrics at checkpoint interval including CUDA performance
  2. Analyze patterns (plateau, divergence, imbalance, performance degradation)
  3. Compute intervention confidence scores
  4. If confidence > threshold, execute intervention
  5. Log decision with reasoning, confidence, and expected performance impact
  6. Monitor intervention impact in subsequent intervals including CUDA efficiency

Intervention Methods

MoE Architecture Management (3 methods):

  • add_expert(layer_idx): Add expert to underutilized layer when average utilization > threshold
  • prune_expert(layer_idx, expert_idx): Remove expert with utilization < threshold
  • _initialize_new_expert(new_expert, existing_experts): Initialize from existing with noise

MoE Routing Control (4 methods):

  • adjust_capacity_factor(factor): Modify token capacity per expert
  • adjust_routing_temperature(temp): Control routing sharpness (lower = more concentrated)
  • enable_expert_dropout(prob): Regularization via random expert dropping
  • get_expert_statistics(): Retrieve utilization, entropy, load balance loss, tokens dropped

MoD Control (2 methods):

  • adjust_mod_capacity(factor): Change fraction of tokens using full computation
  • get_mod_statistics(): Retrieve capacity factor, tokens processed, average depth

Batch Size Adaptation (2 methods):

  • adjust_batch_size(new_size): Change micro batch size (typically for OOM recovery)
  • _recreate_dataloader(dataset): Rebuild dataloader after batch size change

Emergency Recovery (2 methods):

  • emergency_lr_reduction(factor): Reduce learning rate by factor (for gradient explosion)
  • rollback_steps(num_steps): Revert to earlier checkpoint (for divergence)

Optimizer Adjustments (2 methods):

  • adjust_weight_decay(value): Modify regularization strength
  • _update_optimizer_param_groups(param, value): Internal parameter group update

Real-time Metrics (3 methods):

  • get_current_metrics(): Query loss, LR, gradient norm, throughput, CUDA stats
  • _extract_moe_routing_stats(): Internal MoE statistics extraction
  • _calculate_throughput(): Compute tokens/second with CUDA performance breakdown

Performance Optimization (2 methods):

  • optimize_cuda_performance(): Analyze and tune CUDA kernel usage
  • get_cuda_performance_stats(): Detailed per-kernel timing and efficiency metrics

Chinchilla Scaling

Automatic training duration calculation following compute-optimal scaling laws (Hoffmann et al., 2022) with CUDA-aware compute budgets.

Formula: N_optimal_tokens = multiplier × model_parameters

Default multiplier: 20× (configurable: 10-50×)

Process:

  1. Calculate optimal token budget: N_opt = 20 × total_parameters
  2. Adjust for CUDA acceleration: effective_N_opt = N_opt / speedup_factor
  3. Determine base epochs: epochs = N_opt / dataset_tokens
  4. Clamp to min/max epoch constraints (default: 1-50)
  5. Monitor during training: convergence score, loss landscape, compute efficiency
  6. Track CUDA performance: kernel timing, memory efficiency, bottleneck identification
  7. Adjust dynamically: Reduce epochs if fast convergence, stop early if plateaued

Enhanced features:

  • Loss landscape analysis: Detect plateaus (low variance over window), divergence (rapid loss increase)
  • Compute efficiency: Track loss reduction per FLOP including CUDA acceleration benefits
  • Adaptive curriculum: Adjust learning rate or data sampling based on convergence phase
  • Early stopping: Recommend termination when convergence score > threshold (typically 85%)
  • CUDA-aware budgeting: Account for acceleration when calculating optimal training duration

Runtime integration:

System calculates optimal duration at training start with CUDA performance estimates. Displays token budget, coverage percentage (dataset_tokens / optimal_tokens), recommended epochs, expected CUDA speedup. During training, prints status every N steps: current progress, convergence score, training phase (warming/learning/convergence), efficiency trend, CUDA performance metrics, recommendations (continue/adjust/stop).


Hardware Optimization

Platform-specific optimizations automatically applied based on detected hardware with CUDA acceleration where available.

NVIDIA CUDA

Automatic optimizations:

  • Precision: mixed_bf16 for Ampere+, mixed_fp16 for Volta/Turing
  • Flash Attention: Enabled on Ampere+ (compute capability ≥ 8.0)
  • Custom CUDA kernels: Automatic compilation and loading of optimized operations
  • Tensor cores: Automatically utilized for supported operations and precisions
  • CUDA graphs: Enabled for static computation graphs (requires compile=True)
  • Memory management: Optimized allocation patterns for kernel efficiency

Configuration parameters:

  • use_flash_attention: Enable Flash Attention 2.x (2-4× attention speedup)
  • use_fused_rmsnorm: Enable fused RMSNorm kernel (default profile: false)
  • use_fused_rope: Enable fused RoPE kernel (default profile: true)
  • use_fused_swiglu: Enable fused SwiGLU kernel (default profile: true)
  • use_fused_moe: Enable CUDA MoE routing/dispatch kernels (default profile: true)
  • use_fused_loss: Enable fused loss kernel (default profile: true)
  • use_fused_grad_clip: Enable fused gradient clipping kernel (default profile: true)
  • validate_moe_cuda_indices: Extra safety checks for CUDA MoE indices (default: false, slower)
  • force_dense_expert_grads: Force dense expert gradient paths (default: false, slower)
  • routing_stats_update_interval: Routing stats sync cadence (default: 64 steps)
  • gradient_checkpointing: Trade compute for memory (enables larger models)
  • compile: PyTorch 2.0 compilation (5-30% additional speedup)
  • use_deepspeed: Enable DeepSpeed for multi-GPU
  • zero_stage: ZeRO optimization level (0-3)

DeepSpeed ZeRO stages:

  • Stage 0: Disabled (standard DDP with CUDA acceleration)
  • Stage 1: Partition optimizer states (~4× memory reduction)
  • Stage 2: Partition optimizer + gradients (~8× reduction)
  • Stage 3: Partition optimizer + gradients + parameters (~N× reduction where N = num_GPUs)

Memory optimization:

  • CPU offload: Move optimizer states to CPU memory (slower updates, massive memory savings)
  • Gradient compression: Reduce communication volume (quality impact minimal)
  • Activation checkpointing: Recompute activations during backward (trade compute for memory)
  • CUDA kernel memory: Optimized temporary buffer management

CUDA kernel performance:

  • Throughput monitoring: Real-time tokens/second with and without acceleration
  • Per-kernel timing: Individual operation profiling
  • Bottleneck detection: Identify operations limiting performance
  • Automatic tuning: Adjust kernel launch configurations for optimal performance

Apple Silicon (MPS) + Metal Acceleration

Custom Metal Shaders (NEW):

  • Fused RMSNorm: 2-3x faster than PyTorch
  • Rotary Position Embeddings: 3-5x faster
  • SwiGLU Activation: 2-3x faster
  • MoE Routing: Optimized expert dispatch and combination
  • Automatic detection: Metal shaders auto-load when available

Metal shader compilation:

cd Src/Main_Scripts/core
./compile_metal.sh

Automatic optimizations:

  • Precision: FP16 (optimal for Metal, auto-selected)
  • Sequence length: Automatically reduced to 512 on MPS to prevent OOM
  • Data loading: num_workers=0 (MPS prefers single-threaded)
  • Unified memory: Automatic page management
  • Backend detection: Unified backend system selects Metal when available

Limitations:

  • Flash Attention: Not supported (Metal alternative in development)
  • DeepSpeed: Not available (Linux/CUDA only)
  • Mixed precision: FP16 or FP32 only (no BF16)
  • Compilation: Can be unstable (set compile=False if issues)
  • Memory: Unified memory shared with system (monitor via Activity Monitor)

Recommendations:

  • Start with batch_size=2-4 (auto-reduced seq_length=512 helps)
  • Enable gradient checkpointing for larger models
  • Use debug or debug_200m presets for testing
  • Expected throughput: 2-3x faster than pure PyTorch on MPS

CPU

Automatic settings:

  • Precision: FP32 (reduced precision no benefit)
  • Threading: Automatic core detection, configurable via num_workers
  • PyTorch fallback: CUDA kernels unavailable

Optimizations:

  • BLAS libraries: MKL (Intel), OpenBLAS (AMD/ARM), Accelerate (macOS)
  • Thread count: Typically num_cores - 2 for system overhead
  • Vectorization: Automatic SIMD instruction usage

Expectations:

  • 10-100× slower than GPU depending on model size
  • 30-150× slower than CUDA-accelerated GPU training
  • Suitable for debugging, not production training
  • Memory constraints less severe (can use system RAM)

Data Processing

Memory-efficient data loading with zero-copy operations, automatic caching, and CUDA-accelerated preprocessing.

Features enabled:

  • Memory-mapped file access: Read datasets without loading entirely into RAM
  • Zero-copy operations: Apache Arrow columnar format, no serialization overhead
  • Multi-threaded loading: Configurable worker processes (num_workers parameter)
  • Automatic caching: HuggingFace Datasets caches processed data
  • Sharding: Automatic data distribution for multi-GPU training
  • Polars acceleration: Fast DataFrame operations for preprocessing
  • CUDA-accelerated tokenization: Parallel token encoding on GPU for large batches

Data intelligence:

  • Difficulty-based sampling: Prioritize harder examples based on loss history
  • Curriculum learning: Gradually increase example complexity (aggressiveness parameter: 0-1)
  • Automatic cleaning: Remove malformed samples, normalize formatting
  • Quality threshold: Filter samples below quality score (default: 0.85)
  • Sequence length optimization: Dynamic padding, variable length batching

Preprocessing pipeline:

  1. Load raw files (txt, jsonl)
  2. Validate format and structure
  3. Quality filtering (configurable threshold)
  4. Tokenization with CUDA acceleration and caching
  5. Sequence construction (chunking or conversation formatting)
  6. Difficulty scoring (for curriculum learning)
  7. Batch construction with dynamic padding
  8. Transfer to GPU with pinned memory for efficient data transfer

Statistics tracked:

  • Total samples, valid/invalid counts
  • Token statistics: mean, median, max, min, standard deviation
  • Role distribution (conversational data)
  • Quality scores: error rate, issues detected
  • Sequence length distribution
  • CUDA preprocessing time and speedup metrics

Validation:

System validates all data paths before training. Checks: file existence, readability, size, format correctness. Prints summary: file count, total size, samples per file, estimated CUDA preprocessing speedup. Reports errors: missing files, corrupt formats, empty files.


Monitoring

Comprehensive metrics tracked during training with real-time logging, experiment tracking integration, and CUDA performance monitoring.

Core metrics:

  • Loss: Training and validation, rolling average, per-batch values, fused computation timing
  • Perplexity: exp(loss), interpretable quality measure
  • Accuracy: Token-level prediction accuracy, fused with loss computation
  • Learning rate: Current value from scheduler
  • Throughput: Tokens per second with CUDA acceleration, samples per second, speedup ratio
  • Gradient norm: L2 norm of gradients, variance over window, clipping frequency
  • Memory usage: GPU allocated/reserved, system RAM, CUDA kernel buffers

MoE-specific metrics:

  • Expert utilization: Fraction of tokens per expert, per-layer and aggregated
  • Routing entropy: Distribution concentration, higher = more balanced
  • Load balance loss: Auxiliary loss magnitude
  • Tokens dropped: Count exceeding capacity
  • Expert efficiency: Compute per expert, utilization × quality contribution
  • CUDA routing performance: Top-k timing, dispatch efficiency, combination speedup

MoD-specific metrics:

  • Capacity utilization: Actual vs configured capacity per layer
  • Average depth: Mean number of full layers per token
  • Skip patterns: Which tokens skip which layers
  • Compute savings: FLOPs reduction percentage
  • Per-layer usage: Fraction of tokens using full computation per layer

Chinchilla metrics:

  • Token progress: Current tokens / optimal tokens percentage
  • Convergence score: 0-100% based on loss stability
  • Training phase: warming/learning/convergence/plateau
  • Compute efficiency: Loss reduction per FLOP including CUDA benefits
  • Predicted final loss: Extrapolation from current trajectory

CUDA performance metrics:

  • Kernel timing: Per-operation execution time
  • Speedup factors: CUDA vs PyTorch for each operation
  • Memory efficiency: Kernel buffer usage, allocation patterns
  • Throughput breakdown: Contribution of each accelerated operation
  • Fallback frequency: Count of PyTorch fallback invocations
  • Occupancy: GPU utilization percentage, kernel launch efficiency

Logging configuration:

  • Log level: DEBUG/INFO/WARNING/ERROR
  • Log interval: Every N steps (default: 50)
  • Health check interval: Orchestrator monitoring frequency (default: 100)
  • Checkpoint interval: Save frequency (default: 1000 steps)
  • CUDA profiling: Optional detailed kernel profiling (per 1000 steps)

Output destinations:

  • Console: Real-time training progress with CUDA stats
  • Log files: experiments/[name]/logs/training.log
  • Metrics files: JSON format, experiments/[name]/metrics/
  • Wandb: Optional cloud logging and visualization
  • TensorBoard: Optional local visualization
  • CUDA profiles: experiments/[name]/cuda_profiles/ (when enabled)

Health checks:

Orchestrator performs comprehensive health assessment every N steps:

  • Loss trend analysis: Increasing/decreasing/plateau/divergence
  • Gradient stability: Norm within expected range, no explosions
  • Memory status: Utilization percentage, OOM risk
  • Expert balance: MoE utilization distribution (if applicable)
  • Throughput: Current vs baseline, degradation detection
  • Convergence: Progress toward optimal loss
  • CUDA efficiency: Kernel performance, speedup maintenance, memory usage

Health check output includes status (healthy/warning/critical), detected issues, recommended interventions, confidence scores, CUDA performance summary.


Performance Benchmarks

Throughput measurements on reference hardware configurations with and without CUDA acceleration. All benchmarks use sequence_length=2048, batch_size optimized per GPU, mixed precision training with gradient checkpointing.

Single GPU Performance

Google Colab T4 (15.8GB, Turing, sm_75):

  • debug (14M total, 2M active):
    • With CUDA: ~53,000–55,000 tokens/second, batch_size=30, grad_steps=8
    • Without CUDA (PyTorch): ~30,000 tokens/second, batch_size=30, grad_steps=8
    • Speedup: ~1.8×
  • Memory efficiency: High utilization with optimized kernels

NVIDIA RTX 3090 (24GB, Ampere, sm_80):

  • b1 (1B active):
    • With CUDA: ~1400 tokens/second, batch_size=16
    • Without CUDA: ~450 tokens/second, batch_size=16
    • Speedup: 3.1×
  • b7 (7B active):
    • With CUDA: ~270 tokens/second, batch_size=4, requires ZeRO-2
    • Without CUDA: ~75 tokens/second, batch_size=4
    • Speedup: 3.6×
  • Memory efficiency: 85-90% utilization at optimal batch size

NVIDIA A100 40GB (Ampere, sm_80):

  • b1 (1B active):
    • With CUDA: ~1600 tokens/second, batch_size=32
    • Without CUDA: ~480 tokens/second, batch_size=32
    • Speedup: 3.3×
  • b7 (7B active):
    • With CUDA: ~680 tokens/second, batch_size=16
    • Without CUDA: ~195 tokens/second, batch_size=16
    • Speedup: 3.5×
  • b14 (14B active):
    • With CUDA: ~210 tokens/second, batch_size=4, requires ZeRO-2
    • Without CUDA: ~55 tokens/second, batch_size=4
    • Speedup: 3.8×
  • Memory efficiency: 90-95% utilization

NVIDIA A100 80GB (Ampere, sm_80):

  • b7 (7B active):
    • With CUDA: ~750 tokens/second, batch_size=24
    • Without CUDA: ~210 tokens/second, batch_size=24
    • Speedup: 3.6×
  • b14 (14B active):
    • With CUDA: ~340 tokens/second, batch_size=12
    • Without CUDA: ~90 tokens/second, batch_size=12
    • Speedup: 3.8×
  • b30 (30B active):
    • With CUDA: ~68 tokens/second, batch_size=2, requires ZeRO-3
    • Without CUDA: ~16 tokens/second, batch_size=2
    • Speedup: 4.3×
  • Memory efficiency: 85-92% utilization

NVIDIA H100 80GB (Hopper, sm_90):

  • b14 (14B active):
    • With CUDA: ~550 tokens/second, batch_size=16
    • Without CUDA: ~125 tokens/second, batch_size=16
    • Speedup: 4.4×
  • b30 (30B active):
    • With CUDA: ~160 tokens/second, batch_size=8
    • Without CUDA: ~35 tokens/second, batch_size=8
    • Speedup: 4.6×
  • b50 (50B active):
    • With CUDA: ~68 tokens/second, batch_size=4, requires ZeRO-3
    • Without CUDA: ~14 tokens/second, batch_size=4
    • Speedup: 4.9×
  • FP8 support: Additional 1.5-2× speedup with FP8 training

Apple M1 Max (32GB unified, MPS):

  • b1 (1B active): ~300 tokens/second, batch_size=8 (PyTorch only, no CUDA)
  • Memory: Unified architecture shares with system, effective 20-24GB for training
  • Note: 4-5× slower than equivalent CUDA GPU with acceleration

Apple M2 Ultra (128GB unified, MPS):

  • b1 (1B active): ~400 tokens/second, batch_size=16 (PyTorch only, no CUDA)
  • b7 (7B active): ~80 tokens/second, batch_size=4 (PyTorch only, no CUDA)
  • Memory: Up to 96GB available for training after system overhead
  • Note: 3-4× slower than A100 with CUDA acceleration

Multi-GPU Scaling

4× A100 80GB (DeepSpeed ZeRO-2 + CUDA):

  • b30 (30B active):
    • With CUDA: ~480 tokens/second (3.5× single GPU)
    • Without CUDA: ~110 tokens/second (3.4× single GPU)
    • Speedup: 4.4× (CUDA over PyTorch)
  • Scaling efficiency: 87%
  • Communication overhead: ~13%

8× A100 80GB (DeepSpeed ZeRO-3 + CUDA):

  • b50 (50B active):
    • With CUDA: ~440 tokens/second
    • Without CUDA: ~95 tokens/second
    • Speedup: 4.6×
  • b100 (100B active):
    • With CUDA: ~245 tokens/second
    • Without CUDA: ~52 tokens/second
    • Speedup: 4.7×
  • Scaling efficiency: 70-75%
  • Communication overhead: 25-30%

16× H100 80GB (DeepSpeed ZeRO-3 + expert parallelism + CUDA):

  • b100 (100B active):
    • With CUDA: ~820 tokens/second
    • Without CUDA: ~165 tokens/second
    • Speedup: 5.0×
  • b200 (200B active):
    • With CUDA: ~380 tokens/second
    • Without CUDA: ~73 tokens/second
    • Speedup: 5.2×
  • Scaling efficiency: 60-65%
  • Expert parallelism improves MoE scaling

Scaling efficiency factors:

  • Model size: Larger models have lower scaling efficiency (communication bound)
  • Interconnect: InfiniBand (400Gbps) vs Ethernet (100Gbps) significantly affects scaling
  • Expert parallelism: Distributing experts across GPUs improves MoE scaling
  • Gradient accumulation: Higher accumulation reduces communication frequency
  • CUDA acceleration: Maintained across all GPUs, reduces compute bottlenecks

Optimization Impact

CUDA Custom Kernels (3-5× overall):

  • RMSNorm: 3-4× speedup over PyTorch
  • RoPE: 5-7× speedup over PyTorch
  • SwiGLU: 2-3× speedup over PyTorch
  • MoE operations: 2-4× speedup over PyTorch
  • Fused loss: 3-4× speedup over separate operations
  • Combined impact: 3.2-4.8× depending on model configuration

Flash Attention (Ampere+):

  • Attention speedup: 2-4× depending on sequence length
  • Longer sequences benefit more (4× at 4096 length vs 2× at 512)
  • Memory reduction: 30-50% for attention computation
  • Quality: Numerically equivalent to standard attention
  • Compatible with CUDA kernels: Combined 4-8× total speedup

PyTorch Compilation (torch.compile):

  • Speedup: 5-30% depending on model architecture
  • MoE models: Lower benefit (routing breaks fusion)
  • Dense models: Higher benefit (more fusion opportunities)
  • Startup cost: 1-5 minutes additional compilation time
  • Stacks with CUDA: Additional 5-15% on top of kernel acceleration

Gradient Checkpointing:

  • Memory reduction: 30-50% of activation memory
  • Compute overhead: 20-30% additional training time
  • Trade-off: Enables larger batch sizes, often net positive throughput
  • CUDA kernels: Maintain acceleration during recomputation

Mixed Precision:

  • FP32 → mixed_bf16: ~2× speedup, 50% memory reduction
  • FP32 → mixed_fp16: ~2× speedup, 50% memory, may need loss scaling tuning
  • BF16 → FP8: ~1.5-2× speedup (H100 only), quality impacts under investigation
  • CUDA kernels: Optimized for each precision, maintain speedup ratios

CUDA Kernel Profiling

Per-kernel breakdown (b7 model, A100):

Operation          | PyTorch | CUDA  | Speedup | % Time
-------------------|---------|-------|---------|--------
RMSNorm            | 45ms    | 12ms  | 3.8×    | 18%
RoPE               | 38ms    | 6ms   | 6.3×    | 9%
SwiGLU             | 62ms    | 24ms  | 2.6×    | 22%
MoE Routing        | 28ms    | 9ms   | 3.1×    | 11%
MoE Dispatch       | 42ms    | 14ms  | 3.0×    | 15%
MoE Combine        | 35ms    | 11ms  | 3.2×    | 13%
Loss Computation   | 18ms    | 5ms   | 3.6×    | 6%
Other Operations   | 52ms    | 48ms  | 1.1×    | 6%
-------------------|---------|-------|---------|--------
Total per batch    | 320ms   | 129ms | 2.5×    | 100%
Effective tokens/s | 195     | 680   | 3.5×    |

Note: Effective speedup higher than per-operation average due to reduced overhead and better GPU utilization.


API Reference

Trainer Methods

MoE Architecture Management:

add_expert(layer_idx: int) -> None

  • Adds new expert to specified transformer layer
  • Expert initialized from existing experts with Gaussian noise
  • Automatically updates routing weights
  • Triggers rebalancing of capacity factor
  • CUDA kernels automatically adapt to new expert count

prune_expert(layer_idx: int, expert_idx: int) -> None

  • Removes expert from specified layer
  • Parameters frozen but retained in checkpoint for potential recovery
  • Routing weights adjusted to redistribute to remaining experts
  • Triggers capacity factor adjustment
  • CUDA dispatch kernels updated for new expert topology

MoE Routing Control:

adjust_capacity_factor(factor: float) -> None

  • Updates token capacity per expert
  • Values typically 1.0-2.0 (100-200% of fair share)
  • Higher values reduce token dropping but increase memory
  • Lower values force sharper routing but may drop tokens
  • CUDA dispatch kernel buffer sizes adjusted accordingly

adjust_routing_temperature(temperature: float) -> None

  • Controls routing distribution sharpness
  • temperature < 1.0: Sharper routing (more specialized)
  • temperature > 1.0: Softer routing (more uniform)
  • Typical range: 0.5-2.0
  • CUDA top-k kernel uses temperature in softmax computation

enable_expert_dropout(dropout_prob: float) -> None

  • Enables expert-level dropout during training
  • dropout_prob: Probability of dropping each expert (typical: 0.1-0.2)
  • Prevents over-reliance on specific experts
  • Disabled during evaluation automatically
  • CUDA routing kernel handles dropout masking

get_expert_statistics() -> Dict

  • Returns comprehensive expert metrics including CUDA performance
  • Keys: per_expert_utilization, routing_entropy, load_balance_loss, tokens_dropped, cuda_routing_time, cuda_dispatch_time
  • Per-layer and aggregate statistics
  • Updated each forward pass

MoD Control:

adjust_mod_capacity(capacity_factor: float) -> None

  • Updates fraction of tokens using full computation
  • Values: 0.0-1.0 (0% to 100% of tokens)
  • Lower values: More compute savings, potential quality impact
  • Typical range: 0.3-0.7

get_mod_statistics() -> Dict

  • Returns MoD efficiency metrics
  • Keys: capacity_factor, tokens_processed, tokens_skipped, average_depth, per_layer_usage, compute_savings, cuda_overhead
  • Tracks actual vs configured capacity
  • Compute savings as FLOP reduction percentage

Batch Management:

adjust_batch_size(new_batch_size: int) -> None

  • Dynamically changes micro batch size
  • Recreates dataloader with new batch size
  • Adjusts gradient accumulation to maintain effective batch size
  • Typically used for OOM recovery (reduces batch size automatically)
  • CUDA kernel launch configurations adapted to new batch size

Emergency Recovery:

emergency_lr_reduction(reduction_factor: float) -> None

  • Reduces learning rate by specified factor
  • Triggered by gradient explosion (norm > threshold)
  • Typical reduction: 5-10×
  • Logs emergency action with reasoning

rollback_steps(num_steps: int) -> None

  • Reverts training to previous checkpoint
  • Loads checkpoint from num_steps earlier
  • Resets optimizer and scheduler state
  • Used for divergence recovery
  • CUDA kernel state restored from checkpoint

Optimizer Control:

adjust_weight_decay(weight_decay: float) -> None

  • Updates L2 regularization strength
  • Typical values: 0.01-0.1
  • Higher values: Stronger regularization, may slow learning
  • Lower values: Less regularization, may overfit

Metrics Query:

get_current_metrics() -> Dict

  • Real-time training state snapshot including CUDA stats
  • Keys: loss, learning_rate, grad_norm, throughput, memory_usage, epoch, step, cuda_speedup, kernel_timing
  • Updated every forward/backward pass
  • Used by orchestrator for decision making

CUDA Performance:

get_cuda_performance_stats() -> Dict

  • Detailed CUDA kernel performance metrics
  • Keys: per_kernel_timing, speedup_factors, memory_efficiency, occupancy, fallback_count
  • Updated every N steps (configurable)
  • Used for performance optimization and debugging

optimize_cuda_performance() -> None

  • Analyzes CUDA kernel usage and tunes parameters
  • Adjusts launch configurations based on occupancy
  • Identifies and mitigates bottlenecks
  • Logs optimization decisions with expected impact

FLOPs Reduction by Layer Capacity

FLOPs Reduction vs Layer Capacity Lower layer capacity factors reduce FLOPs without significantly affecting performance.

Configuration Parameters

Model architecture:

  • hidden_size: Embedding and hidden dimension (128-20480, prefer multiples of 256 for CUDA efficiency)
  • num_layers: Transformer layer count (2-120)
  • num_heads: Attention head count (2-160)
  • num_kv_heads: KV cache heads for GQA (2-40, typically num_heads/4)
  • intermediate_size: FFN intermediate dimension (typically 8/3 × hidden_size, round to 256 for CUDA)
  • max_position_embeddings: Maximum sequence length (128-32768)
  • vocab_size: Tokenizer vocabulary size (typically 32000-100000)

MoE parameters:

  • use_moe: Enable MoE (boolean)
  • num_experts: Expert count per layer (4-64, typically 8)
  • moe_top_k: Experts activated per token (1-4, typically 2)
  • capacity_factor: Token capacity multiplier (1.0-2.0, typically 1.25)
  • load_balancing_weight: Auxiliary loss coefficient (0.001-0.1, typically 0.01)
  • routing_temperature: Softmax temperature (0.1-2.0, typically 1.0)

MoD parameters:

  • use_mod: Enable MoD (boolean)
  • mod_capacity_factor: Fraction using full computation (0.1-1.0, typically 0.5)
  • mod_routing_type: Routing mechanism ('learned', 'static', 'random')
  • mod_start_layer: First layer with MoD (0-num_layers)
  • mod_end_layer: Last layer with MoD (None = all layers)

Training parameters:

  • num_epochs: Training duration in epochs (1-100)
  • batch_size: Micro batch size per GPU (1-128)
  • gradient_accumulation_steps: Accumulation before optimizer step (1-128)
  • learning_rate: Optimizer learning rate (1e-5 to 1e-3)
  • weight_decay: L2 regularization (0.0-0.1, typically 0.01)
  • gradient_clip_val: Gradient norm clipping (0.5-5.0, typically 1.0)
  • warmup_steps: Learning rate warmup duration (steps or fraction)

Precision parameters:

  • precision: Training precision ('auto', 'fp32', 'fp16', 'bf16', 'mixed_fp16', 'mixed_bf16', 'fp8_e4m3')
  • inference_precision: Evaluation precision (same options as training)

Optimization parameters:

  • use_flash_attention: Enable Flash Attention (boolean, auto-detected)
  • use_fused_rmsnorm: Enable fused RMSNorm kernel (boolean, default false)
  • use_fused_rope: Enable fused RoPE kernel (boolean, default true)
  • use_fused_swiglu: Enable fused SwiGLU kernel (boolean, default true)
  • use_fused_moe: Enable CUDA MoE routing/dispatch kernels (boolean, default true)
  • use_fused_loss: Enable fused loss kernel (boolean, default true)
  • use_fused_grad_clip: Enable fused gradient clipping kernel (boolean, default true)
  • validate_moe_cuda_indices: Enable strict CUDA MoE index validation (boolean, default false)
  • force_dense_expert_grads: Force dense expert gradient path for all experts (boolean, default false)
  • routing_stats_update_interval: Steps between routing stats sync/updates (int, default 64)
  • mod_routing_stats_update_interval: Steps between MoD stats updates (int, default 64)
  • metric_history_size: Bounded in-memory training metric window (int, default 2048)
  • routing_history_size: Bounded in-memory routing metric window (int, default 512)
  • gradient_checkpointing: Activation checkpointing (boolean)
  • compile: PyTorch 2.0 compilation (boolean)
  • use_deepspeed: Enable DeepSpeed (boolean)
  • zero_stage: ZeRO optimization level (0-3)
  • cpu_offload: Offload optimizer to CPU (boolean)

Data parameters:

  • training_mode: Data handling ('base_only', 'finetuning_only', 'hybrid_sequential', 'hybrid_interleaved')
  • base_paths: List of base training files
  • finetuning_paths: List of fine-tuning files
  • base_eval_paths: Base validation files
  • finetuning_eval_paths: Fine-tuning validation files
  • base_ratio: Mixing ratio for interleaved mode (0.0-1.0)
  • mask_user_tokens: Mask user messages in loss (boolean)
  • pin_memory: Pinned host memory for faster CPU→GPU transfer (boolean, default true on CUDA)
  • prefetch_factor: DataLoader prefetch depth when num_workers > 0 (int, default 4)

Orchestrator parameters:

  • use_adaptive_training: Enable orchestrator (boolean)
  • intervention_threshold: Confidence required for intervention (0.0-1.0, typically 0.75)
  • check_interval: Steps between health checks (10-1000, typically 100)
  • enable_emergency_recovery: Allow emergency interventions (boolean)
  • enable_architecture_adaptation: Allow architecture changes (boolean)

Chinchilla parameters:

  • auto_epoch_scaling: Enable automatic epoch calculation (boolean)
  • chinchilla_multiplier: Token multiplier (5-50, typically 20)
  • min_auto_epochs: Minimum epochs (1-10)
  • max_auto_epochs: Maximum epochs (10-100)
  • enable_loss_landscape: Track loss patterns (boolean)
  • enable_compute_efficiency: Track efficiency metrics (boolean)
  • enable_early_stopping: Allow early termination (boolean)

Checkpoint parameters:

  • save_every_n_batches: Checkpoint interval in steps (100-10000)
  • save_total_limit: Maximum checkpoints to keep (1-100)
  • early_stopping_patience: Epochs without improvement before stopping (3-20)

CUDA parameters:

  • cuda_kernel_path: Path to compiled CUDA kernels (default: auto-detect)
  • enable_cuda_profiling: Enable detailed kernel profiling (boolean)
  • cuda_profile_interval: Steps between profiling snapshots (1000-10000)

Troubleshooting

Out of Memory (OOM)

Automatic handling: Orchestrator detects OOM exceptions, reduces batch size by 50%, recreates dataloader, resumes training from last checkpoint. CUDA kernel buffers automatically adjusted.

Manual interventions:

  • Reduce batch_size: Start with 1-2 for very large models
  • Increase gradient_accumulation_steps: Maintains effective batch size with less memory
  • Enable gradient_checkpointing: Trades compute for memory (recompute activations)
  • Increase zero_stage: 1→2→3 for progressively more memory optimization
  • Enable cpu_offload: Moves optimizer states to CPU (slower but massive memory savings)
  • Reduce max_position_embeddings: Shorter sequences use less memory
  • Lower model size: Try smaller preset configuration
  • Check CUDA kernel memory: Temporarily disable fused CUDA ops (use_fused_* = false) or set use_cuda=false

Memory estimation: Model memory (FP16) ≈ 2 bytes × total_parameters Optimizer memory (Adam) ≈ 8 bytes × parameters Gradient memory ≈ 2 bytes × parameters Activation memory ≈ 2 × batch_size × sequence_length × num_layers × hidden_size CUDA kernel buffers ≈ 100-500 MB (temporary buffers) Total ≈ 12-16 bytes per parameter + activation memory + kernel overhead

Training Instabilities

Gradient explosion: Symptoms: Loss becomes NaN, gradient norm > 100, rapid loss increase

Automatic recovery: Orchestrator detects high gradient norm, triggers emergency LR reduction (10×), rolls back to previous checkpoint, resumes with lower LR. CUDA fused gradient clipping prevents most explosions.

Manual fixes:

  • Lower learning_rate: Try 10× reduction
  • Increase gradient_clip_val: Clip at lower threshold (0.5 instead of 1.0)
  • Use mixed precision: BF16 more stable than FP16
  • Enable gradient checkpointing: Can improve numerical stability
  • Check data: Outliers or corrupted samples can cause explosions
  • Verify CUDA kernels: Check for numerical issues by setting use_fused_* = false temporarily

Loss divergence: Symptoms: Loss increases consistently, validation loss >> training loss, sudden loss spikes

Automatic recovery: Orchestrator detects divergence pattern, rolls back N steps, adjusts learning rate, may modify architecture parameters.

Manual fixes:

  • Reduce learning_rate: Start 3-5× lower
  • Increase weight_decay: Stronger regularization (0.1 instead of 0.01)
  • Check data quality: Remove outliers, validate preprocessing
  • Reduce model capacity: Overparameterized models may not converge on small datasets
  • Check CUDA kernel correctness: Verify against PyTorch fallback

Expert collapse (MoE): Symptoms: All tokens route to 1-2 experts, routing entropy < 1.0, most experts have near-zero utilization

Automatic recovery: Orchestrator detects imbalance, increases load_balancing_weight, adjusts routing_temperature, may prune/add experts. CUDA routing kernel continues to function correctly during recovery.

Manual fixes:

  • Increase load_balancing_weight: Try 0.02 or 0.05 (from 0.01)
  • Increase capacity_factor: Allow more tokens per expert (1.5 or 2.0)
  • Adjust routing_temperature: Higher values (1.5-2.0) encourage uniform routing
  • Enable expert_dropout: Forces routing to all experts
  • Check initialization: Poorly initialized experts may never activate
  • Monitor CUDA routing performance: Ensure kernels functioning correctly

Slow Training

Automatic optimization: Orchestrator monitors throughput, detects degradation, suggests optimizations (enable compilation, adjust batch size, check data loading bottlenecks, verify CUDA kernel usage).

Manual optimizations:

  • Enable compile: PyTorch 2.0 compilation (5-30% speedup)
  • Enable fused CUDA ops: use_fused_rope, use_fused_swiglu, use_fused_moe, use_fused_loss, use_fused_grad_clip
  • Keep use_fused_rmsnorm=false unless profiling shows a gain on your workload/GPU
  • Enable use_flash_attention: 2-4× attention speedup on Ampere+
  • Use mixed_bf16 or mixed_fp16: 2× speedup over FP32
  • Increase num_workers: Parallelize data loading (typically 4-8)
  • Ensure pin_memory=true and tune prefetch_factor (typically 2-8)
  • Increase batch_size: Better GPU utilization (if memory allows)
  • Reduce gradient_checkpointing: Faster but more memory
  • Check I/O: Move dataset to fast SSD, use memory-mapped files
  • Profile CUDA: Use enable_cuda_profiling=True to identify bottlenecks

Bottleneck identification:

  • GPU utilization < 80%: CPU or data loading bottleneck
  • Low throughput with high GPU utilization: Model or algorithm bottleneck
  • Throughput decreases over time: Memory fragmentation or thermal throttling
  • Inconsistent throughput: Data loading variance or OS interference
  • CUDA kernel fallbacks: Check compilation, verify GPU compatibility
  • High CUDA overhead: Check kernel launch efficiency with profiling

Apple Silicon (MPS) Issues

Common issues:

  • Training instability: MPS precision handling differs from CUDA
  • Compilation failures: PyTorch MPS backend less mature
  • Memory pressure: Unified memory competition with OS
  • No CUDA acceleration: Custom kernels unavailable

Solutions:

  • Set compile=False: Disable compilation if unstable
  • Set num_workers=0: MPS prefers single-threaded data loading
  • Set all CUDA fused ops to false (use_fused_rmsnorm/use_fused_rope/use_fused_swiglu/use_fused_moe/use_fused_loss/use_fused_grad_clip)
  • Reduce batch_size: Start conservative (2-4)
  • Monitor Activity Monitor: Check memory pressure, GPU usage
  • Update PyTorch: MPS backend rapidly improving, use latest version
  • Fall back to CPU: If MPS unreliable, CPU training is alternative (slower but stable)
  • Expect slower training: 4-5× slower than CUDA-enabled GPU

Checkpoint Issues

Corruption: Symptoms: Checkpoint fails to load, missing keys, size mismatch

Recovery: System automatically tries previous checkpoints (latest → latest-1 → best validation). If all corrupt, restart from initialization.

Prevention: Enable save_total_limit > 3, save to reliable storage, validate checksums. CUDA kernel state saved separately for recovery.

Resume failures: Symptoms: Training resumes but loss resets, optimizer state lost, different results than before

Causes: Incomplete checkpoint save, random seed mismatch, configuration mismatch, CUDA kernel version change

Solutions: Verify checkpoint integrity before resume, ensure configuration matches checkpoint, check random seed restoration, recompile CUDA kernels if updated.

Data Issues

Format errors: Symptoms: Training fails during data loading, "invalid JSON" or "unexpected format" errors

Solutions: Validate data format with provided validation scripts, check for: missing fields, incorrect JSON structure, encoding issues (use UTF-8), empty files or lines.

Quality problems: Symptoms: Training succeeds but poor results, high validation loss, model outputs nonsense

Causes: Data contamination, label errors, poor quality samples, distribution mismatch

Solutions: Enable automatic_data_cleaning, increase quality_threshold, manually inspect samples, check train/validation split, verify preprocessing correctness.

CUDA Kernel Issues

Compilation failures: Symptoms: Kernels fail to compile, nvcc errors, missing libraries

Solutions:

  • Verify CUDA toolkit installation: nvcc --version
  • Check compute capability: Ensure GPU supported (sm_75+)
  • Install build tools: sudo apt-get install build-essential
  • Check PyTorch CUDA version: Must match CUDA toolkit
  • Try manual compilation: cd Src/Main_Scripts/core && ./compile_transformer_ops.sh && ./compile_cuda_moe.sh && cd ../training && ./compile_kernels.sh
  • Check compiler flags: Verify architecture flags match GPU

Runtime errors: Symptoms: CUDA kernel crashes, incorrect results, NaN outputs

Solutions:

  • Verify numerical correctness: Compare with PyTorch fallback
  • Check input shapes: Ensure dimensions match kernel expectations
  • Enable debugging: Set CUDA_LAUNCH_BLOCKING=1
  • Check memory: Verify sufficient GPU memory for kernel buffers
  • Update kernels: Recompile after PyTorch/CUDA update
  • Disable temporarily: set use_fused_* = false (or use_cuda=false) to isolate issue

Performance issues: Symptoms: CUDA kernels slower than expected, low speedup, high overhead

Solutions:

  • Enable profiling: enable_cuda_profiling=True
  • Check occupancy: Use get_cuda_performance_stats()
  • Verify batch size: Small batches have higher overhead
  • Check launch config: May need tuning for specific GPU
  • Monitor fallbacks: Excessive PyTorch fallbacks indicate issues
  • Compare architectures: Older GPUs have lower speedup

Production Deployment

Considerations for deploying trained models in production environments.

Model Export

Checkpoint format: Standard PyTorch state dict compatible with transformers library. Can export to HuggingFace format, ONNX (for inference optimization), TorchScript (for deployment), or TensorRT (for NVIDIA inference).

Conversion process:

  1. Load final checkpoint
  2. Extract model state dict
  3. Convert to target format
  4. Validate outputs match original
  5. Benchmark inference performance
  6. Optimize CUDA kernels for inference (if applicable)

Size optimization:

  • Weight pruning: Remove low-magnitude weights
  • Quantization: INT8 or INT4 for deployment
  • Knowledge distillation: Train smaller dense model from sparse model
  • Expert merging: Combine similar experts in MoE models
  • Kernel fusion: Combine operations for inference

Inference Optimization

Quantization strategies:

  • Post-training quantization: INT8 via bitsandbytes or GPTQ (4-bit)
  • Quantization-aware training: Train with quantization simulation
  • Dynamic quantization: Quantize activations at runtime
  • Mixed precision inference: FP16 or INT8 depending on layer
  • CUDA kernel quantization: Custom kernels for quantized operations

Batching:

  • Dynamic batching: Group requests by sequence length
  • Continuous batching: Add requests to in-flight batches
  • Request queuing: Balance latency and throughput
  • CUDA stream management: Overlap computation and data transfer

KV cache management:

  • Cache quantization: INT8 KV cache (2× memory reduction)
  • Cache eviction: Drop old tokens for long conversations
  • Paged attention: Efficient memory allocation (vLLM)
  • CUDA-optimized caching: Custom kernels for cache operations

Serving frameworks:

  • vLLM: High-throughput inference with paged attention
  • TensorRT-LLM: NVIDIA-optimized inference with custom kernels
  • Text Generation Inference: HuggingFace serving
  • Custom deployment: Direct PyTorch or ONNX Runtime with CUDA kernels

Monitoring Production Models

Inference metrics:

  • Latency: Time to first token, total generation time, CUDA kernel overhead
  • Throughput: Requests per second, tokens per second
  • Quality: Output validation, coherence scoring
  • Resource usage: GPU memory, CPU utilization, kernel efficiency

Model drift detection:

  • Input distribution monitoring
  • Output quality tracking over time
  • Comparison to validation benchmarks
  • Automatic retraining triggers

Licensing

This system is available under a commercial license. Framework provided for evaluation via demo notebook. Production use requires license agreement.

License tiers:

Research/Academic:

  • Non-commercial research and educational use
  • Academic institutions, non-profit research
  • Publications require citation
  • No redistribution of modified code
  • Includes CUDA kernels for research purposes
  • Pricing: Contact for academic pricing

Startup:

  • Companies with <10 employees, <$1M revenue
  • Internal use only (no redistribution)
  • Includes updates and bug fixes
  • Email support
  • CUDA kernel source code included
  • Pricing: Contact for startup pricing

Enterprise:

  • Larger organizations
  • Internal use and customer deployments
  • Includes updates, bug fixes, security patches
  • Priority support, SLA options
  • Custom modifications available
  • CUDA kernel customization support
  • Pricing: Contact for enterprise pricing

Contact: matiasnhmb@gmail.com

Evaluation license: Demo notebook and local installation for evaluation purposes. 30-day evaluation period. No production use. Watermarked outputs during evaluation. CUDA kernels included for testing.


Documentation

Adaptive Training Manual: docs/adaptive_training.md

  • Orchestrator internals
  • Intervention method details
  • Configuration recommendations
  • Case studies and examples
  • Performance optimization strategies

MoE/MoD Tutorial: docs/sparse_architectures.md

  • Sparse architecture theory
  • Implementation details
  • Training best practices
  • Debugging sparse models
  • CUDA-accelerated MoE operations

MoE/MoD Tutorial: docs/adapters.md

  • Deep dive into architecture + implementation details
  • Step-by-step guides for building, training, and plugging in adapters
  • Practical examples + gotchas (routing, load balancing, perf tuning)
  • Best practices for scaling and debugging

MoE/MoD Tutorial: docs/cuda_acceleration.md

  • These kernels replace high-frequency transformer primitives that dominate runtime in dense and MoE models.
  • These kernels implement the full MoE routing pipeline entirely on GPU:routing → dispatch → expert compute → recombination.
  • These kernels eliminate synchronization points by collapsing multi-step training operations into single GPU passes.
  • Mixed precision, streams, and overlap strategies

Citation

@software{AdaptiveTrainingSystem2025,
  title = {Adaptive Training System: Modular Transformer Training with MoE/MoD},
  author = {MatN23},
  year = {2025},
  url = {https://github.com/matn23/AdaptiveTrainingSystem},
  note = {Production-grade training framework with adaptive optimization and CUDA acceleration}
}

Support

Issues: GitHub issue tracker for bug reports and feature requests

Discussions: GitHub discussions for questions and community support

Email: matiasnhmb@gmail.com for licensing and technical inquiries


Adaptive Training System

Production transformer training for ML engineers

GitHubDemoLicense

About

A PyTorch framework for training transformer language models with Mixture of Experts (MoE) architecture support, Mixture of Depths (MoD), and DeepSpeed integration. Implements models from 70M to 300B parameters with automatic dataset processing, distributed training, and memory management.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors