Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992

TamerSoliman · 2025-11-22T20:01:41Z

No description provided.

This commit adds detailed educational materials mapping DeepSpeed ZeRO optimization stages to code and configuration: **Annotated Scripts (4 files):** - 01_hello_deepspeed_annotated.py - Basic ZeRO-1 with CPU offload - 02_cifar10_annotated.py - Configurable ZeRO stages (0-3) - 03_superoffload_zero3_annotated.py - ZeRO-3 with detailed parameter lifecycle - 04_zenflow_zero2_annotated.py - ZeRO-2 with sparse optimizer updates **Annotated Configurations (3 files):** - zero3_nvme_offload_annotated.json - NVMe offloading with AIO - zero3_cpu_offload_annotated.json - CPU offloading configuration - zero2_zenflow_annotated.json - ZenFlow sparse optimization **Comprehensive Guides (2 files):** - ZeRO3_Concept_to_Code.md - Maps ZeRO-3 theory to DeepSpeed source code - Distributed_Training_Guide.md - Complete data flow for gradient step **Key Features:** - Line-by-line annotations explaining distributed training mechanics - Explicit mapping to DeepSpeed source code (stage3.py, partition_parameters.py) - Memory breakdown examples and performance comparisons - Communication pattern diagrams and optimization strategies - Detailed explanation of All-Gather, Reduce-Scatter operations - Parameter lifecycle through forward/backward/optimizer steps All materials placed in claude_tutorials/ directory for easy access.

Complete the full set of 8 annotated training examples: **Script 5 - DeepSpeed-Chat SFT:** - Production RLHF pipeline (Step 1: Supervised Fine-Tuning) - Dynamic DeepSpeed config generation pattern - LoRA integration for parameter-efficient training - Conditional optimizer selection (CPU vs GPU) - ZeRO-3 model saving utilities - Distributed evaluation with metric aggregation **Script 6 - Domino + Megatron:** - 3D Parallelism (Tensor + Pipeline + Data) - Megatron-LM integration with DeepSpeed - Tensor parallelism within nodes (NVLink) - Pipeline parallelism across nodes (InfiniBand) - Interleaved pipeline scheduling - Communication group explanations - Record GPT-3 training implementation **Script 7 - Tensor Parallelism:** - Tensor parallelism with transformers library - ZeRO-1 + Tensor Parallel combination - Layer-wise model splitting - All-Reduce communication patterns - Comparison with data parallelism - Optimal configuration guidelines **Script 8 - Bing BERT:** - Production-scale BERT pre-training (44-min record) - Gradient accumulation boundaries - Custom dataset provider with prefetching - Multi-phase training strategy (128→512 tokens) - LAMB optimizer for large batches - Production monitoring and checkpointing - 1024-GPU scaling patterns All scripts include: - Line-by-line annotations of distributed mechanisms - Communication pattern diagrams - Memory breakdown examples - Production best practices - Usage examples and configurations Total: 8 comprehensive annotated scripts covering all major DeepSpeed features and production patterns.

This commit adds comprehensive tooling and documentation: **Benchmarking Suite** (claude_tutorials/benchmarks/): - zero_stage_comparison.py: Benchmark ZeRO stages 0-3 with detailed metrics - offload_comparison.py: Compare CPU and NVMe offloading strategies - README.md: Complete guide to interpreting benchmarks and choosing optimal config **Troubleshooting Guide** (claude_tutorials/guides/): - Troubleshooting_Guide.md: 20 common issues with detailed solutions - OOM errors, NCCL timeouts, NaN losses, checkpoint issues - Configuration errors, multi-node problems, offloading issues - Debugging tools and quick reference table **Migration Guides** (claude_tutorials/migrations/): - Migration_from_PyTorch_DDP.md: Migrate from PyTorch DDP to DeepSpeed - Migration_from_HF_Trainer.md: Enable DeepSpeed in HuggingFace Trainer - Migration_from_FSDP.md: Migrate from PyTorch FSDP to DeepSpeed Each migration guide includes: - Side-by-side code comparisons - Feature mapping tables - Step-by-step migration checklist - Common issues and solutions - Performance benchmarks Total additions: ~8,000 lines of documentation and tools

This commit adds comprehensive advanced tutorials and automation tools: **Advanced Feature Guides** (claude_tutorials/guides/): - MoE_Tutorial.md: Complete Mixture of Experts training guide (1,500 lines) - Expert Parallelism (EP) implementation and optimization - Load balancing strategies and capacity tuning - Switch Transformer and GPT-MoE examples - Compression_Tutorial.md: Gradient compression for multi-node (1,200 lines) - 1-bit Adam and 1-bit LAMB optimizers - 8-bit quantization techniques - Communication reduction (32× compression) - Inference_Optimization.md: DeepSpeed-Inference guide (1,500 lines) - Kernel injection and fusion - INT8/FP16 quantization - Tensor parallelism for inference - Production deployment patterns - Custom_Kernels.md: Writing CUDA kernels for DeepSpeed (1,000 lines) - OpBuilder system for JIT compilation - Kernel fusion and optimization techniques - Memory coalescing and shared memory - Tensor Core utilization - Visual_Guide.md: Architecture diagrams and visualizations (800 lines) - ASCII diagrams of ZeRO stages 0-3 - Memory layout comparisons - Communication pattern visualizations - Pipeline and tensor parallelism diagrams **Configuration Tools** (claude_tutorials/tools/): - config_generator.py: Interactive config generator (600 lines) - CLI tool for generating optimized DeepSpeed configs - Automatic ZeRO stage selection based on model size - Memory requirement estimation - Command-line and interactive modes - config_optimizer.py: Auto-tuning via benchmarks (500 lines) - Automated configuration optimization - Grid search over ZeRO stages, batch sizes, comm settings - Performance tracking and best config selection - Goal-based optimization (speed/memory/balanced) Total additions: 7 files, ~6,100 lines These tutorials cover advanced DeepSpeed features for production deployment, research experimentation, and performance optimization.

Tier 3 Progress (Part 1/2): - Multi-node training guide with SSH, NCCL, SLURM setup - Cost optimization strategies across cloud providers - Cost calculator tool for training estimates - 13 production-ready model configurations: * LLaMA: 7B, 13B, 70B, LoRA fine-tuning * GPT: GPT-2, GPT-J 6B, GPT-NeoX 20B * BERT: Base fine-tuning, Large pre-training * T5: Small, Base, Large, XL configurations Files: 16, Lines: 2,847

Tier 3 Progress (Part 2/2 - FINAL): - Model-Specific Configuration Guide (1,301 lines) * Comprehensive guide for using 13 production-ready configs * Covers LLaMA, GPT, BERT, T5 models * Includes customization tips and troubleshooting - Framework Comparison Guides (3,176 lines total): * DeepSpeed vs PyTorch FSDP (1,288 lines) * DeepSpeed vs Megatron-LM (984 lines) * DeepSpeed vs HF Accelerate (904 lines) * Performance benchmarks, code examples, use cases - Framework Comparison Tool (720 lines): * Benchmark DeepSpeed, FSDP, Accelerate * Measure throughput, memory, scaling efficiency * Generate comparison tables and reports Files: 5, Lines: 5,197 Total Tier 3: 22 files, 8,044 lines COMPLETE PROJECT SUMMARY: - Tier 0: 14 files, 6,002 lines ✅ - Tier 1: 7 files, 5,492 lines ✅ - Tier 2: 7 files, 5,467 lines ✅ - Tier 3: 22 files, 8,044 lines ✅ GRAND TOTAL: 50 files, 25,005 lines

claude added 6 commits November 18, 2025 07:31

TamerSoliman requested a review from tjruwase as a code owner November 22, 2025 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992

Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992

Uh oh!

TamerSoliman commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992

Are you sure you want to change the base?

Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992

Uh oh!

Conversation

TamerSoliman commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants