-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
TamerSoliman
wants to merge
6
commits into
deepspeedai:master
Choose a base branch
from
TamerSoliman:claude/deepspeed-zero-mapping-01DbSwtx6qb4Nd7MLASQo99o
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Claude/deepspeed zero mapping 01 db swtx6qb4 nd7 mlas qo99o #992
TamerSoliman
wants to merge
6
commits into
deepspeedai:master
from
TamerSoliman:claude/deepspeed-zero-mapping-01DbSwtx6qb4Nd7MLASQo99o
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds detailed educational materials mapping DeepSpeed ZeRO optimization stages to code and configuration: **Annotated Scripts (4 files):** - 01_hello_deepspeed_annotated.py - Basic ZeRO-1 with CPU offload - 02_cifar10_annotated.py - Configurable ZeRO stages (0-3) - 03_superoffload_zero3_annotated.py - ZeRO-3 with detailed parameter lifecycle - 04_zenflow_zero2_annotated.py - ZeRO-2 with sparse optimizer updates **Annotated Configurations (3 files):** - zero3_nvme_offload_annotated.json - NVMe offloading with AIO - zero3_cpu_offload_annotated.json - CPU offloading configuration - zero2_zenflow_annotated.json - ZenFlow sparse optimization **Comprehensive Guides (2 files):** - ZeRO3_Concept_to_Code.md - Maps ZeRO-3 theory to DeepSpeed source code - Distributed_Training_Guide.md - Complete data flow for gradient step **Key Features:** - Line-by-line annotations explaining distributed training mechanics - Explicit mapping to DeepSpeed source code (stage3.py, partition_parameters.py) - Memory breakdown examples and performance comparisons - Communication pattern diagrams and optimization strategies - Detailed explanation of All-Gather, Reduce-Scatter operations - Parameter lifecycle through forward/backward/optimizer steps All materials placed in claude_tutorials/ directory for easy access.
Complete the full set of 8 annotated training examples: **Script 5 - DeepSpeed-Chat SFT:** - Production RLHF pipeline (Step 1: Supervised Fine-Tuning) - Dynamic DeepSpeed config generation pattern - LoRA integration for parameter-efficient training - Conditional optimizer selection (CPU vs GPU) - ZeRO-3 model saving utilities - Distributed evaluation with metric aggregation **Script 6 - Domino + Megatron:** - 3D Parallelism (Tensor + Pipeline + Data) - Megatron-LM integration with DeepSpeed - Tensor parallelism within nodes (NVLink) - Pipeline parallelism across nodes (InfiniBand) - Interleaved pipeline scheduling - Communication group explanations - Record GPT-3 training implementation **Script 7 - Tensor Parallelism:** - Tensor parallelism with transformers library - ZeRO-1 + Tensor Parallel combination - Layer-wise model splitting - All-Reduce communication patterns - Comparison with data parallelism - Optimal configuration guidelines **Script 8 - Bing BERT:** - Production-scale BERT pre-training (44-min record) - Gradient accumulation boundaries - Custom dataset provider with prefetching - Multi-phase training strategy (128→512 tokens) - LAMB optimizer for large batches - Production monitoring and checkpointing - 1024-GPU scaling patterns All scripts include: - Line-by-line annotations of distributed mechanisms - Communication pattern diagrams - Memory breakdown examples - Production best practices - Usage examples and configurations Total: 8 comprehensive annotated scripts covering all major DeepSpeed features and production patterns.
This commit adds comprehensive tooling and documentation: **Benchmarking Suite** (claude_tutorials/benchmarks/): - zero_stage_comparison.py: Benchmark ZeRO stages 0-3 with detailed metrics - offload_comparison.py: Compare CPU and NVMe offloading strategies - README.md: Complete guide to interpreting benchmarks and choosing optimal config **Troubleshooting Guide** (claude_tutorials/guides/): - Troubleshooting_Guide.md: 20 common issues with detailed solutions - OOM errors, NCCL timeouts, NaN losses, checkpoint issues - Configuration errors, multi-node problems, offloading issues - Debugging tools and quick reference table **Migration Guides** (claude_tutorials/migrations/): - Migration_from_PyTorch_DDP.md: Migrate from PyTorch DDP to DeepSpeed - Migration_from_HF_Trainer.md: Enable DeepSpeed in HuggingFace Trainer - Migration_from_FSDP.md: Migrate from PyTorch FSDP to DeepSpeed Each migration guide includes: - Side-by-side code comparisons - Feature mapping tables - Step-by-step migration checklist - Common issues and solutions - Performance benchmarks Total additions: ~8,000 lines of documentation and tools
This commit adds comprehensive advanced tutorials and automation tools: **Advanced Feature Guides** (claude_tutorials/guides/): - MoE_Tutorial.md: Complete Mixture of Experts training guide (1,500 lines) - Expert Parallelism (EP) implementation and optimization - Load balancing strategies and capacity tuning - Switch Transformer and GPT-MoE examples - Compression_Tutorial.md: Gradient compression for multi-node (1,200 lines) - 1-bit Adam and 1-bit LAMB optimizers - 8-bit quantization techniques - Communication reduction (32× compression) - Inference_Optimization.md: DeepSpeed-Inference guide (1,500 lines) - Kernel injection and fusion - INT8/FP16 quantization - Tensor parallelism for inference - Production deployment patterns - Custom_Kernels.md: Writing CUDA kernels for DeepSpeed (1,000 lines) - OpBuilder system for JIT compilation - Kernel fusion and optimization techniques - Memory coalescing and shared memory - Tensor Core utilization - Visual_Guide.md: Architecture diagrams and visualizations (800 lines) - ASCII diagrams of ZeRO stages 0-3 - Memory layout comparisons - Communication pattern visualizations - Pipeline and tensor parallelism diagrams **Configuration Tools** (claude_tutorials/tools/): - config_generator.py: Interactive config generator (600 lines) - CLI tool for generating optimized DeepSpeed configs - Automatic ZeRO stage selection based on model size - Memory requirement estimation - Command-line and interactive modes - config_optimizer.py: Auto-tuning via benchmarks (500 lines) - Automated configuration optimization - Grid search over ZeRO stages, batch sizes, comm settings - Performance tracking and best config selection - Goal-based optimization (speed/memory/balanced) Total additions: 7 files, ~6,100 lines These tutorials cover advanced DeepSpeed features for production deployment, research experimentation, and performance optimization.
Tier 3 Progress (Part 1/2): - Multi-node training guide with SSH, NCCL, SLURM setup - Cost optimization strategies across cloud providers - Cost calculator tool for training estimates - 13 production-ready model configurations: * LLaMA: 7B, 13B, 70B, LoRA fine-tuning * GPT: GPT-2, GPT-J 6B, GPT-NeoX 20B * BERT: Base fine-tuning, Large pre-training * T5: Small, Base, Large, XL configurations Files: 16, Lines: 2,847
Tier 3 Progress (Part 2/2 - FINAL): - Model-Specific Configuration Guide (1,301 lines) * Comprehensive guide for using 13 production-ready configs * Covers LLaMA, GPT, BERT, T5 models * Includes customization tips and troubleshooting - Framework Comparison Guides (3,176 lines total): * DeepSpeed vs PyTorch FSDP (1,288 lines) * DeepSpeed vs Megatron-LM (984 lines) * DeepSpeed vs HF Accelerate (904 lines) * Performance benchmarks, code examples, use cases - Framework Comparison Tool (720 lines): * Benchmark DeepSpeed, FSDP, Accelerate * Measure throughput, memory, scaling efficiency * Generate comparison tables and reports Files: 5, Lines: 5,197 Total Tier 3: 22 files, 8,044 lines COMPLETE PROJECT SUMMARY: - Tier 0: 14 files, 6,002 lines ✅ - Tier 1: 7 files, 5,492 lines ✅ - Tier 2: 7 files, 5,467 lines ✅ - Tier 3: 22 files, 8,044 lines ✅ GRAND TOTAL: 50 files, 25,005 lines
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.