Adding the new feature of FPDT (microsoft#441)#72
Draft
Conversation
* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
* [tools]GQA convert support * fix readme
Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.
…into microsoft-main-fpdt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
from upstream:
(Meta-) Review
Caution
Note: Copilot summary below fails to capture the changes in
megatron/model/transformer.py, which unfortunately:--use-flash-attn-builderon Intel XPUCopilot Summary
This pull request includes several updates to improve the fine-tuning process and configuration for DeepSpeed and Megatron-LM models. The most important changes include adding new conversion commands, updating configuration files, and introducing new arguments and logic for sequence parallelism with FPDT.
Updates to fine-tuning process and configuration:
examples_deepspeed/finetune_hf_llama/README.md: Updated the conversion command to includeconvert_hf2mdsand added information aboutconvert_mds2hffor converting models between Hugging Face and Megatron-Deepspeed formats.examples_deepspeed/finetune_hf_llama/ds_config.json: Modified the configuration to includezero_optimizationandbf16settings, and updatedsteps_per_printto 100.examples_deepspeed/finetune_hf_llama/finetune_llama.sh: Added logic to select appropriate configuration files based on the conversion command and updated arguments for fine-tuning. [1] [2] [3]Introduction of new sequence parallelism with FPDT:
megatron/arguments.py: Added new arguments for DeepSpeed sequence parallelism with FPDT, includingds-sequence-parallel-fpdt,ds-sequence-parallel-fpdt-chunk-size, andds-sequence-parallel-fpdt-offloading.megatron/initialize.py: Updated the warmup function to handle FPDT sequence length and avoid OOM issues. [1] [2]megatron/model/gpt_model.py: Integrated FPDT logits loss in the post-language model processing function. [1] [2]Enhancements to rotary position embeddings:
megatron/model/rotary_pos_embedding.py: Modified theRotaryEmbeddingclass to use the current device and updated the forward method to return cosine and sine components separately. [1] [2] [3]Miscellaneous changes:
pretrain_gpt.py: Added support for FPDT input construction in theget_batchfunction. [1] [2]