-
Notifications
You must be signed in to change notification settings - Fork 130
Support FlagScale-FlagCX co-tuning #957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…s-ai#524) ### Description <img width="1173" alt="image" src="https://github.com/user-attachments/assets/a446ee20-38e7-49fa-b8f5-76e1d43cec2f" /> **How to use:** By setting env value `SCHEDULING_STRATEGY`, current loadbalance strategies contain `slo, robin, random`, default is `slo`. **Todo List:** - Support offline profiling to obtain the compute capacity ratio across different machines - Support auto-tuning to determine the optimal P/D ratio and deployment distribution - Enable disaggregated P/D deployment on heterogeneous machines
…-ai#534) This PR adds the ability to specify a commit for the submodule and optionally apply the FlagScale adaptation. Usage is as follows: `python tools/patch/unpatch.py --backend vllm --backend-commit <vllm_commit> --no-fs-extension` `--backend-commit` indicates the commit to which the submodule will automatically reset after a submodule update. The `--no-fs-extension` means the FlagScale adaptation will not be applied, and the submodule will remain in its original state. Co-authored-by: caozhou <[email protected]>
…agos-ai#532) When running the test, compile and install vllm to synchronize code changes to ensure that the latest vllm code is tested.
Add functional test, qwen3 inference test
Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_gems 2. qwen3_gems
…agos-ai#539) ## Description python run.py --config-path ./examples/deepseek_r1/conf --config-name serve **action=stop**
usage: export USE_FLAGGEMS=True <commands>
Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_flaggems 2. qwen3_flaggems
1. Support serve with SGLang 2. Support auto-tune with SGLang 3. Do some adaption on serve and auto-tune for SGLang, delete 6 common args 4. Add profile report for qwen3-0.6b with 3 backends --------- Co-authored-by: MC952-arch <[email protected]>
Add flagscale test instruction set, after installing flagscale using
pip, you can use the flagscale command to run the following tests:
Unit Testing
1. flagscale test --unit --backend ${BACKEND} --subset ${SUBSET}
2. flagscale test --unit-all
Functional testing
1. flagscale test --functional --type ${TYPE} --task ${TASK}
2. flagscale test --functional-all
All testing
flagscale test
Please read tests/README.md for help on :
1. How to set the above parameters.
2. Data files that need to be configured in advance for functional
testing.
**Update About:**
1. **Environment Installation Path:**
- Run the environment installation from the root directory instead of
the install folder.
2. **Foolproof Design for Image Building :**
- Require explicit specification of the FlagScale version in the
Dockerfile to avoid interference due to Docker cache mechanism and
maintain compatibility with the existing environment.
- Update the input instructions for image building parameters.
3. **TransformerEngine Version:**
- Update TransformerEngine to commit `5bee81e`.
4. **Patch Mechanism:**
- Add the necessary Python packages for the patch mechanism.
- Add unpatch action in the installation step.
- Update the path of requirements according to the patch mechanism.
5. **Customization of Torch Code:**
- Updated the method for customizing torch code.
6. **FlagGems:**
- Add the Python package that FlagGems depends on.
- Add the installation process for FlagGems.
7. **Megatron Training Requirements:**
- Add the Python packages required for Megatron training.
This PR fixed the bug of selecting the newest commit when backend commits is different. Co-authored-by: caozhou <[email protected]>
This PR fixes the patch error when multi backends are `FlagScale xxx`. Co-authored-by: caozhou <[email protected]>
## Description
Automatically executes inference testing pipeline after downloading new
model checkpoints, verifying that the inference service can start
properly and generate results.
## Usage
```bash
tests/scripts/functional_tests/test_task.sh \
--type inference-pipeline \
--task <model_name> \
[--hardware <hardware_type>] \
[--flaggems <enable/disable>]
```
### Parameters
| Parameter | Required | Default | Options | Description |
|-----------|----------|---------|---------|-------------|
| `--type` | Yes | None | `inference-pipeline` | Specify test type as
inference pipeline |
| `--task` | Yes | None | Any model ckpt name (e.g. Qwen3-4B) | Target
model name to test |
| `--hardware` | No | `nvidia` | `nvidia`/`bi_v150`/`cambricon_mlu` |
Hardware platform for testing |
| `--flaggems` | No | `disable` | `enable`/`disable` | Whether to enable
flaggems |
### Example
```bash
# Test Qwen3-4B model on BI_V150 hardware with flaggems enabled
tests/scripts/functional_tests/test_task.sh \
--type inference-pipeline \
--task Qwen3-4B \
--hardware bi_v150 \
--flaggems enable
```
Update engine args in multiple instance case
This PR updates the installation mechanism of FlagScale. Usage is as follows: `cd FlagScale` `PYTHONPATH=./:$PYTHONPATH pip install . --config-settings=backend=<backend> --config-settings=device=<device> --verbose --no-build-isolation` The `backend` parameter can accept multiple values, separated by commas, such as "vllm,sglang,llama.cpp" --------- Co-authored-by: caozhou <[email protected]>
This PR updates readme of installation. Co-authored-by: caozhou <[email protected]>
Update and optimize the Qwen2.5-VL model. 1. Support Qwen2.5-VL-32B and add the sample configuration. 2. Optimize the data processing process to decrease the time consumed. 3. Other optimizations, such as memory cache, recompute strategy, distributed parallelism, and so on. 🚀 **QuickStart Available**: Start your SFT training quickly by referring to the [QuickStart Guide](https://github.com/FlagOpen/FlagScale/pull/572/files#diff-1dc2d875e1726124f3b3ba085e5ed5332d68cb088e8d3473d59d55c35bd725f1). Based on FlagScale, BAAI has achieved [RoboBrain2.0](https://superrobobrain.github.io/), the most powerful open-source embodied brain model to date. PS: Adapted from https://github.com/alibaba/Pai-Megatron-Patch/tree/4c305eff70c5c1d30d65ceed2713acabff48ef87/examples/qwen2_5_vl and https://github.com/huggingface/transformers/tree/10627c1a0f6877ce6715b9537afe7fafb2a89edd/src/transformers/models/qwen2_5_vl --------- Co-authored-by: lizhiyu <[email protected]>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi> Co-authored-by: caozhou <[email protected]>
modify yamls Co-authored-by: Hengchi <[email protected]>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>
[metax] upload llama3-70b patch
This PR adds the patch history yaml for hardware and use `python tools/patch/merge.py --backend vllm FlagScale --task inference --device-type <device> --commit <merged_in_flagscale_commit>` to add information. Co-authored-by: caozhou <[email protected]>
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Train ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> Others ### PR Description <!-- Describe what you’ve done --> Support to train qwen2.5-Vl using Huawei NPU. --------- Co-authored-by: lzy <[email protected]>
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Serve ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> Improvements ### PR Description <!-- Describe what you’ve done --> request args support general standard types like builtin and typing types like: str int bool float Optional[str] Optional[int] List[str] List[int] List[Dict[str, str]] Dict[str, str] Dict[str, List[int]] Tuple[int, str] ci result: ```shell tests/scripts/functional_tests/test_task.sh --type serve --task base ``` <img width="542" height="22" alt="image" src="https://github.com/user-attachments/assets/c14d7629-1ffd-468a-8400-06cd5b6ded2a" />
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Inference ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> New Models ### PR Description <!-- Describe what you’ve done --> Add vLLM backend support to enable efficient inference for Emu3.5 AR. New features include: - Batch scheduler between cond_input and uncond_input. - Customized logits processor: ClassifierFreeGuidanceLogitsForVisualTokenProcessor. DONE List: - [x] Update FlagScale vllm backend to tag-0.11.0 - [x] support Emu3.5 offline inference with v1 engine Example: ``` [v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 0_cfg_0 [v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 1_cfg_0 INFO 11-19 15:56:22 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 21.7 tokens/s, Avg generation throughput: 103.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:32 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.7%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:42 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:52 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.1 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0% INFO 11-19 15:57:02 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0% INFO 11-19 15:57:12 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0% Processed prompts: 50%|█████ | 1/2 [01:06<01:06, 66.29s/it, est. speed input: 3.70 toks/s, output: 61.50 toks/s] Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 66.29s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s] Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 33.14s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s] [2025-11-19 15:57:18,505 FlagScale logger.py:25 INFO] ---------------------------------------- [2025-11-19 15:57:19,608 FlagScale logger.py:25 INFO] >>> 📷[OUTPUT-0][image]: saved to outputs/emu3p5_one_image_generation/000/output_000_0.png ``` <img width="1168" height="880" alt="output_000_0" src="https://github.com/user-attachments/assets/80debc5e-06f8-4d12-87e7-a426654403d6" /> To-do / Verification List: - [ ] DataParalel - [ ] ChunkPrefill - [ ] Continue Batching - [ ] add emu3 v1 version
### PR Category Train ### PR Types New Model ### PR Description - Robobrain-X0 training support - Add example configs and a README file --------- Co-authored-by: lzy <[email protected]>
### PR Category Serve ### PR Types New Model ### PR Description This update introduces support for the Emu3_5 model. It is important to note that this feature is dependent on the Emu3_5 project, which can be found at the following link: [Emu3.5 Project](https://github.com/baaivision/Emu3.5). Additionally, the models rely on Emu3_5, available at [ModelScope - Emu3.5](https://www.modelscope.cn/models/BAAI/Emu3.5/files), and Emu3.5-VisionTokenizer, accessible at [ModelScope - Emu3.5-VisionTokenizer](https://www.modelscope.cn/models/BAAI/Emu3.5-VisionTokenizer/files). **Highlights:** 1. Support for multiple instances 2. Capability for cross-node operation 3. Instance autoscaling support --------- Co-authored-by: MC952-arch <[email protected]> Co-authored-by: cyber-pioneer <[email protected]> Co-authored-by: chenzhuo <[email protected]>
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Serve ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> Bug Fixes ### PR Description <!-- Describe what you’ve done --> fix config about qwen2.5vl
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Train ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> New Features ### PR Description <!-- Describe what you’ve done --> add qwen2.5-10b and qwen3-10b model yaml
### PR Category Serve ### PR Types Bug Fixes ### PR Description Fix the bug where vllm in FlagScale cannot inference model Already verified the results of Qwen3-8B model and Emu3.5 model
### PR Category Others ### PR Types Bug Fixes ### PR Description remove third_party/lerobot, now `git submodule` can work properly
…flagos-ai#917) ### PR Category Train ### PR Types New Features ### PR Description **Integrate TransformerEngine-FL to Megatron-LM for unified training backend** **[TransformerEngine-FL](flagos-ai/TransformerEngine-FL#1 **Background**: Megatron-LM natively relies on NVIDIA TransformerEngine for distributed training, wherein its core operators—including GEMM, LayerNorm, Attention, and communication primitives—are monolithically encapsulated within the proprietary NCCL+cuBLAS stack. This initiative refactor the TransformerEngine, to construct a unified distributed training backend, leveraging FlagGems and FlagCX as foundational components. **Primary Roadmap:** - Initial Development: Leveraging FlagOS (FlagGems & FlagCX), implement core operators—including Linear (Column-wise/Row-wise Parallel), DotProductAttention (FlashAttn), RMSNorm, to facilitate end-to-end training of models such as Qwen3, ensuring correct convergence and performance alignment with expectations. - Performance Optimization: Establish and realize diverse Computation-Communication (Comp/Comm) overlap optimization schemes, such as GEMM+SP Comm Overlap and FlashAttn+CP, for iterative performance refinement of TransformerEngine-FL. - Hardware Ecosystem Compatibility: Enable adaptation across multiple hardware vendors and execute architecture-specific operator optimizations to enhance end-to-end model training performance. **Current Progress:** - Development of Linear, DotProductAttention, RMSNorm, AdamW operators has been completed. - End-to-end convergence, as illustrated in the figure below, demonstrates alignment with TransformerEngine. <img width="985" height="589" alt="截屏2025-11-25 14 21 16" src="https://github.com/user-attachments/assets/2819e011-30af-4b7b-98ab-0804d7a0ae3b" /> - The operator inventory is enumerated below. Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ---------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- 23.8 73434757203 123324 595462.0 565601.0 172896 4311494 459319.4 mm_kernel_general 20.7 63740396844 64878 982465.5 366879.5 97248 425961771 5327781.4 ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 18.1 55932338886 41336 1353114.4 365569.0 304385 496029996 8331941.6 ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 9.3 28677100782 2720 10543051.8 8746512.5 5851277 72716183 5706959.2 ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 6.9 21159530707 10240 2066360.4 2044545.0 2038149 2157983 38519.5 _attn_bwd 6.7 20597525353 160 128734533.5 18290506.5 45856 529057641 180623640.3 ncclDevKernel_AllReduce_Sum_u8_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 2.6 8102566635 160 50641041.5 33241872.5 49408 99884105 32285638.0 ncclDevKernel_AllReduce_Sum_f32_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 2.1 6385626731 80 79820334.1 74899619.5 22688 316885564 76672028.6 ncclDevKernel_AllReduce_Sum_u32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 2.0 6155190811 228706 26913.1 7424.0 1344 585313 52799.6 add_func_kernel_rank_1 1.2 3586448520 10108 354812.9 353185.0 351361 370112 3977.2 _attn_fwd 1.0 3040437164 156992 19366.8 3200.0 1312 530017 43798.4 mul_func_scalar_kernel_rank_1 0.7 2193838986 1520 1443315.1 42928.0 13985 34803310 5967409.2 ncclDevKernel_AllReduce_Sum_f32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 0.5 1529975869 111980 13662.9 10816.0 4255 33856 7757.2 void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n… 0.3 862768767 26100 33056.3 16064.0 2048 297248 47648.7 l2_norm_kernel_1 0.3 848179397 10240 82830.0 82304.0 80992 92736 1810.9 triton_poi_fused_cat_0 0.3 841353024 41280 20381.6 15520.0 8096 48288 11868.9 rms_norm_grad_dw_kernel 0.3 838766524 42080 19932.7 2048.0 1312 231072 30059.9 true_div_func_tensor_scalar_kernel_rank_1 0.2 721676008 41280 17482.5 15328.0 6080 42560 9245.4 rms_norm_grad_dx_kernel 0.2 690642939 61792 11176.9 2880.0 1536 345825 33153.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::bfloat16_copy_kernel_cuda(at::Te… 0.2 662149297 1570 421751.1 26784.0 9504 6105775 1036900.3 ncclDevKernel_Broadcast_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 0.2 656237310 40696 16125.4 15328.0 14720 24096 1740.9 mul_func_kernel_rank_4 0.2 614338150 21440 28653.8 1984.0 1344 332161 44474.2 true_div_func_kernel_rank_1 0.2 591549664 62776 9423.2 3424.0 1312 229856 19177.0 mul_func_kernel_rank_1 0.2 562876515 40752 13812.2 8768.0 5600 43200 9282.8 rms_norm_kernel 0.2 526119217 80 6576490.2 7069194.0 24640 9801797 2067934.3 ncclDevKernel_AllReduce_Sum_u32_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 0.2 512128718 74112 6910.2 6720.0 2208 24544 4225.8 cat_copy_func_kernel_4 0.2 494684161 61840 7999.4 4608.0 2624 26656 5036.1 sum_dim_kernel_non_inner 0.1 423398358 10114 41862.6 40608.0 39904 50048 2419.5 triton_poi_fused_mul_silu_0 0.1 395244884 20720 19075.5 7136.0 1344 229152 30226.3 sqrt_func_kernel_rank_1 0.1 389856610 20720 18815.5 6688.0 1344 229761 30029.7 add_func_tensor_scalar_kernel_rank_1 0.1 374265252 80 4678315.7 4669215.0 4654854 5058286 60766.1 fill_scalar_func_kernel_rank_1 0.1 267491494 320 835910.9 873089.5 311104 1093409 197953.7 embedding_backward_kernel 0.1 252687527 11690 21615.7 16000.0 2208 134912 18665.5 count_nonzero_kernel_1 0.1 248953630 40696 6117.4 7904.0 2624 15776 3187.6 neg_func_kernel_rank_4 0.1 199178669 10240 19451.0 18464.0 17664 28512 2021.3 _attn_bwd_preprocess 0.1 178310444 33130 5382.1 2240.0 1215 223393 19263.9 zeros_kernel 0.1 177752928 20222 8790.1 7776.0 6976 17824 1846.9 triton_poi_fused_add_0 0.1 172528842 10240 16848.5 15968.0 15616 24640 1773.1 add_func_kernel_rank_4 0.0 129051736 960 134428.9 4832.0 1984 397346 185091.0 void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator… 0.0 128656001 320 402050.0 402481.0 388192 410145 3474.4 true_div_func_kernel_rank_3 0.0 128654046 320 402043.9 401905.0 389121 415617 4864.4 sub_func_kernel_rank_3 0.0 126241110 320 394503.5 394496.0 391328 397889 1096.0 mul_func_kernel_rank_3 0.0 117415251 320 366922.7 366832.0 361025 371265 1334.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat… 0.0 115360242 20216 5706.4 5039.5 4544 12288 1565.5 cos_func_kernel_rank_1 0.0 90851159 20216 4494.0 3872.0 3680 11552 1483.1 sin_func_kernel_rank_1 0.0 69295147 26100 2655.0 2080.0 1376 12416 1516.5 l2_norm_kernel_2 0.0 59334903 320 185421.6 185407.5 180864 191808 1901.7 max_kernel 0.0 58171124 320 181784.8 181728.5 179232 184672 969.6 sum_dim_kernel_inner 0.0 51856723 634 81792.9 80448.0 80064 88064 2535.4 masked_fill_kernel_kernel_rank_3 0.0 19589908 12010 1631.1 1568.0 1312 2560 203.9 sub_func_scalar_tensor_kernel_rank_1 0.0 8699435 3840 2265.5 2304.0 2080 2560 60.4 nonzero_kernel 0.0 8570665 314 27295.1 26336.0 23616 39840 3406.7 embedding_kernel 0.0 7695050 3040 2531.3 1472.0 1344 10496 2085.6 isinf_func_kernel_rank_1 0.0 7239914 3040 2381.6 2016.0 1536 8096 994.7 isnan_func_kernel_rank_1 0.0 6594691 640 10304.2 10416.0 9280 11616 636.8 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void … 0.0 5971307 3840 1555.0 1568.0 1439 2048 53.2 reduce_then_scan_root_scan_kernel_row 0.0 5650567 3840 1471.5 1472.0 1344 1728 44.1 gt_func_scalar_kernel_rank_1 0.0 2415396 320 7548.1 7552.0 7328 7712 57.3 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void … 0.0 2041922 960 2127.0 2240.0 1376 2784 336.3 sum_kernel_1 0.0 2033763 954 2131.8 2144.0 1952 2464 57.3 masked_fill_kernel_kernel_rank_1 0.0 1907392 634 3008.5 2688.0 2272 8000 1012.8 lt_func_scalar_kernel_rank_1 0.0 1642143 960 1710.6 1728.0 1376 1984 90.8 sum_kernel_2 0.0 1623040 634 2560.0 2144.0 2080 6752 1133.9 ge_func_scalar_kernel_rank_1 0.0 1422273 634 2243.3 1952.0 1856 6144 893.7 bitwise_or_func_kernel_rank_1 0.0 1356771 640 2120.0 2112.0 1952 2464 116.1 sub_func_kernel_rank_1 0.0 1327523 634 2093.9 2080.0 1952 2272 47.7 sub_func_tensor_scalar_kernel_rank_1 0.0 1131074 640 1767.3 1760.0 1728 1824 28.5 arange_func 0.0 754400 320 2357.5 2336.0 2016 2816 114.7 void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator… 0.0 736416 320 2301.3 2304.0 2240 2496 51.9 log_func_kernel_rank_1 0.0 478783 320 1496.2 1504.0 1344 1536 29.7 clamp_func_min_kernel_rank_1 0.0 437536 320 1367.3 1376.0 1280 1440 48.7 ones_kernel 0.0 182208 80 2277.6 2272.0 2112 2464 78.2 vstack_kernel 0.0 182080 80 2276.0 2336.0 1824 2688 203.8 pow_func_tensor_scalar_kernel_rank_1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Train
PR Types
New Features
PR Description
This PR introduces the capability to dynamically tune FlagCX communication config in FlagScale based on end-to-end training performance. FlagCX tuning is triggered by FlagScale after finding the best parallelization strategy, and the goal is to find the best communication config for the parallelization strategy used by FlagScale during training.