Support FlagScale-FlagCX co-tuning #957

mikethegoblin · 2025-12-12T08:33:15Z

PR Category

Train

PR Types

New Features

PR Description

This PR introduces the capability to dynamically tune FlagCX communication config in FlagScale based on end-to-end training performance. FlagCX tuning is triggered by FlagScale after finding the best parallelization strategy, and the goal is to find the best communication config for the parallelization strategy used by FlagScale during training.

…s-ai#524) ### Description <img width="1173" alt="image" src="https://github.com/user-attachments/assets/a446ee20-38e7-49fa-b8f5-76e1d43cec2f" /> **How to use:** By setting env value `SCHEDULING_STRATEGY`, current loadbalance strategies contain `slo， robin， random`， default is `slo`. **Todo List：** - Support offline profiling to obtain the compute capacity ratio across different machines - Support auto-tuning to determine the optimal P/D ratio and deployment distribution - Enable disaggregated P/D deployment on heterogeneous machines

…-ai#534) This PR adds the ability to specify a commit for the submodule and optionally apply the FlagScale adaptation. Usage is as follows: `python tools/patch/unpatch.py --backend vllm --backend-commit <vllm_commit> --no-fs-extension` `--backend-commit` indicates the commit to which the submodule will automatically reset after a submodule update. The `--no-fs-extension` means the FlagScale adaptation will not be applied, and the submodule will remain in its original state. Co-authored-by: caozhou <[email protected]>

…agos-ai#532) When running the test, compile and install vllm to synchronize code changes to ensure that the latest vllm code is tested.

Add functional test, qwen3 inference test

Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_gems 2. qwen3_gems

…#542) Reverts flagos-ai#540

…agos-ai#539) ## Description python run.py --config-path ./examples/deepseek_r1/conf --config-name serve **action=stop**

usage: export USE_FLAGGEMS=True <commands>

Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_flaggems 2. qwen3_flaggems

1. Support serve with SGLang 2. Support auto-tune with SGLang 3. Do some adaption on serve and auto-tune for SGLang, delete 6 common args 4. Add profile report for qwen3-0.6b with 3 backends --------- Co-authored-by: MC952-arch <[email protected]>

Add flagscale test instruction set, after installing flagscale using pip, you can use the flagscale command to run the following tests: Unit Testing 1. flagscale test --unit --backend ${BACKEND} --subset ${SUBSET} 2. flagscale test --unit-all Functional testing 1. flagscale test --functional --type ${TYPE} --task ${TASK} 2. flagscale test --functional-all All testing flagscale test Please read tests/README.md for help on : 1. How to set the above parameters. 2. Data files that need to be configured in advance for functional testing.

**Update About:** 1. **Environment Installation Path:** - Run the environment installation from the root directory instead of the install folder. 2. **Foolproof Design for Image Building :** - Require explicit specification of the FlagScale version in the Dockerfile to avoid interference due to Docker cache mechanism and maintain compatibility with the existing environment. - Update the input instructions for image building parameters. 3. **TransformerEngine Version:** - Update TransformerEngine to commit `5bee81e`. 4. **Patch Mechanism:** - Add the necessary Python packages for the patch mechanism. - Add unpatch action in the installation step. - Update the path of requirements according to the patch mechanism. 5. **Customization of Torch Code:** - Updated the method for customizing torch code. 6. **FlagGems:** - Add the Python package that FlagGems depends on. - Add the installation process for FlagGems. 7. **Megatron Training Requirements:** - Add the Python packages required for Megatron training.

This PR fixed the bug of selecting the newest commit when backend commits is different. Co-authored-by: caozhou <[email protected]>

This PR fixes the patch error when multi backends are `FlagScale xxx`. Co-authored-by: caozhou <[email protected]>

## Description Automatically executes inference testing pipeline after downloading new model checkpoints, verifying that the inference service can start properly and generate results. ## Usage ```bash tests/scripts/functional_tests/test_task.sh \ --type inference-pipeline \ --task <model_name> \ [--hardware <hardware_type>] \ [--flaggems <enable/disable>] ``` ### Parameters | Parameter | Required | Default | Options | Description | |-----------|----------|---------|---------|-------------| | `--type` | Yes | None | `inference-pipeline` | Specify test type as inference pipeline | | `--task` | Yes | None | Any model ckpt name (e.g. Qwen3-4B) | Target model name to test | | `--hardware` | No | `nvidia` | `nvidia`/`bi_v150`/`cambricon_mlu` | Hardware platform for testing | | `--flaggems` | No | `disable` | `enable`/`disable` | Whether to enable flaggems | ### Example ```bash # Test Qwen3-4B model on BI_V150 hardware with flaggems enabled tests/scripts/functional_tests/test_task.sh \ --type inference-pipeline \ --task Qwen3-4B \ --hardware bi_v150 \ --flaggems enable ```

Update engine args in multiple instance case

This PR updates the installation mechanism of FlagScale. Usage is as follows: `cd FlagScale` `PYTHONPATH=./:$PYTHONPATH pip install . --config-settings=backend=<backend> --config-settings=device=<device> --verbose --no-build-isolation` The `backend` parameter can accept multiple values, separated by commas, such as "vllm,sglang,llama.cpp" --------- Co-authored-by: caozhou <[email protected]>

This PR updates readme of installation. Co-authored-by: caozhou <[email protected]>

Update and optimize the Qwen2.5-VL model. 1. Support Qwen2.5-VL-32B and add the sample configuration. 2. Optimize the data processing process to decrease the time consumed. 3. Other optimizations, such as memory cache, recompute strategy, distributed parallelism, and so on. 🚀 **QuickStart Available**: Start your SFT training quickly by referring to the [QuickStart Guide](https://github.com/FlagOpen/FlagScale/pull/572/files#diff-1dc2d875e1726124f3b3ba085e5ed5332d68cb088e8d3473d59d55c35bd725f1). Based on FlagScale, BAAI has achieved [RoboBrain2.0](https://superrobobrain.github.io/), the most powerful open-source embodied brain model to date. PS: Adapted from https://github.com/alibaba/Pai-Megatron-Patch/tree/4c305eff70c5c1d30d65ceed2713acabff48ef87/examples/qwen2_5_vl and https://github.com/huggingface/transformers/tree/10627c1a0f6877ce6715b9537afe7fafb2a89edd/src/transformers/models/qwen2_5_vl --------- Co-authored-by: lizhiyu <[email protected]>

Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi> Co-authored-by: caozhou <[email protected]>

modify yamls Co-authored-by: Hengchi <[email protected]>

Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>

[metax] upload llama3-70b patch

This PR adds the patch history yaml for hardware and use `python tools/patch/merge.py --backend vllm FlagScale --task inference --device-type <device> --commit <merged_in_flagscale_commit>` to add information. Co-authored-by: caozhou <[email protected]>

### PR Category  Inference ### PR Types  New Models ### PR Description  Add vLLM backend support to enable efficient inference for Emu3.5 AR. New features include: - Batch scheduler between cond_input and uncond_input. - Customized logits processor: ClassifierFreeGuidanceLogitsForVisualTokenProcessor. DONE List: - [x] Update FlagScale vllm backend to tag-0.11.0 - [x] support Emu3.5 offline inference with v1 engine Example: ``` [v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 0_cfg_0 [v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 1_cfg_0 INFO 11-19 15:56:22 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 21.7 tokens/s, Avg generation throughput: 103.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:32 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.7%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:42 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0% INFO 11-19 15:56:52 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.1 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0% INFO 11-19 15:57:02 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0% INFO 11-19 15:57:12 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0% Processed prompts: 50%|█████ | 1/2 [01:06<01:06, 66.29s/it, est. speed input: 3.70 toks/s, output: 61.50 toks/s] Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 66.29s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s] Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 33.14s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s] [2025-11-19 15:57:18,505 FlagScale logger.py:25 INFO] ---------------------------------------- [2025-11-19 15:57:19,608 FlagScale logger.py:25 INFO] >>> 📷[OUTPUT-0][image]: saved to outputs/emu3p5_one_image_generation/000/output_000_0.png ``` <img width="1168" height="880" alt="output_000_0" src="https://github.com/user-attachments/assets/80debc5e-06f8-4d12-87e7-a426654403d6" /> To-do / Verification List: - [ ] DataParalel - [ ] ChunkPrefill - [ ] Continue Batching - [ ] add emu3 v1 version

### PR Category Train ### PR Types New Model ### PR Description - Robobrain-X0 training support - Add example configs and a README file --------- Co-authored-by: lzy <[email protected]>

### PR Category Serve ### PR Types New Model ### PR Description This update introduces support for the Emu3_5 model. It is important to note that this feature is dependent on the Emu3_5 project, which can be found at the following link: [Emu3.5 Project](https://github.com/baaivision/Emu3.5). Additionally, the models rely on Emu3_5, available at [ModelScope - Emu3.5](https://www.modelscope.cn/models/BAAI/Emu3.5/files), and Emu3.5-VisionTokenizer, accessible at [ModelScope - Emu3.5-VisionTokenizer](https://www.modelscope.cn/models/BAAI/Emu3.5-VisionTokenizer/files). **Highlights:** 1. Support for multiple instances 2. Capability for cross-node operation 3. Instance autoscaling support --------- Co-authored-by: MC952-arch <[email protected]> Co-authored-by: cyber-pioneer <[email protected]> Co-authored-by: chenzhuo <[email protected]>

### PR Category Serve ### PR Types Bug Fixes ### PR Description Fix the bug where vllm in FlagScale cannot inference model Already verified the results of Qwen3-8B model and Emu3.5 model

### PR Category Others ### PR Types Bug Fixes ### PR Description remove third_party/lerobot, now `git submodule` can work properly

…flagos-ai#917) ### PR Category Train ### PR Types New Features ### PR Description **Integrate TransformerEngine-FL to Megatron-LM for unified training backend** **[TransformerEngine-FL](flagos-ai/TransformerEngine-FL#1 **Background**: Megatron-LM natively relies on NVIDIA TransformerEngine for distributed training, wherein its core operators—including GEMM, LayerNorm, Attention, and communication primitives—are monolithically encapsulated within the proprietary NCCL+cuBLAS stack. This initiative refactor the TransformerEngine, to construct a unified distributed training backend, leveraging FlagGems and FlagCX as foundational components. **Primary Roadmap:** - Initial Development: Leveraging FlagOS (FlagGems & FlagCX), implement core operators—including Linear (Column-wise/Row-wise Parallel), DotProductAttention (FlashAttn), RMSNorm, to facilitate end-to-end training of models such as Qwen3, ensuring correct convergence and performance alignment with expectations. - Performance Optimization: Establish and realize diverse Computation-Communication (Comp/Comm) overlap optimization schemes, such as GEMM+SP Comm Overlap and FlashAttn+CP, for iterative performance refinement of TransformerEngine-FL. - Hardware Ecosystem Compatibility: Enable adaptation across multiple hardware vendors and execute architecture-specific operator optimizations to enhance end-to-end model training performance. **Current Progress:** - Development of Linear, DotProductAttention, RMSNorm, AdamW operators has been completed. - End-to-end convergence, as illustrated in the figure below, demonstrates alignment with TransformerEngine. <img width="985" height="589" alt="截屏2025-11-25 14 21 16" src="https://github.com/user-attachments/assets/2819e011-30af-4b7b-98ab-0804d7a0ae3b" /> - The operator inventory is enumerated below. Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ---------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- 23.8 73434757203 123324 595462.0 565601.0 172896 4311494 459319.4 mm_kernel_general 20.7 63740396844 64878 982465.5 366879.5 97248 425961771 5327781.4 ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 18.1 55932338886 41336 1353114.4 365569.0 304385 496029996 8331941.6 ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 9.3 28677100782 2720 10543051.8 8746512.5 5851277 72716183 5706959.2 ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 6.9 21159530707 10240 2066360.4 2044545.0 2038149 2157983 38519.5 _attn_bwd 6.7 20597525353 160 128734533.5 18290506.5 45856 529057641 180623640.3 ncclDevKernel_AllReduce_Sum_u8_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 2.6 8102566635 160 50641041.5 33241872.5 49408 99884105 32285638.0 ncclDevKernel_AllReduce_Sum_f32_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 2.1 6385626731 80 79820334.1 74899619.5 22688 316885564 76672028.6 ncclDevKernel_AllReduce_Sum_u32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 2.0 6155190811 228706 26913.1 7424.0 1344 585313 52799.6 add_func_kernel_rank_1 1.2 3586448520 10108 354812.9 353185.0 351361 370112 3977.2 _attn_fwd 1.0 3040437164 156992 19366.8 3200.0 1312 530017 43798.4 mul_func_scalar_kernel_rank_1 0.7 2193838986 1520 1443315.1 42928.0 13985 34803310 5967409.2 ncclDevKernel_AllReduce_Sum_f32_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 0.5 1529975869 111980 13662.9 10816.0 4255 33856 7757.2 void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n… 0.3 862768767 26100 33056.3 16064.0 2048 297248 47648.7 l2_norm_kernel_1 0.3 848179397 10240 82830.0 82304.0 80992 92736 1810.9 triton_poi_fused_cat_0 0.3 841353024 41280 20381.6 15520.0 8096 48288 11868.9 rms_norm_grad_dw_kernel 0.3 838766524 42080 19932.7 2048.0 1312 231072 30059.9 true_div_func_tensor_scalar_kernel_rank_1 0.2 721676008 41280 17482.5 15328.0 6080 42560 9245.4 rms_norm_grad_dx_kernel 0.2 690642939 61792 11176.9 2880.0 1536 345825 33153.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::bfloat16_copy_kernel_cuda(at::Te… 0.2 662149297 1570 421751.1 26784.0 9504 6105775 1036900.3 ncclDevKernel_Broadcast_RING_LL(ncclDevComm *, unsigned long, ncclWork *) 0.2 656237310 40696 16125.4 15328.0 14720 24096 1740.9 mul_func_kernel_rank_4 0.2 614338150 21440 28653.8 1984.0 1344 332161 44474.2 true_div_func_kernel_rank_1 0.2 591549664 62776 9423.2 3424.0 1312 229856 19177.0 mul_func_kernel_rank_1 0.2 562876515 40752 13812.2 8768.0 5600 43200 9282.8 rms_norm_kernel 0.2 526119217 80 6576490.2 7069194.0 24640 9801797 2067934.3 ncclDevKernel_AllReduce_Sum_u32_TREE_LL(ncclDevComm *, unsigned long, ncclWork *) 0.2 512128718 74112 6910.2 6720.0 2208 24544 4225.8 cat_copy_func_kernel_4 0.2 494684161 61840 7999.4 4608.0 2624 26656 5036.1 sum_dim_kernel_non_inner 0.1 423398358 10114 41862.6 40608.0 39904 50048 2419.5 triton_poi_fused_mul_silu_0 0.1 395244884 20720 19075.5 7136.0 1344 229152 30226.3 sqrt_func_kernel_rank_1 0.1 389856610 20720 18815.5 6688.0 1344 229761 30029.7 add_func_tensor_scalar_kernel_rank_1 0.1 374265252 80 4678315.7 4669215.0 4654854 5058286 60766.1 fill_scalar_func_kernel_rank_1 0.1 267491494 320 835910.9 873089.5 311104 1093409 197953.7 embedding_backward_kernel 0.1 252687527 11690 21615.7 16000.0 2208 134912 18665.5 count_nonzero_kernel_1 0.1 248953630 40696 6117.4 7904.0 2624 15776 3187.6 neg_func_kernel_rank_4 0.1 199178669 10240 19451.0 18464.0 17664 28512 2021.3 _attn_bwd_preprocess 0.1 178310444 33130 5382.1 2240.0 1215 223393 19263.9 zeros_kernel 0.1 177752928 20222 8790.1 7776.0 6976 17824 1846.9 triton_poi_fused_add_0 0.1 172528842 10240 16848.5 15968.0 15616 24640 1773.1 add_func_kernel_rank_4 0.0 129051736 960 134428.9 4832.0 1984 397346 185091.0 void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator… 0.0 128656001 320 402050.0 402481.0 388192 410145 3474.4 true_div_func_kernel_rank_3 0.0 128654046 320 402043.9 401905.0 389121 415617 4864.4 sub_func_kernel_rank_3 0.0 126241110 320 394503.5 394496.0 391328 397889 1096.0 mul_func_kernel_rank_3 0.0 117415251 320 366922.7 366832.0 361025 371265 1334.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat… 0.0 115360242 20216 5706.4 5039.5 4544 12288 1565.5 cos_func_kernel_rank_1 0.0 90851159 20216 4494.0 3872.0 3680 11552 1483.1 sin_func_kernel_rank_1 0.0 69295147 26100 2655.0 2080.0 1376 12416 1516.5 l2_norm_kernel_2 0.0 59334903 320 185421.6 185407.5 180864 191808 1901.7 max_kernel 0.0 58171124 320 181784.8 181728.5 179232 184672 969.6 sum_dim_kernel_inner 0.0 51856723 634 81792.9 80448.0 80064 88064 2535.4 masked_fill_kernel_kernel_rank_3 0.0 19589908 12010 1631.1 1568.0 1312 2560 203.9 sub_func_scalar_tensor_kernel_rank_1 0.0 8699435 3840 2265.5 2304.0 2080 2560 60.4 nonzero_kernel 0.0 8570665 314 27295.1 26336.0 23616 39840 3406.7 embedding_kernel 0.0 7695050 3040 2531.3 1472.0 1344 10496 2085.6 isinf_func_kernel_rank_1 0.0 7239914 3040 2381.6 2016.0 1536 8096 994.7 isnan_func_kernel_rank_1 0.0 6594691 640 10304.2 10416.0 9280 11616 636.8 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void … 0.0 5971307 3840 1555.0 1568.0 1439 2048 53.2 reduce_then_scan_root_scan_kernel_row 0.0 5650567 3840 1471.5 1472.0 1344 1728 44.1 gt_func_scalar_kernel_rank_1 0.0 2415396 320 7548.1 7552.0 7328 7712 57.3 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void … 0.0 2041922 960 2127.0 2240.0 1376 2784 336.3 sum_kernel_1 0.0 2033763 954 2131.8 2144.0 1952 2464 57.3 masked_fill_kernel_kernel_rank_1 0.0 1907392 634 3008.5 2688.0 2272 8000 1012.8 lt_func_scalar_kernel_rank_1 0.0 1642143 960 1710.6 1728.0 1376 1984 90.8 sum_kernel_2 0.0 1623040 634 2560.0 2144.0 2080 6752 1133.9 ge_func_scalar_kernel_rank_1 0.0 1422273 634 2243.3 1952.0 1856 6144 893.7 bitwise_or_func_kernel_rank_1 0.0 1356771 640 2120.0 2112.0 1952 2464 116.1 sub_func_kernel_rank_1 0.0 1327523 634 2093.9 2080.0 1952 2272 47.7 sub_func_tensor_scalar_kernel_rank_1 0.0 1131074 640 1767.3 1760.0 1728 1824 28.5 arange_func 0.0 754400 320 2357.5 2336.0 2016 2816 114.7 void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator… 0.0 736416 320 2301.3 2304.0 2240 2496 51.9 log_func_kernel_rank_1 0.0 478783 320 1496.2 1504.0 1344 1536 29.7 clamp_func_min_kernel_rank_1 0.0 437536 320 1367.3 1376.0 1280 1440 48.7 ones_kernel 0.0 182208 80 2277.6 2272.0 2112 2464 78.2 vstack_kernel 0.0 182080 80 2276.0 2336.0 1824 2688 203.8 pow_func_tensor_scalar_kernel_rank_1

cyber-pioneer and others added 30 commits May 20, 2025 20:37

[CI][vllm] Reinstall vllm in workflow to synchronize code changes (fl…

3a40156

…agos-ai#532) When running the test, compile and install vllm to synchronize code changes to ensure that the latest vllm code is tested.

[CI][Inference] Add functional-test-inference qwen3 (flagos-ai#533)

a379704

Add functional test, qwen3 inference test

[CI][FlagGems] Add inference tests using FlagGems (flagos-ai#540)

ea68920

Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_gems 2. qwen3_gems

Revert "[CI][FlagGems] Add inference tests using FlagGems" (flagos-ai…

fd9169b

…#542) Reverts flagos-ai#540

[Serve] Support to clear task by action=stop cross multiple nodes (fl…

abdcd52

…agos-ai#539) ## Description python run.py --config-path ./examples/deepseek_r1/conf --config-name serve **action=stop**

Support FlagGems in vllm0.8+ (flagos-ai#543)

8fe8616

usage: export USE_FLAGGEMS=True <commands>

[CI][FlagGems] Add inference tests using FlagGems (flagos-ai#544)

2742e31

Modify the vllm source code to call FlagGems. The added inference tests using FlagGems include: 1. deepseek_flaggems 2. qwen3_flaggems

Add iluvatar 0.7.3patch for qwen3-30b-a3b (flagos-ai#546)

39c0adc

[Metax_C550] update vllm 0.7.2 patch (flagos-ai#550)

2c0b2d6

[MLU] Support qwen infer (flagos-ai#551)

10faa30

[BugFix] Fix commit topo order bug (flagos-ai#556)

bebd618

This PR fixed the bug of selecting the newest commit when backend commits is different. Co-authored-by: caozhou <[email protected]>

[MLU] fix ci (flagos-ai#557)

1d6faaa

[BugFix] Fix multi banckends for hardware patch error (flagos-ai#561)

772b479

This PR fixes the patch error when multi backends are `FlagScale xxx`. Co-authored-by: caozhou <[email protected]>

[kunlunxin] updated adaption for deepseek_r1 (flagos-ai#562)

571b988

[Serve] Update engine args in multiple instance case (flagos-ai#566)

398960f

Update engine args in multiple instance case

[Docs] Update install readme (flagos-ai#568)

64b4f41

This PR updates readme of installation. Co-authored-by: caozhou <[email protected]>

[MUSA] Update for training openseek (flagos-ai#579)

e4a5504

Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi> Co-authored-by: caozhou <[email protected]>

[Metax_C550] Modify yamls for deepseek r1 (flagos-ai#580)

6798d71

modify yamls Co-authored-by: Hengchi <[email protected]>

[MUSA] update for qwen2.5-vl and llama3 (flagos-ai#584)

ab023d3

Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>

[MUSA] Fix patch commit error (flagos-ai#586)

ad84117

Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>

[Metax] Llama3-70B for patch (flagos-ai#569)

2428423

[metax] upload llama3-70b patch

heavyrain-lzy and others added 23 commits November 19, 2025 10:20

[Model] Robobrain-X0 Training Support (flagos-ai#925)

d609cc8

### PR Category Train ### PR Types New Model ### PR Description - Robobrain-X0 training support - Add example configs and a README file --------- Co-authored-by: lzy <[email protected]>

[Serve] Fix the bug of vllm adaptation (flagos-ai#943)

85142b4

### PR Category Serve ### PR Types Bug Fixes ### PR Description Fix the bug where vllm in FlagScale cannot inference model Already verified the results of Qwen3-8B model and Emu3.5 model

[Misc] remove third_party/lerobot (flagos-ai#944)

2b3f570

### PR Category Others ### PR Types Bug Fixes ### PR Description remove third_party/lerobot, now `git submodule` can work properly

initial commit to support flagscale-flagcx cotuning

a3e0b3d

ensure absolute path

286467d

eval config before updating config

b98a70c

set FLAGCX_USE_TUNER env

fa7ddc6

pass flagcx_tune flag through function parameter

67c4eb6

fix bugs

927939d

remove debug log

a087bf9

enbale tuner when creating flagcx backend options

f28efb4

prevent resetting best configs after tuning is done

66cf6f1

avoid name collision between function and class data fields

78df5d9

support multi process group tuning

b51c627

sync perf across all ranks to get identical results on all ranks

a690be8

resolve some issues

d8f008c

mikethegoblin requested review from aoyulong, heavyrain-lzy and zhaoyinglia as code owners December 12, 2025 08:33

mikethegoblin mentioned this pull request Dec 12, 2025

Support multi process group tuning in flagscale-flagcx cotuning flagos-ai/FlagCX#321

Merged

add flagcx_tune_groups from autoTuner to config

bfb6168

aoyulong closed this Jan 4, 2026

aoyulong force-pushed the main branch from 0526731 to 11c4ed7 Compare January 4, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support FlagScale-FlagCX co-tuning #957

Support FlagScale-FlagCX co-tuning #957

Uh oh!

mikethegoblin commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Support FlagScale-FlagCX co-tuning #957

Support FlagScale-FlagCX co-tuning #957

Uh oh!

Conversation

mikethegoblin commented Dec 12, 2025

PR Category

PR Types

PR Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants