Skip to content

Conversation

@mikethegoblin
Copy link

PR Category

Train

PR Types

New Features

PR Description

This PR introduces the capability to dynamically tune FlagCX communication config in FlagScale based on end-to-end training performance. FlagCX tuning is triggered by FlagScale after finding the best parallelization strategy, and the goal is to find the best communication config for the parallelization strategy used by FlagScale during training.

cyber-pioneer and others added 30 commits May 20, 2025 20:37
…s-ai#524)

### Description
<img width="1173" alt="image"
src="https://github.com/user-attachments/assets/a446ee20-38e7-49fa-b8f5-76e1d43cec2f"
/>



**How to use:**
By setting env value `SCHEDULING_STRATEGY`, current loadbalance
strategies contain `slo, robin, random`, default is `slo`.

**Todo List:**

- Support offline profiling to obtain the compute capacity ratio across
different machines

- Support auto-tuning to determine the optimal P/D ratio and deployment
distribution

- Enable disaggregated P/D deployment on heterogeneous machines
…-ai#534)

This PR adds the ability to specify a commit for the submodule and
optionally apply the FlagScale adaptation. Usage is as follows:
`python tools/patch/unpatch.py --backend vllm --backend-commit
<vllm_commit> --no-fs-extension`
`--backend-commit` indicates the commit to which the submodule will
automatically reset after a submodule update. The `--no-fs-extension`
means the FlagScale adaptation will not be applied, and the submodule
will remain in its original state.

Co-authored-by: caozhou <[email protected]>
…agos-ai#532)

When running the test, compile and install vllm to synchronize code
changes to ensure that the latest vllm code is tested.
Modify the vllm source code to call FlagGems.

The added inference tests using FlagGems include:
1. deepseek_gems
2. qwen3_gems
…agos-ai#539)

## Description
python run.py --config-path ./examples/deepseek_r1/conf --config-name
serve **action=stop**
usage: export USE_FLAGGEMS=True <commands>
Modify the vllm source code to call FlagGems.

The added inference tests using FlagGems include:
1. deepseek_flaggems
2. qwen3_flaggems
1. Support serve with SGLang
2. Support auto-tune with SGLang
3. Do some adaption on serve and auto-tune for SGLang, delete 6 common
args
4. Add profile report for qwen3-0.6b with 3 backends

---------

Co-authored-by: MC952-arch <[email protected]>
Add flagscale test instruction set, after installing flagscale using
pip, you can use the flagscale command to run the following tests:

Unit Testing

    1. flagscale test --unit --backend ${BACKEND} --subset ${SUBSET}
    2. flagscale test --unit-all

Functional testing

    1. flagscale test --functional --type ${TYPE} --task ${TASK}
    2. flagscale test --functional-all

All testing

    flagscale test


Please read tests/README.md for help on :
1. How to set the above parameters.
2. Data files that need to be configured in advance for functional
testing.
**Update About:**

1. **Environment Installation Path:**  
- Run the environment installation from the root directory instead of
the install folder.

2. **Foolproof Design for Image Building :**  
- Require explicit specification of the FlagScale version in the
Dockerfile to avoid interference due to Docker cache mechanism and
maintain compatibility with the existing environment.
   - Update the input instructions for image building parameters.

3. **TransformerEngine Version:**  
   - Update TransformerEngine to commit `5bee81e`.

4. **Patch Mechanism:**  
   - Add the necessary Python packages for the patch mechanism.
   - Add unpatch action in the installation step.
   - Update the path of requirements according to the patch mechanism.

5. **Customization of Torch Code:**  
   - Updated the method for customizing torch code.

6. **FlagGems:**  
   - Add the Python package that FlagGems depends on.
   - Add the installation process for FlagGems.

7. **Megatron Training Requirements:**  
    - Add the Python packages required for Megatron training.
This PR fixed the bug of selecting the newest commit when backend
commits is different.

Co-authored-by: caozhou <[email protected]>
This PR fixes the patch error when multi backends are `FlagScale xxx`.

Co-authored-by: caozhou <[email protected]>
## Description
Automatically executes inference testing pipeline after downloading new
model checkpoints, verifying that the inference service can start
properly and generate results.

## Usage
```bash
tests/scripts/functional_tests/test_task.sh \
    --type inference-pipeline \
    --task <model_name> \
    [--hardware <hardware_type>] \
    [--flaggems <enable/disable>]
```

### Parameters
| Parameter | Required | Default | Options | Description |
|-----------|----------|---------|---------|-------------|
| `--type` | Yes | None | `inference-pipeline` | Specify test type as
inference pipeline |
| `--task` | Yes | None | Any model ckpt name (e.g. Qwen3-4B) | Target
model name to test |
| `--hardware` | No | `nvidia` | `nvidia`/`bi_v150`/`cambricon_mlu` |
Hardware platform for testing |
| `--flaggems` | No | `disable` | `enable`/`disable` | Whether to enable
flaggems |

### Example
```bash
# Test Qwen3-4B model on BI_V150 hardware with flaggems enabled
tests/scripts/functional_tests/test_task.sh \
    --type inference-pipeline \
    --task Qwen3-4B \
    --hardware bi_v150 \
    --flaggems enable
```
Update engine args in multiple instance case
This PR updates the installation mechanism of FlagScale. Usage is as
follows:
`cd FlagScale`
`PYTHONPATH=./:$PYTHONPATH pip install .
--config-settings=backend=<backend> --config-settings=device=<device>
--verbose --no-build-isolation`
The `backend` parameter can accept multiple values, separated by commas,
such as "vllm,sglang,llama.cpp"

---------

Co-authored-by: caozhou <[email protected]>
This PR updates readme of installation.

Co-authored-by: caozhou <[email protected]>
Update and optimize the Qwen2.5-VL model.
1. Support Qwen2.5-VL-32B and add the sample configuration.
2. Optimize the data processing process to decrease the time consumed.
3. Other optimizations, such as memory cache, recompute strategy,
distributed parallelism, and so on.

🚀 **QuickStart Available**:  
Start your SFT training quickly by referring to the [QuickStart
Guide](https://github.com/FlagOpen/FlagScale/pull/572/files#diff-1dc2d875e1726124f3b3ba085e5ed5332d68cb088e8d3473d59d55c35bd725f1).

Based on FlagScale, BAAI has achieved
[RoboBrain2.0](https://superrobobrain.github.io/), the most powerful
open-source embodied brain model to date.
PS: Adapted from
https://github.com/alibaba/Pai-Megatron-Patch/tree/4c305eff70c5c1d30d65ceed2713acabff48ef87/examples/qwen2_5_vl
and
https://github.com/huggingface/transformers/tree/10627c1a0f6877ce6715b9537afe7fafb2a89edd/src/transformers/models/qwen2_5_vl

---------

Co-authored-by: lizhiyu <[email protected]>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>
Co-authored-by: caozhou <[email protected]>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>
Co-authored-by: haoranhuang-mt <haoran.huang@mthreads-gmi>
[metax] upload llama3-70b patch
This PR adds the patch history yaml for hardware and use `python
tools/patch/merge.py --backend vllm FlagScale --task inference
--device-type <device> --commit <merged_in_flagscale_commit>` to add
information.

Co-authored-by: caozhou <[email protected]>
heavyrain-lzy and others added 23 commits November 19, 2025 10:20
### PR Category
<!-- One of [ Train | Inference | Compress | Serve | RL | Core |
Hardware | CICD | Tools | Others ] -->
Train
### PR Types
<!-- One of [ User Experience | New Features | Bug Fixes | Improvements
| Performance | Breaking Change| Deprecations | Test Case | Docs |
Others ] -->
Others
### PR Description
<!-- Describe what you’ve done -->
Support to train qwen2.5-Vl using Huawei NPU.

---------

Co-authored-by: lzy <[email protected]>
### PR Category
<!-- One of [ Train | Inference | Compress | Serve | RL | Core |
Hardware | CICD | Tools | Others ] -->
Serve

### PR Types
<!-- One of [ User Experience | New Features | Bug Fixes | Improvements
| Performance | Breaking Change| Deprecations | Test Case | Docs |
Others ] -->
Improvements

### PR Description
<!-- Describe what you’ve done -->
request args support general standard types like builtin and typing
types like:
str
int
bool
float
Optional[str]
Optional[int]
List[str]
List[int]
List[Dict[str, str]]
Dict[str, str]
Dict[str, List[int]]
Tuple[int, str]

ci result:
```shell
tests/scripts/functional_tests/test_task.sh --type serve --task base
``` 
<img width="542" height="22" alt="image"
src="https://github.com/user-attachments/assets/c14d7629-1ffd-468a-8400-06cd5b6ded2a"
/>
### PR Category
<!-- One of [ Train | Inference | Compress | Serve | RL | Core |
Hardware | CICD | Tools | Others ] -->
Inference
### PR Types
<!-- One of [ User Experience | New Features | Bug Fixes | Improvements
| Performance | Breaking Change| Deprecations | Test Case | Docs |
Others ] -->
New Models
### PR Description
<!-- Describe what you’ve done -->

Add vLLM backend support to enable efficient inference for Emu3.5 AR.
New features include:
- Batch scheduler between cond_input and uncond_input.
- Customized logits processor:
ClassifierFreeGuidanceLogitsForVisualTokenProcessor.

DONE List:
- [x] Update FlagScale vllm backend to tag-0.11.0
- [x] support Emu3.5 offline inference with v1 engine

Example:
```
[v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 0_cfg_0
[v1/core/sched/batch_manager.py:247] Set customized hw: 55 x 73 for request 1_cfg_0
INFO 11-19 15:56:22 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 21.7 tokens/s, Avg generation throughput: 103.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
INFO 11-19 15:56:32 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.7%, Prefix cache hit rate: 0.0%
INFO 11-19 15:56:42 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.3 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
INFO 11-19 15:56:52 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 123.1 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 0.0%
INFO 11-19 15:57:02 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0%
INFO 11-19 15:57:12 [v1/metrics/loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 122.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.7%, Prefix cache hit rate: 0.0%


Processed prompts:  50%|█████     | 1/2 [01:06<01:06, 66.29s/it, est. speed input: 3.70 toks/s, output: 61.50 toks/s]
Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 66.29s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s]
Processed prompts: 100%|██████████| 2/2 [01:06<00:00, 33.14s/it, est. speed input: 3.94 toks/s, output: 123.01 toks/s]
[2025-11-19 15:57:18,505 FlagScale logger.py:25 INFO] ----------------------------------------
[2025-11-19 15:57:19,608 FlagScale logger.py:25 INFO] >>> 📷[OUTPUT-0][image]: saved to outputs/emu3p5_one_image_generation/000/output_000_0.png
```

<img width="1168" height="880" alt="output_000_0"
src="https://github.com/user-attachments/assets/80debc5e-06f8-4d12-87e7-a426654403d6"
/>


To-do / Verification List:
- [ ] DataParalel
- [ ] ChunkPrefill
- [ ] Continue Batching
- [ ] add emu3 v1 version
### PR Category
Train 

### PR Types
New Model

### PR Description
- Robobrain-X0 training support
- Add example configs and a README file

---------

Co-authored-by: lzy <[email protected]>
### PR Category
Serve

### PR Types
New Model

### PR Description
This update introduces support for the Emu3_5 model. It is important to
note that this feature is dependent on the Emu3_5 project, which can be
found at the following link: [Emu3.5
Project](https://github.com/baaivision/Emu3.5). Additionally, the models
rely on Emu3_5, available at [ModelScope -
Emu3.5](https://www.modelscope.cn/models/BAAI/Emu3.5/files), and
Emu3.5-VisionTokenizer, accessible at [ModelScope -
Emu3.5-VisionTokenizer](https://www.modelscope.cn/models/BAAI/Emu3.5-VisionTokenizer/files).

**Highlights:**
1. Support for multiple instances
2. Capability for cross-node operation
3. Instance autoscaling support

---------

Co-authored-by: MC952-arch <[email protected]>
Co-authored-by: cyber-pioneer <[email protected]>
Co-authored-by: chenzhuo <[email protected]>
### PR Category
<!-- One of [ Train | Inference | Compress | Serve | RL | Core |
Hardware | CICD | Tools | Others ] -->
Serve

### PR Types
<!-- One of [ User Experience | New Features | Bug Fixes | Improvements
| Performance | Breaking Change| Deprecations | Test Case | Docs |
Others ] -->
Bug Fixes

### PR Description
<!-- Describe what you’ve done -->
fix config about qwen2.5vl
### PR Category
<!-- One of [ Train | Inference | Compress | Serve | RL | Core |
Hardware | CICD | Tools | Others ] -->
Train
### PR Types
<!-- One of [ User Experience | New Features | Bug Fixes | Improvements
| Performance | Breaking Change| Deprecations | Test Case | Docs |
Others ] -->
New Features
### PR Description
<!-- Describe what you’ve done -->
add qwen2.5-10b and qwen3-10b model yaml
### PR Category
 Serve

### PR Types
 Bug Fixes

### PR Description
 Fix the bug where vllm in FlagScale cannot inference model
Already verified the results of Qwen3-8B model and Emu3.5 model
### PR Category
Others

### PR Types
Bug Fixes

### PR Description
remove third_party/lerobot, now `git submodule` can work properly
…flagos-ai#917)

### PR Category
Train
### PR Types
New Features

### PR Description
**Integrate TransformerEngine-FL to Megatron-LM for unified training
backend**

**[TransformerEngine-FL](flagos-ai/TransformerEngine-FL#1

**Background**: 
Megatron-LM natively relies on NVIDIA TransformerEngine for distributed
training, wherein its core operators—including GEMM, LayerNorm,
Attention, and communication primitives—are monolithically encapsulated
within the proprietary NCCL+cuBLAS stack. This initiative refactor the
TransformerEngine, to construct a unified distributed training backend,
leveraging FlagGems and FlagCX as foundational components.

**Primary Roadmap:**

- Initial Development: Leveraging FlagOS (FlagGems & FlagCX), implement
core operators—including Linear (Column-wise/Row-wise Parallel),
DotProductAttention (FlashAttn), RMSNorm, to facilitate end-to-end
training of models such as Qwen3, ensuring correct convergence and
performance alignment with expectations.
- Performance Optimization: Establish and realize diverse
Computation-Communication (Comp/Comm) overlap optimization schemes, such
as GEMM+SP Comm Overlap and FlashAttn+CP, for iterative performance
refinement of TransformerEngine-FL.
- Hardware Ecosystem Compatibility: Enable adaptation across multiple
hardware vendors and execute architecture-specific operator
optimizations to enhance end-to-end model training performance.

**Current Progress:**

- Development of Linear, DotProductAttention, RMSNorm, AdamW operators
has been completed.
- End-to-end convergence, as illustrated in the figure below,
demonstrates alignment with TransformerEngine.

<img width="985" height="589" alt="截屏2025-11-25 14 21 16"
src="https://github.com/user-attachments/assets/2819e011-30af-4b7b-98ab-0804d7a0ae3b"
/>


- The operator inventory is enumerated below.


Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns)
StdDev (ns) Name
-------- --------------- --------- ----------- ---------- --------
--------- -----------
----------------------------------------------------------------------------------------------------
23.8 73434757203 123324 595462.0 565601.0 172896 4311494 459319.4
mm_kernel_general
20.7 63740396844 64878 982465.5 366879.5 97248 425961771 5327781.4
ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork
*)
18.1 55932338886 41336 1353114.4 365569.0 304385 496029996 8331941.6
ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL(ncclDevComm *, unsigned
long, ncclWork *)
9.3 28677100782 2720 10543051.8 8746512.5 5851277 72716183 5706959.2
ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm *, unsigned
long, ncclWork *)
6.9 21159530707 10240 2066360.4 2044545.0 2038149 2157983 38519.5
_attn_bwd
6.7 20597525353 160 128734533.5 18290506.5 45856 529057641 180623640.3
ncclDevKernel_AllReduce_Sum_u8_TREE_LL(ncclDevComm *, unsigned long,
ncclWork *)
2.6 8102566635 160 50641041.5 33241872.5 49408 99884105 32285638.0
ncclDevKernel_AllReduce_Sum_f32_TREE_LL(ncclDevComm *, unsigned long,
ncclWork *)
2.1 6385626731 80 79820334.1 74899619.5 22688 316885564 76672028.6
ncclDevKernel_AllReduce_Sum_u32_RING_LL(ncclDevComm *, unsigned long,
ncclWork *)
2.0 6155190811 228706 26913.1 7424.0 1344 585313 52799.6
add_func_kernel_rank_1
1.2 3586448520 10108 354812.9 353185.0 351361 370112 3977.2 _attn_fwd
1.0 3040437164 156992 19366.8 3200.0 1312 530017 43798.4
mul_func_scalar_kernel_rank_1
0.7 2193838986 1520 1443315.1 42928.0 13985 34803310 5967409.2
ncclDevKernel_AllReduce_Sum_f32_RING_LL(ncclDevComm *, unsigned long,
ncclWork *)
0.5 1529975869 111980 13662.9 10816.0 4255 33856 7757.2 void
at::native::elementwise_kernel<(int)128, (int)4, void
at::native::gpu_kernel_impl_nocast<at::n…
0.3 862768767 26100 33056.3 16064.0 2048 297248 47648.7 l2_norm_kernel_1
0.3 848179397 10240 82830.0 82304.0 80992 92736 1810.9
triton_poi_fused_cat_0
0.3 841353024 41280 20381.6 15520.0 8096 48288 11868.9
rms_norm_grad_dw_kernel
0.3 838766524 42080 19932.7 2048.0 1312 231072 30059.9
true_div_func_tensor_scalar_kernel_rank_1
0.2 721676008 41280 17482.5 15328.0 6080 42560 9245.4
rms_norm_grad_dx_kernel
0.2 690642939 61792 11176.9 2880.0 1536 345825 33153.8 void
at::native::vectorized_elementwise_kernel<(int)4,
at::native::bfloat16_copy_kernel_cuda(at::Te…
0.2 662149297 1570 421751.1 26784.0 9504 6105775 1036900.3
ncclDevKernel_Broadcast_RING_LL(ncclDevComm *, unsigned long, ncclWork
*)
0.2 656237310 40696 16125.4 15328.0 14720 24096 1740.9
mul_func_kernel_rank_4
0.2 614338150 21440 28653.8 1984.0 1344 332161 44474.2
true_div_func_kernel_rank_1
0.2 591549664 62776 9423.2 3424.0 1312 229856 19177.0
mul_func_kernel_rank_1
0.2 562876515 40752 13812.2 8768.0 5600 43200 9282.8 rms_norm_kernel
0.2 526119217 80 6576490.2 7069194.0 24640 9801797 2067934.3
ncclDevKernel_AllReduce_Sum_u32_TREE_LL(ncclDevComm *, unsigned long,
ncclWork *)
0.2 512128718 74112 6910.2 6720.0 2208 24544 4225.8
cat_copy_func_kernel_4
0.2 494684161 61840 7999.4 4608.0 2624 26656 5036.1
sum_dim_kernel_non_inner
0.1 423398358 10114 41862.6 40608.0 39904 50048 2419.5
triton_poi_fused_mul_silu_0
0.1 395244884 20720 19075.5 7136.0 1344 229152 30226.3
sqrt_func_kernel_rank_1
0.1 389856610 20720 18815.5 6688.0 1344 229761 30029.7
add_func_tensor_scalar_kernel_rank_1
0.1 374265252 80 4678315.7 4669215.0 4654854 5058286 60766.1
fill_scalar_func_kernel_rank_1
0.1 267491494 320 835910.9 873089.5 311104 1093409 197953.7
embedding_backward_kernel
0.1 252687527 11690 21615.7 16000.0 2208 134912 18665.5
count_nonzero_kernel_1
0.1 248953630 40696 6117.4 7904.0 2624 15776 3187.6
neg_func_kernel_rank_4
0.1 199178669 10240 19451.0 18464.0 17664 28512 2021.3
_attn_bwd_preprocess
0.1 178310444 33130 5382.1 2240.0 1215 223393 19263.9 zeros_kernel
0.1 177752928 20222 8790.1 7776.0 6976 17824 1846.9
triton_poi_fused_add_0
0.1 172528842 10240 16848.5 15968.0 15616 24640 1773.1
add_func_kernel_rank_4
0.0 129051736 960 134428.9 4832.0 1984 397346 185091.0 void
at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
0.0 128656001 320 402050.0 402481.0 388192 410145 3474.4
true_div_func_kernel_rank_3
0.0 128654046 320 402043.9 401905.0 389121 415617 4864.4
sub_func_kernel_rank_3
0.0 126241110 320 394503.5 394496.0 391328 397889 1096.0
mul_func_kernel_rank_3
0.0 117415251 320 366922.7 366832.0 361025 371265 1334.3 void
at::native::vectorized_elementwise_kernel<(int)4,
at::native::exp_kernel_cuda(at::TensorIterat…
0.0 115360242 20216 5706.4 5039.5 4544 12288 1565.5
cos_func_kernel_rank_1
0.0 90851159 20216 4494.0 3872.0 3680 11552 1483.1
sin_func_kernel_rank_1
0.0 69295147 26100 2655.0 2080.0 1376 12416 1516.5 l2_norm_kernel_2
0.0 59334903 320 185421.6 185407.5 180864 191808 1901.7 max_kernel
0.0 58171124 320 181784.8 181728.5 179232 184672 969.6
sum_dim_kernel_inner
0.0 51856723 634 81792.9 80448.0 80064 88064 2535.4
masked_fill_kernel_kernel_rank_3
0.0 19589908 12010 1631.1 1568.0 1312 2560 203.9
sub_func_scalar_tensor_kernel_rank_1
0.0 8699435 3840 2265.5 2304.0 2080 2560 60.4 nonzero_kernel
0.0 8570665 314 27295.1 26336.0 23616 39840 3406.7 embedding_kernel
0.0 7695050 3040 2531.3 1472.0 1344 10496 2085.6
isinf_func_kernel_rank_1
0.0 7239914 3040 2381.6 2016.0 1536 8096 994.7 isnan_func_kernel_rank_1
0.0 6594691 640 10304.2 10416.0 9280 11616 636.8 void
at::native::index_elementwise_kernel<(int)128, (int)4, void
at::native::gpu_index_kernel<void …
0.0 5971307 3840 1555.0 1568.0 1439 2048 53.2
reduce_then_scan_root_scan_kernel_row
0.0 5650567 3840 1471.5 1472.0 1344 1728 44.1
gt_func_scalar_kernel_rank_1
0.0 2415396 320 7548.1 7552.0 7328 7712 57.3 void
at::native::index_elementwise_kernel<(int)128, (int)4, void
at::native::gpu_index_kernel<void …
0.0 2041922 960 2127.0 2240.0 1376 2784 336.3 sum_kernel_1
0.0 2033763 954 2131.8 2144.0 1952 2464 57.3
masked_fill_kernel_kernel_rank_1
0.0 1907392 634 3008.5 2688.0 2272 8000 1012.8
lt_func_scalar_kernel_rank_1
0.0 1642143 960 1710.6 1728.0 1376 1984 90.8 sum_kernel_2
0.0 1623040 634 2560.0 2144.0 2080 6752 1133.9
ge_func_scalar_kernel_rank_1
0.0 1422273 634 2243.3 1952.0 1856 6144 893.7
bitwise_or_func_kernel_rank_1
0.0 1356771 640 2120.0 2112.0 1952 2464 116.1 sub_func_kernel_rank_1
0.0 1327523 634 2093.9 2080.0 1952 2272 47.7
sub_func_tensor_scalar_kernel_rank_1
0.0 1131074 640 1767.3 1760.0 1728 1824 28.5 arange_func
0.0 754400 320 2357.5 2336.0 2016 2816 114.7 void
at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
0.0 736416 320 2301.3 2304.0 2240 2496 51.9 log_func_kernel_rank_1
0.0 478783 320 1496.2 1504.0 1344 1536 29.7 clamp_func_min_kernel_rank_1
0.0 437536 320 1367.3 1376.0 1280 1440 48.7 ones_kernel
0.0 182208 80 2277.6 2272.0 2112 2464 78.2 vstack_kernel
0.0 182080 80 2276.0 2336.0 1824 2688 203.8
pow_func_tensor_scalar_kernel_rank_1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.