Skip to content

Releases: MooreThreads/torch_musa

torch_musa Release v2.9.0

17 Mar 06:52
1dc7872

Choose a tag to compare

Release Note

Hi all, torch_musa v2.9.0 is now available. Along with torch2.9.0, we enchanced user experiences and support bunch of new features. This release supports Context Parallel in FSDP2, sparse-related operators and "reduce-overhead" mode for torch.compile. Since torch_musa 2.9.0, GEMM kernels are computed in FP32 by default, user can set environment variable TORCH_ALLOW_TF32_MUBLAS_OVERRIDE=1 or python global setting 'torch.backends.musa.matmul.allow_tf32 = True' to enable TF32 computation.

We also made kineto as a third_party repository of torch_musa, and this is not the official one but a musified one.

Build torch_musa v2.9.0 on MUSA platform with MUSA SDK>= 4.3.2 please.

EnhanceMent

Operators

  • Support torch.arange with Double dtype
  • Fix BatchNorm outputs NaN
  • Optimize performance of embedding_bag
  • Support complex dtypes for index_select, index_put
  • Support some Sparse Tensor operators
  • Support some special operators.
  • Fix empty tensor creation error with pin_memory=True
  • Add W8A8 matmul kernel

New Features

  • Support torch.compile wth mode="reduce-overhead"
  • Support Context Parallel (Ulysses) in FSDP2
  • Support DLPack for torch.tensor to enable zero-copy when interacted with other library

Known && blocked issues

  • torch.compile generated kernel's performance worse than torch_musa v2.7.0

Please feel free to contact us with any issues or questions.

torch_musa Release v2.7.1

19 Jan 12:21
0bc05bf

Choose a tag to compare

torch_musa v2.7.1 bug fix release

torch_musa v2.7.1 is now available. This is an enhanced version of v2.7.0, aimed at fixing issues, adding more operators and optimizing FSDP performances.

EnhanceMent

Operators:

  • Fix error of BCELoss with non-contiguous inputs;
  • Fix error when tensor is devided by scalar 1;
  • Fix torch.conj runs into a dead loop;
  • Fix empty param_group when optimizer was initialized with CPU tensors;
  • A lot of operators are supported, check the ops_list.md for details;

Features:

  • Configurable overlap strategies in FSDP2. We've introduced the TORCH_MUSA_FSDP2_OVERLAP_LEVEL environment variable to let you control how communication overlaps with computation, enabling explicit trade-offs between memory usage and performance, Available overlap strategies are listed bellow:
    • 0 (NO_OVERLAP): No overlap, only for experimental usage mostly
    • 1 (OVERLAP_FSDP_COMM_ONLY): Overlap FSDP collectives only but with less memory usage
    • 2 (OVERLAP_FSDP_COMM_COPY_IN_WITH_COPY_OUT): Overlap communication/input copies with computation
    • 3 (OVERLAP_FSDP_COMM_COPY_IN_WITH_COMM): The inter-node all-reduce overlap with computation was disabled
    • 4 (OVERLAP_HSDP_COMM): Maximum communication overlap, which is PyTorch's default setting
  • Expose StreamContext in torch_musa;

Enjoy.

torch_musa Release v2.7.0

20 Nov 05:57
7a6f07a

Choose a tag to compare

Release Note

We are excited to annound the release of torch_musa v2.7.0 based on PyTorch v2.7.1. Along with torch v2.7.1, we supported more features, like Dynamic Double Casting and Distributed Checkpointing. We have isolated the torchvision kernels from torch_musa, for users who like to use torchvision, one should install it from the repo that we have musified, see README for more details.

New Features

Dynamic Double Casting

We support dynamic cast for some operators of float64 dtype. Before we don't support much operators with float64 dtype, now one can set an environment variable "export TORCH_USE_MUSA_DOUBLE_CAST=1", and torch_musa will utilize float32 as the compute dtype;

Distributed Checkpointing

We enable Distributed Checkpoint, including Asynchronous checkpoint save, which support loading and saving models from multiple ranks in parallel. It can significantly accelerate the saving and loading of checkpoints;

MUSAExtension 'load'

We support "load" method for compiling MUSA extensions on the fly, which is quite useful for third party libraries that can be installed in many platforms, and during execution the kernels will be compiled or not depending on the platform environment;

EnhanceMent

Operators

  • We added Poisson, binomial, _standard_gamma, _sample_dirichlet, vdot, upsample(1d, 2d, 3d, with aa), flash_attention, transformer_encoder_layer...operators, the supported MUSA specified operators is over 1050;
  • We improved profiler (kineto) stability, upgrade musified kineto to version 2.7.0 as well;
  • We optimize memory usage for pipeline parallelism in FSDP2;
  • We supported more quantized operators which can be used in our model compression toolkit (will be released soon);

Features

  • The torch.compile and AOTInductor are both enhanced through the upgrading of torch;
  • TF32 is enabled by default;
  • Keep Improving stability of torch_musa by fixing some musa kernel potential bugs;

Known Issues

  • Some FFT operators are walkarounded through offloading to CPU, which will be fixed in the next release.

Enjoy.

torch_musa Release v2.5.0

21 Oct 08:05
0dbf6f1

Choose a tag to compare

Release Note

torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.

New Features

Support UMM for M1000

Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:

  • Elimination of duplicate memory allocation on GPU
  • Reduction of memory copy between host and device
  • Direct GPU access to memory originally allocated by CPU allocator

We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".

EnhanceMent

Operators

  • Support ilshift, irshift, replication_pad1d_bwd, angle, ctcLossTensor, ctcLossTensorBwd, logit, amin/amax/prod.dim_int, glu_bwd, etc;
  • Support some basic Sparse(csr) operations;
  • Add more quantized operators supported;
  • Fix torch.norm shape error;
  • Support reduce_sum uint8 dtype input and int64 dtype output;
  • Support tensor.is_musa(); in cpp extension;
  • Fix argmax/min with empty input;

Performances

  • Optimize performances of var/std, pad, convolution3d, layer_norm;

Functionality

  • Enable torch.musa.mccl.version() ;
  • Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
  • Optimize FSDP2 Pipeline parallelism memory consume;

Known Issues

  • Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.

Enjoy.

torch_musa Release v2.1.1

09 Sep 13:19
973ed69

Choose a tag to compare

torch_musa v2.1.1 bug fix release

torch_musa v2.1.1 is now available. This is an enhanced version of v2.1.0, aimed at fixing issues discovered during projects and improving core features. Despite some known issues, complete functional/integration tests have been passed based on MUSA 4.2.0. Native supported operators increased to over 948.

New Features

  • Support musagraphs backend for torch.compile, introducing reduced host overhead and e2e acceleration from musa-graph.
  • muSolver has been integrated into the backend of several linalg operators, including lu_factor_ex、lu_solve、solve_ex、cholesky_ex...
  • FusedAdamW/FusedAdam on MUSA are available on DTensor or other Tensor variants that based on the torch_dispatch mechanism.
  • Benchmark module has been expanded to include more operator cases.

EnhanceMent

  • Fixed the occurrence of 0-value in exponential,inspired from Intel MKL vRngExponential(...)
  • Ensured early return for some 0-numel op cases
  • Optimized one-hot by eliminating redundant preprocessing logics
  • Added rrelu_with_noise/nansum, RoPE supports multi-latent
  • Extended SDPA with no-batch inputs, enable mask-grad only for math backend
  • Fixed scatter_reduce crash and cross-entropy with none mode cases
  • Improved bandwiths of binary ops on rhs not last-contiguous cases

torch_musa Release v2.1.0

17 Jul 11:40
8ee39bb

Choose a tag to compare

Release Note

We are excited to annound the release of torch_musa v2.1.0 based on PyTorch v2.5.0. This release delivering optimized performance and flexibility across key PyTorch components on MUSA platform.
We support AOTInductor, FSDP2, also adapted with our Memory Management, Triton-MUSA, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 930. We've simplified MUSA integration with automatic torch_musa loading, users are not required to call "import torch_musa" in python scripts.

New Features

AOT Inductor

MUSA-backend support is now integrated into AOTInductor, enabling models to be ahead-of-time compiled for MUSA devices. This allows seamless inference acceleration via both C++ and Python runtimes, streamlining deployment on MUSA hardware.

FSDP2

Features DTensor-based per-parameter sharding FSDP with Moore Threads GPU optimization, enabling hardware-accelerated distributed training through custom sharding strategies and native mixed precision for Large Models.

Memory Management

We are pleased to introduce a pluggable MUSA (Memory Unified System Allocator) backend, providing greater flexibility and customization for memory management in your applications.

Triton-MUSA(reland)

Reintroduces the MUSA integration with TorchInductor based on PyTorch2.5 with reduced device-specific code.

EnhanceMent

Operators

We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 930 operators, by which we could deploy most of DL models from both industry and academia.

  • Math Ops: _masked_softmax, tril_indices, triu_indices, trace, ...
  • Statistical: nanmedian, normal, huber_loss, cauchy, log_normal,...
  • NN Ops: native_batch_norm, reflection_pad, fractional_max_pool, ...
  • Advanced Math: cosh, erfc, lgamma, digamma, polygamma,...

Performances

We've optimized quantization opertors, enhanced split and chunk operators. Add fused cross entropy loss implementation which can help reduce the peak memory usage. And many more - too numerous to list individually here.

Build

The MUSA backend now automatically initializes with torch - no manual imports or environment setup required. We also revamp the CMake build system to seamlessly integrate MUSA-accelerated Torch libraries in C++ projects through modern target-based dependency management.

Enjoy.

torch_musa Release v2.0.1

26 Jun 04:56
008913f

Choose a tag to compare

torch_musa v2.0.1 Release, bug fix release

Enhancements and bug fixes, including:

  1. Fixed device index error of aten::_scaled_mm
  2. Fixed runtime error of aten::all.dim
  3. Cherry-picked security enhancement of making torch.load(*, weights_only=True)
  4. Porting PyTorch headers to MUSA_PORT_xxx is deprecated
  5. Adding more operators supported

torch_musa Release v2.0.0

27 Apr 09:24
8c8e412

Choose a tag to compare

Release Note

We are excited to annound the release of torch_musa v2.0.0 based on PyTorch v2.2.0.

In this release, we support MUSA virtual memory management, torch compile + torch inductor with triton backend, fused module with higher performances like SwiGLU and RoPE, MUSAGraph for arch greater than QY2, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 760.

With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

VMM(virtual memory management)

We have implemented the ExpandableSegment memory allocator based on the MUSA VMM API, which effectively mitigates GPU memory fragmentation and reduces peak memory consumption during model training, especially in LLMs training scenarios such as using FSDP, DeepSpeed and Megatron-LM.

MUSAGraph

We have implemented the MUSAGraph interface, which is consistent with CUDAGraph. It captures a sequence of MUSA kernels into a graph, which provides a mechanism to launch these kernels through a single CPU operation, and hence reduces the launching overheads. NOTE: Currently supports computational logic only (no MCCL support), and it's still a experimental feature in MUSARuntime

torch.compile for MUSA

We have integrated triton_musa backend into TorchInductor and implemented partial adaptations for TorchDynamo, which enabling users to accelerate both model training and inference through PyTorch's torch.compile interface

Fused modules & functionals

We support customize fusion modules torch.nn.RoPE, torch.nn.SwishGLU and FusedCrossEntropy, which can be used in LLMs to accelerate training and inference

FP8 support

We support FP8 dtype matmul and distribute communication in torch_musa for archs greater than QY2

EnhanceMent

Operators

We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 760 operators, by which we could deploy most of DL models from society

Build

We support multi-arch compilation, one can build torch_musa on any arch of MTGPU platform than run it on other platforms.

Enjoy.

torch_musa Release v1.3.2

23 Apr 03:26
af1592c

Choose a tag to compare

Release Notes

We are excited to release torch_musa v1.3.2 based on PyTorch v2.2.0!

In this release, we support torch_musa running on multiple archs and imported FP8 matmul, as well as torch.compile on MUSA backends, which are both useful for accelerating training/inference tasks. Another highlight feature is that user can implemente their customized operators by using torch.library through Python frontend, with the support of triton_musa, we should have more flexibility to implement high-efficiency operators. For training tasks, we support FusedAdam, which is highly recommanded in LLM training. In addition, we now have adapted more than 700 operators.

With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

Features

New features

  • Support torch_musa on multiple archs, and optimze compiler flags for better performancs;
  • Support torch.library, now users can implement their own kernel and operator by using torch.library, which is compatible with triton_musa;
  • Support FP8 matmul;
  • Support MUSA backend for torch.compile and torch Inductor, which is a highly recommended feature of PyTorch and now we have it on MUSA;
  • Support FusedAdam optimizer, which has better performance than the original one, and we also had some customize optimization included;
  • Support TCPStore with Libuv backend;

Operators supporting

  • New operators: torch.std, rmsnorm.out, reflection_pad, torch.mish, torch.logsigmoid
  • New dtypes supported:
    • int to float of torch.sum
    • Long for torch.histc
    • Bool for torch.index_select
    • Bool for torch.add
    • Int and float dtypes for torch.masked_select and torch.masked_scatter

Bugs fixed & Enhancements

  • Fix arm platform cannot link libmusa_kernels.so
  • Fix error of indexing kernel with negative indices
  • Fix missing dtype supports of MCCL
  • Fix math SDPA with ComputeMode setting
  • Fix low performance of torch.gather
  • Update amp with more privateuse1 compatible
  • Fix shared memory misalign pointer in S5000
  • Fix error of clamp with different input dtypes
  • Optimize compilation steps of torch_musa

torch_musa Release v1.3.0

05 Nov 03:44
73c9f5b

Choose a tag to compare

Highlights

We are excited to release torch_musa v1.3.0 based on PyTorch v2.2.0. In this release, we support FSDP (Fully Sharded Data Parallel) for large model training, as well as improve the stability and efficiency of diferent operators. In general, we add more operators and support more dtypes of Tensors for many operators on our MUSA backend.

With torch_musa v1.3.0, users can utilize most features released in PyTorch v2.2.0 on MUSA GPU, and gain more stable training and inference for many kinds of models in various fields, including the recently popular large language models.

The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

Enhancements

FSDP

We recommand users to refer offical FSDP doc for more utilization details, and move back to our torch_musa to get the same experiences as the original one.

Operators support

1.Support operators including torch.conv_transpose_3d, torch.fmod, torch.fmax and torch.fmin etc.

2.Support more dtypes for torch.sort, torch.unique etc.

Documentation

We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.

Dockers

We provide Release docker image and development docker image.