Update dependency torch to v2.9.1 #292

konflux-internal-p02 · 2025-11-07T16:19:23Z

This PR contains the following updates:

Package	Change	Age	Confidence
torch	`== 2.7.0` -> `==2.9.1`

Release Notes

pytorch/pytorch (torch)

`v2.9.1`: PyTorch 2.9.1 Release, bug fix release

Compare Source

This release is meant to fix the following issues (regressions / silent correctness):

Tracked Regressions

Significant Memory Regression in F.conv3d with bfloat16 Inputs in PyTorch 2.9.0 (#166643)
This release provides work around this issue. If you are impacted please install nvidia-cudnn package version 9.15+ from pypi. (#166480) (#167111)

Torch.compile

Fix Inductor bug when compiling Gemma (#165601)
Fix InternalTorchDynamoError in bytecode_transformation (#166036)
Fix silent correctness error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586)
Improve performance by avoiding recompilation with mark_static_address with cudagraphs (#162208)
Improve performance by caching get_free_symbol_uses in torch inductor (#166338)
Fix fix registration design for inductor graph partition for vLLM (#166458) (#165815) (#165514)
Fix warning spamming in torch.compile (#166993)
Fix exception related to uninitialized tracer_output variable (#163169)
Fix crash in torch.bmm and torch.compile with PyTorch release 2.9.0 (#166457)

Other

Fix warning spamming on new APIs to control TF32 behavior (#166956)
Fix distributed crash with non-contiguous gather inputs (#166181)
Fix indexing on large tensor causes invalid configuration argument (#166974)
Fix numeric issue in CUDNN_ATTENTION (#166912) (#166570)
Fix symmetric memory issue with fused_scaled_matmul_reduce_scatter (#165086)
Improve libtorch stable ABI documentation (#163899)
Fix image display on pypi project description section (#166404)

`v2.9.0`: 2.9 Release Notes

Compare Source

Highlights

Unstable (API-Unstable)
Updates to the stable libtorch ABI for third-party C++/CUDA extensions
Symmetric memory that enables easy programming of multi-GPU kernels
The ability to arbitrarily toggle error or resume on graph breaks in torch.compile
Expanded wheel variant support to include ROCm, XPU and CUDA 13
FlexAttention enablement on Intel GPUs
Flash decoding optimization based on FlexAttention on X86 CPU
ARM Platform improvements and optimizations
Enablement of Linux aarch64 binary wheel builds across all supported CUDA versions

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Min supported Python version is now 3.10 (#162310)

The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.

Undefined behavior when an output of a custom operator shares storage with an input

This is a reminder that outputs of PyTorch custom operators (that are registered using the torch.library or TORCH_LIBRARY APIs) are not allowed to return Tensors that share storage with input tensors. The violation of this condition leads to undefined behavior: sometimes the result will be correct, sometimes it will be garbage.

After #163227, custom operators that violated this condition that previously returned correct results under torch.compile may now return silently incorrect results under torch.compile. Because this is changing the behavior of undefined behavior, we do not consider this to be a bug, but we are still documenting it in this section as a "potentially unexpected behavior change".

This is one of the conditions checked for by torch.library.opcheck and is mentioned in The Custom Operators Manual

More details

Outputs of PyTorch custom operators are not allowed to return Tensors that share storage with input tensors

For example, the following two custom operators are not valid custom operators:

@&#8203;torch.library.custom_op("mylib::foo", mutates_args=())
def foo(x: torch.Tensor) -> torch.Tensor:

### the result of `foo` must not directly be an input to foo.
    return x

@&#8203;torch.library.custom_op("mylib::bar", mutates_args=())
def bar(x: torch.Tensor) -> torch.Tensor:

### the result of bar must not be a view of an input of bar
    return x.view(-1)

The easiest workaround is to add an extra .clone() to the outputs:

@&#8203;torch.library.custom_op("mylib::foo", mutates_args=())
def foo(x: torch.Tensor) -> torch.Tensor:
    return x.clone()

@&#8203;torch.library.custom_op("mylib::bar", mutates_args=())
def bar(x: torch.Tensor) -> torch.Tensor:
    return x.view(-1).clone()

A common way to get into this situation is for a user to want to create a custom operator that sometimes mutates the input in-place and sometimes returns a new Tensor, like in the following example.

@&#8203;torch.library.custom_op("mylib::baz", mutates_args=["x"])
def baz(x: torch.Tensor) -> torch.Tensor:
    if inplace:
        x.sin_()
        return x
    else:
        return x.sin()

This dynamism is not supported and leads to undefined behavior. The workaround is to split the custom operator into two custom operators, one that always mutates the input in-place, and another that always returns a new Tensor.

@&#8203;torch.library.custom_op("mylib::baz_outplace", mutates_args=())
def baz_outplace(x: torch.Tensor) -> torch.Tensor:
    return x.sin()

@&#8203;torch.library.custom_op("mylib::baz_inplace", mutates_args=["x"])
def baz_inplace(x: torch.Tensor) -> torch.Tensor:
    x.sin_()

def baz(x):
    if inplace:
        baz_inplace(x)
        return x
    else:
        return baz_outplace(x)

Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733, #159912)

PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above

Upgrade to DLPack 1.0 (#145000)

This upgrade is doing the same BC-breaking changes as the DLPack release. Objects in torch.utils.dlpack have been updated to reflect these changes, such as DLDeviceType.

See the PR for details on the exact changes and how to update your code.

Raise appropriate errors in `torch.cat` (#158249)

torch.cat now raises ValueError, IndexError or TypeError where appropriate instead of the generic RuntimeError. If you code was catching these errors, you can update to catch the new error type.

Default to `dynamo=True` for ONNX exporter (#159646, #162726)

Previously torch.onnx.export(...) used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newer torch.export.export pipeline by default (dynamo=True). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.

Previously in torch 2.8.0:

### API calls the legacy exporter with dynamo=False
torch.onnx.export(...)

Now in torch 2.9.0:

### To preserve the original behavior
torch.onnx.export(..., dynamo=False)

### Export onnx model through torch.export.export
torch.onnx.export(...)

Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.

Switch off runtime asserts by default in Export in favor of a shape guards function (#160111, #161178, #161794)

To enable runtime asserts, use export(..., prefer_deferred_runtime_asserts_over_guards=True). Also kills the allow_complex_guards_as_runtime_asserts flag, merging it into the former option.

Additionally, exported_program.module() will generate a call to a _guards_fn submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do exported_program.module(check_guards=False) to avoid the generation.

Set default opset to 20 in ONNX (#158802)

Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.

Previously in torch 2.8.0:

### opset_version=18
torch.onnx.export(...)

Now in torch 2.9.0:

### To preserve the original behavior
torch.onnx.export(..., opset_version=18)

### New: opset_version=20
torch.onnx.export(...)

### Use the latest supported opset: opset_version=23
torch.onnx.export(..., opset_version=23)

Drop `draft_export` in exporter API (#161454, #162225)

Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
The expensive torch.export.draft_export diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:

Previously in torch 2.8.0:

### If both torch.export.export(..., strict=False) and
### torch.export.export(..., strict=True) fail to capture

### the model graph, torch.export.draft_export(...) will be triggered,
### and uses real tensor to trace/export the model.

#
### Inside export_to_onnx.py:

###  ... torch.onnx.export(..., dynamo=True)
python export_to_onnx.py

Now in torch 2.9.0:

### To trigger torch.export.draft_export once
### torch.export.export strict=False/True both

### fail:

TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.py

Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend (#158130, #158258)

torch.onnx.dynamo_export is removed. Please use torch.onnx.export instead.
The experimental ONNX Runtime compile backend (torch.compile(backend="onnxrt")) is no longer supported.

Remove `torch.onnx.enable_fake_mode` (#161222)

The dynamo=True mode uses FakeTensors by default which is memory efficient.

Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)

Deprecated members in torch.onnx.verification are removed. Previously private torch.onnx.symbolic_opsets* functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.

Remove `torch.onnx.symbolic_caffe2` (#157102)

Support for caffe2 in the ONNX exporter has ended and is removed.

Remove `/d2implyavx512upperregs` flag that slows build (#159431)

Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see #145702

Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` (#160557)

Before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts. Now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear.

This change adds ScalarType support for user extensions and is only narrowly BC breaking for unpopular dtypes: quint*s, qint*s, Bits*, dummy_uint*s, dummy_int*s, Float8_e8m0fnu, and Float4_e2m1fn_x2 in the use case where an extension retrieves a Tensor dtype of the above and passes it into aoti_torch_call_dispatcher.

Deprecations

Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` (#158323)

We move enabling pin_memory back inside BaseDataLoaderIter. This is required for StatefulDataloader which leveraged BaseDataLoaderIter direclty rather than the Dataloader class init

Deprecate `torch.export.export_for_training` API in favor of equivalent `torch.export.export` API (#158203)

torch.export.export_for_training exists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.

New Features

Python Frontend

Add utility to get the kernel currently registered on the dispatcher (#158393)
Extend __torch_function__ handler to be triggered by elements within a list (#160256)
Add torch.hash_tensor reduction function (#154149)

FX

Extend torch function support to ALL arguments instead of just scalar type (but not inside of list, #145089)
Add is_fx_symbolic_tracing flag (#161385)

Dynamo

Experimental API for ahead-of-time compiling models in fullgraph mode (#161383)
Add a hook for recompilations (#157961)
DynamicInts prototype (#162194)

Introduces an API for annotating dynamic integer inputs & attributes for torch.compile, by wrapping plain ints with DynamicInt().
DynamicInt objects also work in eager mode, acting as their underlying values when passed as scalar inputs.

a = DynamicInt(4)
y = a + 2  # DynamicInt(6)
z = torch.ones(a)  # torch.ones(4)

fn = torch.compile(torch.ones)
fn(a)  # compiled fn takes a dynamic integer input
fn(2)  # returns torch.ones(2) without recompiling

Optimizer

Introduce Muon optimizer to PyTorch (#160213)

Profiler

Add GC Events to Python Stack Tracer (#161209)
Add a custom profiler configuration option (#151656)

Inductor

Allow user to pass in custom partitioner function (#157580)

Export

Add support for param mutation under inference mode (#159661)

AOTDispatcher

Add AOTDispatcher config to set backward autocast behavior (#156356)

Quantization

Enable cpu fp8 qlinear and cpu fp8 qconv (#155678, #157076)

ONNX

RMS Norm support in opset 23 (#159377)

C++ Extensions

Build out a stable set of ATen ops in torch/csrc/stable/ops.h: amax, narrow, new_empty + new_zeros dtype variant, pad, (#159328, #158974, #159508, #161597, #160214)
Add torch::stable::Tensor() default constructor, is_cpu, and get_device_index(#159507, #160212, #160143)
Add beginnings of torch::stable::accelerator with support for DeviceGuard and Stream (#159679, #160453)
Start building out torch/headeronly: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half, #158035, #158365, #157912, #158377, #159302, #159414, #159412, #159415, #159411, #159911)
Remove cmake cache and reconfigure again if it is invalid (#156958)
Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Remove wheel from build requirements (#158027)
Error when TORCH_STABLE_ONLY is defined in TensorBase.h (#161658)

Build Frontend

Add transpose to torch/csrc/stable (#158160)
Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (#158866)

Release Engineering

Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds (#160956, #161073, #161257, #161663, #161316, #160201, #160770, #161013, #161916, #162268, #162322, #162383, #161833)
Enable CUDA 12.6, 12.8 and 13.0 support for Linux ARM64 CD builds (#162364, #160720, #159481)
Add support for Python 3.14 in CI/CD builds (#156889, #157559, #159261, #159869, #160593, #160788, #161255, #159725)
Enable NVSHMEM integration (#151261, #153010, #154538, #155506, #156685, #158938, #161321, #160778, #159907, #160465)

CUDA

Add getter for CUDA graph exec to allow mutation of captured kernel params (#161294)
Implement support for cudnn_batch_norm_out kernel to replace the autogen approach (#123020)

CPU

Support GQA for flash attention (#157893)

MPS

Partial sparse support for MPS backend (#159729, #160254, #160223, #161846, #162007, #157238)
Add avg_pool3d, max_unpool1d/2d/3d, max_pool3d, max_pool3d bwd pass, and avg_pool3d bwd pass for MPS (#158877,#159789, #156467, #157498, #159089)

ROCm

OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360)

XPU

Enable FlexAttention on Intel GPU (#143553)

Improvements

Python Frontend

Speed up torch.load under FakeTensorMode by reducing random reads (#157931)
Make torch.utils.benchmark.utils.timer accelerator agnostic (#157131)
Improve error message for weight-only load errors (#159935)

torch.nn

Allow register_buffer with Tensor-like objects (#159455)
Improve error message for unsupported padding configurations (#160866)
Validate target is 0D when input is 1D in NLLLoss (#161412)

Optimizer

Resolve warning in LBFGS when converting a tensor with requires_grad=True to a scalar (#160389)
Resolve SequentialLR deprecation warning about invoking step(epoch) (#149392)

Autograd

Support deterministic torch.nn.Upsample mode="trilinear" backward (#154239)

Distributed

c10d

Add improvements to eager init of ProcessGroupNCCL (#156748)
Simplify unique hash management of ProcessGroupNCCL (#156790)
Support per operation timeouts in ProcessGroupGloo (#158128)
Allow ping to be retried in TCPStore (#159165)
Support scalar tensor for functional all_gather (#149913)
Expos unsafe_get_ptr for dist.ProcessGroupNCCL.NCCLConfig (#161136)
Add batch option for send/recv_object_list (#160342)
Make FakeStore optional to be passed into fake backend (#162164)
Enable complex datatype support in ProcessGroupGloo (#156633)
Move thread-local capture mode guard to include work.isStarted (#160398)

DistributedDataParallel (DDP)

Support ddp zero hook XCCL path (#159240)

DTensor

Relax device_mesh argument constraint in local_map (#157049)
Support complex numbers in DTensor redistribute (#157329)
Rework partial propagation in point-wise op and support mul (#157340)
Allow dynamic shapes for DTensor slice (#157953)
Implement histc op (#158298)
Made dispatch to sharding prop over decomps (#159324)
Support user-supplied Generator for random ops (#159933)
Add propagate_tensor_meta function that skips cache if _are_we_tracing (#161334)
Support local_map as a decorator (#161353)

Device Mesh

Enable the use of user set backend and pg option even for the global mesh (#157501)
Enable slicing a submesh with warnings (#158899)
Allow controlling PG backend and options via init_device_mesh (#159371)

FullyShardedDataParallel2 (FSDP2)

Support custom all_gather and reduce_scatter comms (#155189)
Made it fail set_allocate_memory_from_process_group if used together with custom comm hooks (#157487)
Use reduceOpSum when world size is 1 (#157529)
Skipp allgather when world size is 1 (#160135)
Use post_reduce_stream.record_event() on hsdp+cpuoffload (#160481)

Tensor Parallel (TP)

Improve parallelize_module API to support more cases (#157182)

TensorPipe

Update TensorPipe pinned dependency version (#159834)

TorchElastic

Enable NUMA binding integration with elastic agent and torchrun (#149334)
Support NUMA Binding for Callable Entrypoints (#160163, #161183)

Pipeline Parallelism (PP)

Add eval() API to schedule (#157795)
Allow intermediate nodes in zero bubble to have multiple grads (#159084)
Support OVERLAP_F_B computation type (#158978)
Initializ P2P communicators on first step (#160210)
Add DualPipeV schedule (#159591)

Linear Algebra Frontend

Use rocSOLVER for Cholesky inversion on AMD. (#157154)
Add option for using TF32 as fp32 internal precision for matmul/linear/conv on MKLDNN (#157520)
Make einsum produce contiguous outputs in more cases (#161755)

Profiler

Add more CUDA API for kernel launcher (#156016)
Allow Custom Time Unit When Printing Profiler Table (#157913)
Update CUDA runtime kernel identification logic (#157890)

FX

Fix DCE eliminating random operations by improving is_impure() (#151524, #157981)
Support converting a float32 tensor to a scalar in FX trace. (#158216)
Correctly copy self.module_stack in ModuleStackTracer (#159956)
Add tool to track events in graph split (#159795)
Add node_name_match to subgraph rewriter (#157574)

Dynamo

Improve tracing support for various Python builtin data structures/modules:
- lists (e.g. #153969)
- sets (e.g. #153150)
- dicts (e.g. #154794)
- iter (e.g. #156371)
- itertools (e.g. #159693)
- collections (e.g. #159365)
- collections.NamedTuple (#159367)
- frozen dataclasses.dataclass (#159529)
Graph break error messages link to a website with more information (#159011)
Add option for TorchDispatchMode to ignore torch.compile internals (#161648)

Inductor

Add Inductor support for MTIA backend (#159211)
Share default device context when all graph partitions and cudagraph-unsafe ops are on the same device(#162873)

Ahead-Of-Time Inductor (AOTI)

Enable AOTI for CPU on Windows (#158915)
Re-enable TMA templates w/ AOTI (#157819)
Don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433)
Add RecordFunction to C shim so that profiling works with AOTI (#159842)
Add AOTI C shim functions for collective ops (#154492)
Add missing ops to set of C-shim ops which can have nullptr returns (#158073)

Export

Handle None & ellipsis slicing/select in non-strict (#157821)
Extend FP8 types in serialization (#158430)
Improve error messages for deserialization (#159881)
Support serialization for triton_kernel_wrapper_functional HOP (#161314)
Support serialization for complex constants (#161517)
Add runtime asserts to while_loop HOP subgraphs (#158467)
Warn on side-effectful code in strict mode (#160060)
Support for vmap in pre-dispatch export (#154650)
Support vmap and custom autograd function/improve DTensor constructor inefficiency (#162240)

AOTDispatcher

Skip logging in fp8 activation quantization if there are no nodes to be quantized (#158129)
Add aot_export_joint_with_descriptors and aot_compile_joint_with_descriptors (#158715)
Extract out prepare_aot_module_simplified for use in next PR (#158319)
Rename modules in AOTAutograd (#158449)
Track descriptors for all inputs/outputs of AOTAutograd traced graph (#158624)
Improve graph output alias with subclass error message (#159619)
Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814)

Composability

Meta implementation for aten.add.Scalar (#161332)
aten.expand_copy decomp (#161688)
Fix result dtype cast in decomp for aten.linalg_vector_norm (#155111)
Add dtype checks in meta implementation for several ordering ops (#159556)
Fix meta function for aten.complex (#160894)
Improve unbacked symint (dynamic shape) support for several decompositions (#148815, #156902, #157008, #158894, #159184, #160683, #160253, #162084, #162099, #162109, #160462)

Quantization

Avoid getting model device once per node for pt2e quantization flow (#159901)
Fixes bug in implementation of HistogramObserver (#156457)
Support bias=None for fbgemm_linear_fp16_weight CPU op (#158535)
Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight for Sigmoid (#160451)

Nested Tensor (NJT)

Added initial log_softmax() support (#159662)

Foreach

Invoke vector.reserve() consistently for non-inplace foreach operations (#161128)
Faster and safer lambda expression capture in has_integral_tensor() (#161042)

ONNX

Support symbolic arguments in ONNX exporter (#157734)
Fix torch.tensor warning in ONNX symbolic_opset10 export (#158835)

C++ Frontend

Generalized AllocatorConfig to be device-agnostic via new AcceleratorAllocatorConfig (#149601, #150312)
Added Scalar::isUnsigned() method (#159877)
Exposed ModelRunner from nativert as public (#159989)
Improve error message for torch.binomial enforcing float inputs (#157658)

Build Frontend

Fix dev warning in Dependencies.cmake (#159702)
Fix building system gloo with CUDA/HIP (#146637)
Build libtorch without NVSHMEM (#160910)
Improve BLAS feature detection (#143846)

Release Engineering

Enable vLLM testing workflow (#160583, #161565, #162292, #162000, #161797)
Enable Windows ARM64 CI testing (#148753, #161504)
Enable PyTorch ROCm CI for MI355X testing. (#158889)

CUDA

Make cublaslt/hipblaslt workspaces persistent (#156495)
Remove unnecessary warnings during the ATen compilation process (#157703)
Slightly improve error message from repeat_interleave kernel (#157996)
Add framework for explanations for common CUDA errors (#158395)
Upgrade KernelLauncher kernelLaunchCheck to print help string (#158896)
Prep for cutlass upgrade by ignoring Wunused-but-set-variable (#159276)
Workaround ATen SFINAE under libc++ (#161101)
Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen (#153373)
Add maybe unused flag to remove warning (#157655)
Use new CCCL API in v2.8 (#160554)
Improve cupy device placement when device is provided with explicit index (#158529)

CPU (AArch64)

Made PyTorch compilable with gcc-14 on ARM (#157867)

MPS

Add shifted_chebyshev_polynomial_[tuvw], igamma/igammac,grid_sampler_3d, native_dropout/native_dropout_backward (#157488, #161927, #160541, #162108)
Extend atomic operations to all int types (#158179)
Extend index_put to complex types (#160159)
Extend addmm to integral types (#160270)
Add support for unsigned types (#159094)
Add API to query GPU core count (#160414)
Add kthvalue (#161817)
Type-promote tensor-iterator common dtype (#160334)
Implement logcumsumexp metal kernel (#156858)
Enable dlpack integration (#158888)
Dynamic reductions (#159355)
Update avg_pool2d to use Metal kernel when ceil_mode=True (#161011)

ROCm

Additional hipify mappings (#158056, #158352, #161992)
Refactor composable_kernel (CK) backend user interface to improve user experience (#152951)
Allow use of rocSOLVER for Cholesky inversion. (#157154)
AOT Inductor enable gfx950 for max autotune using CK (#159195)
Add flag torch.backends.miopen.immediate to toggle MIOpen Immediate Mode instead of relying on deterministic=True and benchmark=False (#158951)
MIOpen convolutions no longer call reshape_ or unexpectedly change memory formats (#161687)

XPU

Support Intel GPU quantization ops in AOTInductor (#156572)
Add device_id to Intel GPU properties to distinguish iGPUs with identical names (#156481)

Bug Fixes

Python Frontend

Add option in torch.utils.cpp_extension.load_inline to override gencode (#156850)
Fix max_width computation in Tensor printing (#126859)
Improve pin_memory error message on CPU-only systems (#159994)
Making batching rule for F.embedding DTensor-aware (#162117)

Autograd

Fix torch.autograd.Function memory leak due to torch.utils.checkpiont early stopping (#161171)
Fix torch.autograd.graph.GradientEdge for torch.autograd.Function (#160098)
Match 0-dim gradients device type regardless of subclass-ness (#160165)

Distributed

c10d

Fix slow init due to repeated dns resolution failure in socket (#159596)
Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429)
Fix a bug of distributed 'gather' with noncontiguous tensors on the Gloo backend (#158903)
Fix a bug of distributed 'gather' with noncontiguous tensors on the NCCL backend (#159549)
Fix data inconsistencies when using batch_isend_irecv with 2D tensor views by making P2P tensors dense (#163719)
Handle discontiguous allgather/reducescatter inputs (#163712)

Device Mesh

Fix the not incorrectly chained each of the strings as iterables (#160709)

DistributedDataParallel (DDP)

Fix incorrect interaction between DDPOptimizer and donated buffers (#160745)

DTensor

Fix DTensor handling of conjugate bit (#158030)
Fix OpSchema equality check (#161231)
Fix grouped_mm strategy for invalid stride cases (#158245)
Fix F.one_hot in DTensor (#162307)
Always disabled ShardingPropagation cache if compiling (#156868)

FullyShardedDataParallel (FSDP)

Fix the bug in FSDP offload pin_memory (#157147)
Fix to ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)

FullyShardedDataParallel2 (FSDP2)

Fix error message for fsdp_pre_all_gather (#160817)
Fix the issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964)

Pipeline Parallelism (PP)

Fix eval step under no_grad() (#159293)
Fix zero bubble schedules for eval() (#159475)

TensorPipe

Fix import torch if compiled without TensorPipe (#159461)

TorchElastic

Fix wrong log file name in the docs of torch.distributed.elastic.multiprocessing.start_processes() (#160396)

Linear Algebra Frontend

Avoid downcasts for fp16 matmul on the BLAS backend (#161999)

Profiler

Fix Linter for Global Annotations flag in Snapshot (#157858)

FX

Fix split_module with symint (#160093)
Fix getattr_recursive with ModuleList (#161204)
Skip const folding with symbolic expression (#161437)
Fix qualified name for methods of torch.Tensor (#162224)

Dynamo

Fix segfault due to interaction between Dynamo backends and torch.compiler.reset() (#156527)
Fix crash due to bad interaction with recompilations and with blocks in Python 3.11+ (#162318)

torch.nn

Fix silent correctness w/ backpropping grads for FlexAttention (#163677)
Fix return_lse warning message in FlexAttention (#163578)
Fix FlexAttention head broadcast (#163426)

Inductor

Fix wrong meta function for constant_pad_nd (#159878)
Fix learnable bias assertion error in Inductor (#161170)
Fix int64 from MutationOutput Buffer (#162020)
Fix Inductor CUDA sort NaN behavior (#159308)
Fix layout for local buf in outer loop fusion (#160857)
Fix slice scatter dtype consistency (#160851)
Fix 3d tiled online softmax (#162341)
Fix unsafe collective reorder past wait in Inductor (#157489)
Fix FallbackKernel alias function to avoid incorrect aliasing for custom ops (#163227)

Ahead-Of-Time Inductor (AOTI)

Fix a bug from load_constants (#161887)
Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904)
Fix unbacked symint and memory leak in Inductor memory planning (#159839)
Fix memory leak in AOTI when calling aoti_torch_as_strided (#162118)
Explicitly delete wait_tensor returned tensor (#159502)
Fix memory leak from all_reduce (#159818)

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

To execute skipped test pipelines write comment /ok-to-test.

Documentation

Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.

Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com>

Update dependency torch to v2.9.1

5826b14

Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com>

konflux-internal-p02 bot force-pushed the konflux/mintmaker/rhoai-3.2/torch-2.x branch from 3d468b5 to 5826b14 Compare November 12, 2025 20:59

konflux-internal-p02 bot changed the title ~~Update dependency torch to v2.9.0~~ Update dependency torch to v2.9.1 Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dependency torch to v2.9.1 #292

Update dependency torch to v2.9.1 #292

Uh oh!

konflux-internal-p02 bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Update dependency torch to v2.9.1 #292

Are you sure you want to change the base?

Update dependency torch to v2.9.1 #292

Uh oh!

Conversation

konflux-internal-p02 bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v2.9.1: PyTorch 2.9.1 Release, bug fix release

Tracked Regressions

Torch.compile

Other

v2.9.0: 2.9 Release Notes

PyTorch 2.9.0 Release Notes

Highlights

Backwards Incompatible Changes

Min supported Python version is now 3.10 (#​162310)

Undefined behavior when an output of a custom operator shares storage with an input

More details

Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#​159733, #​159912)

Upgrade to DLPack 1.0 (#​145000)

Raise appropriate errors in torch.cat (#​158249)

Default to dynamo=True for ONNX exporter (#​159646, #​162726)

Switch off runtime asserts by default in Export in favor of a shape guards function (#​160111, #​161178, #​161794)

Set default opset to 20 in ONNX (#​158802)

Drop draft_export in exporter API (#​161454, #​162225)

Remove torch.onnx.dynamo_export and the onnxrt torch compile backend (#​158130, #​158258)

Remove torch.onnx.enable_fake_mode (#​161222)

Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#​161323)

Remove torch.onnx.symbolic_caffe2 (#​157102)

Remove /d2implyavx512upperregs flag that slows build (#​159431)

Add ScalarType to shim conversion and stable::Tensor.scalar_type (#​160557)

Deprecations

Deprecate pin_memory_device param in torch.utils.data.DataLoader (#​158323)

Deprecate torch.export.export_for_training API in favor of equivalent torch.export.export API (#​158203)

New Features

Python Frontend

FX

Dynamo

Optimizer

Profiler

Inductor

Export

AOTDispatcher

Quantization

ONNX

C++ Extensions

Build Frontend

Release Engineering

CUDA

CPU

MPS

ROCm

XPU

Improvements

Python Frontend

torch.nn

Optimizer

Autograd

Distributed

c10d

DistributedDataParallel (DDP)

DTensor

Device Mesh

FullyShardedDataParallel2 (FSDP2)

Tensor Parallel (TP)

TensorPipe

TorchElastic

Pipeline Parallelism (PP)

Linear Algebra Frontend

Profiler

FX

Dynamo

Inductor

Ahead-Of-Time Inductor (AOTI)

Export

AOTDispatcher

Composability

Quantization

Nested Tensor (NJT)

konflux-internal-p02 bot commented Nov 7, 2025 •

edited

Loading

`v2.9.1`: PyTorch 2.9.1 Release, bug fix release

`v2.9.0`: 2.9 Release Notes

Min supported Python version is now 3.10 (#162310)

Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733, #159912)

Upgrade to DLPack 1.0 (#145000)

Raise appropriate errors in `torch.cat` (#158249)

Default to `dynamo=True` for ONNX exporter (#159646, #162726)

Switch off runtime asserts by default in Export in favor of a shape guards function (#160111, #161178, #161794)

Set default opset to 20 in ONNX (#158802)

Drop `draft_export` in exporter API (#161454, #162225)

Remove `torch.onnx.dynamo_export` and the `onnxrt` torch compile backend (#158130, #158258)

Remove `torch.onnx.enable_fake_mode` (#161222)

Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)

Remove `torch.onnx.symbolic_caffe2` (#157102)

Remove `/d2implyavx512upperregs` flag that slows build (#159431)

Add `ScalarType` to shim conversion and `stable::Tensor.scalar_type` (#160557)

Deprecate `pin_memory_device` param in `torch.utils.data.DataLoader` (#158323)

Deprecate `torch.export.export_for_training` API in favor of equivalent `torch.export.export` API (#158203)