Update dependency torch to v2.9.1 #292
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
== 2.7.0->==2.9.1Release Notes
pytorch/pytorch (torch)
v2.9.1: PyTorch 2.9.1 Release, bug fix releaseCompare Source
This release is meant to fix the following issues (regressions / silent correctness):
Tracked Regressions
Significant Memory Regression in F.conv3d with bfloat16 Inputs in PyTorch 2.9.0 (#166643)
This release provides work around this issue. If you are impacted please install nvidia-cudnn package version 9.15+ from pypi. (#166480) (#167111)
Torch.compile
Fix Inductor bug when compiling Gemma (#165601)
Fix InternalTorchDynamoError in bytecode_transformation (#166036)
Fix silent correctness error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586)
Improve performance by avoiding recompilation with mark_static_address with cudagraphs (#162208)
Improve performance by caching get_free_symbol_uses in torch inductor (#166338)
Fix fix registration design for inductor graph partition for vLLM (#166458) (#165815) (#165514)
Fix warning spamming in torch.compile (#166993)
Fix exception related to uninitialized tracer_output variable (#163169)
Fix crash in torch.bmm and torch.compile with PyTorch release 2.9.0 (#166457)
Other
Fix warning spamming on new APIs to control TF32 behavior (#166956)
Fix distributed crash with non-contiguous gather inputs (#166181)
Fix indexing on large tensor causes invalid configuration argument (#166974)
Fix numeric issue in CUDNN_ATTENTION (#166912) (#166570)
Fix symmetric memory issue with fused_scaled_matmul_reduce_scatter (#165086)
Improve libtorch stable ABI documentation (#163899)
Fix image display on pypi project description section (#166404)
v2.9.0: 2.9 Release NotesCompare Source
PyTorch 2.9.0 Release Notes
Highlights
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Min supported Python version is now 3.10 (#162310)
The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.
Undefined behavior when an output of a custom operator shares storage with an input
This is a reminder that outputs of PyTorch custom operators (that are registered using the
torch.libraryorTORCH_LIBRARYAPIs) are not allowed to return Tensors that share storage with input tensors. The violation of this condition leads to undefined behavior: sometimes the result will be correct, sometimes it will be garbage.After #163227, custom operators that violated this condition that previously returned correct results under
torch.compilemay now return silently incorrect results undertorch.compile. Because this is changing the behavior of undefined behavior, we do not consider this to be a bug, but we are still documenting it in this section as a "potentially unexpected behavior change".This is one of the conditions checked for by
torch.library.opcheckand is mentioned in The Custom Operators ManualMore details
Outputs of PyTorch custom operators are not allowed to return Tensors that share storage with input tensors
For example, the following two custom operators are not valid custom operators:
The easiest workaround is to add an extra
.clone()to the outputs:A common way to get into this situation is for a user to want to create a custom operator that sometimes mutates the input in-place and sometimes returns a new Tensor, like in the following example.
This dynamism is not supported and leads to undefined behavior. The workaround is to split the custom operator into two custom operators, one that always mutates the input in-place, and another that always returns a new Tensor.
Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733, #159912)
PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
Upgrade to DLPack 1.0 (#145000)
This upgrade is doing the same BC-breaking changes as the DLPack release. Objects in
torch.utils.dlpackhave been updated to reflect these changes, such asDLDeviceType.See the PR for details on the exact changes and how to update your code.
Raise appropriate errors in
torch.cat(#158249)torch.catnow raisesValueError,IndexErrororTypeErrorwhere appropriate instead of the genericRuntimeError. If you code was catching these errors, you can update to catch the new error type.Default to
dynamo=Truefor ONNX exporter (#159646, #162726)Previously
torch.onnx.export(...)used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newertorch.export.exportpipeline by default (dynamo=True). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.Previously in torch 2.8.0:
Now in torch 2.9.0:
Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
Switch off runtime asserts by default in Export in favor of a shape guards function (#160111, #161178, #161794)
To enable runtime asserts, use
export(..., prefer_deferred_runtime_asserts_over_guards=True). Also kills theallow_complex_guards_as_runtime_assertsflag, merging it into the former option.Additionally,
exported_program.module()will generate a call to a_guards_fnsubmodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or doexported_program.module(check_guards=False)to avoid the generation.Set default opset to 20 in ONNX (#158802)
Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
Previously in torch 2.8.0:
Now in torch 2.9.0:
Drop
draft_exportin exporter API (#161454, #162225)Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
The expensive
torch.export.draft_exportdiagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:Previously in torch 2.8.0:
Now in torch 2.9.0:
Remove
torch.onnx.dynamo_exportand theonnxrttorch compile backend (#158130, #158258)torch.onnx.dynamo_exportis removed. Please usetorch.onnx.exportinstead.The experimental ONNX Runtime compile backend (
torch.compile(backend="onnxrt")) is no longer supported.Remove
torch.onnx.enable_fake_mode(#161222)The
dynamo=Truemode usesFakeTensors by default which is memory efficient.Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)
Deprecated members in
torch.onnx.verificationare removed. Previously privatetorch.onnx.symbolic_opsets*functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.Remove
torch.onnx.symbolic_caffe2(#157102)Support for
caffe2in the ONNX exporter has ended and is removed.Remove
/d2implyavx512upperregsflag that slows build (#159431)Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see #145702
Add
ScalarTypeto shim conversion andstable::Tensor.scalar_type(#160557)Before, user extensions could only in abstract pass around obfuscated dtypes appearing as
int32_ts. Now, users can confidently usetorch::headeronly::ScalarTypein their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if theScalarTypeenum values change in the future, user extensions need not fear.This change adds ScalarType support for user extensions and is only narrowly BC breaking for unpopular dtypes:
quint*s,qint*s,Bits*,dummy_uint*s,dummy_int*s,Float8_e8m0fnu, andFloat4_e2m1fn_x2in the use case where an extension retrieves a Tensor dtype of the above and passes it intoaoti_torch_call_dispatcher.Deprecations
Deprecate
pin_memory_deviceparam intorch.utils.data.DataLoader(#158323)We move enabling
pin_memoryback insideBaseDataLoaderIter. This is required forStatefulDataloaderwhich leveragedBaseDataLoaderIterdireclty rather than theDataloaderclass initDeprecate
torch.export.export_for_trainingAPI in favor of equivalenttorch.export.exportAPI (#158203)torch.export.export_for_trainingexists because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we deprecated and deleted this API.New Features
Python Frontend
__torch_function__handler to be triggered by elements within a list (#160256)torch.hash_tensorreduction function (#154149)FX
is_fx_symbolic_tracingflag (#161385)Dynamo
Introduces an API for annotating dynamic integer inputs & attributes for
torch.compile, by wrapping plain ints withDynamicInt().DynamicInt objects also work in eager mode, acting as their underlying values when passed as scalar inputs.
Optimizer
Profiler
Inductor
Export
AOTDispatcher
Quantization
ONNX
C++ Extensions
torch/csrc/stable/ops.h:amax,narrow,new_empty+new_zerosdtype variant,pad, (#159328, #158974, #159508, #161597, #160214)torch::stable::Tensor()default constructor,is_cpu, andget_device_index(#159507, #160212, #160143)torch::stable::acceleratorwith support for DeviceGuard and Stream (#159679, #160453)torch/headeronly: c10 Macros, STD_TORCH_CHECK, ScalarTypes (like BFloat16 and Half, #158035, #158365, #157912, #158377, #159302, #159414, #159412, #159415, #159411, #159911)TORCH_ERROR_CODE_CHECKinheaderonlyfrom AOTI (#159604)wheelfrom build requirements (#158027)TORCH_STABLE_ONLYis defined inTensorBase.h(#161658)Build Frontend
torch/csrc/stable(#158160)zero_()andempty_like(t)totorch/csrc/stable/ops.h(#158866)Release Engineering
Add support for CUDA 13.0 in CI/CD builds. Enable CUDA compression mode for binary size reduction for CUDA 13.0 builds (#160956, #161073, #161257, #161663, #161316, #160201, #160770, #161013, #161916, #162268, #162322, #162383, #161833)
Enable CUDA 12.6, 12.8 and 13.0 support for Linux ARM64 CD builds (#162364, #160720, #159481)
Add support for Python 3.14 in CI/CD builds (#156889, #157559, #159261, #159869, #160593, #160788, #161255, #159725)
Enable NVSHMEM integration (#151261, #153010, #154538, #155506, #156685, #158938, #161321, #160778, #159907, #160465)
CUDA
cudnn_batch_norm_outkernel to replace the autogen approach (#123020)CPU
MPS
avg_pool3d,max_unpool1d/2d/3d,max_pool3d,max_pool3dbwd pass, andavg_pool3dbwd pass for MPS (#158877,#159789, #156467, #157498, #159089)ROCm
XPU
FlexAttentionon Intel GPU (#143553)Improvements
Python Frontend
torch.loadunderFakeTensorModeby reducing random reads (#157931)torch.utils.benchmark.utils.timeraccelerator agnostic (#157131)torch.nn
register_bufferwithTensor-like objects (#159455)NLLLoss(#161412)Optimizer
requires_grad=Trueto a scalar (#160389)SequentialLRdeprecation warning about invokingstep(epoch)(#149392)Autograd
torch.nn.Upsamplemode="trilinear"backward (#154239)Distributed
c10d
ProcessGroupNCCL(#156748)ProcessGroupNCCL(#156790)ProcessGroupGloo(#158128)TCPStore(#159165)all_gather(#149913)unsafe_get_ptrfor dist.ProcessGroupNCCL.NCCLConfig (#161136)send/recv_object_list(#160342)ProcessGroupGloo(#156633)work.isStarted(#160398)DistributedDataParallel (DDP)
DTensor
device_meshargument constraint inlocal_map(#157049)DTensorslice (#157953)histcop (#158298)propagate_tensor_metafunction that skips cache if_are_we_tracing(#161334)local_mapas a decorator (#161353)Device Mesh
init_device_mesh(#159371)FullyShardedDataParallel2 (FSDP2)
all_gatherandreduce_scattercomms (#155189)set_allocate_memory_from_process_groupif used together with custom comm hooks (#157487)reduceOpSumwhen world size is 1 (#157529)allgatherwhen world size is 1 (#160135)post_reduce_stream.record_event()on hsdp+cpuoffload (#160481)Tensor Parallel (TP)
parallelize_moduleAPI to support more cases (#157182)TensorPipe
TorchElastic
torchrun(#149334)Pipeline Parallelism (PP)
eval()API to schedule (#157795)OVERLAP_F_Bcomputation type (#158978)DualPipeVschedule (#159591)Linear Algebra Frontend
Profiler
FX
is_impure()(#151524, #157981)self.module_stackin ModuleStackTracer (#159956)node_name_matchto subgraph rewriter (#157574)Dynamo
lists (e.g. #153969)sets (e.g. #153150)dicts (e.g. #154794)iter(e.g. #156371)itertools(e.g. #159693)collections(e.g. #159365)collections.NamedTuple(#159367)dataclasses.dataclass(#159529)TorchDispatchModeto ignoretorch.compileinternals (#161648)Inductor
Ahead-Of-Time Inductor (AOTI)
{non-inf, > int32_max}upper bound is provided (#159433)Export
None& ellipsis slicing/select in non-strict (#157821)triton_kernel_wrapper_functionalHOP (#161314)while_loopHOP subgraphs (#158467)AOTDispatcher
aot_export_joint_with_descriptorsandaot_compile_joint_with_descriptors(#158715)prepare_aot_module_simplifiedfor use in next PR (#158319)aot_export_joint_with_descriptors(#159814)Composability
aten.add.Scalar(#161332)aten.expand_copydecomp (#161688)aten.linalg_vector_norm(#155111)aten.complex(#160894)Quantization
HistogramObserver(#156457)bias=Noneforfbgemm_linear_fp16_weightCPU op (#158535)wrapped_fbgemm_linear_fp16_weightfor Sigmoid (#160451)Nested Tensor (NJT)
log_softmax()support (#159662)Foreach
vector.reserve()consistently for non-inplace foreach operations (#161128)has_integral_tensor()(#161042)ONNX
torch.tensorwarning in ONNXsymbolic_opset10export (#158835)C++ Frontend
AllocatorConfigto be device-agnostic via newAcceleratorAllocatorConfig(#149601, #150312)Scalar::isUnsigned()method (#159877)ModelRunnerfrom nativert as public (#159989)torch.binomialenforcing float inputs (#157658)Build Frontend
Dependencies.cmake(#159702)libtorchwithout NVSHMEM (#160910)Release Engineering
CUDA
repeat_interleavekernel (#157996)kernelLaunchCheckto print help string (#158896)Wunused-but-set-variable(#159276)libc++(#161101)CPU (AArch64)
MPS
shifted_chebyshev_polynomial_[tuvw],igamma/igammac,grid_sampler_3d, native_dropout/native_dropout_backward(#157488, #161927, #160541, #162108)index_putto complex types (#160159)addmmto integral types (#160270)kthvalue(#161817)logcumsumexpmetal kernel (#156858)dlpackintegration (#158888)avg_pool2dto use Metal kernel whenceil_mode=True(#161011)ROCm
composable_kernel(CK) backend user interface to improve user experience (#152951)rocSOLVERfor Cholesky inversion. (#157154)torch.backends.miopen.immediateto toggle MIOpen Immediate Mode instead of relying ondeterministic=Trueandbenchmark=False(#158951)reshape_or unexpectedly change memory formats (#161687)XPU
device_idto Intel GPU properties to distinguish iGPUs with identical names (#156481)Bug Fixes
Python Frontend
torch.utils.cpp_extension.load_inlineto override gencode (#156850)max_widthcomputation in Tensor printing (#126859)pin_memoryerror message on CPU-only systems (#159994)F.embeddingDTensor-aware (#162117)Autograd
torch.autograd.Functionmemory leak due totorch.utils.checkpiontearly stopping (#161171)torch.autograd.graph.GradientEdgefortorch.autograd.Function(#160098)Distributed
c10d
setGroupNameandsetGroupDescingroup_splitandmerge_remote_group(#159429)batch_isend_irecvwith 2D tensor views by making P2P tensors dense (#163719)allgather/reducescatterinputs (#163712)Device Mesh
DistributedDataParallel (DDP)
DDPOptimizerand donated buffers (#160745)DTensor
OpSchemaequality check (#161231)grouped_mmstrategy for invalid stride cases (#158245)F.one_hotin DTensor (#162307)ShardingPropagationcache if compiling (#156868)FullyShardedDataParallel (FSDP)
pin_memory(#157147)NO_SHARDcorrectly by flattening tensors before copying (#154369)FullyShardedDataParallel2 (FSDP2)
fsdp_pre_all_gather(#160817)set_reduce_scatter_divide_factorerrors andMixedPrecisionPolicy(#155964)Pipeline Parallelism (PP)
no_grad()(#159293)eval()(#159475)TensorPipe
import torchif compiled withoutTensorPipe(#159461)TorchElastic
torch.distributed.elastic.multiprocessing.start_processes()(#160396)Linear Algebra Frontend
Profiler
FX
split_modulewith symint (#160093)getattr_recursivewith ModuleList (#161204)torch.Tensor(#162224)Dynamo
torch.compiler.reset()(#156527)torch.nn
FlexAttention(#163677)return_lsewarning message inFlexAttention(#163578)FlexAttentionhead broadcast (#163426)Inductor
constant_pad_nd(#159878)MutationOutputBuffer (#162020)NaNbehavior (#159308)dtypeconsistency (#160851)FallbackKernelalias function to avoid incorrect aliasing for custom ops (#163227)Ahead-Of-Time Inductor (AOTI)
load_constants(#161887)gen_aoti_c_shim(#159904)aoti_torch_as_strided(#162118)wait_tensorreturned tensor (#159502)all_reduce(#159818)Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
To execute skipped test pipelines write comment
/ok-to-test.Documentation
Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.