
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.
The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!
PyTorch 1.11.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
Highlights
The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:
- TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
- functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
- Distributed Data Parallel (DDP) static graph optimizations available in stable.
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
Fixed python deepcopy to correctly copy all attributes on Tensor objects (#65584)
This change ensures that the deepcopy operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).
| 1.10.2 | 1.11.0 |
|---|---|
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# Raise AttributeError: "Tensor" object has no attribute "foo"
| a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# 3
|
steps argument is no longer optional in torch.linspace and torch.logspace
This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps). In PyTorch 1.11, it is not longer optional.
| 1.10.2 | 1.11.0 |
|---|---|
# Works, but raises a deprecation warning
# Steps defaults to 100
a = torch.linspace(1, 10)
# UserWarning: Not providing a value for linspace's steps is deprecated
# and will throw a runtime error in a future release.
# This warning will appear only once per process.
# (Triggered internally at ../aten/src/ATen/native/RangeFactories.cpp:19
| # In 1.11, you must specify steps
a = torch.linspace(1, 10, steps=100)
|
Remove torch.hub.import_module function that was mistakenly public (#67990)
This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module.
C++ API
We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)
When you #include a header from the C++ frontend, you can no longer assume that every aten operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h> in your file, which will maintain the old behavior of including every aten operators.
Custom implementation for c10::List and c10::Dict move constructors have been removed (#69370)
The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"
| 1.10.2 | 1.11.0 |
|---|---|
c10::List list1({"3", "4"});
c10::List list2(std::move(list1));
std::cout << list1.size() // 0
| c10::List list1({"3", "4"});
c10::List list2(std::move(list1)); // calls copy ctr
std::cout << list1.size() // 2
|
CUDA
Removed THCeilDiv function and corresponding THC/THCDeviceUtils.cuh header (#65472)
As part of cleaning up TH from the codebase, the THCeilDiv function has been removed. Instead, please use at::ceil_div, and include the corresponding ATen/ceil_div.h header
Removed THCudaCheck (#66391)
You can replace it with C10_CUDA_CHECK, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions
Removed THCudaMalloc(), THCudaFree(), THCThrustAllocator.cuh (#65492)
If your extension is using THCThrustAllocator.cuh, please replace it with ATen/cuda/ThrustAllocator.h and corresponding APIs (see examples in this PR).
This PR also removes THCudaMalloc/THCudaFree calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr), or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.
Build
Stopped building shared library for AOT Compiler, libaot_compiler.so (#66227)
Building aot_compiler.cpp as a separate library is not necessary, as it’s already included in libtorch.so.
You can update your build system to only dynamically link libtorch.so.
Mobile
Make typing.Union type unsupported for mobile builds (#65556)
typing.Union support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.
Distributed
torch.distributed.rpc: Final Removal of ProcessGroup RPC backend (#67363)
ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.
The backend type “PROCESS_GROUP” is now deprecated, e.g.torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)
Quantization
Disabled the support for getitem in FX Graph Mode Quantization (#66647)
getitem used to be quantized in FX Graph Mode Quantization, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.
| 1.10.2 | 1.11.0 |
|---|---|
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
def forward(self, x):
x = self.linear(x)
y = torch.stack([x], 0)
return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
# (linear): QuantizedLinear(in_features=5, out_features=5,
# scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
# linear_input_scale_0 = self.linear_input_scale_0
# linear_input_zero_point_0 = self.linear_input_zero_point_0
# quantize_per_tensor = torch.quantize_per_tensor(x,
# linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
# x = linear_input_scale_0 = linear_input_zero_point_0 = None
# linear = self.linear(quantize_per_tensor)
# quantize_per_tensor = None
# stack = torch.stack([linear], 0); linear = None
# getitem = stack[0]; stack = None
# dequantize_2 = getitem.dequantize(); getitem = None
# return getitem
| from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
def forward(self, x):
x = self.linear(x)
y = torch.stack([x], 0)
return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
# (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
# linear_input_scale_0 = self.linear_input_scale_0
# linear_input_zero_point_0 = self.linear_input_zero_point_0
# quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0,
linear_input_zero_point_0, torch.quint8)
# x = linear_input_scale_0 = linear_input_zero_point_0 = None
# linear = self.linear(quantize_per_tensor); quantize_per_tensor = None
# stack = torch.stack([linear], 0); linear = None
# dequantize_2 = stack.dequantize(); stack = None
# getitem = dequantize_2[0]; dequantize_2 = None
# return getitem
|
Users should now use fuse_modules for PTQ fusion and fuse_modules_qat for QAT fusion (#69878, #71956)
There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training to decide which mode user wanted, but this was a misuse of the training attribute since that is not the intended purpose. This PR removes the dependency on module.training and uses separate APIs to make the fusion requested by the user explicit.
Previously, fuse_module used to support both cases and distinguished PTQ/QAT fusion based on module.training, but now fuse_module only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat, instead of using fuse_modules, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.
Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.
| 1.10.2 | 1.11.0 |
|---|---|
import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
self.bn = torch.nn.BatchNorm2d(3)
def forward(self, x):
return self.bn(self.conv(x))
m = M().train()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
| import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
self.bn = torch.nn.BatchNorm2d(3)
def forward(self, x):
return self.bn(self.conv(x))
m = M().train()
# For Quantization Aware Training, use fuse_modules_qat()
m = fuse_modules_qat(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
# Result (doesn't change):
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
|
ONNX
Removed f arg from onnx.export_to_pretty_string (#69546)
The arg has always been ignored. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export_to_pretty_string(model, inputs, "file_name")
| torch.onnx.export_to_pretty_string(model, inputs)
|
Removed use_external_data_format arg from onnx.export (#67809)
The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, use_external_data_format=True)
| torch.onnx.export(model, inputs, f_name)
|
Removed example_outputs arg from torch.onnx.export (#67809)
The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))
| torch.onnx.export(model, inputs, f_name)
|
Removed enable_onnx_checker arg from onnx.export (#67276)
The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError will be raised. Users can catch and ignore that exception.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)
| try:
torch.onnx.export(model, inputs, f_name)
except torch.onnx.CheckerError:
pass # ignore error
|
Moved and renamed onnx.utils.ONNXCheckerError to onnx.CheckerError (#66644)
Previously the documentation was incorrect and stated ONNXCheckerError was in the onnx module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.
| 1.10.2 | 1.11.0 |
|---|---|
except torch.onnx.utils.ONNXCheckerError:
| except torch.onnx.CheckerError:
|
Removed _retain_param_name arg from onnx.export (#67276)
The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.
| 1.10.2 | 1.11.0 |
|---|---|
# NOTE: No way to get same behavior as _retain_param_name=False.
torch.onnx.export(model, inputs, f_name, _retain_param_name=True)
| torch.onnx.export(model, inputs, f_name)
|
Deprecations
Python API
Deprecated x.T on tensors of dimension other than 0 or 2 (#64180)
x.T only accepts tensors with 0 or 2 dimensions. Calling x.T on tensors with a different number of dimensions has been deprecated.
| 1.10.2 | 1.11.0 |
|---|---|
a = torch.ones(2, 3, 4)
a.T.size()
# torch.Size([4, 3, 2])
| a = torch.ones(2, 3, 4)
a.T.size()
# UserWarning: The use of `x.T` on tensors of dimension other than 2
# to reverse their shape is deprecated and it will throw an error in a future release.
# Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))`
# to reverse the dimensions of a tensor. (Triggered internally at
# aten/src/ATen/native/TensorShape.cpp:2386.)
# torch.Size([4, 3, 2])
|
Quantization
torch.ao.quantization.QConfigDynamic is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig instead (#69875, #69864)
| 1.10.2 | 1.11.0 |
|---|---|
qconfig = torch.ao.quantization.QConfigDynamic(...)
| qconfig = torch.ao.quantization.QConfig(...)
|
New features
Python API
- Added
set_deterministic_debug_modeandget_deterministic_debug_mode(#67778, #66233) - Added n-dimensional Hermitian FFT:
torch.fft.ifftnandtorch.fft.hfftn(#63890) - Added
Wishartdistribution totorch.distributions(#70377) - Preliminary support for the Python Array API standard has been added to the
torchandtorch.linalgmodules. PyTorch implements over 90% of the operators defined by the Python Array API, including thetorch.from_dlpackoperation for improved DLPack support (#60627) - Moved
torch.testingfrom prototype to beta (#69668)
Autograd
- Added new
torch.utils.checkpointimplementation that does not use reentrant autograd (can be toggled with the newuse_reentrantflag) (#69508) - Added
batched_gradparameter toautograd.gradto allow batched gradient computation (#65564) - Forward mode AD:
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Check the following issue (#71117) to see the list of ops that do not yet support forward AD. Please comment there if you run into any ops that don’t support forward AD that you want prioritized or are missing from that list.
- Added
ctx.save_for_forwardfunction toautograd.Function(#71569) autograd.forward_ad.unpack_dualreturns a named tuple instead of plain tuple (#68062, #68628)
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Linear algebra operation support:
Build
- Added FlexiBLAS build support (#64815)
- Added
IS_LINUXandIS_MACOSglobal vars for cpp extensions building (#69093) - Added ARC for iOS CMake builds (#67884)
- Added support for IBM z14/15 SIMD (#66407)
Complex Numbers
Dataloader
- TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)
LinAlg
- Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
- Added the
linalg.matrix_expoperation (see the docs here) (#62715) - Added the
linalg.crossoperation (see the docs here) (#63285) - Added the
linalg.diagonaloperation, an alias for torch.diagonal (see the docs here) (#70599) - Added the
linalg.lu_factoroperation (see the docs here) (#66933)
torch.nn
- Added
torch.nn.utils.rnn.{unpack_sequence,unpad_sequence}functions (#66550)
Sparse
- Added
torch.sparse.sampled_addmmfor CSR Tensors on GPU (#68007)
CUDA
- The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
- Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
- Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
- The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
- Added complex support for Jiterator, port sinc to Jiterator (#71577)
- Jiterates
lcm,i0e,i1e,ndtri,efcx,digamma,trigamma,lgamma(#70663) - Jiterates
exp2,erfc,erfinvandentr(#71295) - Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
- Jiterates
polygamma(#71162)
- Added cuSPARSE descriptors and updated CSR addmm (#60838)
- Sparse CSR CUDA: added
addmv_out(#61407) - Added nvidia-smi memory and utilization as native Python API (#69104)
Vulkan
- Added Vulkan support for several torch operators:
- Added the
vulkan_perf_testbenchmark binary to benchmark Vulkan ops under various input conditions. (#67230)
Mobile
- Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
- Build tracer for tracing based workflow (#66267)
- Used operator.yaml to build LibTorch library (#66237)
- Unified tracer between internal and external (#64152)
- Reorganized model tracer dependency (#63421)
- Added support for the
boolandintdtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297) - Generic build features for selective build (#67817)
- Made more classes selective (#67397)
- Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)
Distributed
FullyShardedDataParallel- FSDP is a type of data-parallel training but unlike traditional data-parallel it shards model’s parameters, gradients and optimizer states across data parallel workers and can optionally offload the sharded model parameters to the CPUs. This new API can help users to scale their large model training with minimal code change when switching from DDP to FSDP. (#63881, #64964, #66578, #66904, #66956, #66957, #67117, #67292, #67249, #67135, #67813, #68308, #68155, #68417, #68776, #69356, #69357, #69358, #70340, #71803, #71804, #70341, #70235, #72084)
DistributedDataParallel
TorchScript
- Enabled running
torch.jit.freeze()andtorch.jit.optimize_for_inferenceon functions that are not forward (#68668, #69367) - Enabled
torch.jit.freezeto work on for sparse COO tensors (#69614) - Enabled
torch.jit.script(),torch.jit.freeze()and serialization for tensors in Compressed Sparse Row (CSR) format (#69555) - Allowed users to set the fusion strategy for
torch.jit.fuserthrough the now publictorch.jit.set_fusion_strategy. (#72937) - Enabled Dynamic Shape Fusion For GPU & CPU, configurable via
torch.jit.set_fusion_strategy(#72036)
Quantization
- Added bilinear quantized implementation of
torch.nn.functional.grid_sample2d operator (#66879) - Added the
torch.quantize_per_tensor_dynamicoperator (#68004) - Added Quantization Aware Training support for
torch.nn.Embeddingandtorch.nn.EmbeddingBag- Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
- Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
- Eager mode QAT for Embeddings (#66429)
- Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
- Supported Embedding QAT via FX API (#69333)
- Add FX support for QAT EmbeddingBag (#69334)
- Added support for depthwise quantized
torch.nn.Conv3din qnnpack, for use in quantization- Depthwise Conv3d Indirection Buffer Setup (#69311)
- Depthwise Conv3d Weight Packing (#69312)
- Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
- Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
- Tightened Step Height for Indirection Buffers (#70530)
- Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
- Implemented 3d convolution in qnnpack (#66350)
ONNX
- Supports opset version 15 (#67805)
- Supports exporting
nn.Modulecalls as ONNX local functions (#66140, #67803) - Supports for exporting new ops:
- Added BFloat16 type support (#66788)
- Supports exporting with Apex O2 (#66700)
Infra (Releng)
- Added support for ROCm 4.3.1 (#65624)
- Added support for ROCm 4.5.2 (#71064)
- Added support for CUDA 11.5 (#69262)
- Added support for CUDA enabled Bazel builds (#66241)
- Added support for Python 3.10 (#71132, #71419)
Improvements
Python API
- NumPy compatibility:
- Improved
torch.Tensor.view(dtype): enable all dtype combinations (#66493) - Improved
torch.diffby adding support for n greater than 1 (#67260) - Improved
torch.movedimto handle scalar as no-op (#69537) - Improved
cartesian_prod: fixed a warning in the docs example (#68753) - Improved error messages for
max_unpool{}doperators (#67328) torch.distributions- Implemented positive-semidefinite constraint in
torch.distributions(#71375) - Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
- Implemented support for
non-negativeconstraint in exponential distribution (allowing it to include zero). (#67184) - Implemented
kl divergencebetweennormalandlaplacedistribution. (#68807)
- Implemented positive-semidefinite constraint in
- Improved meta tensor support for operators:
- Added support for
torch.Tensor.realfor real-valued tensors (#71718) torch.logaddexp, torch.logaddexp2, torch.remainder: added BFloat16 support on CPU (#63621)torch.bucketizeandsearchsorted: added Half precision support (#67077)- Added new
torch.slice_scatter,torch.select_scatter,torch.diagonal_scatterops (#64430) - Made
torch.scatter_reducea public API (#68580, #73125)
C++ API
- Added C++ API and docs for
hfftn(#66127) - Added support for
MaybeOwned<IValue>(#68157) - Added
set_to_noneoption forzero_grad()to C++ API (#68801) - Added an environment variable,
TORCH_CPP_LOG_LEVEL, that you can use to toggle the log level in the c10 library (#71746)
Autograd
- Added nesting support for
torch.autograd.graph.saved_tensor_hooks(#70932) - Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
- Added complex autograd support to
torch.{col2im,im2col}(#68199) - Added new reduce options and autograd support for
torch.scatter_reduce(#71788) - Added derivatives wrt the second argument for
torch.{remainder,fmod}(#69908) - Added new
strategyflag toautograd.functional.{Jacobian, Hessian}to enable vectorized computation (#67041, #66292) - Added
check_backward_adflag totorch.autograd.gradcheckto be able to skip backward mode AD checks (#65040) - Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)
Build
- Improved incremental build times of PyTorch core by removing a dependency on
native_functions.yamlin many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735) - Enabled bazel build without glog and gflags (#70850)
- Added support for C++ frontend wrapper on Linux (#69094)
- Added support for dynamic codegen outputs in CMake (#68246)
- Max CMake version is now used by default with setup.py (#69355)
- Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
- Code base should now be
-Wno-unused-variablecompliant (#66041) - Added lazy import for
packagingintorch_version(#71345)
Dataloader
- Support custom
SequenceandMappingforutils.data.default_collate(#68779) - Allowed specifying
num_samplestoRandomSamplerwhenreplacementisFalse(#71568) - Fixed the warning of shape inconsistency
utils.data.default_collate(#71065)
ForEach
- Implemented
ForEachL1 & L2 norm (#62646)
LinAlg
- The
linalg.matrix_rank(docs) andlinalg.pinv(docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)
torch.nn
- Added
channels_lastsupport forChannelShuffle(#50247) - Added no-batch-dim support for
nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}(#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291) - Added
BFloat16support on CPU tonn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d}(#56902, #66929, #66927, #56903) - Added
maximizesupport tooptim.{Adam, AdamW, SGD}(#68164, #70146, #67847, #68733, #71023) F.interpolate: Addnearest-exactmode to fix off-by-one error innearestmode (#64501)F.interpolate: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)F.interpolate: Improved error message for invalid shapes (#66417)nn.Conv*d: Accepts 0-sized channel inputs (#66256)nn.LogSigmoid: Usedlog1pfor improved precision (#66441)nn.Module: Added flag for removing duplicates from parameters (#71542)nn.Module: Addedregister_modulealias for registering a sub-module (#65174)nn.ModuleList: Supported concatenation (#70887)nn.MultiheadAttention: Added flag to optionally average output attention weights across heads (#70055)nn.ParameterDict: Supported full set ofdictmethods (#69403)nn.{RNN, GRU}: Allowedhidden_sizeto be 0 (#70556)nn.Sequential: Addedappendmethod (#71326)nn.Upsample: Exposedrecompute_scale_factor(#66419)nn.ZeroPad2d: Addedextra_reprfor printing purposes (#69206)optim.{ChainedScheduler, SequentialLR}: Addedoptimizerattribute (#67406, #69817)optim.swa_utils.AveragedModel: Addeduse_buffersflag for averaging buffers in addition to parameters (#65921, #71763)
torch.fx
- Improved the customizability of
fx.Graph’s code generation function, including support for setting a breakpoint in the generated code (#67139) - Supported printing inplace operators in FX (#71887)
Sparse
- Add CSR support for several operators:
torch.triangular_solve,torch.addmv,torch.addmm,torch.addfor all arguments on CPU (#62180, #61536, #65606, #64391)torch.triangular_solve,torch.addmv,torch.addmm,torch.addfor all arguments on GPU (#61407, #61858, #63511, #63948)- zero-preserving unary functions (#68123, #69292)
torch.empty,torch.resize_,torch.copy_,torch.randn_like,torch.clone(#63509, #63510, #68083, #70581)transpose(#70582)
- Added torch.sparse_coo Layout support to
zeros_like(#68108) - Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
- Added support for conversion of CSR to COO Tensor to
to_sparse(#66774) - Added support for empty COO Tensors to sparse.sum (#71091)
AMD
- Added sparse mappings for CUDA->HIP translation (#67323)
- Enabled frexp support for ROCm builds (#67226)
- Used hipCUB/rocPRIM scan algorithms for large index support (#68487)
CUDA
- Allows external CUDA streams to be set as current (#66324)
- Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
- Improved CUDA memory usage of
nanmedianresult (#68591) - Reduced number of
igammakernel instantiations (#70666) - Reduced number of
comparekernels by unifying them (#69111) - Reduced number of
bernoullitensor tensor kernel instantiations (#70169) - Used
cub::FutureValueto simplify 64bit indexing split of cub scan (#66711) - Added
hascuSOLVERflag to Context (#69825) - Improved error message from
CUDACachingAllocator(#69174) - Fixed
masked_softmaxperf for element_size is not 8 (#70271) - Reduced binary size of
TensorCompare.cu(#68835) - Improved error message for
interpolation(#72066) - Doesn't compile
powkernels for non-existent case (#70017)
Profiler
- Added flop count formulas for
bmmandbaddbmm(#66636)
Vulkan
- Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
- Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)
Mobile
- Introduced multiple improvements for
NNAPI- Added converters for torchscript ops
quantized::mulandquantized::convtranspose2dto converter (torch.backends._nnapi.prepare.convert_model_to_nnapi) (#63913, #63914) - Supported
int32andqint16type in Torchscript expressions (#70197, #70621) - Supported runtime flexible shapes and return shapes (#70334)
- Added converters for torchscript ops
- Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
- Introduced multiple improvements for
CoreML - Type Support in Mobile Lite Interpreter
Distributed
torch.distributed- Improvements to error handling in
TCPStore’s socket implementation (#68225) - Enabled
ncclAvgfor reductions (#62835) - Init dummy
NCCLcomms in constructor (#65173, #66393) - Added pybind trampoline for
ProcessGroupandWork(#66338) - Setup
c10dextension Backend class attr the same way as builtin ones (#66991) - Added barrier to
ProcessGrouptrampoline (#67236) - Raised warning when calling collectives on non-member group objects (#67639)
- Patched
bfloat16support for NCCL (#67843) - Fixed
c10dTCP store race condition with mutex (#68499) - Surfaced
ncclUniqueIdstore broadcast error (#68597) - Checks for file existence before invoking cleanup logic in
FileStoredestructor (#68603) - Implemented gather primitive for
ProcessGroupNCCL(#66745) - Implemented scatter primitive for
ProcessGroupNCCL(#70029) - Enabled
gather_objectonNCCL(#71623) - Implemented
allreduce_coalescedforProcessGroupNCCL(#62140) - Set non-default backend names to lower case (#69400)
- Added support for
deleteKeyforFileStore(#69953) - Fixed
TSANissue inTCPStore(#69590)
- Improvements to error handling in
DistributedDataParalleltorch.distributed.rpctorch.distributed.autograd- Made Kineto + distributed a warning rather than an error (#71120)
torch.distributed.elastic- Added ability to override sys.executable for
torch.distributed.run(#66179)
- Added ability to override sys.executable for
TorchScript
- Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single
CudaFusionGroup, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137) - Enabled
profile_ivalueto convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137) - Added support in
torch.jit.tracefor tracing already JITted subgraphs(#59949) - We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
torch.jit.freezenow can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)- The peephole optimizer, which is used in
torch.jit.freezenow coalesces consecutive calls totorch.concatinto a single call (#67000) - Added ability for Torch.JIT C dispatch to convert python
Noneinto an undefined Tensor(#67793) torch.jit.scriptnow recognizes union of scalars as a JIT NumberType (#66591)- No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
- In
torch.jit.optimize_for_inference, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024) - In
torch.jit.freeze, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024) - Added support for normalizing
torch.Tensor.__rsub__innormalize_opsJIT pass(#65014)
Quantization
- Quantized op improvements
torch.ao.FakeQuantizenow supportsfp32/fp16zero_point. (#65836)torch.ops.quantized.addnow supports broadcasting (#66049)torch.Tensor.dequantizenow supports fp16 + cuda (#67234)- Added quantized CPU support for
torch.nn.GELU(#69968) torch.nn.quantized.functional.hardsigmoidsupports aninplaceflag (#65740)
- Workflow improvements
- FX graph mode quantization: enable
torch.nn.Linear + torch.nn.BatchNorm1dfusion for PTQ (#66484) - Added an option in
torch.ao.quantization.quantize_fx.convert_fxto acceptqconfig_dictto skip quantization (#66878) - Added
torch.nn.qat.dynamic.modules.Linearmodule (#67325) - Added
torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}dfusion support (#70022) - Extended
torch.ao.quantization.prepare_qatwithallow_listargument, to allow custom mapping and custom QAT module (#65119) - Added
torch.ao.quantization.default_replay_qconfigwhich allows observer reuse fortorch.reshapein FX graph mode quantization (#69249)
- FX graph mode quantization: enable
ONNX
- Set
ir_versionof the exported model based onopset_version. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803) - Preserved op input names when op just passes through the input to the output (#67275)
- Shape inference improvements:
- Included op type in exported models’ input and output names (#68976)
- Supports Conv-BatchNorm fusion inside blocks (#67272)
- Exported
torch.reciprocalto ONNX Reciprocal operator instead ofDiv(1, x)(#67271) - Supports
beta!=1in softplus (#66146) - Added warning for inplace updates on
tensor.shapein tracing mode (#66142) - Supports
instance_normin training mode (#64375) - Allow registration of custom symbolics for ops specifying aten namespace (i.e.
aten::foois allowed as well as “foo”). (#67810) - Allow registration of custom symbolics for
primnamespace (#66139) - Supports dynamic inputs for
OneHot, bool forEinsum(#66147)
Infra (Releng)
- Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)
torch.package
- Add ability to retrieve the dependency graph via
all_pathfunction(#65602) - Add support for pickle v4 (#70642)
- Add better testing support for Package Exporter (#70641)
Bug fixes
Python API
- Fixed scalar inputs for aliased binary ops {
multiply,subtract,divide} (#65937) - Fixed
torch.savewhen saving storages that view same data with different type (#66949) - Fixed
torch.saveerror if storages are unallocated (#68787) - Fixed
kout-of-bounds intorch.kthvalue(cpu kernel) (#68863) - Fixed
inference_modedecorator:with inference_mode(mode=False)used to ignore themodeargument and always set inference mode. (#68617) - Fixed
cdist_backwardin the case whencdistinputs are not contiguous (#70016) - Fixed
cdisterror message typo (#70178) - Fixed
scatterfor empty indexes (#70662) - Fixed
torch.{unique, unique_consecutive}out of bound (#71540) - Fixed
torch.isinin the case when inputs are non-contiguous on CPU (#70659) - Fixed
hsplit vsplit dsplitcrash when section is 0 (#69342) - Fixed:
torch.gradientignores dim argument when checking edge_order (#67926) - Fixed:
TransformedDistribution.icdfshould perform validation after applying the inverse transformation rather than before. (#71393) - Fixed
torch.all and torch.anyinternal assert error with requires_grad=True (#65714) - Fixed
torch.logsumexptype promotion: promote integral inputs to floating for(#63393)
C++ API
- Fixed libtorch
at::Tensor::print()linking error (#69615) - Avoided UB when indexing into size-0 tensors (#65878)
- Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)
Autograd
- Fixed autocast state propagation in the
torch.utils.checkpointAPI (#71169) - Fixed
torch.nn.functional.conv_transpose3dbackward when grad_out is non-contiguous (#67829) - Forward mode AD:
- Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
- Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
- Fixed forward AD codegen for functions that have multiple formulas (#68535)
- Fixed deadlock when forward and backward AD are used at the same time (#67360)
- Fixed
Tensor.copy_forward AD to handle broadcasting (#69592) - Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
- Fixed
autograd.Functionwhen non-Tensor argument precedes tensor argument (#71530) - Fixed
autograd.Functionforward AD when forward is a no-op to no longer raise an internal error (#71531)
Build
- Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
- Disabled SVE when cross-compiling for M1 (#67114)
- Added failure if
pocketfftis not found andat_mklis not enabled (#67909) - Fixed clang issues when compiling with
_GLIBCXX_USE_CXX11_ABI(#72081)
Complex Numbers
- Fixed
torch.autograd.gradcheckto generate valid inputs for forward AD computation for complex functions (#68001) - Fixed
torch.Tensor.copy_transpose path for tensors with conjugate or negative bit set (#69026) - Fixed
torch.Tensor.copy_behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)
Dataloader
- Made
ProcessExceptionpicklable (#70118) - Fixed persistent worker exiting before
pin_memory_thread(#71579)
torch.nn
nn.AdaptiveAvgPool*d: Throws an error for negativeoutput_size(#70488)nn.Conv1d: Fixed for 1D convolution on MKL-DNN backend (#68166)nn.CrossEntropyLoss: Fixed for usage ofweight,ignore_index, andlabel_smoothingtogether (#69511)nn.Fold: Checked that block height and width are positive (#69048)nn.LayerNorm: Fixed incorrect result on CUDA whengammaorbiasare missing (#69210)nn.LayerNorm: Avoided overflow by doing computation infloatforhalf(#66920)nn.Module: Throws a proper error message fromload_state_dictfor non-tensor values (#70596)nn.ModuleList: Fixed incorrect return type in__getitem__(#69083)nn.MultiheadAttention: Used query dtype for mask type (#68077)nn.NLLLoss: Fixed backward computation with negative weights (#64572)nn.{RNN, GRU}: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)nn.utils.rnn.pad_sequence: Fix regression to support tuples for padding (#72436)optim._LrScheduler: Fixed print formatting (#68338)optim.ChainedScheduler: Fixedget_last_lr()(#69112)optim.CosineAnnealingWarmRestarts: Fixed ordering bug whenlast_epoch > 0(#64758)optim.SequentialLR: Updated_last_lron step (#70558)
torch.fx
- Supported
torch.layoutas arg (#66048) - Specified a default value when possible for placeholders created from
concrete_args(#59569) - Fixed issue where
GraphModule.delete_all_unused_submodulesdeletes submodules from called leaf modules (#66430) - Fixed
torch.fx.subgraph_rewriter.replace_patternmechanism so that multiple one-liner instances of the pattern are captured correctly (#66442) - Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
- Ensured node stack trace survives copying (#69368)
- Fixed
to_foldernot saving dtype (#69983) - Added a
default_valuearg tofx.Graph.placeholderand fixsplit_module(#71016)
Sparse
- Fixed CSR storage access to throw when used (#70072)
- Fixed multiplication of 0-D sparse tensors (#70749)
- Fixed result dtype for neg if given sparse Tensor (#68885)
CUDA
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
- Fixed error in warning about unsupported GPU (#67900)
- Disabled TF32 in
pinv_jvpandpinv_backward(#67948) - Fixed DLPack CUDA stream convention (#67618)
- Sets device guard in
_cudnn_implfunctions (#70406) - Fixed
mem_get_infowhen querying on a device other than the current device (#69640)
Benchmark
- Fixed divide-by-zero errors in
torch.utils.benchmark.Timer(#70050)
Dispatcher
- Added explicit
OperatorHandledestructor, so that the symbol shows up in windows builds (#70033)
Profiler
Visualization
- Fixed
torch.utils.tensorboardparsing JIT graph incorrectly (#65692)
Vulkan
- Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
- Addressed several warnings raised by the Vulkan Validation layers:
Mobile
- Fixed quantized logistic converter for
NNAPI(#70847) - Fixed potential crash if
MTLCreateSystemDefaultDevicereturns nil (#66859) - Used full name to look for the promoted prim operator table (#66081)
- Fixed function name bug in mobile export (#66915)
- Fixed issues with
irangenot having a header included inMetal(#66877) - Fixed backward compatibility issue for UnionType on mobile in
type_parser. (#71341) - Fixed forward flatbuffer type handling with dynamic type in
flatbuffer_loader. (#71500) - Fixed type equalities issue in
pytorch_jni_common(#71508) - Fixed missing properties to the executor in
CoreML(#67737) - Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
- Ensured that function participating in bundled inputs have their “name" attribute set (#65856)
Distributed
torch.distributed- Fixed bug on empty
GLOO_SOCKET_IFNAME_ENV(#68933)
- Fixed bug on empty
DistributedDataParallel- Fixed “Cannot modify in-place due to DDPSink” (#66015)
torch.distributed.elastic- Fixed scale down bug caused by calling
rdzv_handler.shutdown()on premature agent failures (#67749)
- Fixed scale down bug caused by calling
TorchScript
- Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
- Fixed a ref counting loop for
CompilationUnit, resulting in memory leaks when class objects were in JIT graphs. (#65442) - Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
- Fixed bug where
torch.jit.optimize_for_inferencedid nottorch.jit.freezea module when passed a a non-frozen module (#71436) - Fixed bug where running module.forward() on a
torch.jit.freezeed module ran the wrong graph (#68316) - Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of
torch.split, resulting in invalid optimizations in various JIT optimization passes (#69745) - Fixed places where using
torch.autocasttogether with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648) - Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
- Fixed bug where graph input types were incorrect after running
torch.jit.trace(#68242) - Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when
torch.jit.freezeops are converted to MKLDNN(#66628) - Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
- Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python
pickleversion.(#69807) - Fixed bug where
torch.jit.scriptfails when comments in function has less indent than surrounding code (#70227) - Fixed incorrect device type when torch.device is called inside scripted (
torch.jit.script) code (#69645) - Fixed warning: overloaded virtual function
torch::jit::Function::callis only partially overridden in classtorch::jit::GraphFunction(4bf1be898d)
Quantization
- Fixed applying non-zero offset 1 to null pointer in
torch.nn.functional.interpolatefor quantized tensors (#65570) - Doesn't assume bias is a keyword argument to
torch.nn.Conv{n}d(#61647, #71426) - Made error message when trying to use
torch.quantize_per_tensoron non floats more specific (#66050) - Quantized
torch.nn.Embeddingconversion with unsupported dtype: make error message clearer (#66051) - Fixed
torch.nn.qat.EmbeddingBagfrom_float error message (#66989) - Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in
torch.nn.EmbeddingQAT (#68852) - Fixed scale+zp serialization of
torch.nn.quantized.BatchNorm{2|3}d(#70432) - Fixed
torch.nn.Dropoutin FX graph mode quantization (#71043, #71438) - Fixed
qconfigsetting for fused modules in FX graph mode quantization (#71254) - Removed assumption number of rows is in 32 bit in fbgemm (#69066)
- Fixed
reduce_rangewarning when using default observers (#71027)
ONNX
- Doesn’t create invalid
index_selectop when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493) - Shape inference:
- Fixed inplace
fill_dtype export mismatch (#64580) - Fixed
remainder(#64578) - Fixed
reciprocalwhen input is not floating point (#67808) - Fixed
new_fullandfull_likefor Python 3.9 (#67806) - Fixed reduce ops on
binary_cross_entropy_with_logits(#67805) - Propagated node metadata across passes (#45256)
- Ensured outputs don’t have the same name (#66137)
- Fixed
padwith sequence inputs (#64377) - Fixed
instance_normwithtrack_running_stats=True(#64375) - Fixed
allandanywithdimarg (#67270) - Allows autograd functions (
prim::PythonOp) to be exported withOperatorExportTypes.ONNX_FALLTHROUGH(#67273)
torch.package
- Prevent import race condition that leaves
torch.package.PackagePicklerwith unwanted dispatch table entries. (#71025)
Performance
Python API
- Speed up pickling for
torch.dtype(#65182) - Speed up
histogram: avoid index_put_ overhead in histogram kernel's inner loop (#67815) - Speed up
torch.topkwith sort for some cases (#68632) - Speed up
torch.stack: don't unsqueeze every stack arg if possible (#70288) - Speed up
LayerNorm4-5% (#71423) - Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
- Speed up
indexingfunctions: release GIL in a few places (#71728) - Speed up
torch.emptya bit: define check_sizes_nonnegative as inline (#71640) - Speed up
XLAtensor printing by reducing compilations (#71147)
C++ API
- Updated
c10::SmallVectorfrom LLVM (#69110) - Reduced some framework overhead in
at::copy_()(#68950) - Reduced some overhead in
StorageImpl::set_data_ptr(#65432) - Improved
IValueperformance for tuples by inlining tuple storage (#64066)
Autograd
- Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
- Rewrote the backward of
linalg.luandlinalg.lu_solveto uselinalg_solve_triangular(#63569) - Updated
nn.functional.grid_samplebackward to compute input gradient only if required (#66069, #66070) - Stopped erroneously saving the output of
torch.softplusfor backward (#70296)
Complex Numbers
- Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
- Restored conjugate and negative bits of a tensor when calling
repeat_interleave(#68523)
CUDA
- Used a better hash table in
CUDACachingAllocator(#71667) TopKCUDA Optimization: used multiple block per slice (#71081)- Removed sync in
Embeddingcaused byunique(#66091) EmbeddingBackwardexclusive_scan thrust->cub (#66566)sort_out_cuda: Used custom kernels to fill index tensors (#66668)masked_scatter: fuse mask count check into one kernel (#66871)- Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
- Improved native
layer_normforward perf (#67977) - Improved native
layer_normbackward perf (#68238) - Fast path for size 0 GPU host malloc (#68532)
- Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
- Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
- Removed
call_oncefromCUDACachingAllocator(#71668) - Reworked stat collection in
CUDACachingAllocator(#71669) - Fixed CUDA
LpNormFunctor(#70601)
Dispatcher
- Made
c10::KernelFunctionstruct smaller, which should reduce some memory usage by the dispatcher (#65618)
torch.fx
- Made
torch.fx.symbolic_tracereuse buffers if they're the same (#66211)
Profiler
Mobile
- Reduced PyTorch Library startup time by 40% for mobile and edge deployments(#65735, #65732, #65939, #66112, #66064, #66131)
- Reduced PyTorch Library heap memory utilization by 40% for mobile and edge deployments(#65732, #66112, #66064, #66131)
- Improve efficiency of IValue and reduce overhead in code paths that use IValue and perform Type Parsing (#65710, #64278, #66717, #65381, #66134, #65951, #70477)
TorchScript
- Improved performance of autodiff on small JIT graphs (#71666)
- Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
- Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
- In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)
Quantization
- Sped up quantized
torch.nn.functional.interpolatefor channels last (#66525) - Sped up
torch.nn.functional.upsamplefor channels last (#70903) - Parallelized computation in
torch.quantize_per_tensor_affineandtorch.dequantize(#65845)
Documentation
Python API
- Added docs for
torch.adjoint. (#68869) - Clarified difference in behavior of
empty_stridedandas_strided(#64568) - Added some missing generated doc entries (
torch.select,torch.slice_scatter,torch.diagonal_scatter,torch.select_scatter) (#69030),histogramdd(#68273) - Typo and formatting fixes.
LinearLR(#67840),torch.any(#65310, #70187),torch.futures(#70630), jit docs (#68557),Tensor.type(#67019),torch.lobpcg(#71464),Tensor.triu(),Tensor.tril(),Tensor.ravel(). (#71057),torch.acosh(#66814), (#70439) - General Doc improvements for individual ops.
torch.finfo(mentiontorch.bfloat16) (#68496),torch.quantileinterpolation kwarg (#70637),from_dlpackandto_dlpack(#70437),set_printoptionsadded examples (#68324),index_add(#65806), topk doc (#65938),unique(#66132),chi2(#67379),torch.histc(#64191),emptyandempty_like(#68874),torch.cholesky_inverse(#69069),torch.dsplit(#70557) - Changed README getting started link to explicit instructions (#66828)
- Modernized and clarified docs for
torch.tensorandtorch.as_tensor(#63308) - Improved
torchhubdocs (#69970) - Updated docs for
torch.Tensor.realto indicate that it's supported for real tensors (#71962)
C++ API
- Fixed typos in ATen README (#69170)
- Mentioned
TORCH_SHOW_CPP_STACKTRACESinContributing.mddocs (#64052) - Updated link to C++ frontend examples (#66095)
- Added docs for Visual Studio extension (#63944)
- Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)
Autograd
- Updated docs for forward AD and make them public (#71643, #71159)
- Updated “Extending PyTorch” doc to cover forward AD (#66962)
- Fixed broken code syntax in autograd.rst (#69362)
- Fixed incorrect variable in autograd docs (#70884)
- Fixed typo in
torch.autograd.Functiondocs that prevented it from compiling (#66754)
Dataloader
- Added docstring for
default_collateanddefault_convert(#69862) - Updated the documentation for AMP with DataParallel (#69218)
torch.nn
F.binary_cross_entropy: Updated examples to avoid deprecated calls (#69816)F.linear: Fixed shape docs to indicate no-batch-dim support (#66884)F.max_pool*d: Added functional docs (#63264)F.multilabel_soft_margin_loss: Added reduction args to signature (#70420)nn.AdaptiveLogSoftmaxWithLoss: Fixed typo inlog_probname (#68926)nn.{BatchNorm1d, InstanceNorm1d}: Fixed input shape notation inconsistencies (#71371)nn.CrossEntropyLoss: Corrected typo in formula for class probability targets (#70220)nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)nn.Flatten: Simplified example code (#67472)nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}: Added activation function images (#65415)nn.KLDivLoss: Fixed rendering ofreductionarg (#66583)nn.KLDivLoss: Rewrote docs to clarify math (#67443)nn.MaxUnpool2d: Changed misleading example to better demonstrateoutput_sizeusage (#68936)nn.Module: Added note describing requiredsuper().__init__()call (#66909)nn.Module: Changedsuper()usage to Python 3 syntax in example (#65748)nn.Module: Fixed formatting fornamed_modules()(#70491)nn.NLLLoss: Corrected default value forreduce(#68426)nn.SmoothL1Loss: Clarified equivalence withnn.L1Losswhenbeta == 0(#70673)nn.{TransformerDecoderLayer, TransformerEncoderLayer}: Clarified defaultbatch_first=Falsedimension format (#66574)nn.Upsample: Indicated thatalign_cornerstakes effect inbicubicmode (#66756)nn.utils.clip_grad_norm_: Fixed rendering ofparametersinerror_if_nonfinitearg docs (#69958)optim.Adam: Fixed formatting (#70387)optim.AdamW: Fixed formula (#68587)optim.RAdam: Corrected default value oflrarg (#69186)- Removed orphan from cuDNN persistent note (#65160)
- Updated link to tutorial on defining NN modules (#65534)
nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}: Fix doc formatting regressions from no-batch-dim support (#73014)
torch.fx
- Fixed for retracing documentation which would break for n-ary operators (#71599)
- Updated
torch.fx.passes.split_moduledocstring (#65542) - Updated
fx.rstexample outputs (#68043) - Added document gotcha about training flag (#68915)
- Defined
get_dot_``graphto match documentation (#70541)
Sparse
- Updated sparse.rst to warn about _values() (#71088)
CUDA
- Updated Stream
waitdocumentation to reference underlyingcudaStreamWaitEventcall (#67973) - Documented
torch.cuda.ExternalStream,torch.cuda.caching_allocator_allocandtorch.cuda.caching_allocator_delete(#70126) - Updated
CUDA Graphsdocs: Fixedmake_graphed_callablesexample typos (#69379)
Mobile
- Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
- Added recipe for bundled inputs in TorchScript models (#1524)
Distributed
DistributedDataParalleltorch.distributedtorch.distributed.elastic- Made --max_restarts explicit in the quickstart and runner docs (#65838)
torch.distributed.optim- Rendered
torch.distributed.optimmembers (#67885)
- Rendered
torch.distributed.rpc- Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)
TorchScript
- Added
typing.Unionto supported types in documentation (#68435) - Added documentation to
torch.jit.is_tracing()(#67326) - Fixed typos in
jit_language_reference.rst(#68706)
Quantization
- Added documentation for quantized model save/load instructions (#69789)
- Updated link to qnnpack in quantization doc. (#66226)
- Improved quantization API docs (#66379)
- Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
- Documented the quantization custom module APIs (#67449)
- Improved quantization documentation (#68907)
ONNX
- Improved documentation of
operator_export_typeandopset_versionargs (#69549) - Fixed documentation for
do_constant_foldingarg default (#71348) - Documented
ExportTypes,CheckerError, andunregister_custom_op_symbolic(#68489) - Fixed link to ONNX Runtime custom op documentation (#67944)
- Added section “Discovering all unconvertible ATen ops at once” (#66143)
- Fixed typos (#66090)
- Documented work-arounds for indexing export limitations, and improve error messages (#64579)
torch.package
- Add some docs describing how to debug
torch.packagedependencies (#65704)
Download Release
This release has 2 assets:
- pytorch-v1.11.0.tar.gz
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.
Have any questions?
Contact Exxact Today

PyTorch 1.11.0 Now Available
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.
The newest stable release of PyTorch, version 1.11.0, has a number of new highlights including TorchData, functorch, Distributed Data Parallel (DDP) static graph optimizations, and more!
PyTorch 1.11.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
Highlights
The new PyTorch 1.11.0 release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, they released beta versions of TorchData and functorch. Here's a quick summary:
- TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
- functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
- Distributed Data Parallel (DDP) static graph optimizations available in stable.
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
Fixed python deepcopy to correctly copy all attributes on Tensor objects (#65584)
This change ensures that the deepcopy operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).
| 1.10.2 | 1.11.0 |
|---|---|
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# Raise AttributeError: "Tensor" object has no attribute "foo"
| a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# 3
|
steps argument is no longer optional in torch.linspace and torch.logspace
This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps). In PyTorch 1.11, it is not longer optional.
| 1.10.2 | 1.11.0 |
|---|---|
# Works, but raises a deprecation warning
# Steps defaults to 100
a = torch.linspace(1, 10)
# UserWarning: Not providing a value for linspace's steps is deprecated
# and will throw a runtime error in a future release.
# This warning will appear only once per process.
# (Triggered internally at ../aten/src/ATen/native/RangeFactories.cpp:19
| # In 1.11, you must specify steps
a = torch.linspace(1, 10, steps=100)
|
Remove torch.hub.import_module function that was mistakenly public (#67990)
This function is not intended for public use. If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module.
C++ API
We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)
When you #include a header from the C++ frontend, you can no longer assume that every aten operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h> in your file, which will maintain the old behavior of including every aten operators.
Custom implementation for c10::List and c10::Dict move constructors have been removed (#69370)
The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"
| 1.10.2 | 1.11.0 |
|---|---|
c10::List list1({"3", "4"});
c10::List list2(std::move(list1));
std::cout << list1.size() // 0
| c10::List list1({"3", "4"});
c10::List list2(std::move(list1)); // calls copy ctr
std::cout << list1.size() // 2
|
CUDA
Removed THCeilDiv function and corresponding THC/THCDeviceUtils.cuh header (#65472)
As part of cleaning up TH from the codebase, the THCeilDiv function has been removed. Instead, please use at::ceil_div, and include the corresponding ATen/ceil_div.h header
Removed THCudaCheck (#66391)
You can replace it with C10_CUDA_CHECK, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions
Removed THCudaMalloc(), THCudaFree(), THCThrustAllocator.cuh (#65492)
If your extension is using THCThrustAllocator.cuh, please replace it with ATen/cuda/ThrustAllocator.h and corresponding APIs (see examples in this PR).
This PR also removes THCudaMalloc/THCudaFree calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr), or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.
Build
Stopped building shared library for AOT Compiler, libaot_compiler.so (#66227)
Building aot_compiler.cpp as a separate library is not necessary, as it’s already included in libtorch.so.
You can update your build system to only dynamically link libtorch.so.
Mobile
Make typing.Union type unsupported for mobile builds (#65556)
typing.Union support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.
Distributed
torch.distributed.rpc: Final Removal of ProcessGroup RPC backend (#67363)
ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.
The backend type “PROCESS_GROUP” is now deprecated, e.g.torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)
Quantization
Disabled the support for getitem in FX Graph Mode Quantization (#66647)
getitem used to be quantized in FX Graph Mode Quantization, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.
| 1.10.2 | 1.11.0 |
|---|---|
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
def forward(self, x):
x = self.linear(x)
y = torch.stack([x], 0)
return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
# (linear): QuantizedLinear(in_features=5, out_features=5,
# scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
# linear_input_scale_0 = self.linear_input_scale_0
# linear_input_zero_point_0 = self.linear_input_zero_point_0
# quantize_per_tensor = torch.quantize_per_tensor(x,
# linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
# x = linear_input_scale_0 = linear_input_zero_point_0 = None
# linear = self.linear(quantize_per_tensor)
# quantize_per_tensor = None
# stack = torch.stack([linear], 0); linear = None
# getitem = stack[0]; stack = None
# dequantize_2 = getitem.dequantize(); getitem = None
# return getitem
| from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
def forward(self, x):
x = self.linear(x)
y = torch.stack([x], 0)
return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
# (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
# linear_input_scale_0 = self.linear_input_scale_0
# linear_input_zero_point_0 = self.linear_input_zero_point_0
# quantize_per_tensor = torch.quantize_per_tensor(x, linear_input_scale_0,
linear_input_zero_point_0, torch.quint8)
# x = linear_input_scale_0 = linear_input_zero_point_0 = None
# linear = self.linear(quantize_per_tensor); quantize_per_tensor = None
# stack = torch.stack([linear], 0); linear = None
# dequantize_2 = stack.dequantize(); stack = None
# getitem = dequantize_2[0]; dequantize_2 = None
# return getitem
|
Users should now use fuse_modules for PTQ fusion and fuse_modules_qat for QAT fusion (#69878, #71956)
There are two types of fusion supported by fuse_modules api: PTQ and QAT fusion. Previously we relied on module.training to decide which mode user wanted, but this was a misuse of the training attribute since that is not the intended purpose. This PR removes the dependency on module.training and uses separate APIs to make the fusion requested by the user explicit.
Previously, fuse_module used to support both cases and distinguished PTQ/QAT fusion based on module.training, but now fuse_module only supports the PTQ fusion. So, in the case when user wants to do QAT fusion, they need to change the call to fuse_modules_qat, instead of using fuse_modules, otherwise, they would silently get unwanted fusion results (PTQ fusion), or if the model is in training mode, it might result in error.
Note: Currently it is still enforced that if the model is in eval mode, only PTQ fusion can be used; if the model is in training mode, then only QAT fusion can be used. In the future this constraint will be relaxed.
| 1.10.2 | 1.11.0 |
|---|---|
import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
self.bn = torch.nn.BatchNorm2d(3)
def forward(self, x):
return self.bn(self.conv(x))
m = M().train()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
| import torch
from torch.ao.quantization import fuse_modules
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
self.bn = torch.nn.BatchNorm2d(3)
def forward(self, x):
return self.bn(self.conv(x))
m = M().train()
# For Quantization Aware Training, use fuse_modules_qat()
m = fuse_modules_qat(m, ["conv", "bn"])
print(type(m.conv))
m = M().eval()
m = fuse_modules(m, ["conv", "bn"])
print(type(m.conv))
# Result (doesn't change):
<class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>
<class 'torch.nn.modules.conv.Conv2d'>
|
ONNX
Removed f arg from onnx.export_to_pretty_string (#69546)
The arg has always been ignored. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export_to_pretty_string(model, inputs, "file_name")
| torch.onnx.export_to_pretty_string(model, inputs)
|
Removed use_external_data_format arg from onnx.export (#67809)
The arg has been deprecated and ignored since 1.10. The external data format is now used automatically if and only if the exported file would exceed protocol buffer’s file size limit. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, use_external_data_format=True)
| torch.onnx.export(model, inputs, f_name)
|
Removed example_outputs arg from torch.onnx.export (#67809)
The arg has been deprecated and ignored since 1.10. The provided model is instead executed once to produce example outputs. Simply remove it from your code.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, exaple_outputs=(foo,))
| torch.onnx.export(model, inputs, f_name)
|
Removed enable_onnx_checker arg from onnx.export (#67276)
The arg has been deprecated and ignored since 1.10. The ONNX checker is always enabled. If it fails, onnx.CheckerError will be raised. Users can catch and ignore that exception.
| 1.10.2 | 1.11.0 |
|---|---|
torch.onnx.export(model, inputs, f_name, enable_onnx_checker=False)
| try:
torch.onnx.export(model, inputs, f_name)
except torch.onnx.CheckerError:
pass # ignore error
|
Moved and renamed onnx.utils.ONNXCheckerError to onnx.CheckerError (#66644)
Previously the documentation was incorrect and stated ONNXCheckerError was in the onnx module, so this moves the class to the originally intended module and brings the code in line with the documentation. The new name is shorter and less redundant with the module name.
| 1.10.2 | 1.11.0 |
|---|---|
except torch.onnx.utils.ONNXCheckerError:
| except torch.onnx.CheckerError:
|
Removed _retain_param_name arg from onnx.export (#67276)
The arg has been deprecated and ignored since 1.10. Param names are now always retained. Simply remove it from your code. If you want to remove param names, you can do so by editing the exported ONNX model.
| 1.10.2 | 1.11.0 |
|---|---|
# NOTE: No way to get same behavior as _retain_param_name=False.
torch.onnx.export(model, inputs, f_name, _retain_param_name=True)
| torch.onnx.export(model, inputs, f_name)
|
Deprecations
Python API
Deprecated x.T on tensors of dimension other than 0 or 2 (#64180)
x.T only accepts tensors with 0 or 2 dimensions. Calling x.T on tensors with a different number of dimensions has been deprecated.
| 1.10.2 | 1.11.0 |
|---|---|
a = torch.ones(2, 3, 4)
a.T.size()
# torch.Size([4, 3, 2])
| a = torch.ones(2, 3, 4)
a.T.size()
# UserWarning: The use of `x.T` on tensors of dimension other than 2
# to reverse their shape is deprecated and it will throw an error in a future release.
# Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))`
# to reverse the dimensions of a tensor. (Triggered internally at
# aten/src/ATen/native/TensorShape.cpp:2386.)
# torch.Size([4, 3, 2])
|
Quantization
torch.ao.quantization.QConfigDynamic is deprecated and going to be removed in next the release, please use torch.ao.quantization.QConfig instead (#69875, #69864)
| 1.10.2 | 1.11.0 |
|---|---|
qconfig = torch.ao.quantization.QConfigDynamic(...)
| qconfig = torch.ao.quantization.QConfig(...)
|
New features
Python API
- Added
set_deterministic_debug_modeandget_deterministic_debug_mode(#67778, #66233) - Added n-dimensional Hermitian FFT:
torch.fft.ifftnandtorch.fft.hfftn(#63890) - Added
Wishartdistribution totorch.distributions(#70377) - Preliminary support for the Python Array API standard has been added to the
torchandtorch.linalgmodules. PyTorch implements over 90% of the operators defined by the Python Array API, including thetorch.from_dlpackoperation for improved DLPack support (#60627) - Moved
torch.testingfrom prototype to beta (#69668)
Autograd
- Added new
torch.utils.checkpointimplementation that does not use reentrant autograd (can be toggled with the newuse_reentrantflag) (#69508) - Added
batched_gradparameter toautograd.gradto allow batched gradient computation (#65564) - Forward mode AD:
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Check the following issue (#71117) to see the list of ops that do not yet support forward AD. Please comment there if you run into any ops that don’t support forward AD that you want prioritized or are missing from that list.
- Added
ctx.save_for_forwardfunction toautograd.Function(#71569) autograd.forward_ad.unpack_dualreturns a named tuple instead of plain tuple (#68062, #68628)
- Added support for most ops (and many of their backwards as well) (#71026, #69956, #70355, #71901, #69908, #69884, #67837, #68566, #69661, #69384, #68631, #70468, #70460, #67820, #70460, #65546, #67043, #67268, #67837, #69727)
- Linear algebra operation support:
Build
- Added FlexiBLAS build support (#64815)
- Added
IS_LINUXandIS_MACOSglobal vars for cpp extensions building (#69093) - Added ARC for iOS CMake builds (#67884)
- Added support for IBM z14/15 SIMD (#66407)
Complex Numbers
Dataloader
- TorchData library is going to provide modular data loading primitives for easily constructing flexible and performant data pipelines. Beta release will be provided after the release of PyTorch Core (https://github.com/pytorch/data)
LinAlg
- Added an experimental flag that allows specifying a preferred linear algebra library (see the docs here) (#67980)
- Added the
linalg.matrix_expoperation (see the docs here) (#62715) - Added the
linalg.crossoperation (see the docs here) (#63285) - Added the
linalg.diagonaloperation, an alias for torch.diagonal (see the docs here) (#70599) - Added the
linalg.lu_factoroperation (see the docs here) (#66933)
torch.nn
- Added
torch.nn.utils.rnn.{unpack_sequence,unpad_sequence}functions (#66550)
Sparse
- Added
torch.sparse.sampled_addmmfor CSR Tensors on GPU (#68007)
CUDA
- The Jiterator - enables compiling rarely used CUDA kernels at runtime (#69439)
- Low precision supported for jiterator (#70157) - enables runtime-compilation of ops on low precision tensors (half and bfloat16)
- Enable cpu scalar arguments for jiterator (#69861) - enables passing cpu scalars as an argument to the jit-compiled kernels at runtime
- The Cacherator (#71350) - caches the jit-compiled kernels on disk, so that they can be reused between different processes
- Added complex support for Jiterator, port sinc to Jiterator (#71577)
- Jiterates
lcm,i0e,i1e,ndtri,efcx,digamma,trigamma,lgamma(#70663) - Jiterates
exp2,erfc,erfinvandentr(#71295) - Fixes jiterator cache macro include + updates CUDA note with cache variables (#71452)
- Jiterates
polygamma(#71162)
- Added cuSPARSE descriptors and updated CSR addmm (#60838)
- Sparse CSR CUDA: added
addmv_out(#61407) - Added nvidia-smi memory and utilization as native Python API (#69104)
Vulkan
- Added Vulkan support for several torch operators:
- Added the
vulkan_perf_testbenchmark binary to benchmark Vulkan ops under various input conditions. (#67230)
Mobile
- Tracing Based Selective Build (PyTorch Mobile Build Size Reduction) is a new feature that reduces a mobile model’s binary size by only including the operators that the model uses.
- Build tracer for tracing based workflow (#66267)
- Used operator.yaml to build LibTorch library (#66237)
- Unified tracer between internal and external (#64152)
- Reorganized model tracer dependency (#63421)
- Added support for the
boolandintdtypes in the copy kernel by default when using Tracing Based Selective Build (#69106, #69297) - Generic build features for selective build (#67817)
- Made more classes selective (#67397)
- Added custom classes to selective build and compatibility APIs (#67004, #66972, #67340)
Distributed
FullyShardedDataParallel- FSDP is a type of data-parallel training but unlike traditional data-parallel it shards model’s parameters, gradients and optimizer states across data parallel workers and can optionally offload the sharded model parameters to the CPUs. This new API can help users to scale their large model training with minimal code change when switching from DDP to FSDP. (#63881, #64964, #66578, #66904, #66956, #66957, #67117, #67292, #67249, #67135, #67813, #68308, #68155, #68417, #68776, #69356, #69357, #69358, #70340, #71803, #71804, #70341, #70235, #72084)
DistributedDataParallel
TorchScript
- Enabled running
torch.jit.freeze()andtorch.jit.optimize_for_inferenceon functions that are not forward (#68668, #69367) - Enabled
torch.jit.freezeto work on for sparse COO tensors (#69614) - Enabled
torch.jit.script(),torch.jit.freeze()and serialization for tensors in Compressed Sparse Row (CSR) format (#69555) - Allowed users to set the fusion strategy for
torch.jit.fuserthrough the now publictorch.jit.set_fusion_strategy. (#72937) - Enabled Dynamic Shape Fusion For GPU & CPU, configurable via
torch.jit.set_fusion_strategy(#72036)
Quantization
- Added bilinear quantized implementation of
torch.nn.functional.grid_sample2d operator (#66879) - Added the
torch.quantize_per_tensor_dynamicoperator (#68004) - Added Quantization Aware Training support for
torch.nn.Embeddingandtorch.nn.EmbeddingBag- Added basic EmbeddingBag QAT fakeQuant workflow (#65443)
- Added support for quantization of Embedding{Bag} in dynamic quant APIs (#65674)
- Eager mode QAT for Embeddings (#66429)
- Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
- Supported Embedding QAT via FX API (#69333)
- Add FX support for QAT EmbeddingBag (#69334)
- Added support for depthwise quantized
torch.nn.Conv3din qnnpack, for use in quantization- Depthwise Conv3d Indirection Buffer Setup (#69311)
- Depthwise Conv3d Weight Packing (#69312)
- Depthwise Conv3d mp8x27 (per channel) Neon Kernel (#69313)
- Depthwise Conv3d mp8x27 (per-channel) Sse2 Kernel (#69314)
- Tightened Step Height for Indirection Buffers (#70530)
- Enabled Depthwise Specific Conv3d Kernel for Kernel Size 3x3x3 (#69315)
- Implemented 3d convolution in qnnpack (#66350)
ONNX
- Supports opset version 15 (#67805)
- Supports exporting
nn.Modulecalls as ONNX local functions (#66140, #67803) - Supports for exporting new ops:
- Added BFloat16 type support (#66788)
- Supports exporting with Apex O2 (#66700)
Infra (Releng)
- Added support for ROCm 4.3.1 (#65624)
- Added support for ROCm 4.5.2 (#71064)
- Added support for CUDA 11.5 (#69262)
- Added support for CUDA enabled Bazel builds (#66241)
- Added support for Python 3.10 (#71132, #71419)
Improvements
Python API
- NumPy compatibility:
- Improved
torch.Tensor.view(dtype): enable all dtype combinations (#66493) - Improved
torch.diffby adding support for n greater than 1 (#67260) - Improved
torch.movedimto handle scalar as no-op (#69537) - Improved
cartesian_prod: fixed a warning in the docs example (#68753) - Improved error messages for
max_unpool{}doperators (#67328) torch.distributions- Implemented positive-semidefinite constraint in
torch.distributions(#71375) - Implemented Entropy methods for Binomial and Multinomial distributions (#67609)
- Implemented support for
non-negativeconstraint in exponential distribution (allowing it to include zero). (#67184) - Implemented
kl divergencebetweennormalandlaplacedistribution. (#68807)
- Implemented positive-semidefinite constraint in
- Improved meta tensor support for operators:
- Added support for
torch.Tensor.realfor real-valued tensors (#71718) torch.logaddexp, torch.logaddexp2, torch.remainder: added BFloat16 support on CPU (#63621)torch.bucketizeandsearchsorted: added Half precision support (#67077)- Added new
torch.slice_scatter,torch.select_scatter,torch.diagonal_scatterops (#64430) - Made
torch.scatter_reducea public API (#68580, #73125)
C++ API
- Added C++ API and docs for
hfftn(#66127) - Added support for
MaybeOwned<IValue>(#68157) - Added
set_to_noneoption forzero_grad()to C++ API (#68801) - Added an environment variable,
TORCH_CPP_LOG_LEVEL, that you can use to toggle the log level in the c10 library (#71746)
Autograd
- Added nesting support for
torch.autograd.graph.saved_tensor_hooks(#70932) - Delayed all warnings encountered during the backward pass until the end of backward execution (#66235)
- Added complex autograd support to
torch.{col2im,im2col}(#68199) - Added new reduce options and autograd support for
torch.scatter_reduce(#71788) - Added derivatives wrt the second argument for
torch.{remainder,fmod}(#69908) - Added new
strategyflag toautograd.functional.{Jacobian, Hessian}to enable vectorized computation (#67041, #66292) - Added
check_backward_adflag totorch.autograd.gradcheckto be able to skip backward mode AD checks (#65040) - Relaxed forward AD layout check to allow primal and tangent stride to differ when their size is 1 (#66294)
Build
- Improved incremental build times of PyTorch core by removing a dependency on
native_functions.yamlin many core files (#64499, #66914, #64172, #64171, #66620, #66793, #66913, #66794, #64169, #64173, #64170, #67735) - Enabled bazel build without glog and gflags (#70850)
- Added support for C++ frontend wrapper on Linux (#69094)
- Added support for dynamic codegen outputs in CMake (#68246)
- Max CMake version is now used by default with setup.py (#69355)
- Upgraded oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
- Code base should now be
-Wno-unused-variablecompliant (#66041) - Added lazy import for
packagingintorch_version(#71345)
Dataloader
- Support custom
SequenceandMappingforutils.data.default_collate(#68779) - Allowed specifying
num_samplestoRandomSamplerwhenreplacementisFalse(#71568) - Fixed the warning of shape inconsistency
utils.data.default_collate(#71065)
ForEach
- Implemented
ForEachL1 & L2 norm (#62646)
LinAlg
- The
linalg.matrix_rank(docs) andlinalg.pinv(docs) operations now support specifying absolute and relative tolerances for better handling of singular values (#63102)
torch.nn
- Added
channels_lastsupport forChannelShuffle(#50247) - Added no-batch-dim support for
nn.{AdaptiveLogSoftmaxWithLoss, Bilinear, Conv*d, ConvTranspose*d, CrossEntropyLoss, CTCLoss, Fold, FractionalMaxPool3d, GaussianNLLLoss, GRU, GRUCell, InstanceNorm*d, LSTM, LSTMCell, MarginRankingLoss, MultiheadAttention, MultiLabelSoftMarginLoss, RNN, RNNCell, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}(#69054, #69539, #70506, #71055, #70092, #64909, #69732, #69783, #70236, #65323, #71056, #64975, #67176, #70590, #65690, #70977, #70597, #70322, #69291) - Added
BFloat16support on CPU tonn.{AdaptiveAvgPool2d, AdaptiveMaxPool2d, AvgPool2d, MaxPool2d}(#56902, #66929, #66927, #56903) - Added
maximizesupport tooptim.{Adam, AdamW, SGD}(#68164, #70146, #67847, #68733, #71023) F.interpolate: Addnearest-exactmode to fix off-by-one error innearestmode (#64501)F.interpolate: Added support for anti-aliasing to bilinear and bicubic algorithms (#70930, #68819, #65142, #69318)F.interpolate: Improved error message for invalid shapes (#66417)nn.Conv*d: Accepts 0-sized channel inputs (#66256)nn.LogSigmoid: Usedlog1pfor improved precision (#66441)nn.Module: Added flag for removing duplicates from parameters (#71542)nn.Module: Addedregister_modulealias for registering a sub-module (#65174)nn.ModuleList: Supported concatenation (#70887)nn.MultiheadAttention: Added flag to optionally average output attention weights across heads (#70055)nn.ParameterDict: Supported full set ofdictmethods (#69403)nn.{RNN, GRU}: Allowedhidden_sizeto be 0 (#70556)nn.Sequential: Addedappendmethod (#71326)nn.Upsample: Exposedrecompute_scale_factor(#66419)nn.ZeroPad2d: Addedextra_reprfor printing purposes (#69206)optim.{ChainedScheduler, SequentialLR}: Addedoptimizerattribute (#67406, #69817)optim.swa_utils.AveragedModel: Addeduse_buffersflag for averaging buffers in addition to parameters (#65921, #71763)
torch.fx
- Improved the customizability of
fx.Graph’s code generation function, including support for setting a breakpoint in the generated code (#67139) - Supported printing inplace operators in FX (#71887)
Sparse
- Add CSR support for several operators:
torch.triangular_solve,torch.addmv,torch.addmm,torch.addfor all arguments on CPU (#62180, #61536, #65606, #64391)torch.triangular_solve,torch.addmv,torch.addmm,torch.addfor all arguments on GPU (#61407, #61858, #63511, #63948)- zero-preserving unary functions (#68123, #69292)
torch.empty,torch.resize_,torch.copy_,torch.randn_like,torch.clone(#63509, #63510, #68083, #70581)transpose(#70582)
- Added torch.sparse_coo Layout support to
zeros_like(#68108) - Added Half, BFloat16, and Complex dtype support for matrix multiplication of two COO Tensors on GPU (#59980)
- Added support for conversion of CSR to COO Tensor to
to_sparse(#66774) - Added support for empty COO Tensors to sparse.sum (#71091)
AMD
- Added sparse mappings for CUDA->HIP translation (#67323)
- Enabled frexp support for ROCm builds (#67226)
- Used hipCUB/rocPRIM scan algorithms for large index support (#68487)
CUDA
- Allows external CUDA streams to be set as current (#66324)
- Added an option to disable reduced precision reductions for FP16 GEMM (#67946)
- Improved CUDA memory usage of
nanmedianresult (#68591) - Reduced number of
igammakernel instantiations (#70666) - Reduced number of
comparekernels by unifying them (#69111) - Reduced number of
bernoullitensor tensor kernel instantiations (#70169) - Used
cub::FutureValueto simplify 64bit indexing split of cub scan (#66711) - Added
hascuSOLVERflag to Context (#69825) - Improved error message from
CUDACachingAllocator(#69174) - Fixed
masked_softmaxperf for element_size is not 8 (#70271) - Reduced binary size of
TensorCompare.cu(#68835) - Improved error message for
interpolation(#72066) - Doesn't compile
powkernels for non-existent case (#70017)
Profiler
- Added flop count formulas for
bmmandbaddbmm(#66636)
Vulkan
- Allowed Vulkan models to return multiple outputs by improving Vulkan tensor lifecycle management to release GPU resources when the tensor is destroyed, instead of being released at the end of every inference (#66477, #66478)
- Enabled multiple Vulkan models to execute concurrently in parallel threads, by moving components of the Vulkan global context into thread local objects (#67733, #69576)
Mobile
- Introduced multiple improvements for
NNAPI- Added converters for torchscript ops
quantized::mulandquantized::convtranspose2dto converter (torch.backends._nnapi.prepare.convert_model_to_nnapi) (#63913, #63914) - Supported
int32andqint16type in Torchscript expressions (#70197, #70621) - Supported runtime flexible shapes and return shapes (#70334)
- Added converters for torchscript ops
- Improved Model Tracer Coverage and Selective Metal Ops (#68134, #69492, #69328)
- Introduced multiple improvements for
CoreML - Type Support in Mobile Lite Interpreter
Distributed
torch.distributed- Improvements to error handling in
TCPStore’s socket implementation (#68225) - Enabled
ncclAvgfor reductions (#62835) - Init dummy
NCCLcomms in constructor (#65173, #66393) - Added pybind trampoline for
ProcessGroupandWork(#66338) - Setup
c10dextension Backend class attr the same way as builtin ones (#66991) - Added barrier to
ProcessGrouptrampoline (#67236) - Raised warning when calling collectives on non-member group objects (#67639)
- Patched
bfloat16support for NCCL (#67843) - Fixed
c10dTCP store race condition with mutex (#68499) - Surfaced
ncclUniqueIdstore broadcast error (#68597) - Checks for file existence before invoking cleanup logic in
FileStoredestructor (#68603) - Implemented gather primitive for
ProcessGroupNCCL(#66745) - Implemented scatter primitive for
ProcessGroupNCCL(#70029) - Enabled
gather_objectonNCCL(#71623) - Implemented
allreduce_coalescedforProcessGroupNCCL(#62140) - Set non-default backend names to lower case (#69400)
- Added support for
deleteKeyforFileStore(#69953) - Fixed
TSANissue inTCPStore(#69590)
- Improvements to error handling in
DistributedDataParalleltorch.distributed.rpctorch.distributed.autograd- Made Kineto + distributed a warning rather than an error (#71120)
torch.distributed.elastic- Added ability to override sys.executable for
torch.distributed.run(#66179)
- Added ability to override sys.executable for
TorchScript
- Several improvements to NVFuser, which is an optimization that speeds up all JIT graphs with a CUDA Tensors on Nvidia GPUs. This includes extending fusing support to normalization and reduction kernels, enabling multiple kernel launch for single
CudaFusionGroup, and addition of a graph segmentation cache to the hierarchical caching system. (#63745, #65137, #63745, #65137) - Enabled
profile_ivalueto convert dynamic scalar into compile time constants in NVFuser. (e.g. reduction axes). (#63745, #65137) - Added support in
torch.jit.tracefor tracing already JITted subgraphs(#59949) - We now provide full types on graph inputs when tracing graphs that are already JITted(#67424)
torch.jit.freezenow can preserve attributes of submodules - previously, it was only possible to prevent inlining of attributes of the top level module.(#66102)- The peephole optimizer, which is used in
torch.jit.freezenow coalesces consecutive calls totorch.concatinto a single call (#67000) - Added ability for Torch.JIT C dispatch to convert python
Noneinto an undefined Tensor(#67793) torch.jit.scriptnow recognizes union of scalars as a JIT NumberType (#66591)- No longer adds a tensor in a returned list to the wildcard alias set in AliasDB, allowing for additional optimizations in JIT optimization passes. (#71170)
- In
torch.jit.optimize_for_inference, there is a new graph pass to precompute transposes for linear layers. (#65631, 68024) - In
torch.jit.freeze, there is a new pass where we concat together multiple linear layers with same input Tensor (different weight/bias) (#63198, #68024) - Added support for normalizing
torch.Tensor.__rsub__innormalize_opsJIT pass(#65014)
Quantization
- Quantized op improvements
torch.ao.FakeQuantizenow supportsfp32/fp16zero_point. (#65836)torch.ops.quantized.addnow supports broadcasting (#66049)torch.Tensor.dequantizenow supports fp16 + cuda (#67234)- Added quantized CPU support for
torch.nn.GELU(#69968) torch.nn.quantized.functional.hardsigmoidsupports aninplaceflag (#65740)
- Workflow improvements
- FX graph mode quantization: enable
torch.nn.Linear + torch.nn.BatchNorm1dfusion for PTQ (#66484) - Added an option in
torch.ao.quantization.quantize_fx.convert_fxto acceptqconfig_dictto skip quantization (#66878) - Added
torch.nn.qat.dynamic.modules.Linearmodule (#67325) - Added
torch.nn.ConvTranspose{n}d + torch.nn.BatchNorm{n}dfusion support (#70022) - Extended
torch.ao.quantization.prepare_qatwithallow_listargument, to allow custom mapping and custom QAT module (#65119) - Added
torch.ao.quantization.default_replay_qconfigwhich allows observer reuse fortorch.reshapein FX graph mode quantization (#69249)
- FX graph mode quantization: enable
ONNX
- Set
ir_versionof the exported model based onopset_version. This increases the odds that the exported ONNX model will be usable. Before this change, we were setting the IR version to a hard-coded value which may be higher than what the model consumer supports. (#67803) - Preserved op input names when op just passes through the input to the output (#67275)
- Shape inference improvements:
- Included op type in exported models’ input and output names (#68976)
- Supports Conv-BatchNorm fusion inside blocks (#67272)
- Exported
torch.reciprocalto ONNX Reciprocal operator instead ofDiv(1, x)(#67271) - Supports
beta!=1in softplus (#66146) - Added warning for inplace updates on
tensor.shapein tracing mode (#66142) - Supports
instance_normin training mode (#64375) - Allow registration of custom symbolics for ops specifying aten namespace (i.e.
aten::foois allowed as well as “foo”). (#67810) - Allow registration of custom symbolics for
primnamespace (#66139) - Supports dynamic inputs for
OneHot, bool forEinsum(#66147)
Infra (Releng)
- Build with BUILD_SPLIT_CUDA for all 11.X Windows builds (#70899)
torch.package
- Add ability to retrieve the dependency graph via
all_pathfunction(#65602) - Add support for pickle v4 (#70642)
- Add better testing support for Package Exporter (#70641)
Bug fixes
Python API
- Fixed scalar inputs for aliased binary ops {
multiply,subtract,divide} (#65937) - Fixed
torch.savewhen saving storages that view same data with different type (#66949) - Fixed
torch.saveerror if storages are unallocated (#68787) - Fixed
kout-of-bounds intorch.kthvalue(cpu kernel) (#68863) - Fixed
inference_modedecorator:with inference_mode(mode=False)used to ignore themodeargument and always set inference mode. (#68617) - Fixed
cdist_backwardin the case whencdistinputs are not contiguous (#70016) - Fixed
cdisterror message typo (#70178) - Fixed
scatterfor empty indexes (#70662) - Fixed
torch.{unique, unique_consecutive}out of bound (#71540) - Fixed
torch.isinin the case when inputs are non-contiguous on CPU (#70659) - Fixed
hsplit vsplit dsplitcrash when section is 0 (#69342) - Fixed:
torch.gradientignores dim argument when checking edge_order (#67926) - Fixed:
TransformedDistribution.icdfshould perform validation after applying the inverse transformation rather than before. (#71393) - Fixed
torch.all and torch.anyinternal assert error with requires_grad=True (#65714) - Fixed
torch.logsumexptype promotion: promote integral inputs to floating for(#63393)
C++ API
- Fixed libtorch
at::Tensor::print()linking error (#69615) - Avoided UB when indexing into size-0 tensors (#65878)
- Fixed an ICE when compiling PyTorch from source on MacOS with clang-1300 (#65655)
Autograd
- Fixed autocast state propagation in the
torch.utils.checkpointAPI (#71169) - Fixed
torch.nn.functional.conv_transpose3dbackward when grad_out is non-contiguous (#67829) - Forward mode AD:
- Fixed a case where forward AD in-place-over-view silently copies the view (#67816)
- Fixed deadlock in forward AD for functions that return multiple outputs (#67995)
- Fixed forward AD codegen for functions that have multiple formulas (#68535)
- Fixed deadlock when forward and backward AD are used at the same time (#67360)
- Fixed
Tensor.copy_forward AD to handle broadcasting (#69592) - Do not generate not_implemented error for forward AD when input with tangent passed to non-differentiable function (#66926)
- Fixed
autograd.Functionwhen non-Tensor argument precedes tensor argument (#71530) - Fixed
autograd.Functionforward AD when forward is a no-op to no longer raise an internal error (#71531)
Build
- Stopped reporting CPU Capability as AVX512 on machines with AVX512 support but without AVX512 kernels (#66703)
- Disabled SVE when cross-compiling for M1 (#67114)
- Added failure if
pocketfftis not found andat_mklis not enabled (#67909) - Fixed clang issues when compiling with
_GLIBCXX_USE_CXX11_ABI(#72081)
Complex Numbers
- Fixed
torch.autograd.gradcheckto generate valid inputs for forward AD computation for complex functions (#68001) - Fixed
torch.Tensor.copy_transpose path for tensors with conjugate or negative bit set (#69026) - Fixed
torch.Tensor.copy_behavior for the case when two conjugated or negated tensors of the same dtype (one or both of which are non-contiguous) are copied into each other (#68963)
Dataloader
- Made
ProcessExceptionpicklable (#70118) - Fixed persistent worker exiting before
pin_memory_thread(#71579)
torch.nn
nn.AdaptiveAvgPool*d: Throws an error for negativeoutput_size(#70488)nn.Conv1d: Fixed for 1D convolution on MKL-DNN backend (#68166)nn.CrossEntropyLoss: Fixed for usage ofweight,ignore_index, andlabel_smoothingtogether (#69511)nn.Fold: Checked that block height and width are positive (#69048)nn.LayerNorm: Fixed incorrect result on CUDA whengammaorbiasare missing (#69210)nn.LayerNorm: Avoided overflow by doing computation infloatforhalf(#66920)nn.Module: Throws a proper error message fromload_state_dictfor non-tensor values (#70596)nn.ModuleList: Fixed incorrect return type in__getitem__(#69083)nn.MultiheadAttention: Used query dtype for mask type (#68077)nn.NLLLoss: Fixed backward computation with negative weights (#64572)nn.{RNN, GRU}: Fixed RNN modules with input shapes containing-0 in CUDA (#71696)nn.utils.rnn.pad_sequence: Fix regression to support tuples for padding (#72436)optim._LrScheduler: Fixed print formatting (#68338)optim.ChainedScheduler: Fixedget_last_lr()(#69112)optim.CosineAnnealingWarmRestarts: Fixed ordering bug whenlast_epoch > 0(#64758)optim.SequentialLR: Updated_last_lron step (#70558)
torch.fx
- Supported
torch.layoutas arg (#66048) - Specified a default value when possible for placeholders created from
concrete_args(#59569) - Fixed issue where
GraphModule.delete_all_unused_submodulesdeletes submodules from called leaf modules (#66430) - Fixed
torch.fx.subgraph_rewriter.replace_patternmechanism so that multiple one-liner instances of the pattern are captured correctly (#66442) - Fixed bug in graph matcher that caused certain nodes to be matched twice (#69238)
- Ensured node stack trace survives copying (#69368)
- Fixed
to_foldernot saving dtype (#69983) - Added a
default_valuearg tofx.Graph.placeholderand fixsplit_module(#71016)
Sparse
- Fixed CSR storage access to throw when used (#70072)
- Fixed multiplication of 0-D sparse tensors (#70749)
- Fixed result dtype for neg if given sparse Tensor (#68885)
CUDA
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (#66790)
- Fixed CUDA vs CPU consistency for index_put_ when accumulating (part 2) (#67189)
- Fixed error in warning about unsupported GPU (#67900)
- Disabled TF32 in
pinv_jvpandpinv_backward(#67948) - Fixed DLPack CUDA stream convention (#67618)
- Sets device guard in
_cudnn_implfunctions (#70406) - Fixed
mem_get_infowhen querying on a device other than the current device (#69640)
Benchmark
- Fixed divide-by-zero errors in
torch.utils.benchmark.Timer(#70050)
Dispatcher
- Added explicit
OperatorHandledestructor, so that the symbol shows up in windows builds (#70033)
Profiler
Visualization
- Fixed
torch.utils.tensorboardparsing JIT graph incorrectly (#65692)
Vulkan
- Greatly reduced memory usage of the Vulkan backend by updating the configuration of the Vulkan Memory Allocator (#69088)
- Addressed several warnings raised by the Vulkan Validation layers:
Mobile
- Fixed quantized logistic converter for
NNAPI(#70847) - Fixed potential crash if
MTLCreateSystemDefaultDevicereturns nil (#66859) - Used full name to look for the promoted prim operator table (#66081)
- Fixed function name bug in mobile export (#66915)
- Fixed issues with
irangenot having a header included inMetal(#66877) - Fixed backward compatibility issue for UnionType on mobile in
type_parser. (#71341) - Fixed forward flatbuffer type handling with dynamic type in
flatbuffer_loader. (#71500) - Fixed type equalities issue in
pytorch_jni_common(#71508) - Fixed missing properties to the executor in
CoreML(#67737) - Fixed memory computation when both constants and data tensors are present in model_dump (#66006)
- Ensured that function participating in bundled inputs have their “name" attribute set (#65856)
Distributed
torch.distributed- Fixed bug on empty
GLOO_SOCKET_IFNAME_ENV(#68933)
- Fixed bug on empty
DistributedDataParallel- Fixed “Cannot modify in-place due to DDPSink” (#66015)
torch.distributed.elastic- Fixed scale down bug caused by calling
rdzv_handler.shutdown()on premature agent failures (#67749)
- Fixed scale down bug caused by calling
TorchScript
- Fixed a race condition in the JIT interpreter when unpickling source ranges (5525e9a591)
- Fixed a ref counting loop for
CompilationUnit, resulting in memory leaks when class objects were in JIT graphs. (#65442) - Fixed bug where output type was discarded after calling SubgraphRewriter in C++ (#65453)
- Fixed bug where
torch.jit.optimize_for_inferencedid nottorch.jit.freezea module when passed a a non-frozen module (#71436) - Fixed bug where running module.forward() on a
torch.jit.freezeed module ran the wrong graph (#68316) - Fixed bug where alias analysis in the JIT optimizer was incorrect for the int[] version of
torch.split, resulting in invalid optimizations in various JIT optimization passes (#69745) - Fixed places where using
torch.autocasttogether with autodiff (module.backwards()) in a JIT graph had the wrong number of arguments and would error out. (#67648) - Forbid propagating gradients through views in JIT graphs as currently it is broken (#67732)
- Fixed bug where graph input types were incorrect after running
torch.jit.trace(#68242) - Fixed case where BroadcastMKLDNN breaks the stack invariant by pushing more than 2 tensors to the stack for when
torch.jit.freezeops are converted to MKLDNN(#66628) - Raised error instead of segfaulting when passing None into torch.jit.Graph.create (#68253)
- Raised error instead of crashing when a JIT ScriptFunction is pickled with an incompatible Python
pickleversion.(#69807) - Fixed bug where
torch.jit.scriptfails when comments in function has less indent than surrounding code (#70227) - Fixed incorrect device type when torch.device is called inside scripted (
torch.jit.script) code (#69645) - Fixed warning: overloaded virtual function
torch::jit::Function::callis only partially overridden in classtorch::jit::GraphFunction(4bf1be898d)
Quantization
- Fixed applying non-zero offset 1 to null pointer in
torch.nn.functional.interpolatefor quantized tensors (#65570) - Doesn't assume bias is a keyword argument to
torch.nn.Conv{n}d(#61647, #71426) - Made error message when trying to use
torch.quantize_per_tensoron non floats more specific (#66050) - Quantized
torch.nn.Embeddingconversion with unsupported dtype: make error message clearer (#66051) - Fixed
torch.nn.qat.EmbeddingBagfrom_float error message (#66989) - Fixed bug enforcing quant_min <= zero_point <= quant_max for float zeropoint in
torch.nn.EmbeddingQAT (#68852) - Fixed scale+zp serialization of
torch.nn.quantized.BatchNorm{2|3}d(#70432) - Fixed
torch.nn.Dropoutin FX graph mode quantization (#71043, #71438) - Fixed
qconfigsetting for fused modules in FX graph mode quantization (#71254) - Removed assumption number of rows is in 32 bit in fbgemm (#69066)
- Fixed
reduce_rangewarning when using default observers (#71027)
ONNX
- Doesn’t create invalid
index_selectop when constant folding through ONNX Gather with indices rank > 1. Fixes export of some uses of Embedding. (#68493) - Shape inference:
- Fixed inplace
fill_dtype export mismatch (#64580) - Fixed
remainder(#64578) - Fixed
reciprocalwhen input is not floating point (#67808) - Fixed
new_fullandfull_likefor Python 3.9 (#67806) - Fixed reduce ops on
binary_cross_entropy_with_logits(#67805) - Propagated node metadata across passes (#45256)
- Ensured outputs don’t have the same name (#66137)
- Fixed
padwith sequence inputs (#64377) - Fixed
instance_normwithtrack_running_stats=True(#64375) - Fixed
allandanywithdimarg (#67270) - Allows autograd functions (
prim::PythonOp) to be exported withOperatorExportTypes.ONNX_FALLTHROUGH(#67273)
torch.package
- Prevent import race condition that leaves
torch.package.PackagePicklerwith unwanted dispatch table entries. (#71025)
Performance
Python API
- Speed up pickling for
torch.dtype(#65182) - Speed up
histogram: avoid index_put_ overhead in histogram kernel's inner loop (#67815) - Speed up
torch.topkwith sort for some cases (#68632) - Speed up
torch.stack: don't unsqueeze every stack arg if possible (#70288) - Speed up
LayerNorm4-5% (#71423) - Speed up structured kernels: fix some unnecessary refcount bumps (#71140)
- Speed up
indexingfunctions: release GIL in a few places (#71728) - Speed up
torch.emptya bit: define check_sizes_nonnegative as inline (#71640) - Speed up
XLAtensor printing by reducing compilations (#71147)
C++ API
- Updated
c10::SmallVectorfrom LLVM (#69110) - Reduced some framework overhead in
at::copy_()(#68950) - Reduced some overhead in
StorageImpl::set_data_ptr(#65432) - Improved
IValueperformance for tuples by inlining tuple storage (#64066)
Autograd
- Stopped materializing Tensors full of 0s in forward AD when possible (#64837)
- Rewrote the backward of
linalg.luandlinalg.lu_solveto uselinalg_solve_triangular(#63569) - Updated
nn.functional.grid_samplebackward to compute input gradient only if required (#66069, #66070) - Stopped erroneously saving the output of
torch.softplusfor backward (#70296)
Complex Numbers
- Release GIL when assigning to real or imaginary components of a complex tensor (#71747)
- Restored conjugate and negative bits of a tensor when calling
repeat_interleave(#68523)
CUDA
- Used a better hash table in
CUDACachingAllocator(#71667) TopKCUDA Optimization: used multiple block per slice (#71081)- Removed sync in
Embeddingcaused byunique(#66091) EmbeddingBackwardexclusive_scan thrust->cub (#66566)sort_out_cuda: Used custom kernels to fill index tensors (#66668)masked_scatter: fuse mask count check into one kernel (#66871)- Enabled better depthwise conv perf on cudnn 8.2+ (#58749)
- Improved native
layer_normforward perf (#67977) - Improved native
layer_normbackward perf (#68238) - Fast path for size 0 GPU host malloc (#68532)
- Alternative implementation of CUDA pinned memory allocator focusing on multi-threaded scalability (#69299)
- Used legacy unrolled kernel for non-trivial offset calc cases (#71710)
- Removed
call_oncefromCUDACachingAllocator(#71668) - Reworked stat collection in
CUDACachingAllocator(#71669) - Fixed CUDA
LpNormFunctor(#70601)
Dispatcher
- Made
c10::KernelFunctionstruct smaller, which should reduce some memory usage by the dispatcher (#65618)
torch.fx
- Made
torch.fx.symbolic_tracereuse buffers if they're the same (#66211)
Profiler
Mobile
- Reduced PyTorch Library startup time by 40% for mobile and edge deployments(#65735, #65732, #65939, #66112, #66064, #66131)
- Reduced PyTorch Library heap memory utilization by 40% for mobile and edge deployments(#65732, #66112, #66064, #66131)
- Improve efficiency of IValue and reduce overhead in code paths that use IValue and perform Type Parsing (#65710, #64278, #66717, #65381, #66134, #65951, #70477)
TorchScript
- Improved performance of autodiff on small JIT graphs (#71666)
- Enabled autocasting of tensors between fp16, bfloat 16 and fp32 in torchscript models (#63939, #67707)
- Enables optimizations in more gradSumToSize cases in the JIT Autograd support(#63941)
- In Unpickling a JIT graph, avoid reading file from a stream for 0 byte tensor storage(#67787)
Quantization
- Sped up quantized
torch.nn.functional.interpolatefor channels last (#66525) - Sped up
torch.nn.functional.upsamplefor channels last (#70903) - Parallelized computation in
torch.quantize_per_tensor_affineandtorch.dequantize(#65845)
Documentation
Python API
- Added docs for
torch.adjoint. (#68869) - Clarified difference in behavior of
empty_stridedandas_strided(#64568) - Added some missing generated doc entries (
torch.select,torch.slice_scatter,torch.diagonal_scatter,torch.select_scatter) (#69030),histogramdd(#68273) - Typo and formatting fixes.
LinearLR(#67840),torch.any(#65310, #70187),torch.futures(#70630), jit docs (#68557),Tensor.type(#67019),torch.lobpcg(#71464),Tensor.triu(),Tensor.tril(),Tensor.ravel(). (#71057),torch.acosh(#66814), (#70439) - General Doc improvements for individual ops.
torch.finfo(mentiontorch.bfloat16) (#68496),torch.quantileinterpolation kwarg (#70637),from_dlpackandto_dlpack(#70437),set_printoptionsadded examples (#68324),index_add(#65806), topk doc (#65938),unique(#66132),chi2(#67379),torch.histc(#64191),emptyandempty_like(#68874),torch.cholesky_inverse(#69069),torch.dsplit(#70557) - Changed README getting started link to explicit instructions (#66828)
- Modernized and clarified docs for
torch.tensorandtorch.as_tensor(#63308) - Improved
torchhubdocs (#69970) - Updated docs for
torch.Tensor.realto indicate that it's supported for real tensors (#71962)
C++ API
- Fixed typos in ATen README (#69170)
- Mentioned
TORCH_SHOW_CPP_STACKTRACESinContributing.mddocs (#64052) - Updated link to C++ frontend examples (#66095)
- Added docs for Visual Studio extension (#63944)
- Added docs about an issue with compiling C++ extensions with CUDA 11.5 and Windows (#73013)
Autograd
- Updated docs for forward AD and make them public (#71643, #71159)
- Updated “Extending PyTorch” doc to cover forward AD (#66962)
- Fixed broken code syntax in autograd.rst (#69362)
- Fixed incorrect variable in autograd docs (#70884)
- Fixed typo in
torch.autograd.Functiondocs that prevented it from compiling (#66754)
Dataloader
- Added docstring for
default_collateanddefault_convert(#69862) - Updated the documentation for AMP with DataParallel (#69218)
torch.nn
F.binary_cross_entropy: Updated examples to avoid deprecated calls (#69816)F.linear: Fixed shape docs to indicate no-batch-dim support (#66884)F.max_pool*d: Added functional docs (#63264)F.multilabel_soft_margin_loss: Added reduction args to signature (#70420)nn.AdaptiveLogSoftmaxWithLoss: Fixed typo inlog_probname (#68926)nn.{BatchNorm1d, InstanceNorm1d}: Fixed input shape notation inconsistencies (#71371)nn.CrossEntropyLoss: Corrected typo in formula for class probability targets (#70220)nn.{ELU, Hardshrink, Hardsigmoid, MultiHeadAttention, Softplus, Tanh}: Made first line of docstring readable for overview docs (#70574, #71012, #70987, #71100, #70576, #70577)nn.Flatten: Simplified example code (#67472)nn.{Hardsigmoid, Hardswish, Mish, RReLU, SiLU}: Added activation function images (#65415)nn.KLDivLoss: Fixed rendering ofreductionarg (#66583)nn.KLDivLoss: Rewrote docs to clarify math (#67443)nn.MaxUnpool2d: Changed misleading example to better demonstrateoutput_sizeusage (#68936)nn.Module: Added note describing requiredsuper().__init__()call (#66909)nn.Module: Changedsuper()usage to Python 3 syntax in example (#65748)nn.Module: Fixed formatting fornamed_modules()(#70491)nn.NLLLoss: Corrected default value forreduce(#68426)nn.SmoothL1Loss: Clarified equivalence withnn.L1Losswhenbeta == 0(#70673)nn.{TransformerDecoderLayer, TransformerEncoderLayer}: Clarified defaultbatch_first=Falsedimension format (#66574)nn.Upsample: Indicated thatalign_cornerstakes effect inbicubicmode (#66756)nn.utils.clip_grad_norm_: Fixed rendering ofparametersinerror_if_nonfinitearg docs (#69958)optim.Adam: Fixed formatting (#70387)optim.AdamW: Fixed formula (#68587)optim.RAdam: Corrected default value oflrarg (#69186)- Removed orphan from cuDNN persistent note (#65160)
- Updated link to tutorial on defining NN modules (#65534)
nn.{AvgPool1d, AdaptiveAvgPool3d, MultiMarginLoss, PairwiseDistance, TripletMarginLoss}, ``F.{conv3d, conv_transpose3d, fold, linear}: Fix doc formatting regressions from no-batch-dim support (#73014)
torch.fx
- Fixed for retracing documentation which would break for n-ary operators (#71599)
- Updated
torch.fx.passes.split_moduledocstring (#65542) - Updated
fx.rstexample outputs (#68043) - Added document gotcha about training flag (#68915)
- Defined
get_dot_``graphto match documentation (#70541)
Sparse
- Updated sparse.rst to warn about _values() (#71088)
CUDA
- Updated Stream
waitdocumentation to reference underlyingcudaStreamWaitEventcall (#67973) - Documented
torch.cuda.ExternalStream,torch.cuda.caching_allocator_allocandtorch.cuda.caching_allocator_delete(#70126) - Updated
CUDA Graphsdocs: Fixedmake_graphed_callablesexample typos (#69379)
Mobile
- Added user facing documentation for tracing-based selective build mobile interpreter in Android and iOS (#1709)
- Added recipe for bundled inputs in TorchScript models (#1524)
Distributed
DistributedDataParalleltorch.distributedtorch.distributed.elastic- Made --max_restarts explicit in the quickstart and runner docs (#65838)
torch.distributed.optim- Rendered
torch.distributed.optimmembers (#67885)
- Rendered
torch.distributed.rpc- Deleted distributed optimizer section from RPC and add reference to namespace docs page (#68068)
TorchScript
- Added
typing.Unionto supported types in documentation (#68435) - Added documentation to
torch.jit.is_tracing()(#67326) - Fixed typos in
jit_language_reference.rst(#68706)
Quantization
- Added documentation for quantized model save/load instructions (#69789)
- Updated link to qnnpack in quantization doc. (#66226)
- Improved quantization API docs (#66379)
- Quantization docs: add pages for Numeric Suite (Eager and FX) (#66380)
- Documented the quantization custom module APIs (#67449)
- Improved quantization documentation (#68907)
ONNX
- Improved documentation of
operator_export_typeandopset_versionargs (#69549) - Fixed documentation for
do_constant_foldingarg default (#71348) - Documented
ExportTypes,CheckerError, andunregister_custom_op_symbolic(#68489) - Fixed link to ONNX Runtime custom op documentation (#67944)
- Added section “Discovering all unconvertible ATen ops at once” (#66143)
- Fixed typos (#66090)
- Documented work-arounds for indexing export limitations, and improve error messages (#64579)
torch.package
- Add some docs describing how to debug
torch.packagedependencies (#65704)
Download Release
This release has 2 assets:
- pytorch-v1.11.0.tar.gz
- Source code (zip)
- Source code (tar.gz)
Visit the release page to download them.
Have any questions?
Contact Exxact Today



.jpg?format=webp)