Deep Learning

PyTorch 1.10.0 Now Available

October 21, 2021

171 min read

Introducing PyTorch 1.10.0

PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world, and now adopted fully by Facebook.

The newest stable release of PyTorch, version 1.10.0, has a number of new highlights including CUDA Graphs APIs, Frontend and compiler improvements.

PyTorch 1.10.0 Release Notes

Highlights
Backwards Incompatible Change
New Features
Improvements
Performance
Documentation

Highlights

The new PyTorch 1.10.0 release is composed of over 3,400 commits since 1.9, made by 426 contributors.

PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. Highlights include:

CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads.
Several frontend APIs such as FX, torch.special, and nn.Module Parametrization, have moved from beta to stable.
Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs.
Android NNAPI support is now available in beta.

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

torch.any/torch.all behavior changed slightly to be more consistent for zero-dimension, uint8 tensors. (#64642)

These two functions match the behavior of NumPy, returning an output dtype of bool for all support dtypes, except for uint8 (in which case they return a 1 or a 0, but with uint8 dtype). In some cases with 0-dim tensor inputs, the returned uint8 value could mistakenly take on a value > 1. This has now been fixed.

1.9.1

1.10.0

>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(42, dtype=torch.uint8) # wrong, old behavior

>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(1, dtype=torch.uint8) # new, corrected and consistent behavior

Remove deprecated torch.{is,set}_deterministic (#62158)

This is the end of the deprecation cycle for both of these functions. You should be using torch.use_deterministic_algorithms , torch.are_deterministic_algorithms_enabled instead.

Complex Numbers

Conjugate View: tensor.conj() now returns a view tensor that aliases the same memory and has conjugate bit set (#54987, #60522, #66082, #63602).

This means that .conj() is now an O(1) operation and returns a tensor that views the same memory as tensor and has conjugate bit set. This notion of conjugate bit enables fusion of operations with conjugation which gives a lot of performance benefit for operations like matrix multiplication. All out-of-place operations will have the same behavior as before, but an in-place operation on a conjugated tensor will additionally modify the input tensor.

1.9.1

1.10.0

>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([1.+2.j])

>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([3.-2.j])

Note: You can verify if the conj bit is set by calling tensor.is_conj(). The conjugation can be resolved, i.e., you can obtain a new tensor that doesn’t share storage with the input tensor at any time by calling conjugated_tensor.clone() or conjugated_tensor.resolve_conj() .

Note that these conjugated tensors behave differently from the corresponding numpy arrays obtained from np.conj() when an in-place operation is performed on them (similar to the example shown above).

Negative View: tensor.conj().neg() returns a view tensor that aliases the same memory as both tensor and tensor.conj() and has a negative bit set (#56058).

conjugated_tensor.neg() continues to be an O(1) operation, but the returned tensor shares memory with both tensor and conjugated_tensor.

1.9.1

1.10.0

>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> z.add_(2)
>>> print(x)
tensor([1.+2.j])

>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> print(z.is_neg())
True
>>> z.add_(2)
>>> print(x)
tensor([1.-0.j])

tensor.numpy() now throws RuntimeError when called on a tensor with conjugate or negative bit set (#61925).

Because the notion of conjugate bit and negative bit doesn’t exist outside of PyTorch, calling operations that return a Python object viewing the same memory as input like .numpy() would no longer work for tensors with conjugate or negative bit set.

1.9.1

1.10.0

>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
[2.]

>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
RuntimeError: Can't call numpy() on Tensor that has negative
bit set. Use tensor.resolve_neg().numpy() instead.

Autograd

Raise TypeError instead of RuntimeError when assigning to a Tensor’s grad field with wrong type (#64876)

Setting the .grad field with a non-None and non-Tensor object used to return a RuntimeError but it now properly returns a TypeError. If your code was catching this error, you should simply update it to catch a TypeError instead of a RuntimeError.

1.9.1	1.10.0
try: # Assigning an int to a Tensor's grad field a.grad = 0 except RuntimeError as e: pass	try: a.grad = 0 except TypeError as e: pass

Raise error when inputs to autograd.grad are empty (#52016)

Calling autograd.grad with an empty list of inputs used to do the same as backward. To reduce confusion, it now raises the expected error. If you were relying on this, you can simply update your code as follows:

1.9.1	1.10.0
grad = autograd.grad(out, tuple()) assert grad == tuple()	out.backward()

Optional arguments to autograd.gradcheck and autograd.gradgradcheck are now kwarg-only (#65290)

These two functions now have a significant number of optional arguments controlling what they do (i.e., eps, atol, rtol, raise_exception, etc.). To improve readability, we made these arguments kwarg-only. If you are passing these arguments to autograd.gradcheck or autograd.gradgradcheck as positional arguments, you can update your code as follows:

1.9.1	1.10.0
torch.autograd.gradcheck(fn, x, 1e-6)	torch.autograd.gradcheck(fn, x, eps=1e-6)

In-place detach (detach_) now errors for views that return multiple outputs (#58285)

This change is finishing the deprecation cycle for the inplace-over-view logic. In particular, a few things that were warning are updated:

* `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
* The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from "This view is an output of a function..." to "This view is the output of a function...".

1.9.1	1.10.0
b = a.split(1)[0] b.detach_()	b = a.split(1)[0] c = b.detach()

Fix saved variable unpacking version counter (#60195)

In-place on the unpacked SavedVariables used to be ignored. They are now properly detected which can lead to errors saying that a variable needed for backward was modified in-place.
This is a valid error and the user should fix this by cloning the unpacked saved variable before using it.

No internal formula will trigger this, but it might be triggered by user custom autograd.Function if the backward modifies a saved Tensor inplace and you do multiple backwards. This used to silently return the wrong result and will now raise the expected error.

torch.nn

Added optional tensor arguments to __torch_function__ handling checks (#63967)

This fixes the has_torch_function*() checks throughout torch.nn.functional to correctly pass in optional tensor arguments; prior to this fix, handle_torch_function() was not called for these optional tensor arguments. Previously, passing a tensor-like object into a function that accepts an optional tensor might not trigger that object's __torch_function__. Now, the object's __torch_function__ will be triggered as expected.

1.9.1

1.10.0

import torch
import torch.nn.functional as F
class TestTensor(object):
    def __init__(self, weight):
        self.weight = weight
    def __torch_function__(self, func, _, args=(), kwargs=None):
        print(func)
        print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and False because weight's __torch_function__ is
# called with func as torch.group_norm instead of F.group_norm

import torch
import torch.nn.functional as F
class TestTensor(object):
    def __init__(self, weight):
        self.weight = weight
    def __torch_function__(self, func, _, args=(), kwargs=None):
        print(func)
        print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and True

CUDA

Removed post-backward syncs on default stream (#60421)

Calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:

with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()# no sync
use grads

but a more benign-looking pattern was unsafe:

with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads

Note: this change makes it so that backward() has same user-facing stream semantics as any cuda op.** In other words, the weird pattern is unsafe, and the benign-looking pattern is safe. Implementation-wise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. This PR deletes syncs on the default stream.

torch.package

Removed verbose mode from PackageExporter (#61145)

PackageExporter is losing “verbose” mode argument as we have found it is not useful and sometimes confusing. See following examples on how to modify your code to accommodate this change.

1.9.1

1.10.0

with PackageExporter(buffer, verbose=False) as e:
    e.intern("**")
    e.save_pickle("res", "mod1.pkl", mod1)
    e.save_pickle("res", "mod2.pkl", mod2)

with PackageExporter(buffer) as e:
    e.intern("**")
    e.save_pickle("res", "mod1.pkl", mod1)
    e.save_pickle("res", "mod2.pkl", mod2)

Quantization

Added extra observer/fake_quant (the same observer/fake_quant instance as the input) for some operators in prepare_fx, e.g. maxpool, add_scalar and mul_scalar (#61687, #61859)

Previously the way we insert observers/fake_quants are specific to fbgemm/qnnpack backend, as we work on making FX Graph Mode Quantization extensible to custom backends, we are changing some behaviors for the fbgemm/qnnpack path as well. The above changes are adding extra observer/fake_quant to the output of some operators to make sure we model the quantized operator more accurately in quantization aware training, the comprehensive list of operators where the behavior changes are the following:

modules: torch.nn.MaxPool1d, torch.nn.MaxPool2d, torch.nn.MaxPool3d, torch.nn.Identity
torch functions: torch.nn.functional.max_pool1d, torch.nn.functional.max_pool2d, torch.nn.functional.max_pool3d, torch.chunk, torch.flatten, torch.transpose, torch.repeat_interleave, torch.sort, torch.squeeze, torch.stack, torch.unsqueeze, operator.getitem,
Tensor methods: chunk, contiguous, detach, detach_, numel, permute, repeat, repeat_interleave, reshape, resize_, shape, size, squeeze, squeeze_, transpose, unsqueeze, unsqueeze_, view
Tensor operations: add scalar and mul scalar (add/mul with a Tensor and a Scalar input)

We will show an example with torch.nn.MaxPool2d:

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool2d = torch.nn.MaxPool2d(kernel_size=3)

    def forward(self, x):
        x = self.maxpool2d(x)
        return x
m = M().eval()        
m = prepare_fx(m, {"": torch.quantization.default_qconfig})
print(m.code)

1.9.1

1.10.0

def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
    return maxpool2d

def forward(self, x):
    x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
    maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
    maxpool2d_activation_post_process_0 = self.maxpool2d_activation_post_process_0(maxpool2d); maxpool2d = None
    return maxpool2d_activation_post_process_0

Note that self.maxpool2d_activation_post_process_0 and self.x_activation_post_process_0 will refer to the same observer/fake_quant instance, this is to simulate the numerics for the quantized maxpool implementation, where the output would reuse the quantization parameter of the input. Simple illustration with graph:

Before:

observer_0 - maxpool - ...

After:

observer_0 - maxpool - observer_0 (same observer instance as input observer) - ...

ONNX

Removed aten arg from torch.onnx.export(). (#62759)

The new OperatorExportTypes.ONNX removes the need for an explicit aten argument. If Pytorch was built with -DPYTORCH_ONNX_CAFFE2_BUNDLE the a None value means OperatorExportTypes.ONNX_ATEN_FALLBACK

1.9.1	1.10.0
torch.onnx.export(..., aten=True)	torch.onnx.export(..., operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN)

Deprecations

Python API

Deprecate __torch_function__ as a plain methods (#64843)

The __torch_function__ function used to create Tensor like objects did not have any constraint whether it should be a method, class method or static method.

To make it compatible with newer features on Tensor-like objects, we are deprecating setting it as a plain method. You can define it as a class method to get the current class and scan the argument list if you need an object that is an instance of this class.

Mobile

Removed API torch.utils.bundled_inputs.run_on_bundled_input (#58344)

This API caused many issues and is not really necessary. The functionality (run model with bundled input) can be achieved by using get_all_bundled_inputs. For example:

1.9.1:

model.run_on_bundled_input(0)

1.10.0:

model(*model.get_all_bundled_inputs()[0])

Distributed

torch.distributed.rpc: Removed ProcessGroup RPC backend (#62411 , #62985)

ProcessGroup RPC backend has been deprecated and 1.9 was the last release which carried it. The default RPC backend is TensorPipe which is the recommended backend for RPC. Users who use

torch.distributed.rpc.BackendType.PROCESS_GROUP

will be given an error message to switch to:

torch.distributed.rpc.BackendType.TENSORPIPE

ONNX

Removed following arguments in torch.onnx.export(): enable_onnx_checker, strip_doc_string, _retain_param_name (#64369, #64371, #64370)

enable_onnx_checker argument is removed. ONNX checker will now always run by default. Users can catch exceptions to ignore raised failures. strip_doc_string has been rolled into the verbose arg in torch.onnx.export(). _retain_param_name argument has been removed in torch.onnx.export() will default to True . There is no way to get the old behavior of _retain_param_name=False. Users should stop setting this arg.

1.9.1:

torch.onnx.export(..., enable_onnx_checker=False, strip_doc_string=False)

1.10.0:

try:
    torch.onnx.export(verbose=True)
except torch.onnx.utils.ONNXCheckerError:
   pass

Infra (Releng)

Disable ParallelTBB (#65092)

ParallelTBB config/codepath is no longer actively tested by PyTorch CI and as result is subject to code/functionality degradation

New features

Python API

Added new functions:

torch.isin() (#53125), torch.bitwise_{left/right}_shift, __rlshift__, __rrshift__ (#59544), torch.Tensor.{__rand__, __ror__,__rxor__} (#59240), torch.aminmax (#62401), torch.new_ones (#58405)
For numpy compatibility torch.cov (#58311), torch.frombuffer (#59077), torch.corrcoef (#60420), torch.nanmean (#62671), torch.cumulative_trapezoid (#61615)

The torch.special module is now stable! This module, consistent with SciPy’s special module, has 30 operations including the Hurwitz zeta function and various gamma functions. (#59623, #56352, #58126, #59141, #59143, #58650, #55878, #58838, #60512, #60641, #61633, #60519, #59691, #58194)
Added support for slots and subclass magic getstate/setstate method for Tensor serialization (#62745)
torch.optim:

Added Nesterov Adam as NAdam (#59009)
Added lr_scheduler.ChainedScheduler (#63491, #63457, #65034))
Added lr_scheduler.SequentialLR (#64037, #65035)
Added lr_scheduler.{ConstantLR,LinearLR} (#64395)

torch.cpu.amp.autocast: enable new API for CPU autocast (#57386, #63534)
Added BFloat16 support for torch.{cross, tril, triu, tril_indices, triu_indices, cumsum, cummax, cummin, median, kthvalue, nansum, nextafter, range, sinh, cosh, frexp, nan_to_num, sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv, bucketize, bernoulli, dropout, fold, unfold, MaxPool2D, AdaptiveAvgPool2D, topk} on CPU (#62454, #63307, #55210, #60074, #61083, #61829, #55221, #61826, #55588, #56372, #62880, #55202, #59547)
Added BFloat16 support for torch.{ceil, floor, frac, round, trunc, sort, topk, aminmax, cumsum, logcumsumexp, cumprod, cummin, cummax} on CUDA (#57910, #58196, #59977, #62767, #57904).
Added torch.cuda.is_bf16_supported (#63798)
Added zero rate to Poisson distribution (#61511)
Added torch.segment_reduce (#59951, #60018, #61141, #61266, #59521, #60379, #60379)
Added boolean support to torch.isclose (#61271)
Added torch.trapezoid (#61475).
Added torch.gradient support for second order central differences (edge_order=2) (#58165)
torch.sigmoid: CUDA support and complex autograd support (#48647)
Added channels-last support for torch.bilinear and torch.nn,MaxUnpool2d (#56322, #49984)

Autograd

[Experimental] Forward mode AD:

NOTE: In addition to operators listed below, many simple ops are already supported. If you encounter an operator that does not have a forward-mode AD formula implemented, please file an issue. As a workaround, you can use custom <i>autograd.Function</i> to implement your own forward-mode-AD-supported operator.
Added forward-mode AD support for custom autograd.Function (#64061, #63434)
Added forward-mode AD support for torch.{acos, add, addbmm, addcdiv, addcmul, addmm, addmv, addr, angle, acosh, asinh, atanh, asin, atan, conj, baddbmm, bmm, cat, ceil, clamp, clamp_min, clamp_max, complex, copy_sign, cos, cosh, cross, cumprod, cumsum, cummax, cummin, deg2rad, div, dot, vdot, exp, exp2, expm1, expand, floor, frac, frexp, gather, hardswish, hstack, hypot, index_add_, index_copy_, index_put_, index_select, kthvalue, lerp, lgamma, digamma, polygamma, log, log10, log1p, log2, logaddexp, logaddexp2, xlogy, masked_fill_, masked_fill_, masked_scatter_, masked_select, max, maximum, fmax, mean, min, mininum, fmin, mm, mode, mul, lu, lu_solve, vstack} (#57768, #57863 #59711, #64742)
Added Forward AD support for the following element-wise and linear operators torch.{mvlgamma, nan_to_num, permute, pow, reciprocal, remainder, repeat, round, rsqrt, sigmoid, logit, sign, sgn, sin, sinc, sinh, sqrt, squeeze, sub, sum, t, flip, roll, rot90, take, tan, tanh, trace, transpose, tril, triu, trunc, unfold, unsqueeze, view, zero_, hardshrink} (#59993)
Added Forward AD support for torch.special.{xlog1py, entr} (#59711, #59993)
Added forward AD support for torch.linalg.{cholesky, cholesky_ex, eigh, inv, inv_ex, solve} (#62160, #64646, #62163, #62159)
Added forward AD support for torch.functional.leak_relu (#59993)

Added saved tensor hooks to customize packing/unpacking behavior of tensors saved for backward (#60685, #60663, #62564, #60975, #62909, #62717)
Exposed raw saved tensors for custom autograd.Function to use with the saved tensor hooks (#60551)
Added default saved tensor hooks (#61834, #62563, #62361)
Added context manager using default saved tensor hooks to automatically move saved tensors on CPU and back (#61928, #62410)
Added C++ and python bindings for .is_inference() method (#58729)
torch.lu_solve: Implement support for backward AD (#61681).

torch.nn

Added new modules: nn.{ReflectionPad3d, LazyInstanceNorm*d} (#59791, #60837, #61308, #60982)
nn.CrossEntropyLoss: Added support for class probability targets (#61044)
nn.CrossEntropyLoss: Added support for label smoothing (#63122)
nn.Module: Added support for arbitrary objects in state_dicts via get_extra_state() / set_extra_state() (#62976)
nn.utils.skip_init(): Added function to skip module parameter / buffer initialization (#57555)

Profiler

Added profiler support for mobile (#62419, #62418, #62417,#62228,#62191,#61792)
Ported Nvtx support to new profiler (#61634)
Added Tensor core usage stats and recommendations in Tensorboard (#364,#368,#383, #422)

CUDA

Allow enabling warnings on CUDA synchronization (#62092)
Added CUDA graph Prototype API and documentation (#63269)
Make stream semantics of backward calls consistent with other cuda ops (#57833, #60230, #60127)
Enabled autocast support for user-specified device and dtype (#61002, #63416)

C++ API

Added C++ API for meta functions. They are available in the at::meta:: namespace (#58570)
Exposed interface to set grain size on cpu_kernel, cpu_kernel_vec and cpu_kernel_multiple_outputs (#58949)
Added at::native::resize_bytes_cpu to resize Storage in ATen (#60324)
Added transpose to PackedTensorAccessor (#61114)
Added torch::linalg::qr as the C++ API (#60529)
Exposed amin and amax to aten symbols (#61550)
Added support to invoke callable activation function for Transformer modules (#62342)
Added support for c10::optional to compare with different but comparable types (#62890)
Added a unified API c10::util::check_env to check environment variable (#59052)

TorchScript

Added reference semantics to TorchScript classes (#44324)
Conservatively moved all suitable prim ops from full-jit to mobile, and make them selective. (#58353)
Added change to predicate uses of RPC APIs on torch.distributed.rpc.is_available() (#58887)
Added a phase to perform inplace<->functional conversion for activation operators (#57477)
Enabled Profile-Directed Typing in torch.jit.script (#62420)
Introduced enhancement for smart serialization for operator schemas with out arg (#63096)
Added a pass to transform better handle concatenation ops (#59881)
Added a new operator for concat that takes in variadic parameters (#59880)
Added support for union in TorchScript (#64234)

torch.package

Added basic tooling to enable users to see what is inside of a PackageExporter (#61147)
Added hasattr to torch::deploy C++ API (#62669)
Added support to re-save a PackageImporter module (#65101)
Added support to make frozen symbol name customizable in torch::deploy. (#63817)

Mobile

Built lite interpreter the default for Android and iOS (0c3db1cb33, b5a834a739)
Enabled kineto profiler on mobile via EdgeKinetoProfiler (#62419)
Added support of loading lite interpreter module from assets in Android (#61609)
Enabled tracing based selective build (#63421, #64087, #66237, #66395)

built tracer in OSS (#64087)
used operator.yaml to build libtorch library (#66237)
Built tracer and enabled tracing-based build with tracer output (#66395)

NNAPI

Android NNAPI delegate implementation of runtime initialization (compilation) and execution (#62272)
Added
```
aten::{avgpool2d,softmax,to,div,flatten,detach,slice,log_softmax,conv2d_transpose}
```
to NNAPI converter (#58538, #58539, #58540, #58541, #60885, #58543, #59364, #61378, #59529
Added Int32 support for NNAPI (#59365)
Made nnapi aten::{conv2d,linear,cat,flatten} converter accept flexible batch (#61021, #61022, 76c0f223d3, #61024)
Added option to specify custom NNAPI serializer (#61025)
Made Android NNAPI preprocess to accept both single Tensor inputs and Tensor List inputs (#61752)
Added a few improvements in NNAPI delegation (#63489)
Added support const values in binary ops (2d58f3f56d)

Added unary/binary ops necessary and more shape functions for mobilenet (#56828, #58932)
Added aten::{hardswish,tanh,clamp} for iOS Metal (#64588, #61383)
Added CoreML support (#64521, #64522, #64523)
Added compatibility API (#61477, #57501)
Added support operators with default argument in front of out argument (#63651, #63540)

Distributed

DistributedDataParallel

Local SGD and variants for DDP communication optimization (#60303, #60320, #60632, #60891, #61206, #61207, #62105, #62111, #62131, #62132, #62392, #63277, #63340, #64885, #65197)
Provided a noop hook for performance debugging (#64344, #64352)
Implemented BF16 allreduce gradient communication hook (#63260)
Allowed retrieval of model parameters in communication hook (#61637)

torch.distributed

Added a function to create new subgroups of a given size (#59111)
Introduced a new torchrun entry point for elastic (#64049)

torch.fx

Added APIs to mutate specific args/kwargs (#58571)
Introduced EngineHolder for serializing and running TRT Engines with PyTorch (06399d441d)
Introduced __fx_create_arg__ dunder method for controlling custom classes are handled as node args (#61780)
Added autowrap_functions kwarg to Tracer (#62106)
Gradual typing

Added type annotation field to nodes (#60621)
Added experimental gradual typechecker (#60805)
Extended all experimental type-checking operations to support conv2d, BatchNorm2D, ReLU, maxpool2D, AdaptiveAvgPooling2D, flatten (#61093, #61012, #61150, #61188, #61239, #61265)
Added experimental refinement types and unification for symbolic shape inference (#61776)
Changed output node handling for typechecker to deal with tuples (#62582)
Added handle of get_attr operations in typechecker (#62682)
Added equality constraints for some acc operations for symbolic inference (#63689)
Added inference for algebraic expressions (#63822)

Provided function interface for remove_duplicate_output_args (#65134)
Introduced helper function to generate an unique name for an attr in a module (#64970)

ONNX

Added support for ONNX op set 14 (#59486)
Added support for GRU RNNs with packed input in scripting mode (#58691)
Enhanced shape inference (#64585)
Added support for torch.{linspace, new_ones, nn.LSTMCell, bernoulli, dot, nn.utils.spectral_norm,bernoulli, distributions.normal.Normal, roll} (#58854, #59255, #62757, #62765, #59536,#61560,#58697)

Infra (Releng)

Default Linux/Windows testing workflows were migrated to GitHub Actions. PyTorch Probot has been extended to support new set of rerun command with new set of labels that one can use to opt in and opt out of certain types of CI. More information can be found on Continuous Integration wiki page
Overall statistics and health of PyTorch CI/CD system can be viewed at https://metrics.pytorch.org (#65157, #61389, #62217, #64948, #60026, #61071, #64303)
Improved mechanism for disabling tests via issues. Creating an issue which title begins with “DISABLED” followed by the test name will disable the test in question for all platforms, which could be refined by explicitly specifying list of platforms in the issue body. Comment from @pytorch-probot would indicate that issue format was recognized by the CI system and test is now disabled. Closing the issue re-enabled the specified test in CI. Disabled tests will be temporarily re-enabled while running CI for PR marked as fixing it (#61427)
New documentation preview and new artifacts frontend. Using https://hud.pytorch.org, one can get an overview of PR/commit CI status, download build artifacts as well as read documentation associated with this build. See Using HUD wiki page for more information (#60711, #60792, #60893)

Misc

Added support for torch.fft. operators on ARM-based platforms using pocket FFT (#60976, #62222, #63714)
torch.einsum: added support for the “sublist” format (#56625)
torch.linalg.det: added support for complex autograd (#58195)
Added autograd support for Tensor.to_sparse (#58413)
Added more CUDA support for CSR layout: constructors (#59010), sparse_to_dense/add_sparse_csr (#59011), addmm/matvec (#59012)
Vulkan: Added support for max_pool2d, tanh, hardshrink, log_softmax, leaky_relu, softmax (#58806, #60695, #62870, #63193, #62239)
Enabled local run of clang-tidy and clang-format lint workflows (#61121, #61797, #60745)

Improvements

Python API

Added clearer stack trace for torch.floor_divide deprecation warning (#64034)
Use cascade-summation algorithm to improve torch.nansum accuracy (#61082)
torch.i0: now promote integer inputs to float (#52735)
torch.kthvalue: added change to adjust output dim size for numpy compatibility (#59214)
Added reduce variants for torch.s``catter operation. (#57015)
Added support for quantized tensors in torch.testing.assert_close (#58926)
Improved error message for invalid value input to Distribution methods (#61056)
torch.isclose upcast to most precise dtype within their category before the comparison (#60536)
Added change to cast alpha to acc_type for torch.add and torch.sub (#60227)
Fixed dimension in the error message for CUDA torch.cat shape check and removed unnecessary offending index information (#64556).
Improved DLPack support (#57110).
Added change to raise an error when empty index tensor is passed to torch.gather (#65006).
Added change to store float64 in tensorboard instead of float32 (#59435).
Added use_strict_trace to tensorboard add_graph method (#63120).
Add option to skip GH validation for torch.hub (#62139)
Added a new kwarg output_size to tensor.repeat_interleave(#58881)
Add support for torch.isclose (#63571)
Make the behavior of torch.{testting.assert_close,is_close} consistent with numpy (#63841)

Autograd

Added warning about memory leak when .backward() is called with create_graph=True (#59412)
Added warning when accessing Tensor::grad() on a non-leaf Tensor in the C++ API (#59362)
Fixed error message formatting in grad_output creation for .backward() and autograd.grad() (#59532)
Added change to raise NotImplementedError for forward and backward-mode AD formulas that are not implemented (#59482, #59483)
Reduced memory usage for torch.relu for common use cases (#63089)
Added support for non-leaf inputs for autograd.backward() function inputs argument (#60521)
Improved error message when a tensor with requires_grad=True is passed to a non-differentiable function (#60610)
Made binary_cross_entropy differentiable w.r.t. target (#59447)

torch.nn

Added support for inputs with no batch dimensions for nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, AvgPool*d, CosineEmbeddingLoss, Dropout, FractionalMaxPool2d, Linear, LPPool1d, MaxPool*d, MaxUnpool*d, NLLLoss, PairwiseDistance, ReflectionPad*d, ReplicationPad*d, TripletMarginLoss, ZeroPad*d}, most other loss modules, and all activation modules (#61264, #61847, #61860, #64590, #61911, #62490, #60992, #62190, #62206, #61984, #61310, #62651, #64882, #62183, #61060, #61262, #62729, #61300, #61461, #62726)
Added support for inputs with 0 batch size for nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, Bilinear, FractionalMaxPool*d, LocalResponseNorm, MaxPool*d, MaxUnpool*d, TransformerDecoder, TransformerDecoderLayer, TransformerEncoder, TransformerEncoderLayer} (#62025, #62088, #47106, #62083, #62801, #64082, #62800)
Parametrization: Added support for nested parametrizations, parametrizations depending on several inputs, resizing of parametrized tensors, and the orthogonal parametrization (#65167, #60530, #60418, #62089)
nn.AvgPool2d: Added channels_last support on CPU (#58725)
nn.BatchNorm: Use resize_output and empty instead of empty_like to improve flexibility in output memory format choice (#63084)
nn.Bilinear: Added support for non-contiguous tensor inputs (#38409)
nn.GELU: Added support for fp32/bfloat16 in CPU path using mkldnn implementation (#58525)
nn.GroupNorm: Improved numerical stability by using the Welford algorithm and cascade summation (#54921)
nn.LayerNorm: Improved numerical stability by using the Welford algorithm and pairwise sums (#59987)
nn.NLLLoss: Added support for target of dtype byte (#60308, #60650)
nn.SmoothL1Loss: Added support for integral target within the backward pass (#61112)
nn.Transformer: Added configurable pre/post LayerNorm placement (#60593, #61692)
Added check to verify non-zero sequence length for nn.{RNN, LSTM, GRU} (#60269)
Added support for bfloat16 in CPU path to nn.{LeakyReLU, RReLU} (#61514)
Added support for channels_last memory format in nn.{AdaptiveMaxPool2d, GroupNorm} (#48920, #49821)
Added callable activation function support to nn.{MultiheadAttention, Transformer, TransformerDecoderLayer, TransformerEncoderLayer} (#61355, #62342)

Profiler

Changed profiler.profile argument with_flops when set to True to report total FLOPs rather than FLOP/s, and support more operators (#62779, #61895)
Improved memory profiling and Tensorboard memory view, enabling better understanding of memory usage by showing active memory allocations at various points of your program run as well as a memory usage trend chart. (#61282, #361, #404,#416,#421)
Added flow arrows between ops in the forward pass and the corresponding ops in the backward pass in the trace view (#62553, #372)
Increased profiling coverage of backward pass (#63619)
Made threads and GPU streams appear in a consistent sorted order in the trace view (#399)
Added shapes and reg usage to the GPU kernel view (#351)

Dataloader

Properly delegated indices called by Subset to dataset (#59513)
Removed the restriction that input datasets in ConcatDataset must be Sized (#64114)
Allowed annotation of IterableDataset to accept keyword-only arguments and abc class (#58450)
Changed annotation of DataLoader to accept non-integer Sampler as input(#63500)

CUDA

Include function name in the error message for inputs being on different devices (#58502)
Fix MAGMA initialization (#58521)
Updated NCCL to 2.9.8 (#58667)
Added deterministic path for torch.scatter_add for 1D tensors (#58761)
Added CUDA support for mean reduction (#59543)
Add missing CUDA kernel launch check (#60114)
Improved CUDA extension building error/warning messages (#59665, #60592)
Added change to compute CUDA reduction buffer size in elements (#63969)

TorchScript

Added change to simplify pass on arithmetic expressions for integers. (#61444)
Set future's error to current exception as is when
```
--torch_jit_enable_rethrow_caught_exception=true
```
(#63348)
Improved TorchScript module getattr() to be same as python class getattr() method (#61599)
Improved slicing for scripted version of torch.nn.ModuleList to support arbitrary step size (#58361)
Added parsing logic for Tuple[()] annotation (#58340)
Changed list striding kernel implementation to handle optional integers (#58536)
Added support for torch.nn.Parameter type for Profile-Directed-Typing (#59249)
Added change to annotate NoneType as Optional[type] (#60383)
Added support for default values on NamedTuple fields (#54682)
Improved JIT support for torch.einsum (#59265)
Added change to allow for heterogenous List and Dict values + Improve container typing algorithm (#57137)
Added support for eager mode use of torch.jit.isinstance with multiple types (#60465)
Allowed uncompiled strings as input to checkScriptRaisesRegex (#63901)
Introduced more robust check of whether a class is defined in torch (#64083)
Added change to preserve types during empty container assignment (#58911)
Made JIT not assume that the device is CUDA. (#54238)
Updated optimize_for_mobile to preserve nodes’ debug information (#63106)
Added support for device as Dict key (#65079)
Added support for Python C extension modules in torch::deploy (#58117)
Added a flag to suppress stacktrace in exception messages(#63073)
Added API to change logging levels for JIT (#58821)
Provided API to preserve source range and callstack information during graph rewrite (#58300)
Re-enabled BatchNorm autodiff (#57321)
Extracted element-wise ops supported by JIT fuser into a separate list (#59579)
Reworked requires_grad on DifferentiableGraphOp (#57575)

torch.package

Unified three categories of dependency handling error (broken, denied, unhandled) into a single "error" field in the node, with optional context (#58572)
Renamed MockZipReader into DirectoryReader (#59107)
Added change to silently skip cases where the **import** statement cannot be parsed (#61148)
Make torch::deploy work with or without cuda (#58493)

Mobile

Added check to ensure op name does not contain open parenthesis (#58687)
Added handles and symbolicate exception callstack thrown from backend (#55462, #57441, #57481)
Enabled implicit operator versioning via number of arguments (#58852)
Cleaned up unused APIs and improve debugging experience for iOS GPU (#60280, #60281,#60282)
Added debug information to track memory allocation exception for Metal (#59112)
Added print of IValue type name in error message for Android (#64602)
Added print of error message when failing to load model file (#63404)
Introduced multiple improvements in torch.utils.model_dump APIs:

Make stdout argument for main kwarg-only (#60699)
Implement "Hider" properly (#57654)
Handle torch.device objects (#57656)
Handle dict rendering (#57657)
Add a section that summarizes tensor memory usage (#57658)
Handle invalid UTF-8 in pickles (#57661)

Quantization

Added out variant for int8 quantized::linear (#58282) and quantized::embedding_bag_byte_prepack (#64081)
FX graph mode quantization: improve qconfig_dict argument handling (#59605, #58566)
Added support to embedding trained in FP16 (#60736)
Added support for torch.index_select on quantized tensors (#61406)
Added a new fused MovingAvg Obs + FakeQuant operator (#61570, #61589, #61691, #62346, #62863, #62702, #63043, #64829)
Added support for dynamic linear + relu fusion (INT8) (#63799,#63826)
Enabled JIT tracing on quantizable LSTM (#64438)

Distributed

DistributedDataParallel

Added error logging to DDP logging API (#59281, #59284, #59351,#65023)
Added NCCL_ASYNC_ERROR_HANDLING environment variable to control NCCL error handling (#59109)
Communication hook APIs to always return single tensor (#62074, #62389, #62457)
Added DDP bucket sizes in DDP logging API (#62229, #62232, #62231, #62625,
Improved rebuilding buckets logic (#62279, #58097)
Allowed DDP uneven inputs work with communication hooks (#61017, #61018, #61019, #61020)
Added logging if graph is static at end of training (#61871)
Added logging of unused param names under DETAIL debug mode. (#62209)
Allowed tuning of first bucket in DDP (#62748)
Added gradient ready order, host-side timestamps, and bucket indices to DDP logging (#62751, #62770)
Added a debug check in C++ fp16 gradient hook (#63379)
Added a fallback to use mul and copy_ instead of mul’s out= variant when gradient tensor requires grad in DDP (#63831)
Used Tensor.set_ instead of directory assigning data in model averaging (#63895)
Added more iterations for DDP logging (#64071, #64411)

torch.distributed

Introduced ProcessGroup wrapper and use it in debug mode(#58224, #58281, #60237)
Made a small change for torch.distributed launcher (#59152)
Added complex number support for all_to_all/scatter (#61299)
Made gloo communication profiling more accurate (#61342)
Used generator instead of list to save memory in scatter (#62516)
Provided failure reason from ProcessGroup when aborting NCCL communicator (#64241)
Introduced error raised when capturing uncapturable NCCL in CUDA graphs. (#64440)
Added Single-Machine Model Parallel Support to
```
torch.distributed.optim.ZeroRedundancyOptimizer
```
(#61370)

torch.distributed.nn.RemoteModule

Supported creating a RemoteModule by RRef (#59242)
Supported switching RemoteModule between train/eval (#59026)

torch.distributed.elastic

Added minor logging and error formatting improvements (#63214, #62823)
Improved process termination logic (#61602)
Added fqdn hostname to error printout (#66662)

torch.distributed.rpc

Fix RPC initialization to avoid shutdown timeout (#59801)
Supported RRefs that contain threading.Locks (#57943), torch.cuda.Event (#61354)
Updated rpc tensorpipe logic for sparse tensors (#64575)
Added rpc sparse tensor fix (#59609, #62794)
Added change to ensure that future completion doesn't swallow exception. (#61094)
Set streams when invoking UDFs (#59210)
Set and propagate devices in RRef completion Future (#59211)
Made TensorPipe agent use streams from Future when sending response (#59212)
Added change to leverage TensorPipe's automatic SHM address selection (#63028)
Made Future store Storages instead of references to DataPtrs (#60470, #60943)
Added change to avoid re-doing CUDA stream sync in OwnerRRef (#57355)

torch.distributed.Store

Enhanced connect timeout error message (#61390)
Added minor fixes in c10d for Windows (#62953)

torch.distributed.pipeline

Supported non-tensor inputs in pipeline parallel API (#55441, #57226, #57325)
Added a WithDevice wrapper to specify device execution for a module. (#65190)

torch.fx

Added users of a node to the serialized JSON (#59357)
Added requires_grad to TensorMetadata (#60972)
Added change to swap out Python's AnnAssign with an Assign node where the annotation function is called (#60622)
Added type annotations for the torch.nn.Module constructor (#61334)
Enabled torch.deploy for GraphModules with non-torch dependencies (#61680)
Added change to allow FX tracer to trace control flow (if/while) statements when parameter shapes are in the conditionals (#61820)
Added torch.memory_format as a BaseArgumentType (#62593)
Added backwards compatibility guarantees for 1.10 (#63888)

Renamed reduce functions back to their old, public names (#64324)
Added change to ensure BC coverage for all of torch.fx passes (#65081)

Add __matmul__ to the magic methods for FX tracing (#64512)

Composability

Added meta tensor support for

torch.{any, all, fmax, fmin, remainder, glu, argmax, argmin, avg_pool3d_backward, isposinf, isneginf, fmod, fmin,      signbit, slow_conv_transpose2d, nll_loss_backward, cumprod, aminmax, addcmul, addcdiv, gather, hardshrink_backward, softshrink_backward, hardshrink, gelu, gelu_backward, avg_pool2d, avg_pool2d_backward, avg_pool3d, reflection_pad1d_backward, all, any, silu_backward, sgn, softplus, leaky_relu_backward, hardsigmoid_backward, elu_backward, eq, xlogy, ne, lt, gt, le, ge, sigmoid_backward, tanh_backward, logit_backward, bitwise_or, bitwise_xor, bitwise_and, nll_loss_forward, log_softmax, log_softmax_backward_data, prod, norm, sum.dim_IntList, clamp}

(#64642, #58458,#58732, #61800, #60363, #60364, #59084, #60633, #60809, #60810, #57936, #55503, #62144, #61899, #62401, #62318, #62319, #63312, #58662, #58663, #58664, #58665, #58987, #59082, #59083, #59103, #60360, #60361, #58661, #58197, #58482, #58483, #58484, #58660, #60177, #60814, #60942, #60815, #60816, #60817, #60811, #60812, #60813, #61443, #57374, #62372, #62024, #62711, #61642, #61361)

PyObject preservation: Previously, tensors in python that no longer had any python-side references (but still had references in C++, e.g. if it’s saved for autograd) would get deallocated, and we would create a new Python object to replace it next time it passes from C++ to Python. We now preserve the PyObject as long as there are any references on either the python or C++ side. This ensures that any metadata on the original python object is preserved. For example, tensor subclasses that were saved for autograd now get properly preserved. (#56017)

Build_Frontend

Added a new include directory in BLIS search path (#58166)
Added print to show full Python version in torch.utils.collect_env (#59632)
Added change to respect CMAKE_PREFIX_PATH choice set by caller (#61904)
Dropped incremental linking on Windows when REL_WITH_DEB_INFO=1. (#64892)
Enabled kineto build for ROCm platform (#58401)
Added support to system-provided Intel TBB (#61934)
Added Pytorch build support with Newlib c library (#60345, #60052)
Imrpove torch.__version__ comparisons (#61556, #64565, #63848)
CMake: added optional precompiled header support (#61940)
Removed unnecessary Ubuntu version checks (#61738)
Added GPU support to bazel builds (#63604)

Infra (Releng)

Improved automated test sharding. (#59727, #60206)
Added change to strictly type everything in .github and tools (#59117)
Upgraded Windows CI Python to 3.8 (#59729) and CUDA to 10.2 (#65080)
Made change to use expecttest from PyPI (#60658, #63320)
Added option to run specified tests option to run_test.py (#59649)
Enabled Metal in PyTorch MacOS/iOS nightly builds (#63718, #65075)
Added retries to flaky CI steps. (#65013, #65104, #64120, #60216, #63319)
Allowed Docker build on macOS (#60375)

Misc

Added support for MIOpen channel last convolution (#63617)
Enabled kernel asserts on rocm (#49624)
Added bool, float16, bfloat16 and complex support for to_dense for CSR sparse Tensors (#60657)
Added complex dtype support for matrix multiplication of two COO sparse Tensors on CPU (#59554)
Added the “upper” kwarg to torch.linalg.cholesky (#62434)
Improved error message in ONNX when attempting to export dict modification (#58696)
Migrated THAllocator to MapAllocator in ATen (#60325)
Converted input type of TensorOptions.device_index from int16_t to to c10::DeviceIndex (#60412)

Bug fixes

Python API

Added fix to recognize transposed dense tensors as a form of partial overlap (#59014)
Fixed torch.polygamma incorrect behavior at infinites when n>=1 (#61641)
Fixed for non-contiguous inputs for torch.{sort,topk} on CUDA (#63029), torch.tensor_split indices(#63390)
Fixed legacy constructor torch.Tensor when given a scalar Tensor (#58885)
Added change to not wrap Tensor.{grad,_base} by default for Tensor-like objects(#60464)
Fixed torch.angle on aarch64 (#59832)
Fixed specialized convolution kernel on arm64 (#60460)
torch.normal: fixed RuntimeError when standard deviation named arg is torch.empty (#66524)
Fixed random sampling on SGX platforms (#60368)
Fixed testing when Scipy is not available (#61699)
Fixed torch.Tensor.copy_ when using large inputs and broadcasting (#64425)
Fixed broadcasting behavior for torch.trapezoid (#64054).
Fixed dtype check of comparison ops (#64267).
Fixed torch.median crash on empty tensor (#61698)
Fixed missing lazy initialization in torch.get_num_threads (#64486)
Fixed check for empty named dims list to torch.flatten (#61953)
Fixed torch.hub.{list,help} functions for Windows (#63773)
Fixed torch.{istft,rfft} errors for special inputs (#63469, #63327)
Fixed type annotation

optim.lr_scheduler.CosineAnnealingWarmRestart

(#61106)

Fixed type annotation of torch.hub.load (#63755)

x[index] = value no longer results in a RuntimeError if x and value are different devices.
(#61612)
Fixed crash while creating new tensor if NumPy is not available (#66433)
Handle exceptions from THPModule_setQEngine (#60073)
Fixed torch.Tensor.cauchy_ on CUDA for inf values (#60186)

Autograd

torch.{signbit,``isin} no longer raise an error when passed a tensor that requires grad (#62529)
Fixed sub-gradient for torch.a{max,min} (#59669)
Fixed segfaults when a tensor hook removes itself (#61250)
Fixed double backward for binary_cross_entropy loss function when reduction=sum. (#59479)
Made sure that TLS (grad mode, inference mode, dispatcher state, etc) are properly set in hooks being called during the backward pass (#60067)

torch.nn

nn.AdaptiveAvgPool2d: Correctly dispatch to CUDA implementation (#61851)
nn.AdaptiveAvgPool3d: Fixed gradient computation (#60630)
nn.BatchNorm: Fixed mixed precision usage when affine=False (#61962)
nn.BatchNorm2d: Fixed issue when input is non-contiguous (#63392)
Fixed batch_norm() to preserve output memory layout based on input (#62773)
nn.MaxPool2d: Use channels_last memory format for output and indices when input is channels_last (#61245)
nn.Module: Fixed full backward hook when grad is disabled (#65335)
nn.Module: Fixed get_buffer() to check buffers by name instead of value (#61429)
nn.Module: Fixed pre-forward hooks for Lazy modules (#60517)
nn.Softmax: Improve numerical stability by subtracting max value in vectorized CPU implementation (#63132)
F.cosine_similarity: Fixed type promotion behavior and added input validation checks (#62054, #66191, #62912, #58559)
F.embedding: Added check to validate that weights are 2D (#59314)
F.interpolate: Fixed output for edge case of single pixel without align_corners (#61166)
F.nll_loss: Fixed regression for gradient computation (#64203)
F.pad: Fixed type of default pad value to be floating point (#62095)
Fixed issues with printing torch._ops.ops.{atan, quantized} modules (#62447)
```
torch.nn.utils.parametrizations.spectral_norm
```
Fixed, so that it can be used twice in the same forward pass (#62293)
Disabled cuDNN persistent RNN on A30 to avoid exceptions from hard-to-detect edge cases (#59830)

Dataloader

Fixed IterableFecher to stop fetching data after StopIterator (#59313)
Fixed ExceptionWrapper to re-raise Exception with multiple args (#58131)

AMD

Fix ROCm compilation by properly marking c++ functions as CPU only (#62628)
Fixed torch.{i1,i1e} ROCm failure: mark array as const so that it is available for host and device (#59187)

CUDA

Fixed to not use deprecated data accessor in IndexKernel.cu (#62268)
Fixed sign comparison (#62194, #62483)
Fixed torch.manual_seed{_all} memory leak (#62534)
Fixed CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode (#62527)
Changed to use long index type for torch.index_add deterministic implementation (#59254)
Fixed illegal memory access on NHWC BN kernel (#59981)
Fixed typo in Normalization.cu (#62515)
Added change to ignore and clear errors related to cuda not being ready yet (#61554)
Fixed segmentation fault due to access to destroyed global IPC variable(#56141)
Fixed reduction launch config (#64304)
Fixed typo embedding_renorm_ cuda implementation (#64542)
Added missing kernel checks (#60635)
CUDA graphs: made sure graph mempool malloc counter pairs with frees for all allocations (#61567)
Fix bug where some kernels would not properly call cuda lazy initialization (#61882)
Added check for contiguous to dispatch to NHWC CUDA template (#62839)
Moved grid_sampler to autocast promote list (#58618)
Added check for memory overlap in sort for large input sizes (#58327)

C++ API

Fixed map function for vec256 to accept const pointer to function (#59957)
Added supports_as_strided method to Device and fixed indices of to_sparse() contiguous on all devices (#59370)
Removed redundant bitwise-and op in MT19937RNGEngine (#63219)
Fixed subprocess encoding for cpp extension on Windows (#63756)
Define the SYCL device version __assert_fail when the NDEBUG defined. (#58906)

TorchScript

Fixed inconsistency between Python and JIT power operation (#62842)
Added change to convert __constants__ attribute in model to a set to be consistent (#60003)
Added change to Ignore unsupported attribute checker pass for torch.jit.trace (#60200)
Fixed missing element types and shapes when torch.``autograd.Function has multiple tensor outputs (#57966)
Fixed Tensor.``to schema to reflect that the output may alias input (#60001)
Added change to turn off layer norm in jit symbolic differentiation (#63816)
Fixed name conflict by using a more specific prefix for lowered module name. (#61007)
Added change to allow disabling cache in autocast (automatic mixed precision) (#63552)
Fixed concat optimization to handle cases when input list is mutated after cat using AliasDb (#60774)
Fixed symbolic derivative of hardswish (#59405)

torch.package

Fixed a bug when using importlib.resources.path for python <3.8.8 (#58718)
Fixed bugs when using os and os.path (#60276)
Fixed storage serialization collision when saving a ScriptModule and then saving a Tensor owned by it. (#61806)
Fixed use-after-free during autograd shutdown (#64620)
Fixed non-determinism in naming scheme of serialized storages in export code paths and ABA ABA storage identity problem during serialization for torch.package (#59735)
Fixed GIL issue when acquiring multiple sessions. (#58584)

Mobile

Fixed Nnapi backend dangling pointer bug (#63092)
Fixed missing constants archive in torchscript model after backport (#58892)
Fixed type hints in optimize_for_mobile to be consistent with the default(#59282)
Fixed xnnpack hardswish memory issue (#59577, #61622)
Fixed the issue that model_dump didn’t work with delegate models (#61043)
Fixed concat shaders didn’t work for certain iOS devices (#61074)
Fixed the Metal torch.clamp shader function for x86_64 (#63062)
Fixed callstack pointer serialization bug (#63576)
Fixed model loading error for Vulkan backend in Java API (#63402)
Fixed the issue that sub modules with same names are not serialized correctly in bytecode format (#61933)

Quantization

Fixed crash when model outputs dicts or lists (#58416)
QAT: Fixed the runtime run cannot resize variables that require grad (#57068)
Fixed support for custom module (#59041)
Fixed the "tensors to be on the same device" error in HistogramObserver (#59234)
Fixed dimension for output of batchnorm 1d (#59264)
Fixed quantized mean operator in QNNPACK backend (#59761)
Fixed a bug in .to for qtensors so scale/zp move too (#61576)
Fixed quantized Conv1d module parameters (#62356)
Fixed quantization for tuple arguments (#63376)
Fixed fuse qconfig comparison (#63384)
Fixed the conversion of the quantizable RNN (#63879)
Fixed quantization for sub_scalar (#64603)
Fixed a bug for sub (#65109)
Add change to ensure qconfig works for QAT with multiple modules (#63343)

Distributed

DistributedDataParallel

Fixed Pipe + DDP for unused parameters, static graph (#60118)
Fixed case where new tensors with no grad_fn are returned in DDP forward. (#60882)
Re-enabled the optimization of fusing copy and division when no comm hook is specified for both dense and sparse tensors (#61379, #61814)
Fixed fp16 C++ DDP gradient communication hook (#63375)
Added change to ensure buffers are broadcasted properly when they are reassigned in module (#64776)
Fixed GradBucket.is_last() logic (#63768)

torch.distributed.Store

torch.distributed and RPC cannot both be initialized with the same host:port pair (#58328, #58329, #58330, #58331)

torch.distributed.rpc

Added change to run dist_autograd backward RPCs on appropriate CUDA streams. (#60606)
Fixed race condition in TensorPipe agent (#58753)
Fixed issue when some gradients are None for distributed optimizers (#62249)

torch.distributed.elastic

Added change to ensure rendezvous timeout does not get overwritten (#61471)
Fixed the edge case when no node is alive (#59663)
Added change to cast timestamp type to int (#59712)
Added properly formatted traceback on error (#65041)

torch.distributed.autograd

Updated GraphTask::owner_ in a single thread for DistEngine. (#58625)
Introduced the deadlock fix (#61588, #61593)

torch.distributed

Fixed the slowdown of _object_to_tensor since 1.9 (#65721) (#65721)

torch.fx

Fixed retracing wrapped functions (#58061)
Added override for call_function so that wrapped functions stay wrapped (#60057)
Added fix to retain node.meta after normalizing args (#60449)
Added change to skip the output nodes but process possible nodes after it, when creating a single partition (#60370)
Fixed fx patch module name (#61062)
Fixed graph copy.deepcopy to propagate output type (#61747)
Added change to allow starter nodes to depend on get_attr node (#62234)
Added change to prevent implicit submodule inlining when submodule is a GraphModule (#62436)
Added change to persist tracer_cls on fx.Graph when deep copying (#63353)
Fixed GraphModule deepcopy to use deepcopied graph (#63090)
Fixed constant folding for attrs in submodule hierarchies (#64342)
Fixed some const fold cases with deep model hierarchy (#64945)
Fixed tracing of bitwise and/or (#65196)

ONNX

Added shape type inference fixes for control flow (#60248)
Fixed sum export with attribute keepdims (#60245)
Fixed shape inference for large model (#60244)
Fixed split export in op set 13 (#57605)
Fixed control-flow shape inference with contrib op (#62762)
Updated instance_norm2d export to handle track_running_stats=True (#58690)
Fixed the issue of converting empty list to sequence(#61558)
Fixed sum could not be exported for empty tensor (#59537)
Fixed an issue that optimizations might adjust graph inputs unexpectedly (#62763)

Vulkan

Fixed an issue where comparing equivalent descriptors would evaluate to false (#60199)
Fixed asserts in Vulkan JIT passes to actually throw an exception (#61495)

Performance_as_a_product

Added fix to ensure number of thread utilities are initialized before getting the number of threads (#60185)
Added fix to ensure thread id is valid in nested parallel regions (#60183)
Fixed parallel tbb build (#60532)
Added change to make flags in the pytorch managed thread pool atomic. (#58457)
Set mkl thread locally (#62891)

Composability

Added a fix to ensure that the C++ API’s that skip the dispatcher (such as at::cpu::{op} and at::cuda::{op} get external linkage, so they can be used outside of libtorch (#58569)
Fixed bug where shared memory tensor file names can collide (#60978)

Build_Frontend

Fixed binary building without python (#66031)
Fixed Windows ninja builds when MAX_JOBS is specified (#65444)
Skipped Bfloat16 support when building for VSX (#61630)
Made change to use python3 alias in Makefile (#58786)
Made change to use pybind11 from third_party folder by default (#58951)
Made change to ensure FindLAPACK finds the same BLAS library (#49647)
Improved Python package detection in torch.utils.collect_env (#63321)
Skipped SVE acceleration on M1 machine (#58785)
Made SciPy dependency optional in PyTorch unary operators tests (#59304)
Fixed error-handling when Python executable can not be found (#61230)
Fixed setup.py re-run incremental build logic on Windows (#59689)
Reduced binary size for CUDA-split build by establishing correct linking order (#58287)
Fixed torch.utils.cpp_extension behavior when older setuptools are used (#61484)

Infra (Releng)

Fixed windows ci squid env (#62353)
Introduced CI dependency pinning: (#64922, #65017)
Fixed breakpad build and add to more images (#59236)
Updated certificate trust chain CI to depend on the linked commits (#65934, #66004)

LinAlg_Frontend

Fixed an issue where the “info” tensor returned by torch.linalg.inv_ex could sometimes be on the wrong device (#59223)
Fixed an issue where torch.linalg.norm could return tensors with the wrong shape in some edge cases (#60273)
Fixed an issue where torch.linalg.svd could return tensors with the wrong shape in some edge cases (#62022)
Fixed an issue where torch.matmul would throw an error when attempting to multiply certain empty tensors (#63359)

Sparse_Frontend

Fixed dtype inference in sparse_csr_tensor_ctor (#58631)
Fixed addmm failure for CSR Tensors when MKL is not available (#58768)
Fixed overflow of numel for sparse COO tensors after calling coalesce (#57492)
Fixed multiplication of 0-dim Tensor and COO sparse Tensor and improved Error message for multiplication of dense and sparse COO tensor (#61723)
Fixed internal assert error for CSR tensors crow_/col_indices methods in Debug build (#63176)
Fixed support of torch.conj for zero-dimensional sparse COO Tensors (#59553)

Misc

Added change to increase warmup for better steady state measurements. (#58801)
Fixed bad use of channels last kernel in sync batch norm backward (#64100)

Performance

Python API

torch.special.{'i0', 'i0e', 'i1', 'i1e'}: converted floating-point constants to input type in Bessel functions (#59416)
Added change to speed up torch.unique_consecutive() (#64835)
Made sure all graphs tests call torch.cuda.empty_cache() before capture to fix flaky tests (#59233)
torch.flip : improved performance via TensorIterator (#59509)
Added change to parallelize torch.gelu via tensoriterator (#58950)
torch.sum: added change to accumulate 16-bit float sums in 32-bit accumulators for improved precision and performance (#60387)
Added fast path for conjugated tensors for torch.{dot, vdot, mm, addmm, bmm, baddbmm} (#62915, #59380)

Autograd

Faster torch.cum{sum,``prod} backward formulas (#60642)
Reduced overhead from reshape call if the tensor already has the right shape (#61466)
Added change to speed up saving variables for backward (#59837, #61927)
Reduced number of TLS access when deciding if an op needs to be tracked by autograd or not (#60740)
Improved code that detect when it is valid to re-use existing Tensors during the backward pass (#59817)

torch.nn

nn.utils.clip_grad_norm_: Removed device syncs (#61042)
nn.BatchNorm2d: Optimized performance for channels_last on CPU (#59286)
nn.Softmax: Vectorized softmax calculation for the non-last-dimension case (#59195, #60371)
nn.Transformer: Faster generate_square_subsequent_mask (#60631)

CUDA

Updated launch bounds for trilinear 3d (#59999)
Migrated Embedding thrust sort to cub sort (#62495)
Make unique call in embedding use cub instead of thrust (#63042)
Migrated masked_scatter to use cub instead of thrust (#56750)
Reverted D28547564: [pytorch][PR] masked_scatter thrust→cub (9e261de630)
Make sort in EmbeddingBag use cub instead of thrust (#64498)
Migrated Embedding thrust sort to cub sort (#63806)
Removed cat, equal, and stack from autocast promote list (#59497)
Add cublas and cusolver paths for LU solve (#59148)
Fixed launch bounds for gathertopk kernel (#60314)
Changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd (#60405)
Changed launch bound to fix col2im kernel (#60315)
Fixed launch bounds for grid sampler 3d (#60385)
CUDA graphs: added change to not sync between replays for CUDA driver version 11.4+ (#61063)
Changed launch bounds for upsample_linear1d fwd, bwd from 1024 to 512 (#61307)
Added change to reduce max_num_threads for complex double ops in reduce_kernel (#61438)
Added change to use fastAtomicAdd in EmbeddingBag (mode "max") backward (#63298)
Added change to use multi-dimensional cuFFT transforms to improve FFT performance (#61203)
F.avg_pool3d CUDA backward: use fast atomic adds (#63387)
Add cuSOLVER path for LU factorization in CUDA. (#56887)
Reverted launch bounds change in topK that induced a regression in perf (#63431)
Added change to bring back old algorithm for sorting on small number of segments (#64127)

Mobile

Added change to use channel-last to transform the weights for Metal (#59113)
Implemented RoIAlign in Metal shaders using Sampler (#56075)
Added cache operator lambda during model loading (#61996)
Added Operator Call De-dup at TorchScript Serialization Level (#64269)
Added change to speed up model loading by 1directly calling the C file API from FileAdapter (#61997)
Moved from input ivalues in ByteCodeDeserializer (#64029)
Fixed MobileDebugInfo vector copy (#64030)
Added change to gate tls_local_dispatch_key_set off on iOS too (#64753)
Added change to not store multiple kernels per key on mobile (#64447)
Added OpCode cache in ByteCodeDeserializer (#64110)
Reduced mobile model size by reusing constant and bump bytecode to v5 (#59722)

Distributed

torch.distributed: replaced all_gather with more efficient collective api _all_gather_base (#57769)

torch.distributed.optim.ZeroRedundancyOptimizer:

Sorted params by size (decreasing) (#59586)

Vulkan

Improved the performance of pointwise convolutions by having each shader invocation calculate a 4x4 output tile (#60760)
Implemented a simple scheme to set the local work group size adaptively (#61170)

Performance_as_a_product

TensorIterator: added change to reduce serial_for_each static overhead (#58909)
Added change to avoid using std::regex for device string parsing (#63204)

Composability

Introduced some perf improvements for reduction ops (#58655)
Added optimization to some internal representations of sizes (#59333)
Reduced the number of tensor refcount bumps in many existing kernels (#58303, #59827, #58273, #58272, #58276, #58277, #58279, #60546, #58280)
Added micro-optimizations to improve the time it takes to load pytorch (#64784, #64820, #64821, #64822, #64838, #64678, #64682, #64670)

Build_Frontend

Compiled BatchLinearAlgebra CUDA integration routines with host compiler (#64146)
Sped-up compilation by splitting autogenerated files into smaller ones (#62186)
Allowed ninja-build to dynamically pick best parallel build option (#64733, #65162)

Infra (Releng)

.github: upload /download large artifacts to s3 (#58506)
Made change to only run mem leak check on master (#60023)
Enabled parallel clang-tidy on ec2 runner (#60870)
Made change to skip magma library installation for Windows CPU builds (#59619)

Sparse_Frontend

Sped up conversion of COO to CSR Tensor to_sparse_csr by writing custom CPU/GPU kernels (#61340, #61838)
Slightly sped up calculation of number of dense entries for sparse softmax via c10::multiply_integers for COO Tensors (#60872)
Slightly sped up sparse softmax for COO Tensors by improve usage of std::vector (#60873)
Sped up index_select for sparse COO Tensor (#63008)

Misc

Greatly reduced the post-processing time of the profiler (#60432)
Saved some little memory in default_collate (#61424)
Added new ops to the operator microbenchmark: gelu, bmm, mm, einsum, log1p (#59334, #59595, #63654, #64647, #64032, #64205)
Added AVX512 support in ATen & remove AVX support (#61903)

You can also find the dev specific and documentation related changes in the forum post here.

Download Release

This release has 2 assets:

Source code (zip)
Source code (tar.gz)

Visit the release page to download them.

Have any questions?
Contact Exxact Today

Topics

Have any questions?

Deep Learning