
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world.
Here is the newest PyTorch release v1.5.0 featuring Stable C++ frontend, distributed RPC framework, new experimental higher-level autograd API, Channels Last memory format, and more.
Repository: pytorch/pytorch · Tag: v1.5.0 · Commit: 4ff3872 · Released by: zou3519
PyTorch 1.5.0 Release Notes
The PyTorch v1.5.0 release is now available.
- Highlights
- Known Issues
- Backwards Incompatible Changes
- Python
- C++ API
- JIT
- Quantization
- RPC
- New Features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Decodecations
- Python
- C++ API
- Miscellaneous
Highlights
This release includes several major new API additions and improvements. These include new APIs for autograd allowing for easy computation of hessians and jacobians, a significant update to the C++ frontend, ‘channels last’ memory format for more performant
computer vision models, a stable release of the distributed RPC framework used for model parallel training, and a new API that allows for the creation of Custom C++ Classes that was inspired by PyBind. Additionally torch_xla 1.5 is now
available and tested with the PyTorch 1.5 release providing a mature Cloud TPU experience.
C++ Frontend API [Now Stable]
The C++ frontend API is now at parity with Python and the features overall has been moved to ‘stable’. (codeviously tagged as experimental). Some of the major highlights include:
- C++ torch::nn module/functional are now at ~100% parity with Python API, with appropriate documentation. Now users can easily translate their model from Python API to C++ API, making the model authoring experience much smoother.
- C++ optimizers now behave identically to the Python API. In the past, optimizers in C++ had deviated from the Python equivalent: C++ optimizers couldn’t take parameter groups as input while the Python ones could. Also step function implementations were not exactly the same. With the 1.5 release, C++ optimizers will always behave the same as the Python equivalent.
- New C++ tensor multi-dim indexing API which looks and behaves the similar to the Python API. The codevious workaround was to use a combination of
narrow/select/index_select/masked_select, which is clunky and error-prone compared to the Python API’s eleganttensor[:, 0, ..., mask]syntax. With the 1.5 release users can usetensor.index({Slice(), 0, "...", mask})to achieve the same result.
Channels last memory format for Computer Vision models [Experimental]
Channels Last memory format is an alternative way of ordering NCHW tensors in memory while codeserving the NCHW semantic dimensions ordering. Channels Last tensors are ordered in memory in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
Channels Last memory format unlocks the ability to use performance efficient convolution algorithms and hardware (NVidia’s Tensor Cores, FBGEMM, QNNPACK). Additionally it was designed to automatically propagate through the operators, which allows easy switching between memory layouts.
Learn more here on how to write memory format aware operators.
Custom C++ Classes [Experimental]
This release adds a new API for binding custom C++ classes into TorchScript and Python simultaneously. This API is almost identical in syntax to pybind11. It allows users to expose their C++ class and its methods to the TorchScript type system and runtime system such that they can instantiate and manipulate arbitrary C++ objects from TorchScript and Python. An example C++ binding:
template <class T>
struct MyStackClass : torch::CustomClassHolder {
std::vector<T> stack_;
MyStackClass(std::vector<T> init) : stack_(std::move(init)) {}
void push(T x)
{
stack_.push_back(x);
}
T pop()
{
auto val = stack_.back();
stack_.pop_back();
return val;
}
};static auto testStack =
torch::class_<MyStackClass<std::string>>("myclasses", "MyStackClass")
.def(torch::init<std::vector<std::string>>())
.def("push", &MyStackClass<std::string>::push)
.def("pop", &MyStackClass<std::string>::pop)
.def("size", [](const c10::intrusive_ptr<MyStackClass>& self) {
return self->stack_.size();
});
Which exposes a class you can use in Python and TorchScript like so:
@torch.jit.script
def do_stacks(s : torch.classes.myclasses.MyStackClass):
s2 = torch.classes.myclasses.MyStackClass(["hi", "mom"])
print(s2.pop()) # "mom"
s2.push("foobar")
return s2 # ["hi", "foobar"]
You can try it out in the tutorial here.
Distributed RPC framework APIs [Now Stable]
The torch.distributed.rpc package aims at supporting a wide range of distributed training paradigms that do not fit into DistributedDataParallel. Examples include parameter server training, distributed model parallelism, and distributed
pipeline parallelism. Features in the torch.distributed.rpc package can be categorized into four main sets of APIs.
- The RPC API allows running a function on a specified destination worker with given arguments and fetches the return value or creates a distributed reference to the return value.
- The RRef (Remote REFerence) serves as a reference to an object on another worker. A worker holding an RRef can explicitly request copies of the object, and it can also share the light-weight RRef with other workers without worrying about reference counting. This is especially useful when multiple workers need to repeatedly access different versions of the same remote object.
- With Distributed Autograd, applications can automatically compute gradients even if a model is split on multiple workers using RPC. This is achieved by stitching together local autograd graphs at RPC boundaries in the forward pass and reaching out to participants to transparently launch local autograd in the backward pass.
- The Distributed Optimizer uses gradients computed by Distributed Autograd to update model parameters. Its constructor takes a local optimizer (e.g.,
SGD,Adagrad, etc.) and a list of parameter RRefs, and itsstep()function automatically uses the local optimizer to update parameters on all distinct RRef owner workers.
Learn more here.
torch_xla 1.5 now available
torch_xla is a Python package that uses the XLA linear algebra compiler to accelerate the PyTorch deep learning framework on Cloud TPUsand Cloud TPU Pods. torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. This release of torch_xla is aligned and tested with PyTorch 1.5 to reduce friction for developers and to provide a stable and mature PyTorch/XLA stack for training models using Cloud TPU hardware. You can try it for free in your browser on an 8-core Cloud TPU device with Google Colab, and you can use it at a much larger scale on Google Cloud.
See the full torch_xla release notes here and the full docs here.
New High level autograd API [Experimental]
PyTorch 1.5 brings new functions including jacobian, hessian, jvp, vjp, hvp and vhp to the torch.autograd.functional.* submodule. This feature builds on the current API and allow the user to easily perform these functions.
See the full docs here.
Python 2 no longer supported
For PyTorch 1.5.0 we will no longer support Python 2, specifically version 2.7. Going forward support for Python will be limited to Python 3, specifically Python 3.5, 3.6, 3.7 and 3.8 (first enabled in PyTorch 1.4.0).
Known Issues
torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode.
DistributedDataParallel (DDP) used to support two modes
- Single-Process Multi-GPU (SPMG): In this mode, each DDP process replicates the input
moduleto all specified devices and trains on allmodulereplicas. This mode is enabled when application passes in adevice_idsargument that contains multiple devices. Or ifdevice_idsis not codesented, DDP will try to use all available devices. - Multi-Process Single-GPU (MPSG): This is the recommended mode, as it is faster than SPMG. In this mode, each DDP process directly works on the provided
modulewithout creating additional replicas. This mode is enabled whendevice_idsonly contains a single device or if there is only one visible device (e.g., by settingCUDA_VISIBLE_DEVICES).
A recent change (#33907) in torch.nn.parallel.replicate breaks DDP’s assumption on replicated modules and leads to failures in the SPMG mode. However, since SPMG is known to be slower due to
GIL contention and additional overhead caused by scattering input and gathering output, we are planning to retire this mode in future releases and make MPSG the only supported mode in DDP. The code below shows an example of the recommended way to
construct DDP.
import torch from torch.nn.parallel import DistributedDataParallel as DDP # use "cuda:1" as the target device target_device = 1 local_model = torch.nn.Linear(2, 2).to(target_device) ddp_model = DDP(local_model, device_ids=[target_device])
See #36268 for more discussion.
Tensor.exponential_(0) used to return Inf, now it incorrectly returns 0
Codeviously in 1.4, x.exponential_(0) gives a tensor full of inf. On 1.5.0, it wrongly gives a tensor full of zeros.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.randn(3).exponential_(0) tensor([inf, inf, inf]) |
>>> torch.randn(3).exponential_(0) # This is wrong! tensor([0., 0., 0.]) |
See #36798 for more details
Backwards Incompatible Changes
Python
Tensor.clone, Tensor.to, Tensor.empty_like, and similar functions codeserve stride information instead of returning contiguous tensors
clone, to, type, cuda, cpu, byte, char, double, bool, half, int, long, short, float, bfloat16,
empty_like, full_like, ones_like, zeros_like, rand_like, randn_like, randint_like operators now propagate memory format (roughly, the strides) of the input tensor to the
output tensor.
Since PyTorch operators generally support non-contiguous tensors, this should have no functional effect on most PyTorch programs.
The most common incompatibility with Python programs is with the view operator, which has specific stride requirements. If these requirements are no longer met as a result of this change, you will get an error message indicating that you should
use reshape instead, i.e. “RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead.”
Another possible exception incompatibility is if you have a (usually) C++ operator implementation that works directly on memory (i.e. calls data_ptr and relies on the strides being contiguous).
In the following example, we go through the implementation of a simple clone operation and see how it needs to change between versions.
# Version 1.4.0
Tensor simple_clone(const Tensor& input) {
TORCH_CHECK(input.dim() == 1);
auto output = at::empty_like(input);
auto input_stride = input.strides()[0];
auto* output_ptr = output.data_ptr<float>();
auto* input_ptr = input.data_ptr<float>();
// Before 1.5.0, the result of `empty_like` is always contiguous.
for (int64_t idx = 0; idx < input.size(); idx++) {
output[idx] = input[idx * input_stride]
}
}
# Version 1.5.0
Tensor simple_clone(const Tensor& input) {
TORCH_CHECK(input.dim() == 1);
// From 1.5.0 on, the result of `empty_like` may not be contiguous.
auto output = at::empty_like(input);
// As a result, we need to keep track of the output stride.
auto input_stride = input.strides()[0];
auto output_stride = output.strides()[0];
auto* output_ptr = output.data_ptr<float>();
auto* input_ptr = input.data_ptr<float>();
for (int64_t idx = 0; idx < input.size(); idx++) {
output[idx * output_stride] = input[idx * input_stride]
}
}
The inferred dtype of np.float_, np.float64 scalars in tensor constructors (e.g. torch.tensor(…), torch.as_tensor(…) is now torch.float64 instead of the default dtype (usually torch.float32). (#30486 (#30486))
Please explicitly pass in the desired dtype when constructing tensors with NumPy float64 scalars to get the old behavior.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
# Old behavior: return torch.float32 tensor (by default) >>> torch.tensor(np.float64(0)) tensor(0.) |
# To keep the old behavior, please explicitly pass the dtype >>> torch.tensor(np.float64(0), dtype=torch.get_default_dtype()) tensor(0.) |
This can cause your program to execute in torch.float64, potentially slowing down your program or can lead to errors for operators that don’t support torch.float64 or mixed-dtypes.
numpy integer scalars are now treated as integers for the purposes of type promotion (#30486 (#30486))
Codeviously, in 1.4.0, they were mistakenly treated as floats (so for example, torch.ones(3) * np.int64(3) would return a float32 tensor. In 1.5.0, we’ve fixed that behavior; torch.ones(3) * np.int64(3) returns an int32 tensor.
This can cause your code to fail if you performed operations between PyTorch tensors and numpy scalars and then passed the result into an operation that does not support integral types or mixed types. To fix your code, please cast the resulting tensor to the desired dtype.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.ones(3) * np.int64(3) tensor([3., 3., 3.]) |
>>> (torch.ones(3) * np.int64(3)).float() tensor([3., 3., 3.]) |
numpy integer scalars are now treated as integers for the purposes of type promotion (#30486)
Codeviously, in 1.4.0, they were mistakenly treated as floats (so for example, torch.ones(3) * np.int64(3) would return a float32 tensor. In 1.5.0, we’ve fixed that behavior; torch.ones(3) * np.int64(3) returns an int32 tensor.
This can cause your code to fail if you performed operations between PyTorch tensors and numpy scalars and then passed the result into an operation that does not support integral types or mixed types. To fix your code, please cast the resulting tensor to the desired dtype.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.ones(3) * np.int64(3) tensor([3., 3., 3.]) |
>>> (torch.ones(3) * np.int64(3)).float() tensor([3., 3., 3.]) |
torch.autograd.Function: dropped support for old-style Functions (#33956).
In codevious versions of PyTorch, there were two ways to write autograd Functions. We decodecated one of them in 1.3.0 and dropped support for it entirely in 1.5.0. Old-style autograd Functions will no longer work in user code.
These Functions be identified by not having staticmethod forward and backward functions (see the example below) Please see the current documentation for
how to write new-style Functions.
# Version 1.4.0
class Exp(torch.autograd.Function):
def forward(self, i):
result = i.exp()
self.save_for_backward(result)
return result
def backward(self, grad_output):
result, = self.saved_tensors
return grad_output * result
Exp()(torch.tensor(1.))
# Version 1.5.0
class Exp(torch.autograd.Function):
@staticmethod
def forward(ctx, i):
result = i.exp()
ctx.save_for_backward(result)
return result
@staticmethod
def backward(ctx, grad_output):
result, = ctx.saved_tensors
return grad_output * result
Exp.apply(torch.tensor(1.))
torch.optim optimizers changed to fix in-place checks for the changes made by the optimizer (#33640, #34211)
If this causes your code to fail, there are two possible reasons:
Reason 1: The value of that parameter was actually saved and used and we were computing incorrect gradients in codevious versions of PyTorch. This would result in an error message mentioning incorrect version numbers. You should replace code that uses self.my_param by self.my_param.clone() to
make sure the saved version is different from the one that is modified by the optimizer. For example:
Before 1.5.0, the following may have worked.
def model(input, target, param):
return `(input * param ** 2 - target).norm()`
param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param)
loss.backward(retain_graph=True)
sgd.step()
loss.backward()
param.grad
If after upgrading to 1.5.0, the above fails due to a version counter error, then that means the gradient computed was incorrect. To remedy this, clone param before using it in the model:
def model(input, target, param):
return (input * param ** 2 - target).norm()
param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param.clone())
loss.backward(retain_graph=True)
sgd.step()
loss.backward()
param.grad
Reason 2: You know what you’re doing and change the values back to the right thing before the next backward. However, you’re running into an error because the version counter cannot be decremented. Open an issue with your particular use case and we will help you to work around the version counter issue.
utils.cpp_extensions now use ninja as the default compilation backend (#32495)
ninja enables parallel compilation of your C++ extension, greatly speeding up compilation. This change will not break most user code; if you do not have ninja installed, we fallback to the old distutils backend.
However, if you do have ninja installed, it is possible that this change will cause your C++ extension build to fail by oversubscribing your system with too many worker processes. There are two potential workarounds to this.
Method 1: If a codeviously succeeding python setup.py install now fails, try setting the MAX_JOBS environment variable.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
python setup.py install |
MAX_JOBS=2 python setup.py install |
Method 2: Switch back to the old distutils backend inside your setup.py
| Version 1.4.0 | Version 1.5.0 |
|---|---|
cmdclass={'clean': clean,
'build_ext': BuildExtension}, |
cmdclass={'clean': clean,
'build_ext': BuildExtension.with_options(use_ninja=False)}, |
torch.optim.Adam, torch.optim.SGD changed to not modify gradients in-place (#30257)
In codevious versions of PyTorch, the Adam and SGD optimizers modified gradients (e.g. param.grad) in-place via in-place addition of params.grad += weight_decay * param. To make this consistent with the behavior of other optimizers
and to codevent surprises about the behavior, we’ve changed them to stop modifying gradients in-place.
This should not have an effect on most PyTorch programs unless they relied on this behavior. The easiest way to replicate the old behavior is to create a custom optimizer that implements it.
torch.masked_select now always returns a 1D tensor (#29923)
The behavior of torch.masked_select when both “self” and “mask” are 0-dimensional was changed. In codevious versions of PyTorch, this would return a 0-dimensional tensor. Now, we return a 1-dimensional tensor to be consistent with other input
sizes and our documentation.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.masked_select(torch.tensor(0), torch.tensor(True)) tensor(0) |
>>> torch.masked_select(torch.tensor(0), torch.tensor(True)) tensor([0]) |
torch.index_select on a 0-d tensor now returns a 0-d tensor. (#30790)
In codevious versions of PyTorch, the output of torch.index_select on a 0D input tensor produced a 1D tensor. This was inconsistent with our documentation on it, which stated “The returned tensor has the same number of dimensions as the original
tensor (input).” Now, we return a 0D tensor.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0]))tensor([5]) |
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0])) tensor(5) |
nn.MultiLabelMarginLoss: ‘none’ reduction on 1D tensor now returns a 0D tensor (#30768)
In codevious versions of PyTorch, the output of nn.MultiLabelMarginLoss on 1D and 0D tensors incorrectly produced 1-D tensors. Now, those cases return a 0D tensor to be consistent with the 2-D tensor case.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long)) tensor([0.2959]) |
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long)) tensor(0.2959) |
nn.MultiMarginLoss: ‘none’ reduction on 1D target now returns a 1D tensor (#30826)
In codevious versions of PyTorch, the output of nn.MultiMarginLoss on a 1D target tensor produced a 0D output. We changed this to return a 1D target tensor to make it consistent with other input sizes which return an output
that matches the target shape.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0])) tensor(0.) |
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0])) tensor([0.]) |
Tensor.exponential_(lambda) no longer supports lambda < 0 (#32501)
lambda, the rate parameter of the exponential distribution, mathematically should be greater than 0. We’ve disabled support lambda < 0 to be mathematically correct; most users will not have used a lambda less than zero.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
tensor = torch.empty(3).exponential_(-1.5) |
# Negative lambda not supported! |
nn.BCELoss, nn.functional.binary_cross_entropy no longer accept inputs with the same number of elements that are not broadcastable (#31365)
Codeviously, we supported accepting inputs with the same number of elements. However, this behavior was decodecated and we removed it in 1.5.0. In order to replicate the old behavior, please explicitly reshape your input and target tensors
to have the same shape.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> input = torch.rand(3, 3) >>> target = torch.randn(9) >>> torch.nn.functional.binary_cross_entropy(input, target) |
>>> input = torch.rand(3, 3) >>> target = torch.randn(9) >>> torch.nn.functional.binary_cross_entropy(input, target.reshape_as(input)) |
torch.normal out argument is now required to have the same size as the computed output (#32031)
Codeviously, on CPU devices, torch.normal(mean, std, out=out) would resize out to the correct size. To be consistent with the CUDA implementation, we’ve changed it so that out must either already have the correct size,
or be an empty tensor with size [0]. To work around this, please ensure that your out tensor has the correct size.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2)) tensor([ 0.0300, 0.7830, -1.3579]) |
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2)) RuntimeError: inconsistent tensor, output size ([2]) is not the same as broadcasted mean and std size (3) |
Tensor.geometric_ no longer supports integral Tensors (#31878)
Codeviously, on CPU devices, Tensor.geometric_ supported Tensors with integral dtype. Now, it only supports floating point. We removed support for this because it doesn’t make sense for geometric_ to operate on integral dtypes.
Changed torch.floor_divide input positional argument name to self (#34552)
Before PyTorch 1.5, torch.floor_divide took two positional arguments: torch.floor_divide(input, other). We’ve changed the name of the input argument to self; this will break code that called torch.floor_divide via
keyword argument. For example:
| Version 1.4.0 | Version 1.5.0 |
|---|---|
torch.floor_divide(input=x, other=y) |
# Either of the following works. torch.floor_divide(self=x, other=y) torch.floor_divide(x, y) |
C++ API
RNN / GRU / LSTM layers (#34322)
- Instead of returning
RNNOutput, RNN / GRUforwardmethod now returnsstd::tuple<Tensor, Tensor>, and LSTMforwardmethod now returnsstd::tuple<Tensor, std::tuple<Tensor, Tensor>>, matching Python API. - LSTM forward method’s hidden state parameter now has type
torch::optional<std::tuple<Tensor, Tensor>>, matching Python API. - RNN / LSTM / GRU layers now have
forward_with_packed_inputmethod which acceptsPackedSequenceas input and optionally hidden state, matching theforward(PackedSequence, ...)variant in Python API. - RNN / LSTM / GRU layers no longer have these fields:
w_ih/w_hh/b_ih/b_hh. Instead, to access the weights and biases of the gates, users should do e.g.rnn->named_parameters()["weight_ih_l0"], which mirrors the Python APIrnn.weight_ih_l0. - In
RNNOptionstanh()/relu()/activationare removed. Instead,nonlinearityis added which takes eithertorch::kTanhortorch::kReLUlayersis renamed tonum_layerswith_biasis renamed tobias
- In
LSTMOptionslayersis renamed tonum_layerswith_biasis renamed tobias
- In
GRUOptionslayersis renamed tonum_layerswith_biasis renamed tobias
Upsample layer / F::interpolate function (#35025)
- There are changes to
UpsampleOptionsandInterpolateFuncOptions:sizeis changed fromstd::vector<int64_t>toc10::optional<std::vector<int64_t>>. If you want to pass a list ofint64_tto this argument, you must pass it asstd::vector<int64_t>.scale_factoris changed fromstd::vector<double>toc10::optional<std::vector<double>>. If you want to pass a list ofdoubleto this argument, you must pass it asstd::vector<double>.
- F::multilabel_margin_loss / F::multilabel_soft_margin_loss functions (#35163)
MultiLabelMarginLossFuncOptionsis renamed toMultilabelMarginLossFuncOptionsMultiLabelSoftMarginLossFuncOptionsis renamed toMultilabelSoftMarginLossFuncOptions- The decodecated
torch::nn::BatchNormis removed in favor oftorch::nn::BatchNorm{1,2,3}d - The decodecated
torch::nn::FeatureDropoutis removed in favor oftorch::nn::Dropout{2,3}d - The decodecated
torch::nn::modules_ordered_dictis removed. User should doSequential sequential({{"m1", MyModule(1)}, {"m2", MyModule(2)}})instead. - The decodecated
torch::nn::init::Nonlinearityis removed, in favor of these enums:torch::kLinear/torch::kConv1D/torch::kConv2D/torch::kConv3D/torch::kConvTranspose1D/torch::kConvTranspose2D/torch::kConvTranspose3D/torch::kSigmoid/torch::kTanh/torch::kReLU/torch::kLeakyReLU - The decodecated
torch::nn::init::FanModeis removed, in favor of these enums:torch::kFanIn/torch::kFanOut
Optimizers
Optimizer::stepnow accepts closure function as optional input and returns a tensor, andLossClosureOptimizeris removed (#34790) (#34957). If you had a custom optimizer class defined as:
struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
void step() override {...}
};
* you would need to update your optimizer class definition as follows:
struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
torch::Tensor step(LossClosure closure = nullptr) override {
...
// return `torch::Tensor()` if `closure` is nullptr
// (i.e. we are not computing the loss)
return torch::Tensor();
}
}; - Adagrad (#29335)
- In
AdagradOptions,learning_rateis renamed tolr. - In
Adagrad,sum_buffersandstep_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<AdagradParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.sum() // param_state.step()
- SGD (#32592)
- In
SGDOptions,learning_rateis renamed tolr. - In
SGD,momentum_buffersis now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<SGDParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.momentum_buffer()
- Adam (#33730)
- In
AdamOptions:learning_rateis renamed tolrbeta1andbeta2are replaced by a tuplebetas
- In
Adam,step_buffers,exp_average_buffers,exp_average_sq_buffersandmax_exp_average_sq_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<AdamParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.step() // param_state.exp_avg() // param_state.exp_avg_sq() // param_state.max_exp_avg_sq()
- RMSprop (#33450)
- In
RMSpropOptions:learning_rateis renamed tolr
- In
RMSprop,square_average_buffers,momentum_buffersandgrad_average_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<RMSpropParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.square_avg() // param_state.momentum_buffer() // param_state.grad_avg()
- LBFGS (#34564) (#34957)
- In
LBFGSOptions:learning_rateis renamed tolrmax_eval‘s type is changed fromint64_ttoc10::optional<int64_t>tolerance_grads typeis changed fromfloattodoubletolerance_change typeis changed fromfloattodoublehistory_size typeis changed fromsize_ttoint64_t
- In
LBFGS,d,H_diag,codev_flat_grad,t,codev_loss,ro,al,old_dirs,old_stps,func_evalsandstate_n_iterare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<LBFGSParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.d() // param_state.H_diag() // param_state.codev_flat_grad() // param_state.t() // param_state.codev_loss() // param_state.ro() // param_state.al() // param_state.old_dirs() // param_state.old_stps() // param_state.func_evals() // param_state.n_iter()
Removed AutoGIL/AutoNoGIL in favor of pybind11::gil_scoped_* functions (#34301)
If your code released or acquired the GIL via AutoNoGIL or AutoGIL, please change the invocations to pybind11::gil_scoped_release or pybind11::gil_scoped_release, respectively.
Others
torch::tensor(floating-point values)will always produce tensor of default dtype, andtorch::tensor(integer values)will always produce tensor oftorch::kLongdtype, matching Python API behavior (#32367).torch::Tensor::base()is renamed totorch::Tensor::_base(), matching Python API. (#33316)- Renamed TensorTypeId to DispatchKey (#32154)
- Throw an error if nbytes is called on a sparse tensor. (#33897)
JIT
Simple Executor Is Now On By Default
The simple executor skips the number of fusion-related passes and analyses that are very time-consuming. Disabling these optimizations fixes pathologically long compilation times. The users that rely on GPU fusion to have their desired performance profile, should turn on the profiling executor. We provide C++ and python API to enable the profiling executor:
- in python, call
torch._C._jit_set_profiling_mode(True)before you call your model for the first time. - in C++, include
#include <torch/csrc/jit/runtime/graph_executor.h>and setgetProfilingMode() = truebefore you invoke your model for the first time.
Quantization
Remove qconfig_dict in top level eager mode quantization API (#31972).
In eager mode quantization, one needs to manually insert quant and dequant stubs in a model to specify where activations are quantized. Having a qconfig_dict that specifies the quantization configuration for each module is not useful as one needs to manually modify the model with quant/dequant stubs. The new API makes it explicit that the model needs to be manually modified for quantization.
# codeviously qconfig_dict was an optional argument to codepare def codepare(model, qconfig_dict=None, inplace=False): # now replaced with def codepare(model, inplace=False):
RPC
Functional API for Distributed Autograd and Distributed Optimizer
More specifically, callers must pass context_id to torch.distributed.autograd.backward() and torch.distributed.optim.step(). (#33711)
# Before
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer
with dist_autograd.context() as context_id:
# Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
# Backward pass.
dist_autograd.backward([loss.sum()])
# Optimizer.
dist_optim = DistributedOptimizer(
optim.SGD,
[rref1, rref2],
lr=0.05,
)
# After
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer
with dist_autograd.context() as context_id:
# Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
# Backward pass.
dist_autograd.backward(context_id, [loss.sum()])
# Optimizer.
dist_optim = DistributedOptimizer(
optim.SGD,
[rref1, rref2],
lr=0.05,
)
dist_optim.step(context_id)Disallow sending CUDA tensors over RPC
The motivation is to codevent potential invalid device errors when the number of devices on the sender and the receiver does not match. However applications, can always move CUDA tensors to CPU before sending (#33604).
| Version 1.4.0 | Version 1.5.0 |
|---|---|
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x, 3))
rpc.shutdown() |
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x.cpu(), 3))
rpc.shutdown() |
New Features
Python
Added new functional autograd API (#34066)
- See Highlights for more details
New __torch_function__ API Override Mechanism (#30730, #32194, #32799, #34240, #34303).
We introduced __torch_function__, an API override mechanism for subclassing torch.Tensor in Python. This is useful for creating custom objects that implement the torch.* APIs. These currently support overriding most torch.*,
and torch.nn.functional APIs; we’ve also planned future support for subclassing torch.Tensor (see tracking issue #22402).
New Operators
torch.logical_andandtorch.logical_oroperations added (#30521).torch.squareadded (#30719).torch.bitwise_andadded (#31104).torch.cummax,torch.cumminadded (#32169, #32238, #32537, #33492).torch.floor_divide,Tensor.floor_divideadded (#30493, #34552).torch.true_divide,Tensor.true_divideadded, analogous to Python ‘s, and NumPy’s (true) division (#34236, #34794)nn.functional.hardsigmoidadded(#34545).- Added PCA and SVD for low-rank matrices (
torch.pca_lowrank,torch.svd_lowrank),torch.lobpcgfor positive-defined generalized eigenvalue problem (#34721).
Distributions
distributions.von_misesadded (#33418).distributions.mixture_same_family: Added support for mixture distributions (#22742, #33408).distributions.transforms.TanhTransformadded(#19785).distributions.continuous_bernoulliadded (#34619).
C++ API
- NN modules / functionals
- C++ tensor indexing (#30424, #32841, #30427, #34255)
- Please see docs: https://pytorch.org/cppdocs/notes/tensor_indexing.html
- Operators
- C++ API parity:
isinf(#31099).
- C++ API parity:
- Autograd
- Add
at::Tensor::retain_gradAPI (#33349).
- Add
- C++ extensions
Distributed
- Allows Python application to create subclass of C++
c10d.Storeusing pybind11 trampoline class #30415.
Mobile
Quantization
- qnnpack TanH (#31013).
- Adding quantized clamp kernel (#30541).
- Quantized H Tangent function (#31031).
- QNNPACK: Add support for dynamic quantization. (#31896).
- Add operator support for dynamic quant on mobile (#32479).
- Adding native qconcat (#32252).
- FP16 dynamic quantized Linear (#32331).
- Add support for Dynamic LSTM quantization on Mobile (#32757).
- Quantized sigmoid function (#31851).
- Quantized leaky relu (#33004).
- Add a quantized batch_norm operator (#33080).
- Add Quantized BatchNorm2d module (#33109).
- Add the 3d avg pool for video related model (#33339).
- Add quantized_hardtanh (#34097).
- Add quantized ELU activation (#34267).
- Add the 3d upsample quantized op for video model (#34594).
- Add the quantized batch_norm3d and also batch_norm3d fused with relu operators (#34702).
- Add quantized implementation of hard sigmoid (#34607).
RPC
- [Experimental] Enable autograd profiler to work with RPC (#31381, #34398, #30677, #31346, #31380).
- [Experimental] Allow calling remote TorchScript functions using RPC (#32466, #33190, #32990, #32959, #33526, #33992, #33582, #32197, #33329, #34183).
Improvements
AMD/ROCm
nn.RNN: Ensure MIOpen is called on same stream as operator (#30672)- Fixed asserts in CUDA kernels (#31276, #31297).
- Enable BFloat16 support for convolutions (#30948).
- Abstract atomic add calls (#31992).
- Install complete set of headers for ROCm build (#32076).
- Adjust
elementwise_kernelsettings on ROCm (#32609). nn.BatchNorm{1,2,3}d: UseC10_WARP_SIZEto fix functionality on HIP vs CUDA for gradient computation (#33098).- Enabled Bfloat16 type for activation functions and
batch_norm(#32065). - Added ability to enable/disable MIOpen at runtime (#33118).
- Enable BFloat16 type for pooling ops (#34166).
torch.pdist: improved codecision by enabling double__shfl_down(#34103).- Enabled BFloat16 type for loss functions and few misc ops required for resnet50 (#34469).
- Enabled BFloat16 type for EmbeddingBag, Index, and Sigmoid ops (#34630).
- Enabled 3D batch norms through MIOpen (#33262).
- Enabled 3D convolutions through ROCm (#33067).
nn.RNN: Check if weights need to be flattened (#34265).
C++ API
- NN modules / functionals
- Allow skipping default arguments in module’s forward method when module is used in
torch::nn::Sequential(#33027) (#33718) - Make
torch::nn::Sequential::push_back(AnyModule)methods public (#34208). - Refactor RNN / GRU / LSTM layers to match Python API (#34322).
- For
Conv{1,2,3}d,padding_modenow acceptstorch::kZeros/torch::kReflect/torch::kReplicate/torch::kCircular, matching Python API behavior. (#35023) - Fix
F::interpolateandtorch::nn::Upsampleimplementation to match Python API behavior (#35025) (#36274) - Renaming: MultiLabelMarginLossFuncOptions -> MultilabelMarginLossFuncOptions, MultiLabelSoftMarginLossFuncOptions -> MultilabelSoftMarginLossFuncOptions (#35163)
- Allow skipping default arguments in module’s forward method when module is used in
- Optimizers
- All existing optimizers in the C++ API (Adagrad / SGD / Adam / RMSprop / LBFGS) have the following changes to achieve parity with the Python API: (#29335)
(#30739) (#32592) (#33730) (
#33450) (#34790) (#34564) (#34957)
(#35001) (#36033) (#36245)
- step function implementation is changed to behave the same as Python equivalent
- Constructor now accepts
std::vector<OptimizerParamGroup>as input optimizer.add_param_group(...)can be used to add parameter group to an existing optimizeroptimizer.state()should be used to access parameter state
- All existing optimizers in the C++ API (Adagrad / SGD / Adam / RMSprop / LBFGS) have the following changes to achieve parity with the Python API: (#29335)
(#30739) (#32592) (#33730) (
#33450) (#34790) (#34564) (#34957)
(#35001) (#36033) (#36245)
- autograd
- Renamed
at::Tensor::base()to_base(), matching Python API (#33316)
- Renamed
Distributed
- Allow TCPStore to pick a port to bind to (#31674).
- Enhance NCCL watchdog to actively abort communicators for timed out ops (#32338).
- Adding DDP Design Note (#32158).
- Recommend using DDP over DataParallel (#35063)
Distributions
distributions.independent: added explicit string recodesentation (#33676).categorical.sample: Reduced memory overhead (#34900).distributions.MultivariateNormal: improved numeric stability and performance (#32092).
Mobile
- Add module level qpl logging. (#30906).
- Expose setNumThreads to android api (#31033).
- remove unused SparseCPUType from mobile build (#33517).
- make sure mobile build work with dynamic dispatch (#34038).
- support for custom mobile build with dynamic dispatch (#34055).
- Add watchOS support (#33318).
- speed_benchmark_torch switch to log latency from dataset level to row level (#34598).
ONNX
Exporting More Torch Operators to ONNX
In PyTorch 1.5, we have added support for 10 additional operators and also enhanced support for another set of 10+ existing operators. We have also added support for exporting large models (> 2GB) to ONNX. Additionally, we have made enhancements and optimizations to the export of ScriptModules and will continue to do that in the next release. We have also made improvements to the custom op export experience.
- Export dynamic unbind, split and getitem (#29136).
- Export torch.new_zeros (#34077).
- Export Im2col (#30972).
- Export bitwise_not for bool (#28439).
- Export logsoftmax with dim != -1 (#30433).
- Export einsum (#32716).
- Export aten::copy_ and aten::index_put to ONNX opset 11 (#26941).
- Export floor_divide (#31081).
- Export one_hot (#34454).
- Export torch.take (#33061).
- Export bool type index mask (#32445).
- Export split with list of sizes (#33161).
- Export scalar tensor for split (#32493).
- Export flatten to accept negative indices in opset 11 (#30751).
- Export sort with negative axes (#31971).
- Export Interpolate to support scale (#28324, #31526, #32554).
- Export quantized concat (#30887).
Enhancing the Support for ScriptModule
- Fixed access to element in size tensor for scripting (#32652).
- Export Conv in TorchScript module (#30618).
- Export Dim operation in TorchScript module (#31928).
- Export randnlike in TorchScript module (#32830).
- Partially support tensor lists in loop/concat/stack (#30126)
Enhancing Existing Export Logic
- Updating ONNX checker logic. (#33522).
- Adding ONNX large model export support in exporter (#33062).
- Extend op registration (#32943).
- Support op registration if name starts with underscore (#32017).
Optimizing Exported ONNX Graph
- Try exporting ONNX with force_outplace=False (#29466).
- Enable constant folding (#29834).
- Added cons folding for ONNX mul, div, sqrt ops (#32077).
- Enable constant folding for Reshape (#31054).
Adding Utility Functions and Refactoring
- Added ONNX model checker to ONNX export (#32298).
- Export custom ops (#29752).
- Upgrade exported ONNX IR version to 6 (#31025).
- Provide names for operator nodes in ONNX exported graph (#27342).
- Update ONNX landing page since 1.3 (#32805).
- Turn ONNX_ML into a proper build option (#33424).
Operator Benchmark
- Added small input shapes to test operator overhead (#30617).
- Added
binary_testto benchmark binary ops (#31326). - Added
Tensor.copy_operator (#31327). - Removed option to wipe cache because it did not help with variance (#31334).
- Added
torch.diag(#32597).
Quantization
- Guard against copying from quantized Tensor to non-quantized Tensor (#29660).
- Add assert for min, max, qmin, qmax for ChooseQuantizationParams (#32739).
- Support broadcast for quantized mul kernel (#30442).
- Make FakeQuant use
REGISTER_DISPATCH(#33682). - Set alias analysis kind to
FROM_SCHEMAfor qadd, qmul, qclamp, qconcat (#33359). - Migrate
fake_quant_sliceto TensorIterator (#33744). - Parallelize quantize and dequantize (#33765).
- Make FP16 RNN use new codepack op (#34339).
- Refactor QAT Conv module for better extensibility (#30362).
- Use non-inplace for insert observer pass (#34190).
RPC
- Add default arguments for
init_method(#30208). - By default ignore RRef leaks during shutdown (#30217).
- Robustify
rpc_agenthandlers with generic Future (#31224). - Fix error message in incorrect
rref.localValue()call (#31199). - Add
RpcAgent::getWorkerInfos()API to return allWorkInfos in the group (#30241). - Add local shutdown to process group agent (#30330).
- Add
RRef.str()API to return a string recodesentation of the RRef (#30609). - Adding Debug Info for RRef Context (#30610).
- Add
get_metricsandget_debug_infoto RPC agent (#30833). - Adding debugging metrics to process group agent (#30884).
- Add glue code to collect debug info from all components (#30888).
- Make RRef leak detection always print a warning log (#31922).
- Allow multiple backward passes to accumulate gradients. (#32506).
- Allow RRef local creation with IValue objects (#33263).
- Improve ProcessGroup
RpcBackendOptionsConstructor API (#34081). - Enhanced Error Reporting in Dist Autograd/RPC (#34179).
- Delete all user forks tracked in
RRefContextbefore graceful shutdown (#31893). - Best-effort Error Detection for Using Deleted UserRRefs (#34673).
- Don’t run user function until all UserRRefs in the args are confirmed (#34497).
- Support using self as the destination in
rpc.remotefor builtin operators (#34931). - Add debug info API for distributed autograd. (#30642).
- Propagate errors in
clearAndWaitForOutstandingRpcsAsync. (#32952).
Type Hints
- DataLoader
default_collatetype hint added (#28935). Tensor.rsub, Tensor.rpow, Tensor.rtruediv, Tensor.map_type hints were added (#30576).torch.optim: added more missing type hints (#31130).nn.functional.grid_sample,nn.functional.affine_grid: added missing align_corners annotation (#32492).torch.nn.Parameterconstructor type hint was fixed (#32617).nn.MultiheadAttention,nn.Transformer: added type hints (#28396).torch.optim.LambdaLRconstructor type hint was fixed (#33271).torch.optim: added missing default value forLRScheduler.step()(#32411).- Make type of
Tensor.type()more specific (#32353). torch.optim.optimizer.Optimizertype hints were fixed (#32900).optim.AdamWtype hints were fixed (#34299).torch.utils.data.Samplersubclasses type hints were added (#33679).nn.Sequential,nn.ModuleList,nn.ParameterList,nn.ParameterDicttype hints were fixed (#33686).Tensor.bfloat16()type hint was added (#33747).- Binary operator type hints were fixed (#33748).
torch.bfloat16,nn.Module.training,Tensor.cuda, and 10s of other type hints added (#33762).torch.addtype hint was fixed(#33935).Tensor.shapetype hint was fixed (#34595).- Fixed
utils.dataimports (#33543). Tensor.__radd__type hint was fixed (#35231)
Other
autograd.detect_anomaly: added support for Sparse Tensors (#29803).autograd.detect_anomaly: Error messages now print the current Node name (#33875).autograd.profiler: added better error message when crashing while profiling multi-worker DataLoader (#31473).autograd.profilerEnable usingtorch.autograd.profiler.record_functionas decorator (#30861).autograd.profilerSpeed upexport_chrome_traceby up to 4x (#30724).torch.autograd: added better error message when attempting to fork (#33885).torch.cuda.memory.caching_allocator_alloc,torch.cuda.memory.caching_allocator_deleteexposed in Python API (#33860).torch.roll: added bool tensor support (#31194).torch.flip: added support for bool tensors (#31267).torch.equal: added support for bfloat16 CPU scalar types (#30817).torch.save,torch.load: added error message for minimum dill version support (#30985).torch.diagonal: added named tensor support(#30193).torch.linspace: added support for integral types on CPU (#32218).torch.eig: Added autograd support in the case where eigenvalues are real (#33090).torch.mvlgamma: improved error message (#32665).torch.no_grad,torch.enable_grad: added support for decorating generator functions (#31792).torch.narrow: added Tensor overload forstart(#34317).Tensor.random_: enabled support for half on CPU (#34030).Tensor.grad: added warnings when accessing it if it won’t be populated for known reasons (#30531).torch.cuda.comm.gather: improved error message (#27456).nn.functional.max_pool{1,2,3}d: added named tensor support (#31669).nn.Module.load_state_dict: Include the contents of the exception in error messages (#32693).nn.MultiheadAttention: add support for 3D attention mask (#31996).nn.MSELoss: Added performance warning for using CPU Half (#33021).nn.ModuleList,nn.ParameterDict,nn.ParameterDict: added more descriptive error messages when attempting to call these like Modules (#29991).nn.init.dirac_: Addedgroupsoption for compatibility with initializing group convolutions (#32825).- Added error message to indicate that reduction operations are not supported for dim >= 64 (#31476).
- Type Promotion: added supports for sparse tensors and arithmetic operations (#30429).
- Enabled indexing for bfloat16 tensors (#31692).
- Add 64-bit indexing support for CUDA Tensors (#33405).
- Added warning when converting a read-only NumPy array to
torch.Tensor(#33615). - Set rpath for JNI library on Mac (#32247).
- Updated MAGMA to 2.5.2 for Windows (#30513, #34205).
- Marked PyTorch incompatible with Python-3.6.0 (#34724).
- Consider
hub_diralongsideTORCH_HOMEenv variable for storing hub models (#32844). - Improved dll loading logic on Windows (#33856).
- Error out if legacy
Tensor.newis called on alternate layouts or dtypes (#31485). utils.checkpoint.checkpoint_sequential: Removed decodecated variadic arguments behavior (#25985).
Bug Fixes
C++ API
- NN modules / functionals
output_ratioforFractionalMaxPool{2,3}dmodule andfractional_max_pool{2,3}dfunctional should accept double as data type (#33304)- For
AdaptiveAvgPool{2,3}dandAdaptiveMaxPool{2,3}d,output_sizeis changed to acceptc10::nulloptin its elements, matching Python API behavior. (#35022) - Fix bug in
fractional_max_pool3d_with_indicesimplementation (#35024) - Remove
namespace F = torch::nn::functionalfrom torch/nn/modules/batchhnorm.h, so that people don’t have to useFto aliastorch::nn::functionalif they don’t want to (#30684)
- autograd
- For
AutogradContext,get_dirty()is removed andget_and_bump_dirty()is added, and the latter always bumps the version counter of the returned tensors (#33068) - Fix allow_unused checking for C++ API (#34035)
- Remove
using namespace torch::autogradfrom torch/csrc/api/include/torch/nn/modules/_functions.h (#34423)
- For
- Operators
torch::tensor(floating-point values)will always produce tensor of default dtype, andtorch::tensor(integer values)will always produce tensor oftorch::kLongdtype, matching Python API behavior (#32367)- Fix
torch::allcloseto handlestd::numeric_limits::lowest()for integral types (#32978) - Switch
torch::empty_liketo usemerge_into process TensorOptions (#33505)
Distributed
- Allow DDP to detect globally unused parameters (#28883).
- Accept url query when
rankorworld_sizeis specified in Process Groupinit_methodURL (#32016). - Add ability to abort NCCL communicators from the store. (#32895).
- Fix timeout support when initializing process group with TCP store (#33434).
- Abort NCCL communicators before throwing operation timed out (#31128).
- Fix logging for aborted communicators in ProcessGroupNCCL (#33147).
- Fix handling of replica parameters in DataParallel (#33907).
- Specify
requires_gradfor Parameter replica so it’s not always set to True by default (#32356) - Put sparse
allreduceresults to input tensors (#32226) - Issue a warning when
zero_gradis used inDataParallel(#33064)
JIT
- TorchScript compilation fixed for (#33783):
torch.stfttorch.lu,torch.lu_unpacktorch.cdisttorch.norm
tensor.tolist()compilation now supported, requires output type annotation (#33472)
def foo(float_matrix, scalar_ten):
# type: (Tensor, Tensor) -> Tuple[List[List[float]], bool]
out1 : List[List[float]] = float_matrix.tolist()
out2 = torch.jit.annotate(bool, scalar_ten.tolist())
return out1, out2torch.rand_likeand other_likeconstructors no longer require additional arguments in TorchScript- Compilation for
nn.ModuleAPIs added (#29495):childrennamed_childrenmodulesnamed_modules
- Support for ModuleList Indexing with Integer Literal (#29236)
- Fixed flipped outputs for
PackedSequence(#32955) - Support
indexandtypeproperties onDevice(#32953)device.indexdevice.type
- Add remaining
Tensorproperties (#33906)tensor.ndimtensor.Ttensor.nametensor.is_leaf
- Fix augmented assignment to non-tensor attributes #32993
- Fixed type resolution for function arguments #29623
- Codeviously we resolved types by parsing their names directly, but now TorchScript uses the value of the type directly from Python
- This allows types types like
torch.deviceto be used
lenon tuples containing different types #35768
Mobile
- Fix exception message in Java Tensor (#30205).
- Fix the crashes for c++ not able to find java class through Jni (#30390).
- Add @DoNotStrip to nativeNewTensor method. (#30472).
- GenericDict/List type use unshapedType() (#30428).
- Support tensors with a storage offset in Java (#31584).
- Fix SIGABORT caused by double exception in PyTorchStreamReader when file not found. (#33243).
- Fix
SELECTED_OP_LISTfile path issue (#33942). - Fix for handling batch size 0. (#34599).
- fixed AutoGradMode/AutoNonVariableTypeMode uses for mobile callsites
- Use
gettimeofdayon iOS (#30361).
ONNX
- Fix
weight_normexport for dim=0 (#31015). - Fix for constant folding flaky tests (#32546).
- Fix export for avg_pool with default stride (#33017).
- Fix ONNX CI by moving test data to aws (#33200).
- Fix for random generators export (#33789).
- Fix export of index_put (#31552).
- Fix for expand -1 dim value (#34069).
- Reduce ONNX test time on CI (#33242).
- ONNX Error Message on Missing Op (#33593).
- Fix exporting
copy_with index as tensor input (#32801). - Fix for
rand_likeas well (#33095). - Added torchvision tests as part of ORT tests (#31835).
- Remove non-ascii character from
torch/onnx/symbolic_opset11.py(#31814). - Add flag to enable script tests (#32654).
- Skip same tests in ONNX Python3 CI as in Python2 (#31827).
- Fixed
torch.mmexport (#34794) - Fixed
aten::sizefor opset 11 (#35984)
Quantization
- Bug fix: Handle missing keys in observer state dict during load (#30357).
- Fix BC for quantized linear (#30481).
- Fix mapping white list to avoid attaching qconfig for DeQuantStub (#30636).
- Fix default instantation of dynamic quantized LSTM (#31433).
- Use default scale/zero_point in fake_quantize module instead of None (#32318).
- Fix ASAN / potential segfault in quantized Tensor memory allocations. (#29882).
- Don’t serialize None values in observer (#32733).
- Enable inplace relu fusion for training (#33105).
- Bug fix in dynamic quantization kernels + better test coverage. (#33320).
- Run weight_post_process for QAT (#33852).
- Fix histogram observer to work with QAT on GPU (#34232).
- Fix the quantized batchnorm2d (#34579).
- Move QScheme ops to c10 (#30134).
- Remove incorrect fp16 dynamic linear/relu op (#32774).
RPC
- Fix serialization memory lifetime issue. (#30603).
- Don’t crash callee when function does not exist on it, instead return an Exception (#32726).
- Throw the correct Exception on local client based on the
RemoteException(#32936). - Attach autograd edges only for tensors requiring grad. (#30904).
WireSerializershould checkhas_storage()(#34626).- Fixed potential deadlock in python exception handling (#35283)
Other
torch.split: Fixed incorrect gradient computation that assumed the output was not a view (#32044).- Allowed numpy integer types to be used where we accept Python integers (#30486).
torch.unique,torch.unique_consecutive: fixed bug with zero-element input support (#31211).Tensor.to_sparse: fixed backward in the non-contiguous tensor case (#31223).torch.index_put: Added error checks for input tensors’ devices (#31280) (#31280).- Ensure we switch the CUDA stream correctly in CUDA operations (#31537, #31538, #31541).
torch.SparseTensor: ensure the legacy sparse constructor doesn’t intercodet Python data as tensor data. (#31490).torch.argmax,torch.argmin: Fixed incorrect behavior on large tensors (#33310).torch.div: Fixed to throw an error when dividing by integer zero on CPU (#32629).torch.cos: Fixed incorrect gradient computation caused by not properly initializing temporary vectors in avx2 code (#32722, #34281).torch.logspace: Added missing integer dtype support, fixed codecision issues in floating-point implementation (#32744).torch.prod: Fixed behavior when passed atorch.halfinput tensor andtorch.floatoutput tensor (#32831).torch.max,torch.min: Fixed NaN handling (#32541).torch.max,torch.min: Added error check that operand and outputs are on the same device type (#32862).torch.stack: Added missing input size checks (#32931).torch.add: Fixed memory leak on certain platforms (#32478).torch.normal: Fixed shape checks (#33050).torch.cumsum: fixed to handle inputs with zero-sized dimensions correctly (#31694).torch.device: Disallow incorrectly formatted device strings (#29087).torch.cat: Disallow passingoutas one of the input tensors (#30577).torch.pdist: Added support for large batch sizes (#31593).torch.stft: Fixed crash when used withnn.DataParallel(#31861).torch.autograd: Ensure the original grad mode is restored during backward (#31884).torch.autograd: Fixed a race condition by locking graph_task before writing leaf_streams. (#31995) (#31995).torch.tensordot: Fixed support for negative dimensions (#31954).torch.cumprod: Fixed to handle inputs with zero-sized dimensions correctly (#32070).torch.pow: Fixed the gradient computation when the base is a Tensor or Scalar of zeros (#32062, #32063).torch.baddbmm: Fixed bug in corner case (#33538).torch.where: Added check for consistent devices (#33432).torch.cdist: Fixed gradient computation forp=2and large inputs (#31167).torch.mv: Fixed NaN handling (#31666).torch.index_put: Added handling for large input tensors (#33753).torch.addmm: Fixed incorrect output when using BLAS backend (#33819).torch.topkfixed double backward when input has non-finite values (#35253)torch.load: Avoid problematic pickle usages on Python 3.8.0 and 3.8.1 (#33824).Tensor.to: Fixed race condition for gradient computation that spans CUDA devices (#31930).Tensor.random_added check thatfromandtoare within the Tensor’s dtype bounds (#34033).Tensor.copy_: Fixed memory overlap check and allowed outputs to be zero-strided tensors if the size is <= 1 along that dimension (#34100).nn.BatchNorm{1,2,3}d: fixed gradient computation for empty inputs (#32820).nn.BatchNorm: Fixed behavior for inputs with large batch sizes (#32763).nn.Conv2d: Fixed 5d weight handling with MKLDNN backend (#34115).nn.Conv3d: Fixed unstable gradient computation (#34358).nn.Conv{1,2,3}d: added support for empty batch size(#32709).nn.Conv{1,2,3}d: fixedCUDNN_STATUS_NOT_SUPPORTEDerrors by trying multiple algorithms (#33073).nn.Conv{1,2,3}d: fixed padding mode support and added additional padding modes (reflection and replication) (#31784).nn.Conv2d,nn.Conv3d,nn.Conv1d,nn.ConvTranspose2d: Fixed support for batch sizes greater than 2^32 (#31383, #31379, #31889, #34407,#31510).nn.InstanceNorm,nn.GroupNorm: Added error check for input with exactly one element (#29082).nn.RNN: Fixed moving RNNs to a device after applying weight norm (#32563, #32989).nn.MultiLabelMarginLoss: added support for 0-d tensors (#30765).nn.GroupNorm: added support for empty batch (#32401).nn.NLLLoss: fixed to support empty tensors on CUDA (#31491).nn.GroupNorm: corrected input size check (#33008)nn.MultiLabelMarginLoss: fixed memory leak on CUDA (#30767).nn.MultiMarginLoss: fixed error checking on CUDA for the 1D case. (#30825).nn.Softmax: Fixed half->float case of softmax backward (#30838).nn.Softshrink: Added check that lambda is no less than zero (#33201).nn.functional.interpolate: added support for empty batch size input for interpolate. (#32400).nn.functional.pad: Also return a new tensor instead of sometimes returning a view (#32350).nn.functional.grid_sample: Fixed gradient computation at image borders (#32829).nn.functional.leaky_relu_: disabled incorrect leaky_relu_ negative slope backward calculation (#33639).optim.LambdaLR: removed unintentional side effects (#32848).optim.Adam,optim.AdamW: Added missingweight_decayparameter validation (#33126).optim.MultiStepLR: Fix “unbound local variable” error by removing return value for__exit__(#32997).optim.MultiStepLR: Fixed brokenstep()method (#33356).torch.autograd: added new error message if incorrect usage would cause a deadlock (#32295).torch.autograd: Prohibited copying autograd engines (#34567).torch.autograd: Fixed incorrect handling of functions that return multiple views (#32790).autograd.Function: Fixed error ifFunctionreturned a view in atorch.no_gradblock (#33896).autograd.Function: Added more error checks for incorrect behavior (#33069).autograd.Function: Added nice error message if missing overrides (#33142).autograd.Function: Fixed version check forgrad_fnfor views (#34145).autograd.profiler: Fix incorrect chrome trace formatting output for CUDA traces (#33987).multiprocessing.util.register_after_fork: fixed crash on Windows (#30809).utils.data.DataLoader: Fixed potential hang when exiting main process (#33721).utils.tensorboard.SummaryWriterfixedscale_factorcalculation for uint8 tensor (#31778).utils.tensorboardFix for when PyTorch model trace has RecursiveScriptModules (#30430).- Fixed
CPU_INTELflag error on Windows (#30564). - Don’t use
RTLD_GLOBALto load_C, resolving a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages (#31162). - Fixed dll load logic for Python 3.8 on Windows (#32215).
quasirandom.SobolEngine: Fixed crash when default tensor type is CUDA (#32496).- Fixed error message when converting NumPy array with negative strides to a
torch.Tensor(#33254). - Fixed crash when indexing a
torch.Tensorwith a single-element array (#33456). - Fixed crash when converting CUDA tensors and non-strided tensors to NumPy arrays (#33612).
- Codevented crash on exit from static destructor race on Windows (#33955).
- Fixed uncaught
std::domain_erroron macOS (#34301). - Don’t reset worker affinity when using operators that call into OpenMP (#29006).
torch.backends.mkldnn: changed to be usable without import (#32055).
Performance
Mobile
- Java Tensor hybrid, owns at::Tensor, no memcopy for java outputs. (#30501).
- Tensor codep from image in native (#31426).
- Pass to remove codepacking ops. (#34319).
Quantization
- Per channel quantization performance improvement (#33772).
- Speed up per-channel min-max observer (#34118).
- Vectorized qmul and more methods on qint data types (#34376).
RPC
- Improve
ProcessGroupAgentserialization speed (#29785). - Avoid sending large unneeded data over wire in
ProcessGroupAgent. (#31357). - Integrate async mode for autograd engine with distributed autograd. (#31508).
- Make handling of
FORWARD_AUTOGRAD_REQinrequest_callback_implnonblocking (#32476).
Other
- Major multithreaded performance regression when doing operator calls resolved (#30333)
- Improved performance of comparison ops on CUDA (#29743).
Tensor.viewimproved performance (#30554).- Improved tensor creation overhead (#30452, #30709)
nn.SmoothL1Loss: vectorized gradient computation on CPU. (#30046).nn.EmbeddingBag: improved performance on CPU (#30701, #27477).nn.LayerNorm: optimized with explicit vectorization using Vec256 (#31127).Tensor.copy_: fixed kernel speed regression introduced in #29631 (#31279).- Moved a number of debug asserts to not compile in release builds (#31240).
Tensor::has_namessped up for unnamed tensors (#31436).torch.index_select: optimized performance on CPU (#30598).nn.Conv{1,2,3}d: Improved performance by refactoringbiashandling for cuDNN backend (#31524).torch.norm: Optimized case wherep = 2(#31903).nn.utils.clip_grad_norm_: Refactored the computation for more performance (#32020).- Made an assert on a hotpath trigger only in DEBUG mode (#32117).
- First steps toward TensorIterator unrolling and vectorized load (#31974).
nn.functional.normalize: changed to useclamp_min_(#32360).- Stopped refreshing numel on a stride update (#32116).
nn.functional.softplus: vectorized operator and gradient computation on CPU (#32944).torch.gatherregression fixed by not materializing loop vars in error message (#33108).nn.ELUforward and backward vectorized on CPU (#32985, #32986)torch.cat: optimized performance on CPU (#30806, #33534).torch.conv3d: optimized Unfold3d to improve performance (#33191).- Workaround performance bug and memory leak in GOMP for AMD CPUs (#32875).
- Improved TensorIterator overhead (#33165).
torch.conv3d: optimized Unfold3dAcc to improve gradient computation performance (#33317).torch.rollimproved performance (#33623).- Bounds checking for functor execution in vectorized/unrolled kernels (#33642).
nn.EmbeddingBag: improved performance on CUDA (#33589).- Remove unnecessary tensor copies while calling operators (#33732).
- clang intrinsics targeting on Windows (#33958).
nn.Dropout: added vectorized CUDA implementation (#33879).nn.UpSampleNearest{1, 2, 3}dperformance on CPU optimized (#31452) (#31452).- Remove
cudaMemcpyon full memory overlap (#34548). - CUDA Loops: move address computation into policy, make
policy.loadload all arguments (#33720). nn.BatchNorm{1, 2, 3}dcontiguous case’s performance improved (#34530).- Add the build for runtime dispatch for AVX, AVX2 instruction set (#26125).
nn.RReLUperformance improved up to 5x for inference on CPU (#31094).nn.LogSigmoidperformance improved up to 10x on CPU (#30958).torch.distperformance improved up to 2x (#29714).torch.max,torch.minperformance improved up to 1.5x on CPU (#33936).nn.GLUperformance improved up to 1.5X on CPU (#33179).nn.LeakyReLUperformance improved up to 4x (#29899).nn.HardTanhperformance improved up to 5x (#30152).
Documentation
Python
- Added documentation for
nn.functional.softplus(#30055, #32945). torch.max: Added warning about different, nondeterministic behavior on CPU and CUDA (#31115).- Clarified the documentation for
nn.NLLLoss(#31488). - Exclude generated source docs from Google search indexing (#31484).
torch.poissondocstring added to documentation (#31667) (#31667).torch.eqfixed incorrect examples in documentation (#32399).torch.load: added warning regarding pickle insecurity (#32593).optim.CosineAnnealingLR: fixed the usage in examples (#31358).- Added doc codeviewing instructions (#31905).
- Removed legacy
.datausages from thetorch.nndocumentation (#31481). - Fixed description of convolution modules (#30079).
Tensor.t(),Tensor.permute(),Tensor.unfold(), andTensor.select()clarified to note that they return views (#32512).torch.multiprocessingUpdated documentation indicating that start_method is ignored formp.spawn()(#33070).- Improved CPU threading documentation (#33083).
nn.BCELoss: documented how it avoids infinite results (#33160).nn.utils.rnn.pack_padded_sequence: Improved the description ofenforce_sorted(#33617).nn.utils.pad_packed_sequence: doc improvement (#33768).nn.LPPool{1,2}d: removed nonexistent parameter (#33714).- Created a Tensor View documentation page that documents all PyTorch operations that return views (#32560).
- Added grad context manager doc to top level torch module. (#33877).
- Enhanced reproducibility documentation (#33795).
- Numerous typo fixes (#30448, #30518, #30614, #30464, #30608, #24335, #34581, #34624, #34008, #31395, #31677, #31617, #31973, #32068, #33689, #30385, #32003, #31682, #30846, #33478, #33549, #32307, #33144, #33805, #33836, #34053).
- Numerous formatting and/or rendering fixes (#30377, #30779, #32667, #34027, #32911, #30814, #30815, #31760, #34503).
C++ API
- Fix
at::Tensordocs generation and make it accessible again at https://pytorch.org/cppdocs/api/classat_1_1_tensor.html (#34467) - Add docs for all
torch::nn modulesand functionals (#34522) (#34688) (#34752) - Improve C++ autograd and tensor indexing docs (#35919)
- Fix example in
torch::nn::ModuleListdocs (#34463)
RPC
- Reorganize RPC API doc and add introduction (#30491, #35109).
- Make doc source format consistent in
rpc/init.cpp(#30515). - Add examples to RRef doc (#30516).
- Add more details to explain
rpc_backend_optionsarg ininit_rpc(#30855). - Fix examples in API doc (#30856).
- Fix examples in RRef API doc (#30857).
- Document WorkerInfo and
RpcBackendOptionsstructures in RPC docs. (#31077). - Explain RPC behavior when using Tensor as arg or return value (#31968).
- Update RPC docs to reflect correct use of dist_autograd backwards and dist_optim
step()(#34670). - Minor doc tweak to use mp.spawn in example (#30381).
- Update distributed autograd note (#34657).
Mobile
- Add info about transitive dependencies in case of using local aars (#30128).
- Update Docs for building PyTorch for Android. (#32578).
- Javadoc changes (#31956).
Quantization
- Updates to quantization documentation (#30288).
- Fix docs so that the example works (#30120).
- Add the explicit per-tensor/per-channel quant info when we print the module (#30591).
- Fixed typos in quantization docs / docstrings (#34182).
- Docs entry for the
is_quantized(#32075).
Decodecations
Python
How to figure out which line in your code is raising a warning
Attempting to use decodecated behavior will raise warnings. Unfortunately, sometimes it is not entirely obvious what line of code the warning corresponds to, especially if the the warning comes from our C++ backend. For example, with a file named foo.py with
the following contents,
import torch # This is newly decodecated behavior, see the next section torch.tensor(1) / torch.tensor(2)
running it doesn’t give us the location of the warning:
> python foo.py ../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
We can use the warnings module to tell us where the warning is by asking it to treat warnings as errors:
import torch
import warnings
warnings.filterwarnings('error', message='Integer division')
# This is newly decodecated behavior, see the next section
torch.tensor(1) / torch.tensor(2)
Running the file now tells us exactly where the warning is:
> python foo.py
Traceback (most recent call last):
File "foo.py", line 5, in <module>
torch.tensor(1) / torch.tensor(2)
UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide
or floor_divide (// in Python) instead.Decodecated torch.div and torch.addcdiv integer floor division behavior (#34570)
In 1.5.0 and older PyTorch releases torch.div and the / operator perform integer floor division. In a future PyTorch release, torch.div (including the / operator) will perform “true” division as in Python3 and NumPy.
To floor divide integer tensors, please use torch.floor_divide instead.
| Before | After |
|---|---|
>>> torch.tensor(3) / torch.tensor(2) ../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. tensor(1) |
>>> NB: the following is equivalent to `torch.floor_divide(torch.tensor(3), torch.tensor(2)) >>> torch.tensor(3) // torch.tensor(2) tensor(1) |
The fix for torch.addcdiv is similar.
| Before | After |
|---|---|
>>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> torch.addcdiv(input, tensor, other, value=value) ../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning: Integer division with addcdiv is decodecated, and in a future release addcdiv will perform a true division of tensor1 and tensor2. The current addcdiv behavior can be replicated using floor_divide for integral inputs (self + value * tensor1 // tensor2) and division for float inputs (self + value * tensor1 / tensor2). The new addcdiv behavior can be implemented with true_divide (self + value * torch.true_divide(tensor1, tensor2). tensor(0) |
>>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> (input + torch.floor_divide(value * tensor, other)) tensor(0) |
Decodecated torch.full returning float tensors if no dtype is specified (#34709).
In a future PyTorch release, torch.full will infer its dtype from its fill value when the optional dtype and out parameters are unspecified, matching NumPy’s inference for numpy.full. For example, torch.full(size, 1) will
return a tensor of torch.long dtype, unlike today where it returns a tensor of torch.float dtype.
Decodecated torch.nn.modules.conv._ConvTransposeMixin (#31784).
This is an internal-facing class that is not a part of our public API. We’ve refactored some PyTorch internals to work without it and will remove it in a future release.
Decodecated positional args in multiple torch function signatures (#32009, #33428)
Below please find a list of decodecated signatures and what to change them to.
torch.add(self: Tensor, alpha: Scalar, other: Tensor),torch.sub(self: Tensor, alpha: Scalar, other: Tensor)please usealphaas a keyword-only arg instead of positional argstorch.addbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor): please usealphaandbetaas keyword only args instead of positional args.torch.addcdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor),torch.addmdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor): please usevalueas a keyword-only argtorch.addmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor),torch.sspaddmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor)please usealphaandbetaas keyword only args instead of positional args.torch.addmv(beta: Scalar, self: Tensor, alpha: Scalar, mat: Tensor, vec: Tensor): please usealphaandbetaas keyword only args instead of positional args.torch.addr(beta: Scalar, self: Tensor, alpha: Scalar, vec1: Tensor, vec2: Scalar): please usealphaandbetaas keyword only args instead of positional args.torch.baddbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor): please usealphaandbetaas keyword only args instead of positional args.
| Before | After |
|---|---|
>>> torch.zeros(2,3).add(2, torch.ones(2, 3))
../torch/csrc/utils/python_arg_parser.cpp:750: UserWarning: This overload of add is decodecated:
add(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add(Tensor other, Number alpha)
tensor([[2., 2., 2.],
[2., 2., 2.]]) |
>>> torch.zeros(2, 3).add(torch.ones(2, 3), alpha=2)
tensor([[2., 2., 2.],
[2., 2., 2.]]) |
Decodecate modifying in-place a view that returned by a custom autograd Function (#32839).
Modifying in-place a view that was created by a custom Function leads to the custom backward not being called or being called with a partial gradient. This behavior will be removed in 1.6.
Please clone() the output of the Function to avoid incorrect gradient computation.
class Id(Function):
@staticmethod
def forward(ctx, input):
return input.view_as(input)
@staticmethod
def backward(ctx, grad_input):
return grad_input| Version 1.5.0 | Version 1.5.0 |
|---|---|
>>> input = torch.randn(3, requires_grad=True) >>> other = torch.randn(3) >>> output = Id.apply(input) >>> output.copy_(other) # Warning: Incorrect gradients |
>>> input = torch.randn(3, requires_grad=True) >>> other = torch.randn(3) >>> output = Id.apply(input).clone() >>> output.copy_(other) |
Decodecate modifying in-place a view created inside a no_grad block (#32839)
Modifying in-place a view created inside a no_grad block is ambiguous and error-prone so we have decodecated it.
Here is an example of some code that we’ve decodecated. In codevious versions of PyTorch, the following code throws a non-descriptive error message, but we’ve added a decodecation in 1.5.0.
base = torch.rand(10, requires_grad=True) var = torch.rand([], requires_grad=True) with torch.no_grad(): view = base[1] view.copy_(var) torch.autograd.grad(base.sum(), var) RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is decodecated and will be forbidden starting 1.6 (see https://github.com/pytorch/pytorch/pull/32839 for more details about this). You can clarify your code and remove this warning by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).
If you want to differentiate, you should change the above code to
base = torch.rand(10, requires_grad=True) var = torch.rand([], requires_grad=True) view = base[1] view.copy_(var) torch.autograd.grad(base.sum(), var) (tensor(1.),)
If you don’t want to differentiate, you should change it to
base = torch.rand(10, requires_grad=True)
var = torch.rand([], requires_grad=True)
with torch.no_grad():
view = base[1]
view.copy_(var)C++ API
Decodecated Tensor.type() (#30281)
Please use Tensor.options() instead.
Miscellaneous
- Part of an automated mixed-codecision solution (#33366, #33832).
This release has 2 assets:
- Source code (zip)
- Source code (tar.gz)

PyTorch v1.5.0 Now Available
PyTorch is a widely used, open source deep learning platform used for easily writing neural network layers in Python enabling a seamless workflow from research to production. Based on Torch, PyTorch has become a powerful machine learning framework favored by esteemed researchers around the world.
Here is the newest PyTorch release v1.5.0 featuring Stable C++ frontend, distributed RPC framework, new experimental higher-level autograd API, Channels Last memory format, and more.
Repository: pytorch/pytorch · Tag: v1.5.0 · Commit: 4ff3872 · Released by: zou3519
PyTorch 1.5.0 Release Notes
The PyTorch v1.5.0 release is now available.
- Highlights
- Known Issues
- Backwards Incompatible Changes
- Python
- C++ API
- JIT
- Quantization
- RPC
- New Features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Decodecations
- Python
- C++ API
- Miscellaneous
Highlights
This release includes several major new API additions and improvements. These include new APIs for autograd allowing for easy computation of hessians and jacobians, a significant update to the C++ frontend, ‘channels last’ memory format for more performant
computer vision models, a stable release of the distributed RPC framework used for model parallel training, and a new API that allows for the creation of Custom C++ Classes that was inspired by PyBind. Additionally torch_xla 1.5 is now
available and tested with the PyTorch 1.5 release providing a mature Cloud TPU experience.
C++ Frontend API [Now Stable]
The C++ frontend API is now at parity with Python and the features overall has been moved to ‘stable’. (codeviously tagged as experimental). Some of the major highlights include:
- C++ torch::nn module/functional are now at ~100% parity with Python API, with appropriate documentation. Now users can easily translate their model from Python API to C++ API, making the model authoring experience much smoother.
- C++ optimizers now behave identically to the Python API. In the past, optimizers in C++ had deviated from the Python equivalent: C++ optimizers couldn’t take parameter groups as input while the Python ones could. Also step function implementations were not exactly the same. With the 1.5 release, C++ optimizers will always behave the same as the Python equivalent.
- New C++ tensor multi-dim indexing API which looks and behaves the similar to the Python API. The codevious workaround was to use a combination of
narrow/select/index_select/masked_select, which is clunky and error-prone compared to the Python API’s eleganttensor[:, 0, ..., mask]syntax. With the 1.5 release users can usetensor.index({Slice(), 0, "...", mask})to achieve the same result.
Channels last memory format for Computer Vision models [Experimental]
Channels Last memory format is an alternative way of ordering NCHW tensors in memory while codeserving the NCHW semantic dimensions ordering. Channels Last tensors are ordered in memory in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
Channels Last memory format unlocks the ability to use performance efficient convolution algorithms and hardware (NVidia’s Tensor Cores, FBGEMM, QNNPACK). Additionally it was designed to automatically propagate through the operators, which allows easy switching between memory layouts.
Learn more here on how to write memory format aware operators.
Custom C++ Classes [Experimental]
This release adds a new API for binding custom C++ classes into TorchScript and Python simultaneously. This API is almost identical in syntax to pybind11. It allows users to expose their C++ class and its methods to the TorchScript type system and runtime system such that they can instantiate and manipulate arbitrary C++ objects from TorchScript and Python. An example C++ binding:
template <class T>
struct MyStackClass : torch::CustomClassHolder {
std::vector<T> stack_;
MyStackClass(std::vector<T> init) : stack_(std::move(init)) {}
void push(T x)
{
stack_.push_back(x);
}
T pop()
{
auto val = stack_.back();
stack_.pop_back();
return val;
}
};static auto testStack =
torch::class_<MyStackClass<std::string>>("myclasses", "MyStackClass")
.def(torch::init<std::vector<std::string>>())
.def("push", &MyStackClass<std::string>::push)
.def("pop", &MyStackClass<std::string>::pop)
.def("size", [](const c10::intrusive_ptr<MyStackClass>& self) {
return self->stack_.size();
});
Which exposes a class you can use in Python and TorchScript like so:
@torch.jit.script
def do_stacks(s : torch.classes.myclasses.MyStackClass):
s2 = torch.classes.myclasses.MyStackClass(["hi", "mom"])
print(s2.pop()) # "mom"
s2.push("foobar")
return s2 # ["hi", "foobar"]
You can try it out in the tutorial here.
Distributed RPC framework APIs [Now Stable]
The torch.distributed.rpc package aims at supporting a wide range of distributed training paradigms that do not fit into DistributedDataParallel. Examples include parameter server training, distributed model parallelism, and distributed
pipeline parallelism. Features in the torch.distributed.rpc package can be categorized into four main sets of APIs.
- The RPC API allows running a function on a specified destination worker with given arguments and fetches the return value or creates a distributed reference to the return value.
- The RRef (Remote REFerence) serves as a reference to an object on another worker. A worker holding an RRef can explicitly request copies of the object, and it can also share the light-weight RRef with other workers without worrying about reference counting. This is especially useful when multiple workers need to repeatedly access different versions of the same remote object.
- With Distributed Autograd, applications can automatically compute gradients even if a model is split on multiple workers using RPC. This is achieved by stitching together local autograd graphs at RPC boundaries in the forward pass and reaching out to participants to transparently launch local autograd in the backward pass.
- The Distributed Optimizer uses gradients computed by Distributed Autograd to update model parameters. Its constructor takes a local optimizer (e.g.,
SGD,Adagrad, etc.) and a list of parameter RRefs, and itsstep()function automatically uses the local optimizer to update parameters on all distinct RRef owner workers.
Learn more here.
torch_xla 1.5 now available
torch_xla is a Python package that uses the XLA linear algebra compiler to accelerate the PyTorch deep learning framework on Cloud TPUsand Cloud TPU Pods. torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. This release of torch_xla is aligned and tested with PyTorch 1.5 to reduce friction for developers and to provide a stable and mature PyTorch/XLA stack for training models using Cloud TPU hardware. You can try it for free in your browser on an 8-core Cloud TPU device with Google Colab, and you can use it at a much larger scale on Google Cloud.
See the full torch_xla release notes here and the full docs here.
New High level autograd API [Experimental]
PyTorch 1.5 brings new functions including jacobian, hessian, jvp, vjp, hvp and vhp to the torch.autograd.functional.* submodule. This feature builds on the current API and allow the user to easily perform these functions.
See the full docs here.
Python 2 no longer supported
For PyTorch 1.5.0 we will no longer support Python 2, specifically version 2.7. Going forward support for Python will be limited to Python 3, specifically Python 3.5, 3.6, 3.7 and 3.8 (first enabled in PyTorch 1.4.0).
Known Issues
torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode.
DistributedDataParallel (DDP) used to support two modes
- Single-Process Multi-GPU (SPMG): In this mode, each DDP process replicates the input
moduleto all specified devices and trains on allmodulereplicas. This mode is enabled when application passes in adevice_idsargument that contains multiple devices. Or ifdevice_idsis not codesented, DDP will try to use all available devices. - Multi-Process Single-GPU (MPSG): This is the recommended mode, as it is faster than SPMG. In this mode, each DDP process directly works on the provided
modulewithout creating additional replicas. This mode is enabled whendevice_idsonly contains a single device or if there is only one visible device (e.g., by settingCUDA_VISIBLE_DEVICES).
A recent change (#33907) in torch.nn.parallel.replicate breaks DDP’s assumption on replicated modules and leads to failures in the SPMG mode. However, since SPMG is known to be slower due to
GIL contention and additional overhead caused by scattering input and gathering output, we are planning to retire this mode in future releases and make MPSG the only supported mode in DDP. The code below shows an example of the recommended way to
construct DDP.
import torch from torch.nn.parallel import DistributedDataParallel as DDP # use "cuda:1" as the target device target_device = 1 local_model = torch.nn.Linear(2, 2).to(target_device) ddp_model = DDP(local_model, device_ids=[target_device])
See #36268 for more discussion.
Tensor.exponential_(0) used to return Inf, now it incorrectly returns 0
Codeviously in 1.4, x.exponential_(0) gives a tensor full of inf. On 1.5.0, it wrongly gives a tensor full of zeros.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.randn(3).exponential_(0) tensor([inf, inf, inf]) |
>>> torch.randn(3).exponential_(0) # This is wrong! tensor([0., 0., 0.]) |
See #36798 for more details
Backwards Incompatible Changes
Python
Tensor.clone, Tensor.to, Tensor.empty_like, and similar functions codeserve stride information instead of returning contiguous tensors
clone, to, type, cuda, cpu, byte, char, double, bool, half, int, long, short, float, bfloat16,
empty_like, full_like, ones_like, zeros_like, rand_like, randn_like, randint_like operators now propagate memory format (roughly, the strides) of the input tensor to the
output tensor.
Since PyTorch operators generally support non-contiguous tensors, this should have no functional effect on most PyTorch programs.
The most common incompatibility with Python programs is with the view operator, which has specific stride requirements. If these requirements are no longer met as a result of this change, you will get an error message indicating that you should
use reshape instead, i.e. “RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead.”
Another possible exception incompatibility is if you have a (usually) C++ operator implementation that works directly on memory (i.e. calls data_ptr and relies on the strides being contiguous).
In the following example, we go through the implementation of a simple clone operation and see how it needs to change between versions.
# Version 1.4.0
Tensor simple_clone(const Tensor& input) {
TORCH_CHECK(input.dim() == 1);
auto output = at::empty_like(input);
auto input_stride = input.strides()[0];
auto* output_ptr = output.data_ptr<float>();
auto* input_ptr = input.data_ptr<float>();
// Before 1.5.0, the result of `empty_like` is always contiguous.
for (int64_t idx = 0; idx < input.size(); idx++) {
output[idx] = input[idx * input_stride]
}
}
# Version 1.5.0
Tensor simple_clone(const Tensor& input) {
TORCH_CHECK(input.dim() == 1);
// From 1.5.0 on, the result of `empty_like` may not be contiguous.
auto output = at::empty_like(input);
// As a result, we need to keep track of the output stride.
auto input_stride = input.strides()[0];
auto output_stride = output.strides()[0];
auto* output_ptr = output.data_ptr<float>();
auto* input_ptr = input.data_ptr<float>();
for (int64_t idx = 0; idx < input.size(); idx++) {
output[idx * output_stride] = input[idx * input_stride]
}
}
The inferred dtype of np.float_, np.float64 scalars in tensor constructors (e.g. torch.tensor(…), torch.as_tensor(…) is now torch.float64 instead of the default dtype (usually torch.float32). (#30486 (#30486))
Please explicitly pass in the desired dtype when constructing tensors with NumPy float64 scalars to get the old behavior.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
# Old behavior: return torch.float32 tensor (by default) >>> torch.tensor(np.float64(0)) tensor(0.) |
# To keep the old behavior, please explicitly pass the dtype >>> torch.tensor(np.float64(0), dtype=torch.get_default_dtype()) tensor(0.) |
This can cause your program to execute in torch.float64, potentially slowing down your program or can lead to errors for operators that don’t support torch.float64 or mixed-dtypes.
numpy integer scalars are now treated as integers for the purposes of type promotion (#30486 (#30486))
Codeviously, in 1.4.0, they were mistakenly treated as floats (so for example, torch.ones(3) * np.int64(3) would return a float32 tensor. In 1.5.0, we’ve fixed that behavior; torch.ones(3) * np.int64(3) returns an int32 tensor.
This can cause your code to fail if you performed operations between PyTorch tensors and numpy scalars and then passed the result into an operation that does not support integral types or mixed types. To fix your code, please cast the resulting tensor to the desired dtype.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.ones(3) * np.int64(3) tensor([3., 3., 3.]) |
>>> (torch.ones(3) * np.int64(3)).float() tensor([3., 3., 3.]) |
numpy integer scalars are now treated as integers for the purposes of type promotion (#30486)
Codeviously, in 1.4.0, they were mistakenly treated as floats (so for example, torch.ones(3) * np.int64(3) would return a float32 tensor. In 1.5.0, we’ve fixed that behavior; torch.ones(3) * np.int64(3) returns an int32 tensor.
This can cause your code to fail if you performed operations between PyTorch tensors and numpy scalars and then passed the result into an operation that does not support integral types or mixed types. To fix your code, please cast the resulting tensor to the desired dtype.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.ones(3) * np.int64(3) tensor([3., 3., 3.]) |
>>> (torch.ones(3) * np.int64(3)).float() tensor([3., 3., 3.]) |
torch.autograd.Function: dropped support for old-style Functions (#33956).
In codevious versions of PyTorch, there were two ways to write autograd Functions. We decodecated one of them in 1.3.0 and dropped support for it entirely in 1.5.0. Old-style autograd Functions will no longer work in user code.
These Functions be identified by not having staticmethod forward and backward functions (see the example below) Please see the current documentation for
how to write new-style Functions.
# Version 1.4.0
class Exp(torch.autograd.Function):
def forward(self, i):
result = i.exp()
self.save_for_backward(result)
return result
def backward(self, grad_output):
result, = self.saved_tensors
return grad_output * result
Exp()(torch.tensor(1.))
# Version 1.5.0
class Exp(torch.autograd.Function):
@staticmethod
def forward(ctx, i):
result = i.exp()
ctx.save_for_backward(result)
return result
@staticmethod
def backward(ctx, grad_output):
result, = ctx.saved_tensors
return grad_output * result
Exp.apply(torch.tensor(1.))
torch.optim optimizers changed to fix in-place checks for the changes made by the optimizer (#33640, #34211)
If this causes your code to fail, there are two possible reasons:
Reason 1: The value of that parameter was actually saved and used and we were computing incorrect gradients in codevious versions of PyTorch. This would result in an error message mentioning incorrect version numbers. You should replace code that uses self.my_param by self.my_param.clone() to
make sure the saved version is different from the one that is modified by the optimizer. For example:
Before 1.5.0, the following may have worked.
def model(input, target, param):
return `(input * param ** 2 - target).norm()`
param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param)
loss.backward(retain_graph=True)
sgd.step()
loss.backward()
param.grad
If after upgrading to 1.5.0, the above fails due to a version counter error, then that means the gradient computed was incorrect. To remedy this, clone param before using it in the model:
def model(input, target, param):
return (input * param ** 2 - target).norm()
param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param.clone())
loss.backward(retain_graph=True)
sgd.step()
loss.backward()
param.grad
Reason 2: You know what you’re doing and change the values back to the right thing before the next backward. However, you’re running into an error because the version counter cannot be decremented. Open an issue with your particular use case and we will help you to work around the version counter issue.
utils.cpp_extensions now use ninja as the default compilation backend (#32495)
ninja enables parallel compilation of your C++ extension, greatly speeding up compilation. This change will not break most user code; if you do not have ninja installed, we fallback to the old distutils backend.
However, if you do have ninja installed, it is possible that this change will cause your C++ extension build to fail by oversubscribing your system with too many worker processes. There are two potential workarounds to this.
Method 1: If a codeviously succeeding python setup.py install now fails, try setting the MAX_JOBS environment variable.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
python setup.py install |
MAX_JOBS=2 python setup.py install |
Method 2: Switch back to the old distutils backend inside your setup.py
| Version 1.4.0 | Version 1.5.0 |
|---|---|
cmdclass={'clean': clean,
'build_ext': BuildExtension}, |
cmdclass={'clean': clean,
'build_ext': BuildExtension.with_options(use_ninja=False)}, |
torch.optim.Adam, torch.optim.SGD changed to not modify gradients in-place (#30257)
In codevious versions of PyTorch, the Adam and SGD optimizers modified gradients (e.g. param.grad) in-place via in-place addition of params.grad += weight_decay * param. To make this consistent with the behavior of other optimizers
and to codevent surprises about the behavior, we’ve changed them to stop modifying gradients in-place.
This should not have an effect on most PyTorch programs unless they relied on this behavior. The easiest way to replicate the old behavior is to create a custom optimizer that implements it.
torch.masked_select now always returns a 1D tensor (#29923)
The behavior of torch.masked_select when both “self” and “mask” are 0-dimensional was changed. In codevious versions of PyTorch, this would return a 0-dimensional tensor. Now, we return a 1-dimensional tensor to be consistent with other input
sizes and our documentation.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.masked_select(torch.tensor(0), torch.tensor(True)) tensor(0) |
>>> torch.masked_select(torch.tensor(0), torch.tensor(True)) tensor([0]) |
torch.index_select on a 0-d tensor now returns a 0-d tensor. (#30790)
In codevious versions of PyTorch, the output of torch.index_select on a 0D input tensor produced a 1D tensor. This was inconsistent with our documentation on it, which stated “The returned tensor has the same number of dimensions as the original
tensor (input).” Now, we return a 0D tensor.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0]))tensor([5]) |
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0])) tensor(5) |
nn.MultiLabelMarginLoss: ‘none’ reduction on 1D tensor now returns a 0D tensor (#30768)
In codevious versions of PyTorch, the output of nn.MultiLabelMarginLoss on 1D and 0D tensors incorrectly produced 1-D tensors. Now, those cases return a 0D tensor to be consistent with the 2-D tensor case.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long)) tensor([0.2959]) |
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long)) tensor(0.2959) |
nn.MultiMarginLoss: ‘none’ reduction on 1D target now returns a 1D tensor (#30826)
In codevious versions of PyTorch, the output of nn.MultiMarginLoss on a 1D target tensor produced a 0D output. We changed this to return a 1D target tensor to make it consistent with other input sizes which return an output
that matches the target shape.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0])) tensor(0.) |
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0])) tensor([0.]) |
Tensor.exponential_(lambda) no longer supports lambda < 0 (#32501)
lambda, the rate parameter of the exponential distribution, mathematically should be greater than 0. We’ve disabled support lambda < 0 to be mathematically correct; most users will not have used a lambda less than zero.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
tensor = torch.empty(3).exponential_(-1.5) |
# Negative lambda not supported! |
nn.BCELoss, nn.functional.binary_cross_entropy no longer accept inputs with the same number of elements that are not broadcastable (#31365)
Codeviously, we supported accepting inputs with the same number of elements. However, this behavior was decodecated and we removed it in 1.5.0. In order to replicate the old behavior, please explicitly reshape your input and target tensors
to have the same shape.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> input = torch.rand(3, 3) >>> target = torch.randn(9) >>> torch.nn.functional.binary_cross_entropy(input, target) |
>>> input = torch.rand(3, 3) >>> target = torch.randn(9) >>> torch.nn.functional.binary_cross_entropy(input, target.reshape_as(input)) |
torch.normal out argument is now required to have the same size as the computed output (#32031)
Codeviously, on CPU devices, torch.normal(mean, std, out=out) would resize out to the correct size. To be consistent with the CUDA implementation, we’ve changed it so that out must either already have the correct size,
or be an empty tensor with size [0]. To work around this, please ensure that your out tensor has the correct size.
| Version 1.4.0 | Version 1.5.0 |
|---|---|
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2)) tensor([ 0.0300, 0.7830, -1.3579]) |
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2)) RuntimeError: inconsistent tensor, output size ([2]) is not the same as broadcasted mean and std size (3) |
Tensor.geometric_ no longer supports integral Tensors (#31878)
Codeviously, on CPU devices, Tensor.geometric_ supported Tensors with integral dtype. Now, it only supports floating point. We removed support for this because it doesn’t make sense for geometric_ to operate on integral dtypes.
Changed torch.floor_divide input positional argument name to self (#34552)
Before PyTorch 1.5, torch.floor_divide took two positional arguments: torch.floor_divide(input, other). We’ve changed the name of the input argument to self; this will break code that called torch.floor_divide via
keyword argument. For example:
| Version 1.4.0 | Version 1.5.0 |
|---|---|
torch.floor_divide(input=x, other=y) |
# Either of the following works. torch.floor_divide(self=x, other=y) torch.floor_divide(x, y) |
C++ API
RNN / GRU / LSTM layers (#34322)
- Instead of returning
RNNOutput, RNN / GRUforwardmethod now returnsstd::tuple<Tensor, Tensor>, and LSTMforwardmethod now returnsstd::tuple<Tensor, std::tuple<Tensor, Tensor>>, matching Python API. - LSTM forward method’s hidden state parameter now has type
torch::optional<std::tuple<Tensor, Tensor>>, matching Python API. - RNN / LSTM / GRU layers now have
forward_with_packed_inputmethod which acceptsPackedSequenceas input and optionally hidden state, matching theforward(PackedSequence, ...)variant in Python API. - RNN / LSTM / GRU layers no longer have these fields:
w_ih/w_hh/b_ih/b_hh. Instead, to access the weights and biases of the gates, users should do e.g.rnn->named_parameters()["weight_ih_l0"], which mirrors the Python APIrnn.weight_ih_l0. - In
RNNOptionstanh()/relu()/activationare removed. Instead,nonlinearityis added which takes eithertorch::kTanhortorch::kReLUlayersis renamed tonum_layerswith_biasis renamed tobias
- In
LSTMOptionslayersis renamed tonum_layerswith_biasis renamed tobias
- In
GRUOptionslayersis renamed tonum_layerswith_biasis renamed tobias
Upsample layer / F::interpolate function (#35025)
- There are changes to
UpsampleOptionsandInterpolateFuncOptions:sizeis changed fromstd::vector<int64_t>toc10::optional<std::vector<int64_t>>. If you want to pass a list ofint64_tto this argument, you must pass it asstd::vector<int64_t>.scale_factoris changed fromstd::vector<double>toc10::optional<std::vector<double>>. If you want to pass a list ofdoubleto this argument, you must pass it asstd::vector<double>.
- F::multilabel_margin_loss / F::multilabel_soft_margin_loss functions (#35163)
MultiLabelMarginLossFuncOptionsis renamed toMultilabelMarginLossFuncOptionsMultiLabelSoftMarginLossFuncOptionsis renamed toMultilabelSoftMarginLossFuncOptions- The decodecated
torch::nn::BatchNormis removed in favor oftorch::nn::BatchNorm{1,2,3}d - The decodecated
torch::nn::FeatureDropoutis removed in favor oftorch::nn::Dropout{2,3}d - The decodecated
torch::nn::modules_ordered_dictis removed. User should doSequential sequential({{"m1", MyModule(1)}, {"m2", MyModule(2)}})instead. - The decodecated
torch::nn::init::Nonlinearityis removed, in favor of these enums:torch::kLinear/torch::kConv1D/torch::kConv2D/torch::kConv3D/torch::kConvTranspose1D/torch::kConvTranspose2D/torch::kConvTranspose3D/torch::kSigmoid/torch::kTanh/torch::kReLU/torch::kLeakyReLU - The decodecated
torch::nn::init::FanModeis removed, in favor of these enums:torch::kFanIn/torch::kFanOut
Optimizers
Optimizer::stepnow accepts closure function as optional input and returns a tensor, andLossClosureOptimizeris removed (#34790) (#34957). If you had a custom optimizer class defined as:
struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
void step() override {...}
};
* you would need to update your optimizer class definition as follows:
struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
torch::Tensor step(LossClosure closure = nullptr) override {
...
// return `torch::Tensor()` if `closure` is nullptr
// (i.e. we are not computing the loss)
return torch::Tensor();
}
}; - Adagrad (#29335)
- In
AdagradOptions,learning_rateis renamed tolr. - In
Adagrad,sum_buffersandstep_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<AdagradParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.sum() // param_state.step()
- SGD (#32592)
- In
SGDOptions,learning_rateis renamed tolr. - In
SGD,momentum_buffersis now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<SGDParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.momentum_buffer()
- Adam (#33730)
- In
AdamOptions:learning_rateis renamed tolrbeta1andbeta2are replaced by a tuplebetas
- In
Adam,step_buffers,exp_average_buffers,exp_average_sq_buffersandmax_exp_average_sq_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<AdamParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.step() // param_state.exp_avg() // param_state.exp_avg_sq() // param_state.max_exp_avg_sq()
- RMSprop (#33450)
- In
RMSpropOptions:learning_rateis renamed tolr
- In
RMSprop,square_average_buffers,momentum_buffersandgrad_average_buffersare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<RMSpropParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.square_avg() // param_state.momentum_buffer() // param_state.grad_avg()
- LBFGS (#34564) (#34957)
- In
LBFGSOptions:learning_rateis renamed tolrmax_eval‘s type is changed fromint64_ttoc10::optional<int64_t>tolerance_grads typeis changed fromfloattodoubletolerance_change typeis changed fromfloattodoublehistory_size typeis changed fromsize_ttoint64_t
- In
LBFGS,d,H_diag,codev_flat_grad,t,codev_loss,ro,al,old_dirs,old_stps,func_evalsandstate_n_iterare now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:
- In
auto& param_state = static_cast<LBFGSParamState&>( *optimizer.state()[c10::guts::to_string(parameter.unsafeGetTensorImpl())]); // Use the following to access parameter state: // // param_state.d() // param_state.H_diag() // param_state.codev_flat_grad() // param_state.t() // param_state.codev_loss() // param_state.ro() // param_state.al() // param_state.old_dirs() // param_state.old_stps() // param_state.func_evals() // param_state.n_iter()
Removed AutoGIL/AutoNoGIL in favor of pybind11::gil_scoped_* functions (#34301)
If your code released or acquired the GIL via AutoNoGIL or AutoGIL, please change the invocations to pybind11::gil_scoped_release or pybind11::gil_scoped_release, respectively.
Others
torch::tensor(floating-point values)will always produce tensor of default dtype, andtorch::tensor(integer values)will always produce tensor oftorch::kLongdtype, matching Python API behavior (#32367).torch::Tensor::base()is renamed totorch::Tensor::_base(), matching Python API. (#33316)- Renamed TensorTypeId to DispatchKey (#32154)
- Throw an error if nbytes is called on a sparse tensor. (#33897)
JIT
Simple Executor Is Now On By Default
The simple executor skips the number of fusion-related passes and analyses that are very time-consuming. Disabling these optimizations fixes pathologically long compilation times. The users that rely on GPU fusion to have their desired performance profile, should turn on the profiling executor. We provide C++ and python API to enable the profiling executor:
- in python, call
torch._C._jit_set_profiling_mode(True)before you call your model for the first time. - in C++, include
#include <torch/csrc/jit/runtime/graph_executor.h>and setgetProfilingMode() = truebefore you invoke your model for the first time.
Quantization
Remove qconfig_dict in top level eager mode quantization API (#31972).
In eager mode quantization, one needs to manually insert quant and dequant stubs in a model to specify where activations are quantized. Having a qconfig_dict that specifies the quantization configuration for each module is not useful as one needs to manually modify the model with quant/dequant stubs. The new API makes it explicit that the model needs to be manually modified for quantization.
# codeviously qconfig_dict was an optional argument to codepare def codepare(model, qconfig_dict=None, inplace=False): # now replaced with def codepare(model, inplace=False):
RPC
Functional API for Distributed Autograd and Distributed Optimizer
More specifically, callers must pass context_id to torch.distributed.autograd.backward() and torch.distributed.optim.step(). (#33711)
# Before
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer
with dist_autograd.context() as context_id:
# Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
# Backward pass.
dist_autograd.backward([loss.sum()])
# Optimizer.
dist_optim = DistributedOptimizer(
optim.SGD,
[rref1, rref2],
lr=0.05,
)
# After
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer
with dist_autograd.context() as context_id:
# Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
# Backward pass.
dist_autograd.backward(context_id, [loss.sum()])
# Optimizer.
dist_optim = DistributedOptimizer(
optim.SGD,
[rref1, rref2],
lr=0.05,
)
dist_optim.step(context_id)Disallow sending CUDA tensors over RPC
The motivation is to codevent potential invalid device errors when the number of devices on the sender and the receiver does not match. However applications, can always move CUDA tensors to CPU before sending (#33604).
| Version 1.4.0 | Version 1.5.0 |
|---|---|
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x, 3))
rpc.shutdown() |
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x.cpu(), 3))
rpc.shutdown() |
New Features
Python
Added new functional autograd API (#34066)
- See Highlights for more details
New __torch_function__ API Override Mechanism (#30730, #32194, #32799, #34240, #34303).
We introduced __torch_function__, an API override mechanism for subclassing torch.Tensor in Python. This is useful for creating custom objects that implement the torch.* APIs. These currently support overriding most torch.*,
and torch.nn.functional APIs; we’ve also planned future support for subclassing torch.Tensor (see tracking issue #22402).
New Operators
torch.logical_andandtorch.logical_oroperations added (#30521).torch.squareadded (#30719).torch.bitwise_andadded (#31104).torch.cummax,torch.cumminadded (#32169, #32238, #32537, #33492).torch.floor_divide,Tensor.floor_divideadded (#30493, #34552).torch.true_divide,Tensor.true_divideadded, analogous to Python ‘s, and NumPy’s (true) division (#34236, #34794)nn.functional.hardsigmoidadded(#34545).- Added PCA and SVD for low-rank matrices (
torch.pca_lowrank,torch.svd_lowrank),torch.lobpcgfor positive-defined generalized eigenvalue problem (#34721).
Distributions
distributions.von_misesadded (#33418).distributions.mixture_same_family: Added support for mixture distributions (#22742, #33408).distributions.transforms.TanhTransformadded(#19785).distributions.continuous_bernoulliadded (#34619).
C++ API
- NN modules / functionals
- C++ tensor indexing (#30424, #32841, #30427, #34255)
- Please see docs: https://pytorch.org/cppdocs/notes/tensor_indexing.html
- Operators
- C++ API parity:
isinf(#31099).
- C++ API parity:
- Autograd
- Add
at::Tensor::retain_gradAPI (#33349).
- Add
- C++ extensions
Distributed
- Allows Python application to create subclass of C++
c10d.Storeusing pybind11 trampoline class #30415.
Mobile
Quantization
- qnnpack TanH (#31013).
- Adding quantized clamp kernel (#30541).
- Quantized H Tangent function (#31031).
- QNNPACK: Add support for dynamic quantization. (#31896).
- Add operator support for dynamic quant on mobile (#32479).
- Adding native qconcat (#32252).
- FP16 dynamic quantized Linear (#32331).
- Add support for Dynamic LSTM quantization on Mobile (#32757).
- Quantized sigmoid function (#31851).
- Quantized leaky relu (#33004).
- Add a quantized batch_norm operator (#33080).
- Add Quantized BatchNorm2d module (#33109).
- Add the 3d avg pool for video related model (#33339).
- Add quantized_hardtanh (#34097).
- Add quantized ELU activation (#34267).
- Add the 3d upsample quantized op for video model (#34594).
- Add the quantized batch_norm3d and also batch_norm3d fused with relu operators (#34702).
- Add quantized implementation of hard sigmoid (#34607).
RPC
- [Experimental] Enable autograd profiler to work with RPC (#31381, #34398, #30677, #31346, #31380).
- [Experimental] Allow calling remote TorchScript functions using RPC (#32466, #33190, #32990, #32959, #33526, #33992, #33582, #32197, #33329, #34183).
Improvements
AMD/ROCm
nn.RNN: Ensure MIOpen is called on same stream as operator (#30672)- Fixed asserts in CUDA kernels (#31276, #31297).
- Enable BFloat16 support for convolutions (#30948).
- Abstract atomic add calls (#31992).
- Install complete set of headers for ROCm build (#32076).
- Adjust
elementwise_kernelsettings on ROCm (#32609). nn.BatchNorm{1,2,3}d: UseC10_WARP_SIZEto fix functionality on HIP vs CUDA for gradient computation (#33098).- Enabled Bfloat16 type for activation functions and
batch_norm(#32065). - Added ability to enable/disable MIOpen at runtime (#33118).
- Enable BFloat16 type for pooling ops (#34166).
torch.pdist: improved codecision by enabling double__shfl_down(#34103).- Enabled BFloat16 type for loss functions and few misc ops required for resnet50 (#34469).
- Enabled BFloat16 type for EmbeddingBag, Index, and Sigmoid ops (#34630).
- Enabled 3D batch norms through MIOpen (#33262).
- Enabled 3D convolutions through ROCm (#33067).
nn.RNN: Check if weights need to be flattened (#34265).
C++ API
- NN modules / functionals
- Allow skipping default arguments in module’s forward method when module is used in
torch::nn::Sequential(#33027) (#33718) - Make
torch::nn::Sequential::push_back(AnyModule)methods public (#34208). - Refactor RNN / GRU / LSTM layers to match Python API (#34322).
- For
Conv{1,2,3}d,padding_modenow acceptstorch::kZeros/torch::kReflect/torch::kReplicate/torch::kCircular, matching Python API behavior. (#35023) - Fix
F::interpolateandtorch::nn::Upsampleimplementation to match Python API behavior (#35025) (#36274) - Renaming: MultiLabelMarginLossFuncOptions -> MultilabelMarginLossFuncOptions, MultiLabelSoftMarginLossFuncOptions -> MultilabelSoftMarginLossFuncOptions (#35163)
- Allow skipping default arguments in module’s forward method when module is used in
- Optimizers
- All existing optimizers in the C++ API (Adagrad / SGD / Adam / RMSprop / LBFGS) have the following changes to achieve parity with the Python API: (#29335)
(#30739) (#32592) (#33730) (
#33450) (#34790) (#34564) (#34957)
(#35001) (#36033) (#36245)
- step function implementation is changed to behave the same as Python equivalent
- Constructor now accepts
std::vector<OptimizerParamGroup>as input optimizer.add_param_group(...)can be used to add parameter group to an existing optimizeroptimizer.state()should be used to access parameter state
- All existing optimizers in the C++ API (Adagrad / SGD / Adam / RMSprop / LBFGS) have the following changes to achieve parity with the Python API: (#29335)
(#30739) (#32592) (#33730) (
#33450) (#34790) (#34564) (#34957)
(#35001) (#36033) (#36245)
- autograd
- Renamed
at::Tensor::base()to_base(), matching Python API (#33316)
- Renamed
Distributed
- Allow TCPStore to pick a port to bind to (#31674).
- Enhance NCCL watchdog to actively abort communicators for timed out ops (#32338).
- Adding DDP Design Note (#32158).
- Recommend using DDP over DataParallel (#35063)
Distributions
distributions.independent: added explicit string recodesentation (#33676).categorical.sample: Reduced memory overhead (#34900).distributions.MultivariateNormal: improved numeric stability and performance (#32092).
Mobile
- Add module level qpl logging. (#30906).
- Expose setNumThreads to android api (#31033).
- remove unused SparseCPUType from mobile build (#33517).
- make sure mobile build work with dynamic dispatch (#34038).
- support for custom mobile build with dynamic dispatch (#34055).
- Add watchOS support (#33318).
- speed_benchmark_torch switch to log latency from dataset level to row level (#34598).
ONNX
Exporting More Torch Operators to ONNX
In PyTorch 1.5, we have added support for 10 additional operators and also enhanced support for another set of 10+ existing operators. We have also added support for exporting large models (> 2GB) to ONNX. Additionally, we have made enhancements and optimizations to the export of ScriptModules and will continue to do that in the next release. We have also made improvements to the custom op export experience.
- Export dynamic unbind, split and getitem (#29136).
- Export torch.new_zeros (#34077).
- Export Im2col (#30972).
- Export bitwise_not for bool (#28439).
- Export logsoftmax with dim != -1 (#30433).
- Export einsum (#32716).
- Export aten::copy_ and aten::index_put to ONNX opset 11 (#26941).
- Export floor_divide (#31081).
- Export one_hot (#34454).
- Export torch.take (#33061).
- Export bool type index mask (#32445).
- Export split with list of sizes (#33161).
- Export scalar tensor for split (#32493).
- Export flatten to accept negative indices in opset 11 (#30751).
- Export sort with negative axes (#31971).
- Export Interpolate to support scale (#28324, #31526, #32554).
- Export quantized concat (#30887).
Enhancing the Support for ScriptModule
- Fixed access to element in size tensor for scripting (#32652).
- Export Conv in TorchScript module (#30618).
- Export Dim operation in TorchScript module (#31928).
- Export randnlike in TorchScript module (#32830).
- Partially support tensor lists in loop/concat/stack (#30126)
Enhancing Existing Export Logic
- Updating ONNX checker logic. (#33522).
- Adding ONNX large model export support in exporter (#33062).
- Extend op registration (#32943).
- Support op registration if name starts with underscore (#32017).
Optimizing Exported ONNX Graph
- Try exporting ONNX with force_outplace=False (#29466).
- Enable constant folding (#29834).
- Added cons folding for ONNX mul, div, sqrt ops (#32077).
- Enable constant folding for Reshape (#31054).
Adding Utility Functions and Refactoring
- Added ONNX model checker to ONNX export (#32298).
- Export custom ops (#29752).
- Upgrade exported ONNX IR version to 6 (#31025).
- Provide names for operator nodes in ONNX exported graph (#27342).
- Update ONNX landing page since 1.3 (#32805).
- Turn ONNX_ML into a proper build option (#33424).
Operator Benchmark
- Added small input shapes to test operator overhead (#30617).
- Added
binary_testto benchmark binary ops (#31326). - Added
Tensor.copy_operator (#31327). - Removed option to wipe cache because it did not help with variance (#31334).
- Added
torch.diag(#32597).
Quantization
- Guard against copying from quantized Tensor to non-quantized Tensor (#29660).
- Add assert for min, max, qmin, qmax for ChooseQuantizationParams (#32739).
- Support broadcast for quantized mul kernel (#30442).
- Make FakeQuant use
REGISTER_DISPATCH(#33682). - Set alias analysis kind to
FROM_SCHEMAfor qadd, qmul, qclamp, qconcat (#33359). - Migrate
fake_quant_sliceto TensorIterator (#33744). - Parallelize quantize and dequantize (#33765).
- Make FP16 RNN use new codepack op (#34339).
- Refactor QAT Conv module for better extensibility (#30362).
- Use non-inplace for insert observer pass (#34190).
RPC
- Add default arguments for
init_method(#30208). - By default ignore RRef leaks during shutdown (#30217).
- Robustify
rpc_agenthandlers with generic Future (#31224). - Fix error message in incorrect
rref.localValue()call (#31199). - Add
RpcAgent::getWorkerInfos()API to return allWorkInfos in the group (#30241). - Add local shutdown to process group agent (#30330).
- Add
RRef.str()API to return a string recodesentation of the RRef (#30609). - Adding Debug Info for RRef Context (#30610).
- Add
get_metricsandget_debug_infoto RPC agent (#30833). - Adding debugging metrics to process group agent (#30884).
- Add glue code to collect debug info from all components (#30888).
- Make RRef leak detection always print a warning log (#31922).
- Allow multiple backward passes to accumulate gradients. (#32506).
- Allow RRef local creation with IValue objects (#33263).
- Improve ProcessGroup
RpcBackendOptionsConstructor API (#34081). - Enhanced Error Reporting in Dist Autograd/RPC (#34179).
- Delete all user forks tracked in
RRefContextbefore graceful shutdown (#31893). - Best-effort Error Detection for Using Deleted UserRRefs (#34673).
- Don’t run user function until all UserRRefs in the args are confirmed (#34497).
- Support using self as the destination in
rpc.remotefor builtin operators (#34931). - Add debug info API for distributed autograd. (#30642).
- Propagate errors in
clearAndWaitForOutstandingRpcsAsync. (#32952).
Type Hints
- DataLoader
default_collatetype hint added (#28935). Tensor.rsub, Tensor.rpow, Tensor.rtruediv, Tensor.map_type hints were added (#30576).torch.optim: added more missing type hints (#31130).nn.functional.grid_sample,nn.functional.affine_grid: added missing align_corners annotation (#32492).torch.nn.Parameterconstructor type hint was fixed (#32617).nn.MultiheadAttention,nn.Transformer: added type hints (#28396).torch.optim.LambdaLRconstructor type hint was fixed (#33271).torch.optim: added missing default value forLRScheduler.step()(#32411).- Make type of
Tensor.type()more specific (#32353). torch.optim.optimizer.Optimizertype hints were fixed (#32900).optim.AdamWtype hints were fixed (#34299).torch.utils.data.Samplersubclasses type hints were added (#33679).nn.Sequential,nn.ModuleList,nn.ParameterList,nn.ParameterDicttype hints were fixed (#33686).Tensor.bfloat16()type hint was added (#33747).- Binary operator type hints were fixed (#33748).
torch.bfloat16,nn.Module.training,Tensor.cuda, and 10s of other type hints added (#33762).torch.addtype hint was fixed(#33935).Tensor.shapetype hint was fixed (#34595).- Fixed
utils.dataimports (#33543). Tensor.__radd__type hint was fixed (#35231)
Other
autograd.detect_anomaly: added support for Sparse Tensors (#29803).autograd.detect_anomaly: Error messages now print the current Node name (#33875).autograd.profiler: added better error message when crashing while profiling multi-worker DataLoader (#31473).autograd.profilerEnable usingtorch.autograd.profiler.record_functionas decorator (#30861).autograd.profilerSpeed upexport_chrome_traceby up to 4x (#30724).torch.autograd: added better error message when attempting to fork (#33885).torch.cuda.memory.caching_allocator_alloc,torch.cuda.memory.caching_allocator_deleteexposed in Python API (#33860).torch.roll: added bool tensor support (#31194).torch.flip: added support for bool tensors (#31267).torch.equal: added support for bfloat16 CPU scalar types (#30817).torch.save,torch.load: added error message for minimum dill version support (#30985).torch.diagonal: added named tensor support(#30193).torch.linspace: added support for integral types on CPU (#32218).torch.eig: Added autograd support in the case where eigenvalues are real (#33090).torch.mvlgamma: improved error message (#32665).torch.no_grad,torch.enable_grad: added support for decorating generator functions (#31792).torch.narrow: added Tensor overload forstart(#34317).Tensor.random_: enabled support for half on CPU (#34030).Tensor.grad: added warnings when accessing it if it won’t be populated for known reasons (#30531).torch.cuda.comm.gather: improved error message (#27456).nn.functional.max_pool{1,2,3}d: added named tensor support (#31669).nn.Module.load_state_dict: Include the contents of the exception in error messages (#32693).nn.MultiheadAttention: add support for 3D attention mask (#31996).nn.MSELoss: Added performance warning for using CPU Half (#33021).nn.ModuleList,nn.ParameterDict,nn.ParameterDict: added more descriptive error messages when attempting to call these like Modules (#29991).nn.init.dirac_: Addedgroupsoption for compatibility with initializing group convolutions (#32825).- Added error message to indicate that reduction operations are not supported for dim >= 64 (#31476).
- Type Promotion: added supports for sparse tensors and arithmetic operations (#30429).
- Enabled indexing for bfloat16 tensors (#31692).
- Add 64-bit indexing support for CUDA Tensors (#33405).
- Added warning when converting a read-only NumPy array to
torch.Tensor(#33615). - Set rpath for JNI library on Mac (#32247).
- Updated MAGMA to 2.5.2 for Windows (#30513, #34205).
- Marked PyTorch incompatible with Python-3.6.0 (#34724).
- Consider
hub_diralongsideTORCH_HOMEenv variable for storing hub models (#32844). - Improved dll loading logic on Windows (#33856).
- Error out if legacy
Tensor.newis called on alternate layouts or dtypes (#31485). utils.checkpoint.checkpoint_sequential: Removed decodecated variadic arguments behavior (#25985).
Bug Fixes
C++ API
- NN modules / functionals
output_ratioforFractionalMaxPool{2,3}dmodule andfractional_max_pool{2,3}dfunctional should accept double as data type (#33304)- For
AdaptiveAvgPool{2,3}dandAdaptiveMaxPool{2,3}d,output_sizeis changed to acceptc10::nulloptin its elements, matching Python API behavior. (#35022) - Fix bug in
fractional_max_pool3d_with_indicesimplementation (#35024) - Remove
namespace F = torch::nn::functionalfrom torch/nn/modules/batchhnorm.h, so that people don’t have to useFto aliastorch::nn::functionalif they don’t want to (#30684)
- autograd
- For
AutogradContext,get_dirty()is removed andget_and_bump_dirty()is added, and the latter always bumps the version counter of the returned tensors (#33068) - Fix allow_unused checking for C++ API (#34035)
- Remove
using namespace torch::autogradfrom torch/csrc/api/include/torch/nn/modules/_functions.h (#34423)
- For
- Operators
torch::tensor(floating-point values)will always produce tensor of default dtype, andtorch::tensor(integer values)will always produce tensor oftorch::kLongdtype, matching Python API behavior (#32367)- Fix
torch::allcloseto handlestd::numeric_limits::lowest()for integral types (#32978) - Switch
torch::empty_liketo usemerge_into process TensorOptions (#33505)
Distributed
- Allow DDP to detect globally unused parameters (#28883).
- Accept url query when
rankorworld_sizeis specified in Process Groupinit_methodURL (#32016). - Add ability to abort NCCL communicators from the store. (#32895).
- Fix timeout support when initializing process group with TCP store (#33434).
- Abort NCCL communicators before throwing operation timed out (#31128).
- Fix logging for aborted communicators in ProcessGroupNCCL (#33147).
- Fix handling of replica parameters in DataParallel (#33907).
- Specify
requires_gradfor Parameter replica so it’s not always set to True by default (#32356) - Put sparse
allreduceresults to input tensors (#32226) - Issue a warning when
zero_gradis used inDataParallel(#33064)
JIT
- TorchScript compilation fixed for (#33783):
torch.stfttorch.lu,torch.lu_unpacktorch.cdisttorch.norm
tensor.tolist()compilation now supported, requires output type annotation (#33472)
def foo(float_matrix, scalar_ten):
# type: (Tensor, Tensor) -> Tuple[List[List[float]], bool]
out1 : List[List[float]] = float_matrix.tolist()
out2 = torch.jit.annotate(bool, scalar_ten.tolist())
return out1, out2torch.rand_likeand other_likeconstructors no longer require additional arguments in TorchScript- Compilation for
nn.ModuleAPIs added (#29495):childrennamed_childrenmodulesnamed_modules
- Support for ModuleList Indexing with Integer Literal (#29236)
- Fixed flipped outputs for
PackedSequence(#32955) - Support
indexandtypeproperties onDevice(#32953)device.indexdevice.type
- Add remaining
Tensorproperties (#33906)tensor.ndimtensor.Ttensor.nametensor.is_leaf
- Fix augmented assignment to non-tensor attributes #32993
- Fixed type resolution for function arguments #29623
- Codeviously we resolved types by parsing their names directly, but now TorchScript uses the value of the type directly from Python
- This allows types types like
torch.deviceto be used
lenon tuples containing different types #35768
Mobile
- Fix exception message in Java Tensor (#30205).
- Fix the crashes for c++ not able to find java class through Jni (#30390).
- Add @DoNotStrip to nativeNewTensor method. (#30472).
- GenericDict/List type use unshapedType() (#30428).
- Support tensors with a storage offset in Java (#31584).
- Fix SIGABORT caused by double exception in PyTorchStreamReader when file not found. (#33243).
- Fix
SELECTED_OP_LISTfile path issue (#33942). - Fix for handling batch size 0. (#34599).
- fixed AutoGradMode/AutoNonVariableTypeMode uses for mobile callsites
- Use
gettimeofdayon iOS (#30361).
ONNX
- Fix
weight_normexport for dim=0 (#31015). - Fix for constant folding flaky tests (#32546).
- Fix export for avg_pool with default stride (#33017).
- Fix ONNX CI by moving test data to aws (#33200).
- Fix for random generators export (#33789).
- Fix export of index_put (#31552).
- Fix for expand -1 dim value (#34069).
- Reduce ONNX test time on CI (#33242).
- ONNX Error Message on Missing Op (#33593).
- Fix exporting
copy_with index as tensor input (#32801). - Fix for
rand_likeas well (#33095). - Added torchvision tests as part of ORT tests (#31835).
- Remove non-ascii character from
torch/onnx/symbolic_opset11.py(#31814). - Add flag to enable script tests (#32654).
- Skip same tests in ONNX Python3 CI as in Python2 (#31827).
- Fixed
torch.mmexport (#34794) - Fixed
aten::sizefor opset 11 (#35984)
Quantization
- Bug fix: Handle missing keys in observer state dict during load (#30357).
- Fix BC for quantized linear (#30481).
- Fix mapping white list to avoid attaching qconfig for DeQuantStub (#30636).
- Fix default instantation of dynamic quantized LSTM (#31433).
- Use default scale/zero_point in fake_quantize module instead of None (#32318).
- Fix ASAN / potential segfault in quantized Tensor memory allocations. (#29882).
- Don’t serialize None values in observer (#32733).
- Enable inplace relu fusion for training (#33105).
- Bug fix in dynamic quantization kernels + better test coverage. (#33320).
- Run weight_post_process for QAT (#33852).
- Fix histogram observer to work with QAT on GPU (#34232).
- Fix the quantized batchnorm2d (#34579).
- Move QScheme ops to c10 (#30134).
- Remove incorrect fp16 dynamic linear/relu op (#32774).
RPC
- Fix serialization memory lifetime issue. (#30603).
- Don’t crash callee when function does not exist on it, instead return an Exception (#32726).
- Throw the correct Exception on local client based on the
RemoteException(#32936). - Attach autograd edges only for tensors requiring grad. (#30904).
WireSerializershould checkhas_storage()(#34626).- Fixed potential deadlock in python exception handling (#35283)
Other
torch.split: Fixed incorrect gradient computation that assumed the output was not a view (#32044).- Allowed numpy integer types to be used where we accept Python integers (#30486).
torch.unique,torch.unique_consecutive: fixed bug with zero-element input support (#31211).Tensor.to_sparse: fixed backward in the non-contiguous tensor case (#31223).torch.index_put: Added error checks for input tensors’ devices (#31280) (#31280).- Ensure we switch the CUDA stream correctly in CUDA operations (#31537, #31538, #31541).
torch.SparseTensor: ensure the legacy sparse constructor doesn’t intercodet Python data as tensor data. (#31490).torch.argmax,torch.argmin: Fixed incorrect behavior on large tensors (#33310).torch.div: Fixed to throw an error when dividing by integer zero on CPU (#32629).torch.cos: Fixed incorrect gradient computation caused by not properly initializing temporary vectors in avx2 code (#32722, #34281).torch.logspace: Added missing integer dtype support, fixed codecision issues in floating-point implementation (#32744).torch.prod: Fixed behavior when passed atorch.halfinput tensor andtorch.floatoutput tensor (#32831).torch.max,torch.min: Fixed NaN handling (#32541).torch.max,torch.min: Added error check that operand and outputs are on the same device type (#32862).torch.stack: Added missing input size checks (#32931).torch.add: Fixed memory leak on certain platforms (#32478).torch.normal: Fixed shape checks (#33050).torch.cumsum: fixed to handle inputs with zero-sized dimensions correctly (#31694).torch.device: Disallow incorrectly formatted device strings (#29087).torch.cat: Disallow passingoutas one of the input tensors (#30577).torch.pdist: Added support for large batch sizes (#31593).torch.stft: Fixed crash when used withnn.DataParallel(#31861).torch.autograd: Ensure the original grad mode is restored during backward (#31884).torch.autograd: Fixed a race condition by locking graph_task before writing leaf_streams. (#31995) (#31995).torch.tensordot: Fixed support for negative dimensions (#31954).torch.cumprod: Fixed to handle inputs with zero-sized dimensions correctly (#32070).torch.pow: Fixed the gradient computation when the base is a Tensor or Scalar of zeros (#32062, #32063).torch.baddbmm: Fixed bug in corner case (#33538).torch.where: Added check for consistent devices (#33432).torch.cdist: Fixed gradient computation forp=2and large inputs (#31167).torch.mv: Fixed NaN handling (#31666).torch.index_put: Added handling for large input tensors (#33753).torch.addmm: Fixed incorrect output when using BLAS backend (#33819).torch.topkfixed double backward when input has non-finite values (#35253)torch.load: Avoid problematic pickle usages on Python 3.8.0 and 3.8.1 (#33824).Tensor.to: Fixed race condition for gradient computation that spans CUDA devices (#31930).Tensor.random_added check thatfromandtoare within the Tensor’s dtype bounds (#34033).Tensor.copy_: Fixed memory overlap check and allowed outputs to be zero-strided tensors if the size is <= 1 along that dimension (#34100).nn.BatchNorm{1,2,3}d: fixed gradient computation for empty inputs (#32820).nn.BatchNorm: Fixed behavior for inputs with large batch sizes (#32763).nn.Conv2d: Fixed 5d weight handling with MKLDNN backend (#34115).nn.Conv3d: Fixed unstable gradient computation (#34358).nn.Conv{1,2,3}d: added support for empty batch size(#32709).nn.Conv{1,2,3}d: fixedCUDNN_STATUS_NOT_SUPPORTEDerrors by trying multiple algorithms (#33073).nn.Conv{1,2,3}d: fixed padding mode support and added additional padding modes (reflection and replication) (#31784).nn.Conv2d,nn.Conv3d,nn.Conv1d,nn.ConvTranspose2d: Fixed support for batch sizes greater than 2^32 (#31383, #31379, #31889, #34407,#31510).nn.InstanceNorm,nn.GroupNorm: Added error check for input with exactly one element (#29082).nn.RNN: Fixed moving RNNs to a device after applying weight norm (#32563, #32989).nn.MultiLabelMarginLoss: added support for 0-d tensors (#30765).nn.GroupNorm: added support for empty batch (#32401).nn.NLLLoss: fixed to support empty tensors on CUDA (#31491).nn.GroupNorm: corrected input size check (#33008)nn.MultiLabelMarginLoss: fixed memory leak on CUDA (#30767).nn.MultiMarginLoss: fixed error checking on CUDA for the 1D case. (#30825).nn.Softmax: Fixed half->float case of softmax backward (#30838).nn.Softshrink: Added check that lambda is no less than zero (#33201).nn.functional.interpolate: added support for empty batch size input for interpolate. (#32400).nn.functional.pad: Also return a new tensor instead of sometimes returning a view (#32350).nn.functional.grid_sample: Fixed gradient computation at image borders (#32829).nn.functional.leaky_relu_: disabled incorrect leaky_relu_ negative slope backward calculation (#33639).optim.LambdaLR: removed unintentional side effects (#32848).optim.Adam,optim.AdamW: Added missingweight_decayparameter validation (#33126).optim.MultiStepLR: Fix “unbound local variable” error by removing return value for__exit__(#32997).optim.MultiStepLR: Fixed brokenstep()method (#33356).torch.autograd: added new error message if incorrect usage would cause a deadlock (#32295).torch.autograd: Prohibited copying autograd engines (#34567).torch.autograd: Fixed incorrect handling of functions that return multiple views (#32790).autograd.Function: Fixed error ifFunctionreturned a view in atorch.no_gradblock (#33896).autograd.Function: Added more error checks for incorrect behavior (#33069).autograd.Function: Added nice error message if missing overrides (#33142).autograd.Function: Fixed version check forgrad_fnfor views (#34145).autograd.profiler: Fix incorrect chrome trace formatting output for CUDA traces (#33987).multiprocessing.util.register_after_fork: fixed crash on Windows (#30809).utils.data.DataLoader: Fixed potential hang when exiting main process (#33721).utils.tensorboard.SummaryWriterfixedscale_factorcalculation for uint8 tensor (#31778).utils.tensorboardFix for when PyTorch model trace has RecursiveScriptModules (#30430).- Fixed
CPU_INTELflag error on Windows (#30564). - Don’t use
RTLD_GLOBALto load_C, resolving a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages (#31162). - Fixed dll load logic for Python 3.8 on Windows (#32215).
quasirandom.SobolEngine: Fixed crash when default tensor type is CUDA (#32496).- Fixed error message when converting NumPy array with negative strides to a
torch.Tensor(#33254). - Fixed crash when indexing a
torch.Tensorwith a single-element array (#33456). - Fixed crash when converting CUDA tensors and non-strided tensors to NumPy arrays (#33612).
- Codevented crash on exit from static destructor race on Windows (#33955).
- Fixed uncaught
std::domain_erroron macOS (#34301). - Don’t reset worker affinity when using operators that call into OpenMP (#29006).
torch.backends.mkldnn: changed to be usable without import (#32055).
Performance
Mobile
- Java Tensor hybrid, owns at::Tensor, no memcopy for java outputs. (#30501).
- Tensor codep from image in native (#31426).
- Pass to remove codepacking ops. (#34319).
Quantization
- Per channel quantization performance improvement (#33772).
- Speed up per-channel min-max observer (#34118).
- Vectorized qmul and more methods on qint data types (#34376).
RPC
- Improve
ProcessGroupAgentserialization speed (#29785). - Avoid sending large unneeded data over wire in
ProcessGroupAgent. (#31357). - Integrate async mode for autograd engine with distributed autograd. (#31508).
- Make handling of
FORWARD_AUTOGRAD_REQinrequest_callback_implnonblocking (#32476).
Other
- Major multithreaded performance regression when doing operator calls resolved (#30333)
- Improved performance of comparison ops on CUDA (#29743).
Tensor.viewimproved performance (#30554).- Improved tensor creation overhead (#30452, #30709)
nn.SmoothL1Loss: vectorized gradient computation on CPU. (#30046).nn.EmbeddingBag: improved performance on CPU (#30701, #27477).nn.LayerNorm: optimized with explicit vectorization using Vec256 (#31127).Tensor.copy_: fixed kernel speed regression introduced in #29631 (#31279).- Moved a number of debug asserts to not compile in release builds (#31240).
Tensor::has_namessped up for unnamed tensors (#31436).torch.index_select: optimized performance on CPU (#30598).nn.Conv{1,2,3}d: Improved performance by refactoringbiashandling for cuDNN backend (#31524).torch.norm: Optimized case wherep = 2(#31903).nn.utils.clip_grad_norm_: Refactored the computation for more performance (#32020).- Made an assert on a hotpath trigger only in DEBUG mode (#32117).
- First steps toward TensorIterator unrolling and vectorized load (#31974).
nn.functional.normalize: changed to useclamp_min_(#32360).- Stopped refreshing numel on a stride update (#32116).
nn.functional.softplus: vectorized operator and gradient computation on CPU (#32944).torch.gatherregression fixed by not materializing loop vars in error message (#33108).nn.ELUforward and backward vectorized on CPU (#32985, #32986)torch.cat: optimized performance on CPU (#30806, #33534).torch.conv3d: optimized Unfold3d to improve performance (#33191).- Workaround performance bug and memory leak in GOMP for AMD CPUs (#32875).
- Improved TensorIterator overhead (#33165).
torch.conv3d: optimized Unfold3dAcc to improve gradient computation performance (#33317).torch.rollimproved performance (#33623).- Bounds checking for functor execution in vectorized/unrolled kernels (#33642).
nn.EmbeddingBag: improved performance on CUDA (#33589).- Remove unnecessary tensor copies while calling operators (#33732).
- clang intrinsics targeting on Windows (#33958).
nn.Dropout: added vectorized CUDA implementation (#33879).nn.UpSampleNearest{1, 2, 3}dperformance on CPU optimized (#31452) (#31452).- Remove
cudaMemcpyon full memory overlap (#34548). - CUDA Loops: move address computation into policy, make
policy.loadload all arguments (#33720). nn.BatchNorm{1, 2, 3}dcontiguous case’s performance improved (#34530).- Add the build for runtime dispatch for AVX, AVX2 instruction set (#26125).
nn.RReLUperformance improved up to 5x for inference on CPU (#31094).nn.LogSigmoidperformance improved up to 10x on CPU (#30958).torch.distperformance improved up to 2x (#29714).torch.max,torch.minperformance improved up to 1.5x on CPU (#33936).nn.GLUperformance improved up to 1.5X on CPU (#33179).nn.LeakyReLUperformance improved up to 4x (#29899).nn.HardTanhperformance improved up to 5x (#30152).
Documentation
Python
- Added documentation for
nn.functional.softplus(#30055, #32945). torch.max: Added warning about different, nondeterministic behavior on CPU and CUDA (#31115).- Clarified the documentation for
nn.NLLLoss(#31488). - Exclude generated source docs from Google search indexing (#31484).
torch.poissondocstring added to documentation (#31667) (#31667).torch.eqfixed incorrect examples in documentation (#32399).torch.load: added warning regarding pickle insecurity (#32593).optim.CosineAnnealingLR: fixed the usage in examples (#31358).- Added doc codeviewing instructions (#31905).
- Removed legacy
.datausages from thetorch.nndocumentation (#31481). - Fixed description of convolution modules (#30079).
Tensor.t(),Tensor.permute(),Tensor.unfold(), andTensor.select()clarified to note that they return views (#32512).torch.multiprocessingUpdated documentation indicating that start_method is ignored formp.spawn()(#33070).- Improved CPU threading documentation (#33083).
nn.BCELoss: documented how it avoids infinite results (#33160).nn.utils.rnn.pack_padded_sequence: Improved the description ofenforce_sorted(#33617).nn.utils.pad_packed_sequence: doc improvement (#33768).nn.LPPool{1,2}d: removed nonexistent parameter (#33714).- Created a Tensor View documentation page that documents all PyTorch operations that return views (#32560).
- Added grad context manager doc to top level torch module. (#33877).
- Enhanced reproducibility documentation (#33795).
- Numerous typo fixes (#30448, #30518, #30614, #30464, #30608, #24335, #34581, #34624, #34008, #31395, #31677, #31617, #31973, #32068, #33689, #30385, #32003, #31682, #30846, #33478, #33549, #32307, #33144, #33805, #33836, #34053).
- Numerous formatting and/or rendering fixes (#30377, #30779, #32667, #34027, #32911, #30814, #30815, #31760, #34503).
C++ API
- Fix
at::Tensordocs generation and make it accessible again at https://pytorch.org/cppdocs/api/classat_1_1_tensor.html (#34467) - Add docs for all
torch::nn modulesand functionals (#34522) (#34688) (#34752) - Improve C++ autograd and tensor indexing docs (#35919)
- Fix example in
torch::nn::ModuleListdocs (#34463)
RPC
- Reorganize RPC API doc and add introduction (#30491, #35109).
- Make doc source format consistent in
rpc/init.cpp(#30515). - Add examples to RRef doc (#30516).
- Add more details to explain
rpc_backend_optionsarg ininit_rpc(#30855). - Fix examples in API doc (#30856).
- Fix examples in RRef API doc (#30857).
- Document WorkerInfo and
RpcBackendOptionsstructures in RPC docs. (#31077). - Explain RPC behavior when using Tensor as arg or return value (#31968).
- Update RPC docs to reflect correct use of dist_autograd backwards and dist_optim
step()(#34670). - Minor doc tweak to use mp.spawn in example (#30381).
- Update distributed autograd note (#34657).
Mobile
- Add info about transitive dependencies in case of using local aars (#30128).
- Update Docs for building PyTorch for Android. (#32578).
- Javadoc changes (#31956).
Quantization
- Updates to quantization documentation (#30288).
- Fix docs so that the example works (#30120).
- Add the explicit per-tensor/per-channel quant info when we print the module (#30591).
- Fixed typos in quantization docs / docstrings (#34182).
- Docs entry for the
is_quantized(#32075).
Decodecations
Python
How to figure out which line in your code is raising a warning
Attempting to use decodecated behavior will raise warnings. Unfortunately, sometimes it is not entirely obvious what line of code the warning corresponds to, especially if the the warning comes from our C++ backend. For example, with a file named foo.py with
the following contents,
import torch # This is newly decodecated behavior, see the next section torch.tensor(1) / torch.tensor(2)
running it doesn’t give us the location of the warning:
> python foo.py ../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
We can use the warnings module to tell us where the warning is by asking it to treat warnings as errors:
import torch
import warnings
warnings.filterwarnings('error', message='Integer division')
# This is newly decodecated behavior, see the next section
torch.tensor(1) / torch.tensor(2)
Running the file now tells us exactly where the warning is:
> python foo.py
Traceback (most recent call last):
File "foo.py", line 5, in <module>
torch.tensor(1) / torch.tensor(2)
UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide
or floor_divide (// in Python) instead.Decodecated torch.div and torch.addcdiv integer floor division behavior (#34570)
In 1.5.0 and older PyTorch releases torch.div and the / operator perform integer floor division. In a future PyTorch release, torch.div (including the / operator) will perform “true” division as in Python3 and NumPy.
To floor divide integer tensors, please use torch.floor_divide instead.
| Before | After |
|---|---|
>>> torch.tensor(3) / torch.tensor(2) ../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is decodecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. tensor(1) |
>>> NB: the following is equivalent to `torch.floor_divide(torch.tensor(3), torch.tensor(2)) >>> torch.tensor(3) // torch.tensor(2) tensor(1) |
The fix for torch.addcdiv is similar.
| Before | After |
|---|---|
>>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> torch.addcdiv(input, tensor, other, value=value) ../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning: Integer division with addcdiv is decodecated, and in a future release addcdiv will perform a true division of tensor1 and tensor2. The current addcdiv behavior can be replicated using floor_divide for integral inputs (self + value * tensor1 // tensor2) and division for float inputs (self + value * tensor1 / tensor2). The new addcdiv behavior can be implemented with true_divide (self + value * torch.true_divide(tensor1, tensor2). tensor(0) |
>>> input = torch.tensor(0) >>> tensor = torch.tensor(1) >>> other = torch.tensor(3) >>> value = 1 >>> (input + torch.floor_divide(value * tensor, other)) tensor(0) |
Decodecated torch.full returning float tensors if no dtype is specified (#34709).
In a future PyTorch release, torch.full will infer its dtype from its fill value when the optional dtype and out parameters are unspecified, matching NumPy’s inference for numpy.full. For example, torch.full(size, 1) will
return a tensor of torch.long dtype, unlike today where it returns a tensor of torch.float dtype.
Decodecated torch.nn.modules.conv._ConvTransposeMixin (#31784).
This is an internal-facing class that is not a part of our public API. We’ve refactored some PyTorch internals to work without it and will remove it in a future release.
Decodecated positional args in multiple torch function signatures (#32009, #33428)
Below please find a list of decodecated signatures and what to change them to.
torch.add(self: Tensor, alpha: Scalar, other: Tensor),torch.sub(self: Tensor, alpha: Scalar, other: Tensor)please usealphaas a keyword-only arg instead of positional argstorch.addbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor): please usealphaandbetaas keyword only args instead of positional args.torch.addcdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor),torch.addmdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor): please usevalueas a keyword-only argtorch.addmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor),torch.sspaddmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor)please usealphaandbetaas keyword only args instead of positional args.torch.addmv(beta: Scalar, self: Tensor, alpha: Scalar, mat: Tensor, vec: Tensor): please usealphaandbetaas keyword only args instead of positional args.torch.addr(beta: Scalar, self: Tensor, alpha: Scalar, vec1: Tensor, vec2: Scalar): please usealphaandbetaas keyword only args instead of positional args.torch.baddbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor): please usealphaandbetaas keyword only args instead of positional args.
| Before | After |
|---|---|
>>> torch.zeros(2,3).add(2, torch.ones(2, 3))
../torch/csrc/utils/python_arg_parser.cpp:750: UserWarning: This overload of add is decodecated:
add(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add(Tensor other, Number alpha)
tensor([[2., 2., 2.],
[2., 2., 2.]]) |
>>> torch.zeros(2, 3).add(torch.ones(2, 3), alpha=2)
tensor([[2., 2., 2.],
[2., 2., 2.]]) |
Decodecate modifying in-place a view that returned by a custom autograd Function (#32839).
Modifying in-place a view that was created by a custom Function leads to the custom backward not being called or being called with a partial gradient. This behavior will be removed in 1.6.
Please clone() the output of the Function to avoid incorrect gradient computation.
class Id(Function):
@staticmethod
def forward(ctx, input):
return input.view_as(input)
@staticmethod
def backward(ctx, grad_input):
return grad_input| Version 1.5.0 | Version 1.5.0 |
|---|---|
>>> input = torch.randn(3, requires_grad=True) >>> other = torch.randn(3) >>> output = Id.apply(input) >>> output.copy_(other) # Warning: Incorrect gradients |
>>> input = torch.randn(3, requires_grad=True) >>> other = torch.randn(3) >>> output = Id.apply(input).clone() >>> output.copy_(other) |
Decodecate modifying in-place a view created inside a no_grad block (#32839)
Modifying in-place a view created inside a no_grad block is ambiguous and error-prone so we have decodecated it.
Here is an example of some code that we’ve decodecated. In codevious versions of PyTorch, the following code throws a non-descriptive error message, but we’ve added a decodecation in 1.5.0.
base = torch.rand(10, requires_grad=True) var = torch.rand([], requires_grad=True) with torch.no_grad(): view = base[1] view.copy_(var) torch.autograd.grad(base.sum(), var) RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is decodecated and will be forbidden starting 1.6 (see https://github.com/pytorch/pytorch/pull/32839 for more details about this). You can clarify your code and remove this warning by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).
If you want to differentiate, you should change the above code to
base = torch.rand(10, requires_grad=True) var = torch.rand([], requires_grad=True) view = base[1] view.copy_(var) torch.autograd.grad(base.sum(), var) (tensor(1.),)
If you don’t want to differentiate, you should change it to
base = torch.rand(10, requires_grad=True)
var = torch.rand([], requires_grad=True)
with torch.no_grad():
view = base[1]
view.copy_(var)C++ API
Decodecated Tensor.type() (#30281)
Please use Tensor.options() instead.
Miscellaneous
- Part of an automated mixed-codecision solution (#33366, #33832).
This release has 2 assets:
- Source code (zip)
- Source code (tar.gz)




.jpg?format=webp)