We have quite a few commits in the 1.10 release and some things that are interesting for people that develop within PyTorch.
You can find below a curated list of these changes:
Developers
Python API
- Generic test parametrization functionality (#60753)
- Ensure NativeFunctions.h codegen output is deterministic (#58889)
- hide top-level test functions from pytest’s traceback (#58915)
- remove pytest.UsageError (#58916)
- introduce TestingErrorMeta for internal use (#58917)
- make assert_equal an example how to partial
torch.testing.assert_close
(#58918) - split TestAsserts by functionality (#58919)
- add support for constant (#60166)
- ensure xml report path are relative to */pytorch/test (#60380)
- Set TORCH_WARN_ONCE to always warn inside of assertNotWarn (#60020)
- Print stdout and stderr to console on parallel runs (#60869)
- Fix several test_ops cuda dtypes tests (#60922)
- remove UsageError in favor of ValueError (#61031)
- Remove test_out, test_variant_consistency_eager skips for
addmv
; fixed before (#61579) - [OpInfo] Added ReductionOpInfo subclass of OpInfo and ported sum test (#62737)
- Added API tests to ReductionOpInfo and ported amax/amin/nansum tests (#62899)
- Revert "Revert D30558877: Ported std/var to ReductionOpInfo (#65262)
- Cleanup functional.py after lu_unpack was removed (#58669)
- Fix
torch.finfo.bits
typo in stub (#58819) - masked_scatter thrust->cub (#58865)
- improve broadcasts_input error message (#58295)
- OpInfo
- Adding coverage for:
t
,split/split_with_sizes
,kthvalue
tests,diag_embed/diagonal
,fmod/
remainder
,tril
,triu
,renorm
,tensor_split
,true_divide
,div
,log_softmax
,torch.{lin, log}space
,clone
,contiguous
,__rmod__
,where
,fill_
,resize_
,resize_as_
,zero_
,norm
,to_sparse
,addbmm
,bitwise_and
,fmod
,remainder
,conv_transpose2d
,adaptive_avg_pool2d
,torch.nn.functional.normalize
,torch.nn.functional.pad
,torch.meshgrid
,torch.nn.functional.dropout
,torch.nn.functional.conv2d
(#59442, #58184, #58654, #58642, #57941, #59145, #59079, #59133, #59154, #59173, #59336, #53685, #58390, #58476, #58349, #59138, #59176, #58731, #59259, #59445, #61832, #61349, #61527, #63389, #62935, #62635, #62814, #62720, #62315, #63517) - Add expected_failure kwarg to SkipInfo (#62963)
- Launch OpInfoHelper tool (#58698)
- Unify OpInfo dtype tests (#60157)
- Add an OpInfo note (#57428)
- Adding reference tests to
OpInfo
class (#59369) - Test shape analysis with opinfos (#59814)
- Adding coverage for:
- ATen migration:
nonzero
,renorm
,torch.lstsq
,crossKernel
,glu
,nll_loss_forward
, (#59149, #59250, #59400, #60039, #61153, #60097) - Move sharding to after all tests have been excluded (#59583)
- Fix Python 3.8 expecttest (#60044)
- Improve error messages of
torch.testing.assert_close
in case of mismatching values (#60091) - update docstring examples of
torch.testing.assert_close
(#60163) - tests for diagnostics in callable
msg
intorch.testing.assert_close
(#60254) - Increase tolerance for test_grad_scaling_clipping (#60458)
- Add a test case for findDanglingImpls (#61104)
- Don’t check stride by default in Torch testing (#60637)
- Fix ignoring Tensor properties in torch.overrides (#60050)
- mvlgamma: int → float type promotion (#59934)
- Use python3.6 compatible APIs in clang_tidy.py (#60659)
- correct filename issue for test_cpp_extensions_aot (#60604)
- First step to rearrange files in tools folder (#60473)
- Add exclusion list to _check_kernel_launches.py (#60562)
- use explicitly non-returning GPU atomics (#60607)
- Test parametrization for instantiated device-specific tests (#60233)
- Use maximum of tolerances in case of mismatching dtypes (#60636)
- remove
randn?
fromtorch.testing
namespace (#61840) - Parity tests for functional optimizer step_param (#61756)
- Functional Optim: Test kwargs parity for SGD (#62078)
- Make SkipInfo with a
unittest
skip(#63481) - Modify compare scalars testing error message when atol=0 and rtol=0 (#60897)
- Added reference tests to ReductionOpInfo (#64273)
- Remove opinfo warning from floor_divide (#58682)
- Support
torch.concat
alias, addcat
OpInfo & remove OpInfo test_out skips {cat, stack, hstack, vtack, dstack} (#62560) - Remove
run_functional_checks
fromtest_autograd
and create necessary OpInfos (#64993) - Fixing user inputs for low, high in
make_tensor
(#61108) - kill SkipInfo (#64878)
- Add
skipIfTBB
decorator (#64942)
Complex Number
- Skip conjugate and negate fallback for view ops and their in-place versions (#64392).
- Remove component wise comparison of complex values in TestCase.assertEqual (#63572).
Autograd
- Added autograd not implemented boxed fallback (#63458)
- Added support for using gradients named for outputs in
derivatives.yaml
(#63947) - Forward AD now being tested using OpInfo (#58304, #60498)
- Added change to extract
TestAutogradComplex
into its own test file (#63400) - No longer skip gradgrad checks because fast gradcheck made it faster (#60435)
- Improve debug-only autograd checks to ensure ops are properly view and inplace as advertised on their schema (#60286)
torch.nn
- Testing
- Added
OpInfo
entries fornn.functional.{avg_pool2d, conv2d, conv_transpose2d, cosine_similarity, grid_sample, interpolate, layer_norm, linear, mse_loss, one_hot, relu, softmax, softplus, unfold}
(#62455, #65233, #62882, #62959, #62311, #61956, #63276, #61971, #62254, #62253, #62076, #62077, #62317, #62705) - Added
ModuleInfo
-based testing for forward reference function comparison, module string representation, pickling, in-place vs. out-of-place comparisons, and factory kwarg usage (#61935, #63737, #63736, #63739, #62340, #62999) -
nn.AvgPool2d
: Empty caching allocator before large subtest intest_avg_pool2d
(#63528) -
nn.Conv2d
: Disabled testtest_Conv2d_groups_nobias
for ROCm (#59158) - Tweaked TF32 tolerance thresholds for
nn.{Conv2d, Conv3d}
tests (#60209, #60451) -
nn.LayerNorm
: Restored deleted numerics test (#64385) -
nn.LayerNorm
: Set behavior to always use fast gradcheck for test variant3d_no_affine_large_feature
(#61848) - Remove using directive from header to fix Sandcastle tests (#59728)
- Fixed logic to correctly avoid TF32 operations for tests that explicitly disable TF32 (#59624)
- Added
- Migrated
{multilabel_margin_loss, nll_loss2d, nll_loss2d_backward, rrelu_with_noise, thnn_conv2d, thnn_conv_depthwise2d}
from TH / THC to ATen (#60708, #62826, #60299, #57864, #62006, #63428, #62281) - Removed dispatch from parallel regions for
nn.{BatchNorm*d, Embedding, CTCLoss}
(#60596, #60597, #60599) - Back out “[pytorch][PR] Adds dtype to nn.functional.one_hot” (#59080)
- Removed unused variables (D29994983, D30000830)
-
nn.Module
: Added option to pass module instance to_load_state_dict_pre_hooks
(#62070) -
nn.EmbeddingBag
: Replace thrust with cub for sorting (#64498)
Dataloader
- Fixed
test_fd_limit_exceeded
to usefork
explicitly as multiprocessing method for different default methods across Python versions on MacOS (#60868) - Fixed
test_ind_worker_queue
to setmax_num_worker
based on system resource (#63779) - Fixed
test_multiprocessing_contexts
that jetson device doesn’t support sharding cuda Tensor (#64757) - Fixed
TestSetAffinity
to use CPU ID based on hardware (#65042) - Added test to validate the generator attached to
Sampler
is created lazily (#63646)
AMD
- Updated ROCm PyTorch persons of interest (#55206)
- Make Jeff and Jithun .circleci/docker code owners (#60958)
- Skipped test_masked_scatter_large_tensor_cuda (#61313)
- Fixed bug in #60313 (#61073)
- Updated CI images for ROCm 4.3.1 (#64610)
- Added ROCm as a platform for which tests can be disabled (#63813)
- Added a skip for ROCm for a flaky test (#62664)
- Added distributed/_sharded_tensor/test_sharded_tensor to ROCM_BLOCKLIST (#63508)
- Disabled test test_Conv2d_groups_nobias_v2 for ROCm (#58701)
- Disable CircleCI ROCm build (#64434)
- [ROCm] allow user to override PYTORCH_ROCM_ARCH (#60602)
- [ROCM] enable fft tests (#60313)
CUDA
- Added change to wrap cudaStreamSynchronize calls (#61889)
- Fix formatting repeat_interleave op files (#58313)
- Added kernel launch checks after each kernel launch to silence the check (#58432)
- Added change to move code to Blas.cpp, clean up THC magma (#58526)
- Introduced step 0 of cuDNN v8 convolution API integration (#51390)
- Added
#pragma once
to CUDA foreach headers (#58209) - Removed THCReduce.cuh (#59431)
- Fixed const correctness and loop variable type in CUDACachingAllocator (#59819)
- Fixed bad change in a CUDACachingAllocator loop (#59903)
- Fixed compile failure on CUDA92 (#60017)
- Added a follow-up fix for compilation error on CUDA92 (#60287)
- Fixed version comparison for defining CUDA11OrLater (#60010)
- Moved remaining Sort in
THC
toATen
(#58953) - Fixed kernel launch check in cross kernel (7bf195f360)
- Removed some unnecessary functions from CUDAHooks (#59655)
- Added change to use make_unique instead of std::unique_ptr in CudaCachingAllocator (#61272)
- Added change to split zeta_kernel out of BinaryMiscOpsKernel.cu (#62261)
- Added cast return of cudaGetLastError() to void when discarding (#62518)
- Replaced hardcoded values in IndexKernel.cu (#63372)
- Killed THCUNN (#63429)
- Migrated legacy lstsq from THC to ATen (CUDA) (#63504)
- Migrated THCTensor_copyIgnoringOverlaps to ATen (#63505)
- Added change to convert mul to use opmath_gpu_kernel_with_scalars (#64019)
- Added acc_gpu_kernel_with_scalars and port add to use it (#63884)
- CUDA graphs: added hotfix for test_graph_ (#64339)
- Migrated uses of THCReduceApplyUtils to cuda_utils::BlockReduce (#64713)
- Removed 9.2 related macros for CONSTEXPR (#65066)
- Removed CUDA 9.2 references conditionals and workarounds (#65070)
- Makes a streaming backward test try gradient stealing more directly (#60065)
C++ API
- Made it easy to grep out variant of unary/binary op kernels in codebase (#60128)
- Enhanced comparison tests for
c10::optional
(#62887)
TorchScript
- Removed metadata.pkl file from PyTorch model (#61760)
torch.package
- Added a hack that allows typing.io/re use case (#60666)
- Added change to wrap torch::deploy API functions in safe rethrow macros (#58412)
- Added proper handling of re-packaging mocked modules (#61434)
- Added support in Digraph to track predecessors (#61146)
- Added
register_module_source
, a helper function to get some python source code loaded on each interpreter (#58290)
Distributed
DistributedDataParallel
- Retired _module_copies field in DDP reducer. (#59094)
- Merged work and future_work in reducer (#59574)
- Removed legacy code from DDP internals (#58592, #58595, #58603)
- Removed unneeded code around DDP buffer synchronization logic (#64777, #64473)
torch.distributed
- Added a split for custom class bindings out of python binding code (#58992)
- Added the use of irange in a few places (#55325)
- Corrected launcher tests (#59635)
- Added the change to use TORCH_CHECK instead of std::runtime_error (#59667, #59683,#59918, #63928)
- Moved c10d to libtorch(_cuda) (#59561, #59562, #59696, #59697, #59563)
- Enabled ncclAvg for reductions (#62303)
- Made JIT operation call to accept Stack& instead of Stack* (#63414)
- Cleaned up autograd/distributed autograd engine thread state (#63115)
torch.distributed.rpc
- Introduced new rpc.barrier API (#53423)
- Added some TORCH_API annotations to RPC (#60169)
- Moved RPC agents to libtorch (#60170)
- Moved torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
- Added RPC Parameter Server Benchmark (#58003, #60784, #60785)
- Added helpers to manipulate futures (#57846)
- Made RequestCallback collect Futures from methods, rather than providing them (#57847)
- Added change to centralize setting messageId in RequestCallback (#57848)
- Made some methods of RequestCallback return void instead of bool (#57849)
- Unified invoking JIT operands (#57850)
- Unified invoking JIT functions (#57851)
- Unified async execution for JIT functions (#57852)
- Added change to simplify OwnerRRef completion (#57853)
- Removed getScriptRemoteCallType (#57854)
- Unified assignment of OwnerRRef result (#57856)
- Added change to simplify process(Script|Python)(Remote)?Call (#57857)
- Deduplicated Python object serialization (#57858)
- Unified fetching RRefs (#57859)
- Made remaining RRef methods return futures (#57860)
- Made remaining autograd methods return futures (#57861)
- Added change to always use intrusive_ptr for Message (1 out of 2) (#59205)
- Added change to always use intrusive_ptr for Message (2 out of 2) (#59206)
- Provided pre-extracted DataPtrs when completing a Future with a Message (#59208)
- Created CUDA-aware futures in RequestCallback (#59209)
- Removed LazyStreamContext (#59298, #59299)
- Allowed Future::then to return pre-extracted DataPtrs (#59207)
- Merged TensorPipe’s CPU and CUDA channel registry (#59375)
- Made CUDA serde support for TensorPipe agent pluggable (#59376)
- Moved CUDA-related stuff of TensorPipe agent to separate file (#59377)
- Prepared for TensorPipe separating its CUDA-specific headers (#59788)
- Added change to reduce overhead when Future invokes callbacks inline (#57638)
- Added change to prevent using anything other than intrusive_ptr for Future (#58421)
- Allowed Future::then to return pre-extracted DataPtrs (#58424)
torch.distributed.optim.ZeroRedundancyOptimizer
- Cleaned up ZeRO (#60285)
- Fixed ZeRO sort to be by numel (#60556)
- Refactored non-joined process computation (#61555)
- Made broadcast_object_list accept a device parameter (#61305)
- Made _Join, _Joinable, _JoinHook public (#62605)
- Refactored commonalities between two approaches (#62624)
- Add change to gate DistributedOptimizers on RPC availability (#62937)
torch.fx
- Introduced prototype for guarding against mutable operations in tracing (#64295)
Benchmark
- PyTorch core should not require NumPy:
torch.utils.benchmark
(#60564) - PyTorch core should not require NumPy: Operator Microbenchmarks (#64707)
Performance_as_a_product
- OpenMP: Refactored parallel_reduce to share code with parallel_for (#60184)
- Removed many unnecessary constructor calls of Vectorized (#58875)
- Fixed arange functions for VSX specializations of Vec256 (#58553)
Composability
- Added query() and synchronize() to
c10::Stream
, increasing parity withat::cuda::CUDAStream
(#59560) - Added change to make PyObject_FastGetAttrString accept const char* (#59758)
- Added change to tag PyObject on TensorImpl per torchdeploy interpreter (#57985)
- Added ExclusivelyOwned and MaybeOwned, which can be used to help reduce the number of tensor refcount bumps in some parts of the codebase (#59419, #63450)
- Improved CI testing parity across pytorch/pytorch and pytorch/xla (#59888, #59989)
- Added a boxed CPU fallback kernel (#58065)
- Added an optional Device parameter to pin_memory/is_pinned that does nothing (#60201)
- Fixed Dispatching not considering List[Optional[Tensor]] for dispatch (#60787)
- Added support for registering boxed functors to the dispatcher (#62658)
- Structured kernels**:** added the ability to precompute values in the meta function that can be reused later in the impl function (#61746)
- Improved build times by factoring out core parts of the tensor class from
TensorBody.h
(which is used as an input to codegen) toTensorBase.h
(which is not) (#63612) - Added basic support for meta tensor serialization. (#62192)
- Added a new per-operator C++ API that can be accessed through (with
add.Tensor
as an example)ATEN_FN2(add, Tensor)
. This API exposes a struct that contains some compile-time constant info about each operator, such as its name, overload name, and schema. (#60214)
Build_Frontend
- CUDA-9.2 builds or older are no longer supported (#65065, #61462, #65024)
- Update CMake minimum version to 3.10 (#63660)
Foreach_Frontend
- Foreach Binary Test Refactor now supports complex dtypes via binary ops with one tensorlist and one scalarlist (#59907)
- Foreach Test Refactor: Pointwise, Min/Maximum (#61327)
Sparse_Frontend
- Setup CI to run test_sparse_csr tests (#58666)
- Remove some uses of deprecated Tensor methods (#58990)
- Improved test for unsupported backwards for sparse COO Tensor (#59971)
- Increased dtype test coverage for sparse CSR tensors (#60656)
- Improved the representative power of random sparse CSR tensors used by tests (#60283)
- Removed dispatch in parallel regions for sparse COO CPU kernels (#60598)
- Fixed variable initialization issues in C++ code (#60896)
- Improved error messages of
torch.testing.assert_close
for sparse inputs (#61583) - Moved some .cu files that do not rely on CUDA to .cpp (#63894)
- Add support for sparse Tensors to
torch.testing.assert_close
(#58844)