PyTorch 1.9 release contains quite a few commits that are not user facing but are interesting to people compiling from source or developing low level extensions for PyTorch. Here is a non-exhaustive list of the most important ones.
Python API
- Added
cpu_kernel_multiple_outputs
to help developers implement new torch functions that return two or more tensors conveniently (#51097) - Support auto generation of device check (#56872)
- Fix bug in self.assertExpectedInline (#55149)
- protect destructors of python bindings that can be kept alive by c++ objects (#57488)
-
Testing related commits:
- See the Developer Wiki article “Writing tests in PyTorch 1.9” for details on significant testing improvements
- A prototype torch.testing module (see it documentation here) has been added to facilitate testing libraries built using PyTorch
- It currently has one function,
torch.testing.assert_close
, which can be useful when comparing PyTorch tensors (add your feedback to its RFC here!) - Request more features by filing issues on PyTorch’s Github
- It currently has one function,
- A prototype torch.testing module (see it documentation here) has been added to facilitate testing libraries built using PyTorch
- OpInfo coverage and testing continues to expand!
- test_ops.py now verifies the out= argument works correctly for operators with OpInfos (#53259)
- OpInfos now support sample inputs with tensorlist arguments (#54922)
- OpInfos can now be wrapped in a lambda for grad and gradgrad checks (#54914)
- OpInfos can now handle sample inputs where the input is broadcast (#55771)
- [ROCm] Setting TEST_WITH_ROCM now skips tests that don’t use GPUs (#55069)
- A new test case method,
assertWarnsOnceRegex
, can be used to test warnings that are usually thrown only once per process (#52387) -
make_tensor()
, the test suite’s goto mechanism for constructing a random tensor, now supports adiscontiguous
kwarg (#51985)
- See the Developer Wiki article “Writing tests in PyTorch 1.9” for details on significant testing improvements
Distributed
-
torch.distributed.rpc
: Adds a parameter server benchmark for RPC to torch/benchmarks/distributed. (#57454) -
torch.distributed.nn.RemoteModule
: Improve typing for RemoteModule (#58012) -
torch.distributed.rpc
: Assert that GIL is not held in blocking destructors in RPC (#57030) - Add logging when
store_based_barrier
succeeds (#57711) -
torch.distributed.rpc:
Allow to specify a set of device for CUDAFuture (#56515) -
torch.distributed.nn.RemoteModule
: Replace Python Pickler with internal RPC pickler for RemoteModule (#58019) -
torch.distributed.rpc
: Make CUDAFuture handle any kind of device type (#57051) -
torch.distributed
: Remove deprecated use of torch.LongTensor, torch.ByteTensor in distributed APIs (#55861) -
torch.distributed
: Join work clean up thread before aborting communicators (#55444) -
DistributedDataParallel
: Log use of uneven inputs API (#54919) -
DistributedDataParallel
: Deduplicate shared params before constructing Reducer in DDP (#53279) -
torch.distributed
: Log nccl_async_error_handling (#52965) -
torch.distributed.rpc
: Reduce logging verbosity in tensorpipe agent (#51784, #51785) -
torch.distributed
: Log nccl debug level in ProcessGroupNCCL (#52803) -
torch.distributed.rpc
: make pickler/unpickler pluggable in RPC (#53050) -
torch.distributed
: make the pickler in distributed_c10d pluggable (#53060) -
DistributedDataParallel
: log newly added construction and runtime stats at randomly selected iterations (#51394) -
torch.distributed.rpc
: Fix flaky TestTrainingLoop - TestE2ETensorPipe (#51939) -
DistributedDataParallel
: Ensure local_used_maps_tmp is distinct from local_used_maps_[i] (#54474) -
DistributedDataParallel
: Declare NamedTuple at top level to fix typing (#53273) -
torch.distributed
Combine backtrace print in test logging into one string to avoid interleaving (#56961).
torch.nn
- Reenable
test_nn
tests for Windows (#52051) - Replace
type().backend()
withdevice()
(#52558) - Remove annoying warnings from
common_nn.py
(#55982) - Fix
__torch_function__
tests. (#54492) - Fixes new tf32 failures in
test_nn.py
(#52871) - Enable test cases in
test_nn.py
for ROCm (#52836) - Fix compiler warnings from conv.h (#56181)
- Update upsample tests in
test_nn.py
to test for memory_format (#53665) - Lowering NLLLoss/CrossEntropyLoss to ATen code (#53789)
- Refactor
multi_head_attention_forward
(#56674) - Convert type annotations in
nn/functional.py
to py3 syntax (#53656) - Migrates some of
test_nn.py
fromassertEqualIgnoreTypes
toassertEqual
(#57642) - Removes unused
RReLU
code (#57672) - Disable
TestComplexity.test_nn_module_test
infbcode
(#56677) - Make
convolution_overrideable
default implementation raise NotImplementedError (#54707) - Remove
ddp_gpu_size
field from SyncBatchNorm (#55946) - Remove
_specify_ddp_gpu_num
method from SyncBatchNorm (#56425) - Check exception messages in embedding_bag_proxy unit test (5a1191d050)
- Remove legacy constructor calls from
_torch_
folder. (#53889)
C++ Frontend
- Add NoOpDeviceGuardImpl (#53142)
- Lower ReLu6 to aten (#52723)
- Prevent VS from emitting ambiguous symbol errors (#53490)
- Devirtualize TensorImpl is_contiguous (#55333)
- Update expand_size API to match expand_inplace (#55246)
- Put llvmMathExtras in c10 namespace (#55886)
- Move
flatten_dense_tensors
andunflatten_dense_tensors
to Native (#58006)
Autograd
- Forward AD: Added systematic testing via gradcheck and OpInfos (#57633, #57701)
-
torch.autograd.gradcheck
:fast_mode
is now enabled by default for tests (#55699, #55237) - Update autograd kernels and tracing codegen to use redispatch API (#51363, #52009)
- Move view handling logic to gen_inplace_or_view_type.py (#53341)
- Add getters for attributes on autograd Node(#55225, #53205, #56499, #52451)
- Eliminate global usage of torch.set_default_dtype in test_autograd (#56446)
- Use _WeakTensorRef over weakref in test_autograd.py (#55726)
- Move view and inplace handling to a separate key (#53342)
Complex Numbers
- Added complex support for
torch.testing.assert_(equal|close)
(#57162). - Fixed NVCC related build warnings for complex operations in PyTorch (#55142).
- Add eager and jit variant consistency tests for
torch.cfloat
tensor type (#54854). - Fixed complex mean and reduction tests that weren’t being properly run (#55640).
- [ROCm] Added missing template declarations for complex BLAS (#52472).
CUDA
- Kernel launch checks for aten/src/ATen (#52185)
- Add more kernel launch checks (#53286)
- Final kernel launch checks (#54214)
- Fix nvcc warnings (#55367)
- irange for Indexing.cu (#57479)
- reduce number of randperm template instantiations (#58362)
- Enforce kernel launch checks (#58116)
- fix comments in ATenNVRTC.h (#57318)
AMD
Generalize HIP-specific launch bounds to apply to CUDA (#56143)
Composability
- Dispatcher passes computed dispatch keys to kernels (#49354)
- Add
TORCH_CHECK_NOT_IMPLEMENTED
/c10::NotImplementedError
; make dispatch use it (#53377) - Refactor
tensor_new.cpp
to useTensorOptions
instead ofDispatchKey
(#54034) - Add
Tensor::is_cpu
, genericizeTensorIterator
(#54079) - Migrate about 100 kernel to C10 full dispatcher (#54109)
- Rename
XPLAT_MOBILE_BUILD
toTEMPLATE_SELECTIVE_BUILD
(#54217) - Migrate kernels with
Tensor?
to C10 full dispatcher (#54263) - Delete all unnecessary singular
Math
entries (#54436) - Rename
Math
toCompositeImplicitAutograd
(#54466) - Rename
DefaultBackend
toCompositeExplicitAutograd
(#54470) - Migrate kernels with
TensorOptions
to C10 full dispatcher (#54539) - Expose ops present in dispatcher via
Dispatcher::getAllOpNames()
(#54791) - Make redispatch functions callable from out of tree extensions (#54966)
- Remove
use_c10_dispatcher
option (#54969) - Provide a method
ObservedOperators::getUnobservedOperatorList()
so that model tracer can empty it out during tracing (#55017) - Support
needsOutputs
forRecordFunction
andObserverUtil
improvements (#55012) - Strict typecheck all files in
tools/codegen
(#55227) - Add
MaybeOwned::operator*() &&
(#55244) - Allow copy operations on
MaybeOwned
(#55419) - Remove non-const
TensorIterator::tensor()
method (#55420) - Make
as_strided_
use_const ref for mutable tensors (#55875) - Generate xla codegen in-tree (#56601)
- HABANA Device registration key and Autograd key addition (#57094)
- Refactor autocast to be extensible for devices (#57104)
- Add pybind type caster for
c10::Device
(#57292) - Make
c10::TempFile
non-copyable but movable (#57308) - Fix
string_view::equals_
compilation by CUDA-11.3 (#57322) - Delete move constructor on TensorImpl (#58048)
- structured kernels - error check when structured_delegate is not marked structured (#52227)
- fix RegistrationDeclarations.yaml, now that we codegen composite kernels for structured functional/inplace ops (#56307)
TorchScript
- Remove
output_args
fromReduceOp
(#52187) - Remove
ReduceOp::accumulator
(#52196) - Fix memory dependencies computation to not look at reduction output args (#52170)
- Update
rfactor
to not useReduceOp->output_args()
(#52177) - Add an initialization expression to
Reduce()
(#53751) - Add IRVerifier (#52901)
- Add index verifier for
Store
(#53137) - Add new APIs to get loops corresponding to a
Buf
(#53778) - Remove Dropout during frozen optimization (#51589)
- Add pure list-producing ops to alias analysis (#51999)
- Remove
DepTracker
from LoopNest (#52405) - Use graph executor to run
forward
on a gradient (#52136) - Support
casted_batch_one_hot_lengths
with 4-argto
(#53215) - Enable
ClipRangesGatherRangesX2SigridHash
fusion forSigridHashPrecompute
(#53324) - Convert
to
toto_copy
(#53524) - Fuse SigridTransforms + ListUnpack (#53920)
- Use reshape when possible in broadcasting (#53326)
- Lazily initialize
AliasDb
constant prop (#54640) - Fix freezing with MKLDNN tensors (#54632)
- Add
EliminateExceptions
pass (#54730) - Lazily initialize
AliasDb
and addchanged
status to CSE (#54776) - Make transformations return whether graph is modified (#54777)
- Change
resize_as_
toresize_
(#55098) - Update to short forms of
splitWithTail
/splitWithMask
(#55542) - Patch
requires_grad
onDifferentiableGraph
(#55701) - Replace
AutoNonVariableTypeMode
withInferenceMode
in static runtime (#55731) - Move tensor implicit conversions to
test_builtins.py
(#55532) - Redesign
Rfactor
loopnest transformation. (#55324) - Remove mask field from
Load
andStore
classes (#55825) - Switch type of
tensors_
fromTensor
toBuf
(#56318) - Merge
ivalue::Future
'smarkCompleted
andmarkCompletedWithDataPtrs
(#56512) - Don’t lift tensor constants from fusion groups (#56756)
- Use
c10::ScalarType
instead oftensorexpr::ScalarType
(#56825) - Use JIT Plug-in for coverage to cover JIT’d functions and methods (#56310)
- Add all pools, Batchnorm and Tanh (i.e. all ideeped MKLDNN ops) to MKLDNNFuser (#56541)
- Inline hooks in
ivalue::Future
(#57354) - Add a pass for annotating a graph with input types derived from sample inputs (#57076)
- Add a pass for removing a first (self) argument from a graph if it is unused (#57169)
- Remove
dtype_
and addbuf_
fields toCodeGen::BufferArg
. (#57382) - Add tests for custom state_dict
save
/load
methods in TorchScript (#57886) - Add schema check to
aten::repeat
andfb::fast_gather
(#58106) - Rename
Tensor::call
toTensor::load
to be consistent withBuf
andPlaceholder
. (#55826)
Mobile
- Check in Gradle wrapper for easier pytorch_android builds. (#51067)
torch.fx
- Hoist custom class .so loading into setUp (#52883)
- Test forward reference annotations (#53713)
- Add TestConstFold coverage to test_fx (#54072)
- Fix logic in TestFX.test_get_torch_func_signature_exhaustive (#54510)
- Test tracing into all the standard torch.nn.functional (#55550)
- Add more model symbolic tracing tests from torchvision (#55744)
- Make stack trace testing less strict (#58088)
Quantization
- Make bundled inputs work with quantized zero inputs (#47407)
- Call native resize_/resize_as_ as much as possible (#53425)
- Use
expect_contiguous
inquantized::linear
fbgemm version (#58221) - Add pass in convert to fold quant-dequant sequence (#54860)
- Add support for one value being quantized with different qconfigs (#53586)
- Store dtype, axis as literals in the graph (#54624)
- add _remove_qconfig flag to convert_fx (#53166)
- Get first linear use of quantize_per_tensor for FQN (#54859)
- Factoring out the list of no_observers (#50459)
- Enable test for non quantized input for add/mul (#52412)
- Guard the supported quantization type for add/mul (#52413)
- Enable test for non quantized input for cat (#52414)
- Merge add and mul handler (#52651)
- Refactoring binary op tests to split int8 and float16 tests (#52807)
- Refactoring binary op tests to split int8 and float16 tests (#52807) (#53020)
- Remove reduandent code (#54073)
- Change activation_post_process_map to track the observer name instead (#54643)
- Separate handling Copy operator to a helper function (#54644)
- Factor out insert_observers_for_model to a separate function (#54733)
- Factor out insert_observers_for_model to a separate function (#54733) (#55307)
- Separate handling Copy operator to a helper function (#54644) (#55429)
- Add shape to nontensor op list (#55529)
- fx quant:
- clean up nit in insert_observer (#57367)
- readability improvements on observer functions (#57368)
- move output obs logic to QuantizeHandler (#57377)
- move input_output_observed to qhandler (#57388)
- remove FixedQParamsOpQuantizeHandler from quantize.py (#57393)
- remove unnecessary quants arguments (#57399)
- remove
find_quants
from convert (#57402) - refactor observer insertion (4f50fdc2a3)
- remove matching hack for binary qhandler (#57470)
- clean up names of quantize handlers (#53614)
- Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229)
- Add Per Tensor Quantization Support to FXIRImporter (#55405)
- Hide warnings for deprecated quantization APIs (#56291)
- Remove “Sparsity” from the function names (#56555)
ONNX
- cmake: fix ONNX_NAMESPACE if USE_SYSTEM_ONNX (#54973)
- Link onnx_library when BUILD_TEST=0 for Windows (#51937)
- Fix onnx/constant_fold.cpp compilation on Windows (#55770) (#56167)