PyTorch 1.9 dev release notes

PyTorch 1.9 release contains quite a few commits that are not user facing but are interesting to people compiling from source or developing low level extensions for PyTorch. Here is a non-exhaustive list of the most important ones.

Python API

  • Added cpu_kernel_multiple_outputs to help developers implement new torch functions that return two or more tensors conveniently (#51097)
  • Support auto generation of device check (#56872)
  • Fix bug in self.assertExpectedInline (#55149)
  • protect destructors of python bindings that can be kept alive by c++ objects (#57488)
  • Testing related commits:
    • See the Developer Wiki article “Writing tests in PyTorch 1.9” for details on significant testing improvements
      • A prototype torch.testing module (see it documentation here) has been added to facilitate testing libraries built using PyTorch
        • It currently has one function, torch.testing.assert_close, which can be useful when comparing PyTorch tensors (add your feedback to its RFC here!)
        • Request more features by filing issues on PyTorch’s Github
    • OpInfo coverage and testing continues to expand!
      • test_ops.py now verifies the out= argument works correctly for operators with OpInfos (#53259)
      • OpInfos now support sample inputs with tensorlist arguments (#54922)
      • OpInfos can now be wrapped in a lambda for grad and gradgrad checks (#54914)
      • OpInfos can now handle sample inputs where the input is broadcast (#55771)
    • [ROCm] Setting TEST_WITH_ROCM now skips tests that don’t use GPUs (#55069)
    • A new test case method, assertWarnsOnceRegex, can be used to test warnings that are usually thrown only once per process (#52387)
    • make_tensor(), the test suite’s goto mechanism for constructing a random tensor, now supports a discontiguous kwarg (#51985)

Distributed

  • torch.distributed.rpc: Adds a parameter server benchmark for RPC to torch/benchmarks/distributed. (#57454)
  • torch.distributed.nn.RemoteModule: Improve typing for RemoteModule (#58012)
  • torch.distributed.rpc: Assert that GIL is not held in blocking destructors in RPC (#57030)
  • Add logging when store_based_barrier succeeds (#57711)
  • torch.distributed.rpc: Allow to specify a set of device for CUDAFuture (#56515)
  • torch.distributed.nn.RemoteModule: Replace Python Pickler with internal RPC pickler for RemoteModule (#58019)
  • torch.distributed.rpc: Make CUDAFuture handle any kind of device type (#57051)
  • torch.distributed: Remove deprecated use of torch.LongTensor, torch.ByteTensor in distributed APIs (#55861)
  • torch.distributed: Join work clean up thread before aborting communicators (#55444)
  • DistributedDataParallel: Log use of uneven inputs API (#54919)
  • DistributedDataParallel: Deduplicate shared params before constructing Reducer in DDP (#53279)
  • torch.distributed: Log nccl_async_error_handling (#52965)
  • torch.distributed.rpc: Reduce logging verbosity in tensorpipe agent (#51784, #51785)
  • torch.distributed: Log nccl debug level in ProcessGroupNCCL (#52803)
  • torch.distributed.rpc: make pickler/unpickler pluggable in RPC (#53050)
  • torch.distributed: make the pickler in distributed_c10d pluggable (#53060)
  • DistributedDataParallel: log newly added construction and runtime stats at randomly selected iterations (#51394)
  • torch.distributed.rpc: Fix flaky TestTrainingLoop - TestE2ETensorPipe (#51939)
  • DistributedDataParallel: Ensure local_used_maps_tmp is distinct from local_used_maps_[i] (#54474)
  • DistributedDataParallel: Declare NamedTuple at top level to fix typing (#53273)
  • torch.distributed Combine backtrace print in test logging into one string to avoid interleaving (#56961).

torch.nn

  • Reenable test_nn tests for Windows (#52051)
  • Replace type().backend() with device() (#52558)
  • Remove annoying warnings from common_nn.py (#55982)
  • Fix __torch_function__ tests. (#54492)
  • Fixes new tf32 failures in test_nn.py (#52871)
  • Enable test cases in test_nn.py for ROCm (#52836)
  • Fix compiler warnings from conv.h (#56181)
  • Update upsample tests in test_nn.py to test for memory_format (#53665)
  • Lowering NLLLoss/CrossEntropyLoss to ATen code (#53789)
  • Refactor multi_head_attention_forward (#56674)
  • Convert type annotations in nn/functional.py to py3 syntax (#53656)
  • Migrates some of test_nn.py from assertEqualIgnoreTypes to assertEqual (#57642)
  • Removes unused RReLU code (#57672)
  • Disable TestComplexity.test_nn_module_test in fbcode (#56677)
  • Make convolution_overrideable default implementation raise NotImplementedError (#54707)
  • Remove ddp_gpu_size field from SyncBatchNorm (#55946)
  • Remove _specify_ddp_gpu_num method from SyncBatchNorm (#56425)
  • Check exception messages in embedding_bag_proxy unit test (5a1191d050)
  • Remove legacy constructor calls from _torch_ folder. (#53889)

C++ Frontend

  • Add NoOpDeviceGuardImpl (#53142)
  • Lower ReLu6 to aten (#52723)
  • Prevent VS from emitting ambiguous symbol errors (#53490)
  • Devirtualize TensorImpl is_contiguous (#55333)
  • Update expand_size API to match expand_inplace (#55246)
  • Put llvmMathExtras in c10 namespace (#55886)
  • Move flatten_dense_tensors and unflatten_dense_tensors to Native (#58006)

Autograd

  • Forward AD: Added systematic testing via gradcheck and OpInfos (#57633, #57701)
  • torch.autograd.gradcheck: fast_mode is now enabled by default for tests (#55699, #55237)
  • Update autograd kernels and tracing codegen to use redispatch API (#51363, #52009)
  • Move view handling logic to gen_inplace_or_view_type.py (#53341)
  • Add getters for attributes on autograd Node(#55225, #53205, #56499, #52451)
  • Eliminate global usage of torch.set_default_dtype in test_autograd (#56446)
  • Use _WeakTensorRef over weakref in test_autograd.py (#55726)
  • Move view and inplace handling to a separate key (#53342)

Complex Numbers

  • Added complex support for torch.testing.assert_(equal|close) (#57162).
  • Fixed NVCC related build warnings for complex operations in PyTorch (#55142).
  • Add eager and jit variant consistency tests for torch.cfloat tensor type (#54854).
  • Fixed complex mean and reduction tests that weren’t being properly run (#55640).
  • [ROCm] Added missing template declarations for complex BLAS (#52472).

CUDA

  • Kernel launch checks for aten/src/ATen (#52185)
  • Add more kernel launch checks (#53286)
  • Final kernel launch checks (#54214)
  • Fix nvcc warnings (#55367)
  • irange for Indexing.cu (#57479)
  • reduce number of randperm template instantiations (#58362)
  • Enforce kernel launch checks (#58116)
  • fix comments in ATenNVRTC.h (#57318)

AMD

Generalize HIP-specific launch bounds to apply to CUDA (#56143)

Composability

  • Dispatcher passes computed dispatch keys to kernels (#49354)
  • Add TORCH_CHECK_NOT_IMPLEMENTED/c10::NotImplementedError; make dispatch use it (#53377)
  • Refactor tensor_new.cpp to use TensorOptions instead of DispatchKey (#54034)
  • Add Tensor::is_cpu, genericize TensorIterator (#54079)
  • Migrate about 100 kernel to C10 full dispatcher (#54109)
  • Rename XPLAT_MOBILE_BUILD to TEMPLATE_SELECTIVE_BUILD (#54217)
  • Migrate kernels with Tensor? to C10 full dispatcher (#54263)
  • Delete all unnecessary singular Math entries (#54436)
  • Rename Math to CompositeImplicitAutograd (#54466)
  • Rename DefaultBackend to CompositeExplicitAutograd (#54470)
  • Migrate kernels with TensorOptions to C10 full dispatcher (#54539)
  • Expose ops present in dispatcher via Dispatcher::getAllOpNames() (#54791)
  • Make redispatch functions callable from out of tree extensions (#54966)
  • Remove use_c10_dispatcher option (#54969)
  • Provide a method ObservedOperators::getUnobservedOperatorList() so that model tracer can empty it out during tracing (#55017)
  • Support needsOutputs for RecordFunction and ObserverUtil improvements (#55012)
  • Strict typecheck all files in tools/codegen (#55227)
  • Add MaybeOwned::operator*() && (#55244)
  • Allow copy operations on MaybeOwned (#55419)
  • Remove non-const TensorIterator::tensor() method (#55420)
  • Make as_strided_ use_const ref for mutable tensors (#55875)
  • Generate xla codegen in-tree (#56601)
  • HABANA Device registration key and Autograd key addition (#57094)
  • Refactor autocast to be extensible for devices (#57104)
  • Add pybind type caster for c10::Device (#57292)
  • Make c10::TempFile non-copyable but movable (#57308)
  • Fix string_view::equals_ compilation by CUDA-11.3 (#57322)
  • Delete move constructor on TensorImpl (#58048)
  • structured kernels - error check when structured_delegate is not marked structured (#52227)
  • fix RegistrationDeclarations.yaml, now that we codegen composite kernels for structured functional/inplace ops (#56307)

TorchScript

  • Remove output_args from ReduceOp (#52187)
  • Remove ReduceOp::accumulator (#52196)
  • Fix memory dependencies computation to not look at reduction output args (#52170)
  • Update rfactor to not use ReduceOp->output_args() (#52177)
  • Add an initialization expression to Reduce() (#53751)
  • Add IRVerifier (#52901)
  • Add index verifier for Store (#53137)
  • Add new APIs to get loops corresponding to a Buf (#53778)
  • Remove Dropout during frozen optimization (#51589)
  • Add pure list-producing ops to alias analysis (#51999)
  • Remove DepTracker from LoopNest (#52405)
  • Use graph executor to run forward on a gradient (#52136)
  • Support casted_batch_one_hot_lengths with 4-arg to (#53215)
  • Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324)
  • Convert to to to_copy (#53524)
  • Fuse SigridTransforms + ListUnpack (#53920)
  • Use reshape when possible in broadcasting (#53326)
  • Lazily initialize AliasDb constant prop (#54640)
  • Fix freezing with MKLDNN tensors (#54632)
  • Add EliminateExceptions pass (#54730)
  • Lazily initialize AliasDb and add changed status to CSE (#54776)
  • Make transformations return whether graph is modified (#54777)
  • Change resize_as_ to resize_ (#55098)
  • Update to short forms of splitWithTail / splitWithMask (#55542)
  • Patch requires_grad on DifferentiableGraph (#55701)
  • Replace AutoNonVariableTypeMode with InferenceMode in static runtime (#55731)
  • Move tensor implicit conversions to test_builtins.py (#55532)
  • Redesign Rfactor loopnest transformation. (#55324)
  • Remove mask field from Load and Store classes (#55825)
  • Switch type of tensors_ from Tensor to Buf (#56318)
  • Merge ivalue::Future's markCompleted and markCompletedWithDataPtrs (#56512)
  • Don’t lift tensor constants from fusion groups (#56756)
  • Use c10::ScalarType instead of tensorexpr::ScalarType (#56825)
  • Use JIT Plug-in for coverage to cover JIT’d functions and methods (#56310)
  • Add all pools, Batchnorm and Tanh (i.e. all ideeped MKLDNN ops) to MKLDNNFuser (#56541)
  • Inline hooks in ivalue::Future (#57354)
  • Add a pass for annotating a graph with input types derived from sample inputs (#57076)
  • Add a pass for removing a first (self) argument from a graph if it is unused (#57169)
  • Remove dtype_ and add buf_ fields to CodeGen::BufferArg. (#57382)
  • Add tests for custom state_dict save/load methods in TorchScript (#57886)
  • Add schema check to aten::repeat and fb::fast_gather (#58106)
  • Rename Tensor::call to Tensor::load to be consistent with Buf and Placeholder. (#55826)

Mobile

  • Check in Gradle wrapper for easier pytorch_android builds. (#51067)

torch.fx

  • Hoist custom class .so loading into setUp (#52883)
  • Test forward reference annotations (#53713)
  • Add TestConstFold coverage to test_fx (#54072)
  • Fix logic in TestFX.test_get_torch_func_signature_exhaustive (#54510)
  • Test tracing into all the standard torch.nn.functional (#55550)
  • Add more model symbolic tracing tests from torchvision (#55744)
  • Make stack trace testing less strict (#58088)

Quantization

  • Make bundled inputs work with quantized zero inputs (#47407)
  • Call native resize_/resize_as_ as much as possible (#53425)
  • Use expect_contiguous in quantized::linear fbgemm version (#58221)
  • Add pass in convert to fold quant-dequant sequence (#54860)
  • Add support for one value being quantized with different qconfigs (#53586)
  • Store dtype, axis as literals in the graph (#54624)
  • add _remove_qconfig flag to convert_fx (#53166)
  • Get first linear use of quantize_per_tensor for FQN (#54859)
  • Factoring out the list of no_observers (#50459)
  • Enable test for non quantized input for add/mul (#52412)
  • Guard the supported quantization type for add/mul (#52413)
  • Enable test for non quantized input for cat (#52414)
  • Merge add and mul handler (#52651)
  • Refactoring binary op tests to split int8 and float16 tests (#52807)
  • Refactoring binary op tests to split int8 and float16 tests (#52807) (#53020)
  • Remove reduandent code (#54073)
  • Change activation_post_process_map to track the observer name instead (#54643)
  • Separate handling Copy operator to a helper function (#54644)
  • Factor out insert_observers_for_model to a separate function (#54733)
  • Factor out insert_observers_for_model to a separate function (#54733) (#55307)
  • Separate handling Copy operator to a helper function (#54644) (#55429)
  • Add shape to nontensor op list (#55529)
  • fx quant:
    • clean up nit in insert_observer (#57367)
    • readability improvements on observer functions (#57368)
    • move output obs logic to QuantizeHandler (#57377)
    • move input_output_observed to qhandler (#57388)
    • remove FixedQParamsOpQuantizeHandler from quantize.py (#57393)
    • remove unnecessary quants arguments (#57399)
    • remove find_quants from convert (#57402)
    • refactor observer insertion (4f50fdc2a3)
    • remove matching hack for binary qhandler (#57470)
    • clean up names of quantize handlers (#53614)
  • Benchmark for torch.ops.quantized.linear_prepack_fp16 operator (#52229)
  • Add Per Tensor Quantization Support to FXIRImporter (#55405)
  • Hide warnings for deprecated quantization APIs (#56291)
  • Remove “Sparsity” from the function names (#56555)

ONNX

  • cmake: fix ONNX_NAMESPACE if USE_SYSTEM_ONNX (#54973)
  • Link onnx_library when BUILD_TEST=0 for Windows (#51937)
  • Fix onnx/constant_fold.cpp compilation on Windows (#55770) (#56167)

Misc

  • Updated PyBind to official v2.6.2 tag (#52304)
  • Added gdb special command to print tensors (#54339)
  • Numpy dependency is now only checked when using Numpy features (#52794)
  • don’t set the same C++ and C standards twice (#51832)
  • Fix cmake_minimum_require in libshm (#58306)