State of symbolic shapes branch

State of symbolic shapes branch: Oct 2 edition

The symbolic-shapes branch (PyTorch: Symbolic shapes by ezyang · Pull Request #84246 · pytorch/pytorch · GitHub ; torchdynamo: [WIP branch] symbolic shape hacking by Chillee · Pull Request #1180 · pytorch/torchdynamo · GitHub ) are long running branches in PyTorch/torchdynamo containing a large number of features and bugfixes related to dynamic shapes support in PyTorch. Previous update: State of symbolic shapes branch

Commit IDs at time of writing: pytorch eea837d4fa2f287c6d8633387c5fa7e7ed8dc9e9; torchdynamo 3ddb46e873c2bdd1c59217a128b9b2b7af8696fe (unchanged)

Executive summary

We made a lot of progress:

Runbooks:

Previous branch diff: 60 files changed, 1464 insertions(+), 389 deletions(-)
Current branch diff: 41 files changed, 1044 insertions(+), 448 deletions(-)

Retrospective on merge to master

What were unexpected complications when merging PRs to master?

  • Enable convolution_backward with bias and symints required adding support for another SymInt type (SymInt[]?). Fortunately, this was relatively symmetric with the other type support we’ve added, but it took Nick a day or so to do the first pieces, and there were two more subtle bits that Nick didn’t manage on his first try. First, aten/src/ATen/core/jit_type.h needed to be implemented with care: whenever a SymInt-including type (or container type) is written for getTypePtr_, you must instead implement it on getMaybeFakeTypePtr_ and ensure you either recursively pass on the fake template argument, or resolve it to be an int (if fake is true) and a SymInt (if fake is false). This lets us maintain the illusion for TorchScript that SymInt C++ types turn into conventional int-only schema. Second, aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h needed another case to allow for OptionalArray to be provided by an int-list IValue (covariant typing). We hope that this is the very last C++ type we need to add for SymInts (we now have SymInt, SymInt[], SymInt? and SymInt[]? which covers all used permutations of optional and array types on SymInt; there are no uses of int?[] in native_functions.yaml)
  • Consistent compute numel/contiguous strategy with SymInts involved refactoring some code that introduced an unsigned underflow bug; previously we had an expression int64_t i = dim() - 1, where dim() returned an int64_t, which was refactored into int64_t i = sizes.size() - 1, where sizes.size() returned a size_t (unsigned type). When size is zero, this underflows into a large number when then triggers signed conversion UB (as the very large unsigned underflow is not representable as a signed integer). The UB appears to resolve to a negative number as desired on most platforms, but not on Android. It is not clear why UBSAN did not catch the error on conventional platforms. This suggests we need more unit testing of TensorImpl (not done yet.)
  • Revert “Revert “Symintifying slice ops (#85196)”” was reverted for breaking Executorch fbcode-only builds. This is because Executorch was translating SymInt JIT type into SymInt C++ type, but the Executorch runtime does not (and should not) support SymInt. This was fixed by refactoring the codegen to offer an “int-only codegen mode” (by passing symint=False argument to relevant functions), and then having Executorch opt out of SymInt codegen.
  • Revert “Revert “Symintified mmm/addmm derivative formulas (#85794)”” was reverted because it broke Mac AOTAutograd tests. It was resolved by marking linalg.lu_solve as flaky, but it’s not clear what the root cause was.
  • removed compile cache and static argnums didn’t cause any problems on the way in, but we got complaints from users that they were relying on the cache, and so removing it broke their code, e.g., Functorch memory_efficient_fusion gives wrong output batch size · Issue #86020 · pytorch/pytorch · GitHub

Though not directly caused by the work on symbolic shapes, Edward had to spend some time this week fixing an ios SEV caused by insufficient CI for Metal backend: Insufficient CI for metal backend · Issue #84172 · pytorch/pytorch · GitHub This is relevant to us because we caused a similar ios SEV by changing the signature of C++ functions; however, we subsequently made changes that don’t require us to update C++ signatures, reducing the likelihood of a repeat SEV along these lines. There is also now a PR up to test Metal in OSS CI.

What should I expect to work?

The end-to-end command for BERT_pytorch is unchanged:

TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python benchmarks/torchbench.py --only BERT_pytorch --accuracy --backend aot_eager --training

We recommend using TORCH_SHOW_CPP_STACKTRACES=1 . The other models listed in Operators that need symbolic support - Google Sheets are expected to work, but we do not have CI at the moment, so some minor regressions may occur during development.

What’s new in the branch this week?

As always, merged to master PRs are only listed in the master section.

What’s made it to master this week?

PRs marked with X have discussion in “Retrospective on merge to master”

What’s coming next?

  • Improved testing
    • Make sure we are testing meta functions that aren’t registered to the dispatcher
    • Improve stride testing coverage
  • More merge to master.
  • Get min-cut partitioner working with symbolic shapes (we are in better shape thanks to @wconstab but there are still problems)
  • Full operator coverage for all benchmark models (as infra is nearly all landed on master, it is a good time to start also merging these operators into master)
  • Consolidated guard strategy across torchdynamo/AOTAutograd/torchinductor (@Chillee is probably going to work on this)
  • Fallback implementation for custom operators without symbolic shape propagation, inferred by running fallback on real operators (can be done later)
2 Likes