State of symbolic shapes branch

ezyang · December 2, 2022, 5:59am

State of symbolic shapes branch: Dec 1 edition (even of PyTorch Conference)

The symbolic-shapes branch (PyTorch: Symbolic shapes by ezyang · Pull Request #84246 · pytorch/pytorch · GitHub ) is a long running branch containing a large number of features and bugfixes related to dynamic shapes support in PyTorch. Previous update: State of symbolic shapes branch - #18 by ezyang

Commit ID at time of writing: a05b7b1c73247ff562a82aac0edca79bbaebc2bd

Executive summary

It is the eve of the PyTorch Conference and we have been busy getting things ready for some big announcements. Before and after Thanksgiving, many folks involved with dynamic shapes were deputized to help fix some major release blockers in the general compiler workstream; Brian and Jane landed all of the pieces needed to properly update batch norm running stats, and Alban and Edward found and fixed some more fairly major AOTAutograd bugs. On the dynamic shapes front, Voz has been steadily working on getting all of the Dynamo changes passing CI on master; half of the preparatory changes have been landed so far, and the branch has been resync’ed after those merges. There is some regression in the aot_eager pass rate as we remove hacks and redo fixes properly.

Lazily guarding for duck sizing and views - Google Docs is our plan (Voz + Edward, with some assistance from Horace) for dealing with the major outstanding soundness bug in dynamic shapes, which is that we are not guarding on duck sizing. Edward’s previous attempt at Total revamp of how ShapeEnv symbol allocation works by ezyang · Pull Request #89695 · pytorch/pytorch · GitHub is hopelessly broken on master, so we are redoing it more incrementally with the new game plan. A bit of this has involved quite a lot of Dynamo refactoring.
Batch norm running stats are now properly updated after first draft of input mutation handling for aot autograd and fixes for inductor <> batch norm. Mark Saroufim has confirmed this resolves training instability when using inductor (turns out, batch norms running stats matter, whodathought.) Alban posted a really nice fix Fix CopySlices logic to ensure wrapped node runs properly. for a bug that took Voz and Ed two days to track down. But still there are more bugs: AOTAutograd and internal grad_fn structure - Google Docs
Model training status on symbolic-shapes. See also Symbolic shapes work items tracker - Google Sheets
- aot_eager: 136 out of 166 (+8 WoW) logs csv
- inductor: TODO
OpInfo tests on symbolic shapes.
- pytest test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - TODO
- pytest test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - TODO

Previous branch diff: 68 files changed, 2612 insertions(+), 554 deletions(-)
Current branch diff: 68 files changed, 1440 insertions(+), 290 deletions(-)

What’s new on the branch these two weeks?

Metas/decompositions

Don’t decompose copy (sic) ezyang

Infrastructure

Debug interpreter

Dynamo

Add guard_source for RandomValueSource ezyang
Don’t use explain() for --explain; instead read it off the counters ezyang
Remove fake_tensors_available ezyang
Cond capture with fake tensors actually works; don’t raise in this case ezyang
Support unspecialized integers with dynamic shapes ezyang
Easy: These tests work with fake_tensor_propagation on ezyang
Force test_rng_state to run with fake tensor prop ezyang (nb: was reverted)
Run optimizer tests with fake tensors ezyang
Reenable fake_tensor_propagation on test_cudnn_rnn ezyang
Graph break on torch.tensor failure, allowing maml to run with fake t… ezyang
Remove fake_tensor_propagation ezyang
Delay verify correctness wrapping to call site. ezyang
Don’t support kwargs at runtime in aot_module_simplified ezyang
Simplify aot_module_simplified by removing top_args/top_kwargs ezyang
Change aot_module_simplified to take take arguments directly ezyang
Make aot_module_simplified accept fake tensors ezyang
Use isinstance test rather than exact type test for wrap to fake ezyang
Disable cache to restore accuracy ezyang

Inductor

Sufficient to get inductor working on BERT_pytorch again ezyang
[UPDATED PROTOTYPE] Use dynamo fake tensor mode in aot_autograd, move… voz/ezyang
Restore the base fix, which fixes most of the missing symbol errors ezyang
Restore enable_python_dispatcher on has_mutation analysis ezyang
Restore RANDOM_VALUE fix ezyang
Restore TENSOR_MATCH fix ezyang
Restore stack tracking for sympy symbols ezyang

QOL

Make log_extract.py able to deal with NotImplementedError ezyang
print graph breaks by default ezyang
Dasboard runner cmd anijain

Merge to master retrospective

Reland “Add single process version of dynamo distributed hf_Bert tests (#89721)” - this got bounced because not enough tests ran on PR. We added more files to automatically trigger inductor tests.
Refactor how AOTAutograd backends are defined - this is just an example of a few cases where folks ran inductor CI, got accuracy failure on a model, and then spent a bunch of time trying to debug what had happened; when in fact, the failure was a preexisting master failure. It is not easy to identify these because ciflow/inductor does not run on every master commit.
Change aot_module_simplified to take take arguments directly - this broke a timm model, and lead us on a pretty big chase that eventually revealed that example inputs being passed to backends did not have correct requires grad because they were being cloned. This was fixed by refactoring the AOTAutograd-Dynamo integration to not clone example inputs.
Remove fake_tensor_propagation - this nearly got bounced because it broke some internal users who didn’t have fake tensor support for some operations. Averted because (1) their tests weren’t in CI and (2) it turned out to be pretty easy to add meta tensor support.
Don’t unsafely clone autograd meta - this couldn’t be landed because it broke an inductor model, causing it to raise an error where previously it passed. This lead to a very long debugging session by Alban until we finally nailed the problem.

What’s made it to master this week?

ezyang

bdhirsh

anjali411

Meta impl for linalg_cholesky and linalg_cholesky_ex

nkaretnikov

voz

albanD

Fix CopySlices logic to ensure wrapped node runs properly. and beef up inplace/view note on copy slices

What’s coming next?

Land fake tensor propagation from Dynamo to AOTAutograd (voz)
ShapeEnv revamp to get guards for duck sizing (ezyang)
GuardEnv for non-shape related extra guards produced by AOTAutograd (voz)
Address CI comments for AOTAutograd input mutation, factoring it to be more modular (bdhirsh)
Proper inductor integration (Chillee didn’t end up working on it, unallocated; mildly blocked on ShapeEnv revamp)

Our north star:

All benchmark models are passing aot_eager and inductor training on branch
Fallback implementation for custom operators without symbolic shape propagation, inferred by running fallback on real operators
All OpInfo tests passing
Dynamic shapes on by default for developers / users

Topic		Replies	Views
How to invoke symbolic shape propagation? frontend API	3	443	November 16, 2023
State of PyTorch core: September 2021 edition frontend API	1	9404	September 21, 2021
Lazy Tensor Core hardware-backends	20	7621	July 12, 2022
Symbolic Shape Inference torchscript	1	1573	March 31, 2021
Understanding dynamic shapes and guards and when it does/does not cause graph breaks compiler	1	321	November 7, 2024