State of symbolic shapes branch

ezyang · October 30, 2022, 7:30pm

State of symbolic shapes branch: Oct 30 edition

The symbolic-shapes branch (PyTorch: Symbolic shapes by ezyang · Pull Request #84246 · pytorch/pytorch · GitHub ) is a long running branch containing a large number of features and bugfixes related to dynamic shapes support in PyTorch. Previous update: State of symbolic shapes branch - #8 by ezyang

Commit ID at time of writing: 121e8ebcc2fc50e5ca28cfb3ad437596084424e1

Executive summary

Voz enabled propagation of symbolic shapes in dynamo (previously, symbolic shapes were only propagated in AOTAutograd), and this uncovered a large number of previously undiscovered bugs and coverage problems in TORCHDYNAMO_DYNAMIC_SHAPES=1 itself. The team plans to pivot to working on the torchdynamo codebase to help fix these problems.

@SherlockNoMad found in Meta OpInfo Test for stride correctness by SherlockNoMad · Pull Request #87849 · pytorch/pytorch · GitHub that there are ~70 aten ops have mismatched stride value between meta function and eager’s implementation. Usually, this is a result of incomplete python meta function, or decompositions (because our unit tests was not asserting on stride’s correctness, and we didn’t have enough test cases for strided inputs). Sherlock compiled all the failures into this tracker sheet Stride Mismatch Tracker - Google Sheets . Bugs on this sheet are open season for burn down.
Model training status on symbolic-shapes. (@ezyang) See also Operators that need symbolic support - Google Sheets (out of date).
- aot_eager, no TDS: 149 out of 165 (new) - logs
- aot_eager, with TDS: 118 out of 163 (+2 WoW; heavily depressed due to new bugs discovered in torchdynamo) - logs
- inductor, with TDS: 45 out of 163 (new) - logs
Model inference status on symbolic-shapes. (@ezyang)
- inductor, with TDS: 69 out of 177 (new) - logs
Model inference status on master. (@ezyang)
- This week is really bad, as we haven’t gotten all the branch fixes after dynamo symbolic shapes fallout. With TORCHDYNAMO_DYNAMIC_SHAPES=1: 4 out of 177 (-23 WoW) - logs
OpInfo tests on symbolic-shapes. (@ezyang)
- pytest -v test/test_proxy_tensor.py -k test_make_fx_symbolic_exhaustive - 347 passed (+8 WoW), 374 failed (-7 WoW), 497 skipped (+1 WoW)
- pytest -v test/functorch/test_aotdispatch.py -k test_aot_autograd_symbolic_exhaustive - 238 passed (+19 WoW), 246 failed (-18 WoW), 125 skipped (+0 WoW)

Previous branch diff: 70 files changed, 1681 insertions(+), 430 deletions(-)
Current branch diff: 110 files changed, 2788 insertions(+), 2114 deletions(-)

Notable bug fixes

Dynamo’s dynamic shapes support is seriously buggy, and it doesn’t seem like there is a clear enough conceptual framework for how things should be implemented to make bug fixes simple. For example, voz in Symbolic shapes by ezyang · Pull Request #84246 · pytorch/pytorch · GitHub needed to write a page of code just to get dynamic size(i) method calls working. We intend to have a knowledge sharing session with Voz on Tuesday to get the team up to speed on how to approach dynamo bugs.
We’re still fixing incorrect stride bugs. Delete incorrect and duplicate meta_add_ in particular was a symbolic-shapes branch only howler. It is still quite difficult to diagnose these without a minifier; ezyang proposed we have a mode where aot_eager checks the real tensors have metadata consistent with the trace, but this still isn’t implemented yet. Sherlock has a workstream for fixing strides based on the test suite: Stride Mismatch Tracker - Google Sheets There is still a live problem with softmax: minimal repro gist:54f03e02fd36069bf9693ae2ab707d10 · GitHub
In parallel to Sherlock’s burn down of stride bugs, @ezyang is investigating whether or not we can make incorrect stride bugs less severe by preserving reshapes during tracing. This is tricky to do, because we still have to do autograd, and autograd doesn’t directly support reshape (as it sometimes returns a view and sometimes returns a fresh tensor). Our plan is to transform reshape into an always copying operation, and adding enough extra sanity checking in functionalization to detect if this is not semantics preserving. To do this, we need to run functionalization and autograd at the same time; thus functionalize and compute joint simultaneously. Implementing this was a doozy, as it triggered multiple functionalization bugs:
- FunctionalTensorWrapper that directly wraps fake tensor didn’t report their devices correctly, and so went to the wrong kernels. After staring at the backtrace, this was determined by code reading. We had to both forward device calls to inner tensor in FunctionalTensorWrapper and also update functionalization dispatch key strategy so that FunctionalTensorWrapper appropriately pretends to be a Dense tensor.
We made a LOT of quality of life improvements to the branch this week. A lot of it was simply dogfooding the software and making adjustments when we noticed things could be improved. These range from paper cuts (spammy warnings, overly verbose exceptions) to important debugging tools like (printing more program state on failure, e.g., the incomplete make_fx traced graph or strides of variables in a graph) to important unblocking features (adding timeout so that sympy infinite loops don’t block model sweeps).
Use functionalize in symbolic exhaustive tests, not sound otherwise is a howler; someone intentionally turned off functionalization on the test suite, so many tests were failing because AOTAutograd without functionalization isn’t actually sound. Fixing this helped a number of test cases pass. The moral of the story is, don’t give users configuration knobs that are known unsound, unless you make it really obvious (or at least, obvious enough that a core dev doesn’t turn that knob on by default in tests, and another core dev approves it in CR.)
Fix bernoulli functionalization was an annoying typofix in the bernoulli implementation for a longstanding problem (long enough that XLA had manually worked around it.) It had to be diagnosed manually (by looking at the graphs and noticing some operators were missing from DCE); it could have been immediately caught by testing for no mutating ops after functionalization, but the PR that implemented this is still not landed aot_autograd: add assert for functional-only graph by bdhirsh · Pull Request #85681 · pytorch/pytorch · GitHub (I must emphasize how important it is to actually land the PRs you write!)
A runtime error complaining about shape mismatch in backwards turned out to be a functionalization bug, where we were not copying enough metadata when doing a shallow copy and detach of FunctionalTensorWrapper (which happens when a variable is saved for backwards.) Both bdhirsh and ezyang root caused the problem at about the same time. Fixed in functionalization: fix detach() and tested by Saved tensor that is view test case functionalization
This is not a bug fix per se, but an infrastrucutre improvement basic bin_op(symint, symint) support was initially done in a brute force way by adding each overload for mixed operations one-by-one, but on subsequent code review we didn’t like it. Unify SymIntNode and SymFloatNode into SymNode instead removed the static types from C++, which meant you don’t have to add overloads. This produced a patch that was functionally equivalent, but a net decrease in total LOC. https://twitter.com/ezyang/status/1585373693852934144
Prevent use of as_strided in backwards was discovered when inspecting some backward graphs by hand and noticing they had a lot of as_strided calls in them. For reference, typical user programs don’t really ever use as_strided, and neither do most of our autograd derivative formulas, so it is weird that they were showing up. In fact, this is due to how our view mutation logic work, which by default converts all views into AsStrided operations so you don’t have to track the provenance of any single view. This is a nice performance optimization for eager, but it generates code that’s worse for compilers, and in fact, XLA already had a way of disabling this shortcut and maintaining the views by hand. Turning this on for PT2 eliminates these as strided calls. BTW, as strided calls are preferably avoided in IR as they are extremely input stride sensitive; if the incoming tensor has a different stride, as strided will silently corrupt your data. This would be less of a problem with a “compositional” variant of as strided that respects input strides (e.g., if a dim was already stride 2, restriding it by 2 would result in 2 * 2).
We have a lot of sympy infinite loops. Fix another sympy infinite loop whackamoles one of them, but there are still more. @Chillee we need to figure out a more sustainable strategy for these.
Disable aot autograd cache is a nice stopgap: some of our models are now failing because they do exercise dynamic shapes, but our guards are insufficient (because AOTAutograd guards aren’t propagated up to torchdynamo yet.) Disabling the AOTAutograd cache means we always recompile at AOTAutograd until this can be fixed.
Unify meta tensor and fake tensor converter conversion broke a number of inductor model runs on master (non-dynamic shapes configuration.) I was able to fix this by disabling meta tensor converter’s view preservation (it’s logic to remake a tensor into a view if it was originally a view). But in principle, it should be OK to do this. This is very perplexing. Some of the current investigation notes are at AOTAutograd has soundness bug when passing views as inputs · Issue #1815 · pytorch/torchdynamo · GitHub

What’s new on the branch this week?

This time, instead of doing commits chronologically, I’ve tried grouping them by theme.

Meta support

Fix meta for meta_fill_ SherlockNoMad
Delete incorrect and duplicate meta_add_ ezyang/Chillee
meta for topk ezyang
Logical binary inplace ezyang
Implement scalar_tensor meta ezyang
fix meta defaults in meta__fused_moving_avg_obs_fq_helper bdhirsh

SymInt support

Symintify tensor_split ezyang
SymInt padding/output padding on convolution; default arg codegen ezyang
adaptive_avg_pool2d symintification continued anjali411
Some more meta support for unit tests test_aotdispatch ezyang
More progress on quantized ezyang
pass on aten._fused_moving_avg_obs_fq_helper_functional.default ezyang
SymIntify resize_ ezyang
SymIntify _copy functionalization kernels (and _copy_out too) ezyang
Sym support for torch.numel ezyang

Quality of life

[HACK] Stop rethrowing exceptions ezyang
[HACK] Make it easier to correlate what you ran and the status ezyang
enable pybind debugging by default albanD
[HACK] Don’t rethrow TorchRuntimeError ezyang (plus bug fix Fix missing re-raise in exception fallthrough )
Improve broadcast_shapes error printing ezyang
[HACK] Print size hints, not expressions ezyang
[HACK] Make logging less spammy ezyang
Make FX collected traceback more useful by stripping format_stack() f… ezyang
[HACK] Record stack traces on make_fx by default. ezyang
[HACK] Print incomplete graph upon failure ezyang
Added stride printing to gm.print_readable() and removed add_meta met… Chillee + update print_readable anjali411
Ignore log files ezyang
Add a 5min timeout ezyang
Distinguish timeout and regular fail, go to stderr ezyang
[HACK] Disable High loss value assert ezyang
Report what node failed when running an operation ezyang
Beefier error message ezyang
Be a little more accurate in the report ezyang

Dynamo

Partial cherry pick from vision_maskrcnn bugfix ezyang
[Dynamo] Symbolic shape guards ( voz
Misc dynamo fixes wip cleanups rm comment voz
[dynamo] Refactor DynamicShapeVariable to be a top level VariableTracker voz (voz/fix_vars branch merge)
Remove some unnecessary BC code ezyang
Enable Python dispatcher when running fake tensors in dynamo ezyang
Fix floor guards in torchdynamo guard codegen ezyang
Fix slice voz
Delete questionable symintification of constant ezyang
Fix dynamo not recording slices in graphs. This led to issues with mi… ( Fix dynamo not recording slices in graphs) voz
Specialize on dynamic shape control flow instead of graph breaking voz
Fix math.sqrt and size(i) handling in dynamo ( Fix voz

Functionalization

Use functionalize in symbolic exhaustive tests, not sound otherwise ezyang
Fix bernoulli functionalization. ezyang
Forward device calls to inner tensor in FunctionalTensorWrapper ezyang
Update functionalization dispatch key strategy ezyang
Functionalize and compute joint simultaneously ezyang
Save and restore reapply views TLS correctly ezyang
Saved tensor that is view test case functionalization ezyang
functionalization: fix detach() bdhirsh

Infrastructure

basic bin_op(symint, symint) support more basic bin_op(symint, symint) support bdhirsh, subsumed by Unify SymIntNode and SymFloatNode into SymNode ezyang
[HACK] Prevent use of as_strided in backwards ezyang
Do not use unsafe restriding for subclasses ezyang
Add get_guard_expr to symbolic_shapes which returns all guards in a s… Chillee
totally untested sym_sqrt support ezyang
Get the magic method try reverse protocol correct ezyang
Fix accuracy minifier ezyang
Fix another sympy infinite loop Chillee
disable aot autograd cache anjali411
Properly pass on shape_env to view base ezyang
Force people to call from_meta_and_device directly ezyang
Convert MetaConverter’s tensor memo into a weak value dictionary. ezyang
Unify meta tensor and fake tensor converter conversion ezyang

Merge to master retrospective

Many symintifications was reverted because it broke internal Executorch build, due to an Executorch only YAML file defining an out variant of an operator (Redirecting...). The fbcode change ended up being trivial, so we relanded the PR by landing it GH1, and then ninja’ing the fbcode fix after it was imported in diff train. @albanD was initially concerned about merge conflicts between fbcode master and GH master because this was a large patch, but this strategy neatly sidestepped the problem (the biggest annoyance being ensuring the diff train was sufficiently imported to import this diff).
Fix bernoulli functionalization. required non-trivial XLA changes, but at time of writing @ezyang wasn’t able to compile XLA successfully. Fortunately, JackCaoG helped do the XLA side patch (which was relatively long, but not too difficult; just undoing XLA’s bernoulli hack.)
Unify meta tensor and fake tensor converter conversion was reverted on master because inductor tests were not run on PR. This will be fixed by enabling inductor CI on all fake tensor changes.

What’s made it to master this week?

What’s coming next?

Educate the team on how torchdynamo dynamic shapes is supposed to work, and spend a lot of time fixing issues here
E2E training on master with inductor.
- Hook up torchdynamo’s symbolic shapes to AOT Autograd’s symbolic shapes, check Sym Shapes, Control Flow in Dynamo - RFC + Plan of Record - Google Docs for design details)
- Resolve strategy for sharing ShapeEnv between forward and backwards (@ezyang’s take: don’t return SymInts from forward pass)
Reshape unification
All benchmark models are passing aot_eager training on branch; tracked at Operators that need symbolic support - Google Sheets
Fallback implementation for custom operators without symbolic shape propagation, inferred by running fallback on real operators
All OpInfo tests passing

Topic		Replies	Views
PyTorch 2.1: automatic dynamic shape compilation, torch.distributed.checkpoint, torch.compile + NumPy, torch.export prototype, and more! Release Announcements	0	1006	October 6, 2023
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	66045	July 29, 2024
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	17150	March 9, 2023
A TorchDynamo trace time ablation study compiler	0	543	March 22, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1980	September 22, 2023