State of symbolic shapes branch

ezyang · September 24, 2023, 8:34pm

State of PT2: Sep 23, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

Yanbo has been making good progress on understanding the state of our skipfiles/allowlist situation. Here is my attempt to record what he described to me in our 1:1.
- First, what do these things do? For any given frame, we can make one of three decisions on it: inline - the default decision; skip - we never start Dynamo on this frame, and we induce a graph break instead of inlining into it (BUT, skipped functions may have overrides in Dynamo that allow us to avoid a graph break); allow in graph - we don’t inline into the function, but instead directly put it into the graph (and run it to do fake tensor propagation.) Skipfiles and allowlist control whether or not we do something different from the default decision.
- Yanbo’s theory is that allowlist should be explicitly enumerated function-by-function. This makes sense; there’s a fixed set of operations we can actually put in the graph (coinciding with Torch IR; see composability sync), and they have to be audited to ensure they don’t do naughty stuff like mutate Python state.
- Suppose that we didn’t care about compile time / Dynamo bugs at all. In theory, it shouldn’t be necessary to have a skip list at all, because you’d expect Dynamo to independently work out that something couldn’t be compiled and graph break. There is a big nuance here though: the torch module is skipped! Most of the time, this skip is bypassed for other reasons, e.g., a torch function is allowed in graph, or a submodule is explicitly allowed for inlining. But by default we won’t actually compile anything in torch (and this can be quite surprising for PyTorch devs!)
- Chesterton’s fence rules everything around me. Sometimes we have manual implementations of functions (like nn.Module.parameters) which are unnecessary, because they were added back when Dynamo’s Python language support was not so good, but now we can just inline into these functions, but some seemingly benign skip rules are load bearing and cause problems. So many of Yanbo’s initial refactors will be oriented around preserving semantics as much as possible, while improving code organization.
Jason, Edward, Animesh and Voz got together to discuss some design questions about guard trees raised last week. The conclusion was that we are NOT going to do combined guard tries, Animesh’s plan as original scoped as is. One interesting thing I learned from this discussion was that guards with source-based guard structure deal poorly with guards that mention two sources, but Jason proposed a way to deal with this case: instead of directly having a guard like x.size(0) == x.size(1), instead have assignment statements like let s0 = x.size(0) and let s1 = x.size(1), and then have an extra guard that only makes reference to this local scratch space s0 == s1. These extra guards get run immediately once you notice that all of its free variables have been assigned. Jason’s argument is that size guards can be very fast to run if we compile them, so it doesn’t matter if they get run unnecessarily early. Some very rough meeting notes: https://docs.google.com/document/d/1EbrR9o7Loi_fU1MHNJAxxCOItn0dv44hLF3NB2pZ1nE/edit#heading=h.o7t8ttlom4nx
Lazos suffered a bit from some bikeshedding about how he should write some new VariableTrackers, but hey, at least we got a doc out of it: Which VariableTracker should I use - Google Docs
PSA: when you print Dynamo logs, they come with a [0/0] marker that says what frame you are compiling, and which recompile of that frame you are on. [19/0] means you are compiling the 19th interesting frame for the first time, while [1/20] means you are recompiling the 1st frame for the 20th time (probably bad!) Occasionally, you will see [0/0_1]; this means that we restarted analysis on 0/0 for some reason, so this is the second (1 is zero-indexed) time around compiling it.
I mentioned to jansel that there are a number of dynamo refactors I’d kind of like to firm up: mutable variable trackers, a more accurate python object model, variable tracker cleanup (we have a big VariableTracker subclass hierarchy), more metacircularity (so that constant folding is a lot easier to implement.) Hopefully we can hit some of these during the offsite.

Composability

We had a very spicy session at composability sync this week on Torch IR https://youtu.be/FSPNXppkcjg Composability meeting notes - Google Docs
https://docs.google.com/document/d/17O1R57oOZp_fK4dRf83UiM4fH6h6nblxjizTrbHP8BY/edit The crux of the matter is what to do about “Torch IR”, which is conceptually a PyTorch program capture representation that is produced by fx.symbolic_trace: an unnormalized format that contains precisely torch API operations that are part of PyTorch’s public API. It is a coherent concept that is used by folks today, but its denormalization makes it difficult to write sound analyses/passes on. Some argued that because it’s so difficult to use, we shouldn’t expose it, while others argued that the horse escaped from the barn. We were able to agree in the meeting what Torch IR is and what guarantees you should expect from it, but it’s still an ongoing discussion how this should relate to export.
Zachary DeVito’s been working on single controller distributed paradigm for PyTorch, where we issue commands of entire Dynamo traced graphs for external nodes to run. This is being done via a tensor subclass, but it is a bit unusual in that it doesn’t match other tensor subclass applications, where we don’t actually want to trace into the subclass itself, we just want to trace tensor operations on it “as if it were a normal tensor.”
Apparently optree GitHub - metaopt/optree: OpTree: Optimized PyTree Utilities is slowly making progress into becoming an optional dependency for PyTorch: if it is installed, PyTorch pytree APIs will transparently make use of it instead, for hefty speedups. Pytrees are a big performance cost for PT2 compile time, so it will be nice for this to land; so nice that Richard Zou has been wondering if we shouldn’t actually just make this a required dependency.
COW tensor is alive again, thanks to Kurt Mohler accepting a mission to finish of Mikey Dagitses work. Implement Copy-on-write (COW) tensors · Issue #109833 · pytorch/pytorch · GitHub
Discussions on internal reliability continue. It’s going to be some sort of multi pronged approach where we negotiate with PGs what testing we absolutely must do, while at the same time improving testing on poorly exercised parts of PT2 (e.g., PT2+FSDP) and working out periodic tests that are covered similarly to how we cover benchmark results in OSS.

Dynamic shapes

PSA: We’ve covered this before, but someone asked me about this so I’ll repeat it: our plan for PT2 support for table-batched embeddings is that we will eventually support capturing embeddings both with and without the fusion preapplied (e.g., by module swapping). It’s your choice then whether or not to reuse preexisting optimization passes, or write a new PT2 based optimization .
PSA: When registering C++ implementations (meta or composite) that take SymInt, make sure to pass SymInt by value and not by const reference. This applies to composite types like optional<SymInt> too. Unfortunately, we don’t check this properly right now: https://github.com/pytorch/pytorch/pull/109727 is wending its way in but I have to fix all the people who did it wrong first lol.
Adnan Akhundov has been working on PT2 compiling https://docs.google.com/document/d/1q1Rccii_A1xRsZETGPmrOebjKPOhbtWZ4bz9XrI6AdA/edit#heading=h.cxi2qmayvqhr (Meta only) and triggering a lot of unbacked SymInt missing features that we’ve been meaning to patch in but haven’t gotten around to yet. We landed a lot of improvements this week driven by this workstream:
- Use constrain_range_as_size for nonzero/repeat_interleave
- Allow inferring size-nature from sizes passed to empty constructor
- Handle unbacked symints in Triton size hints
- Handle unbacked symints in buffer reuse calculation
- In progress: Add support for item() and nonzero() codegen in Inductor - just needs deps
- There’s still a lot more to do. There’s a lot of use of size hints in inductor which need to be rewritten to deal with the unbacked case when no hint is available. Another unusual thing about Adnan’s setup is he is using AOTInductor, so we’re also getting a lot of extra asserts from add_runtime_assertions_for_constraints_pass.py which also don’t compile atm.
Jeffrey Wan has worked out that we should represent singleton symints (to be used to represent ragged dimensions) as sympy Atoms, which compare only equal to themselves. This is better than Symbol because they don’t show up as free symbols this way.
Notable bug fixes:
- Implement traceable torch.tensor when you have SymInt/SymFloat inputs
- Make SymFloat behave symmetrically with float in torch.tensor
Notable new bug reports:

Topic		Replies	Views
PyTorch 2.1: automatic dynamic shape compilation, torch.distributed.checkpoint, torch.compile + NumPy, torch.export prototype, and more! Release Announcements	0	999	October 6, 2023
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	64266	July 29, 2024
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	16993	March 9, 2023
A TorchDynamo trace time ablation study compiler	0	530	March 22, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1956	September 22, 2023

State of symbolic shapes branch

State of PT2: Sep 23, 2023 edition

Executive summary

Related topics