State of symbolic shapes branch

ezyang · October 8, 2023, 8:16pm

State of PT2: Oct 8, 2023 edition

Previous update: State of symbolic shapes branch - #70 by ezyang

We were on break last week as I was on vacation.

Executive summary

Compiler/distributed offsite was last week! PyTorch Conference talk slides are due to Linux Foundation end of this week!

Dynamo

Our initial take on mutable variable trackers was “well, it is probably technically feasible, but it’d be a lot of work and the ROI is not obviously there.” It came up again this week, though,for perf reasons: RFC / Discussion - Mutable Variable Trackers - Google Docs from Voz
We discussed accurate python object model: definitely something we should do for user defined objects, maybe Fidget-Spinner will work on it. We have some Dynamo bugs recently relating to classes with nontrivial metaclasses (like abc) and multiple inheritance.
We have a proposal for guarding on Dynamo configuration, which should make it a lot easier to tweak config options: Dynamo guard on global configuration · Issue #110682 · pytorch/pytorch · GitHub One notable choice we make is that outer-most torch.compile config wins; if this would be annoying for you please comment on the issue.

Tracing FSDP

We talked about the relative importance of landing tracing FSDP quickly during the offsite. The general consensus was that, while this is an important capability, the more time pressing problems are optimizing tensor parallel compute (as it’s harder to manually get optimal overlapping in this regime) and tracing DDP (which Chien-Chin is working on.)
During the offsite, we came up with a full plan for Dynamo-level support for propagating hooks to backwards. The primary complication is that, in full generality, a backward hook installed in a Dynamo compiled region may be arbitrary Python code that would vary from run to run, but we emphatically do not want to guard on it (nor can we, since we didn’t inline into the function.) In the simple case, the function is constant from iteration to iteration and we can bake it into the backwards graph (this is what is currently implemented); in the complicated case, Dynamo must construct the residual function, and then somehow pass it to AOTAutograd compiled function, so AOTAutograd knows to know that particular function is what should be invoked when backward rolls around. This can be done but it’s all quite fiddly. For FSDP we don’t need it in full generality because it’s a constant function.
More folks are collaborating on Voz’s experimental FSDP tracing branch: https://github.com/pytorch/pytorch/tree/voz/fsdp_autograd3 To run things on the branch just say torchrun --standalone --nproc_per_node=2 fsdp.py (will run with compiled forwards, but NOT compiled backwards). Current status is that compiled forwards works, compiled backwards does not. The problem is that compiled autograd has to do a pre-pass with fake tensors to construct the FX graph, but during this pre-pass it is unable to run hooks, and that means parameters aren’t the sizes it is expecting.
Not quite FSDP, but putting it here: on the subject of single controller, Haiping Zhao also looking at it this problem space, much more from the distributed side. He, Zach and Horace have been chatting.

Core

We spent a bit of time talking about optimizer in the offsite. @janeyx99 summarized the discussion at Meta only: https://fb.workplace.com/groups/pytorch.oss.dev/posts/1750253475399186/ and Meta only: https://docs.google.com/document/d/1JJhRCl8F51nH_Ke8Yd_BV3scAmv5V4eSVnle5D8__po/edit My brief summary: we’re going to make optimizer support taking parameters in arbitrary pytree structure, rather than forcing just a list of parameters (which gives you the awful integer indexed structure where you have to reverse engineer which parameter is what.) It’s not BC-breaking, but people who use this API will have a much easier to work with state dictionary.
Composability sync this week was all about quantization https://www.youtube.com/watch?v=7WhgpAIvxHU Composability meeting notes - Google Docs The resolutions:
- uint2/uint3/uint4 support in core to be prototyped as Python subclass by torchrec folks (lead by Ivan Kobzarev)
- Decent chance we are going to get dequantize operators that can show up in export IR
- We will have a pattern matcher that will let you compile regular Python calls in PyTorch IR into appropriate ATen matchers, mirroring how Inductor’s pattern match infra works. This will support just returning fx.Nodes to you so you can do arbitrary transformations, instead of just doing a replacement.
Richard Zou has been trying to convince people to use the new operator registration API, but he has been noticing that people really like the old fashioned autograd.Function API, because it doesn’t require them to do work for things they don’t care about (e.g., supporting other transforms). Since we need to support this anyway, we are going to make sure Dynamo’s support in this regime is good.
AOTDispatch subclass PR is approved, close to landing! https://github.com/pytorch/pytorch/pull/104483

Inductor

AOTInductor is currently working hard on GPU model support, but some folks have been poking at it for CPU, overhead sensitive workflows. There will likely be some work done in this area, cool increase in scope.
Inductor strategy will be presented to upper leadership soon. Meta-only slides: https://docs.google.com/presentation/d/1M6W5YuXhfCkngmjPC8a_EfdMs_P6wXzsAbr_LKkuisU/edit#slide=id.g2887f2cdaba_0_23 (They’re pretty interesting, I recommend reading them if you have access.)

Export

Export input/output matching is being a problem again. This is the AssertionError: traced result #1 (<class 'torch.Tensor'>) is not among graph-captured output. Someone should rewrite this code.

Dynamic shapes

ysiraichi is transitioning to PyTorch XLA, so he will have less time to work on dynamic shapes specifically.
There was a torchrec_dlrm update last week; the main change is that I redid the torchrec changes assuming variable batches, and after fixing bugs it all worked out smoothly. Still blocked on complicated autograd.Function support. https://docs.google.com/document/d/1VTGEh0MqadAsuRy0s5u39wQhNwMSVgCgYewivMcBbuU/edit#heading=h.34z2pradlobb
Notable bug fixes:
Notable new bug reports:

Numbers

I guess I’m doing these monthly now.

Training. 1b34238d67 dashboard

nanogpt, stable_diffusion_text_encoder, stable_diffusion_unet newly added torchbench models
mobilevit_s now passing timm_models
Not sure what’s going on with blueberries readout lol.
Compile time does seem to have gotten worse. @Chillee has been complaining about compile time, although a lot of it is in Dynamo tracing. It is hard to see the effect of Dynamo in our current benchmark suite because it is heavily Inductor biased. Some improvement from guard hashing, but Horace says it’s only 1-2 seconds.
Unattributed speedup on timm_efficientdet

Inference. 1b34238d67 dashboard

Lots of enablement in aot inductor, I like to see that pass rate go up
Some speedups are attributable to nanogpt being added
1% improvement in HF, Horace updated some loading logic
HF inference: better flash attention matching at low inference +19%, some of this is also FlashAttention v2
Some ups and downs with Yanbo’s for equiv invocation, letting us hit baddmm

Topic		Replies	Views
[RFC] Improve Dynamic Shapes Support Across Aten Operators and Expand Test Coverage compiler	1	71	June 8, 2026
How to invoke symbolic shape propagation? frontend API	3	549	November 16, 2023
State of PyTorch core: September 2021 edition frontend API	1	9533	September 21, 2021
Lazy Tensor Core hardware-backends	20	8079	July 12, 2022
Symbolic Shape Inference torchscript	1	1636	March 31, 2021

State of symbolic shapes branch

State of PT2: Oct 8, 2023 edition

Executive summary

Numbers

Related topics