State of symbolic shapes branch

State of PT2: Oct 8, 2023 edition

Previous update: State of symbolic shapes branch - #70 by ezyang

We were on break last week as I was on vacation.

Executive summary

Compiler/distributed offsite was last week! PyTorch Conference talk slides are due to Linux Foundation end of this week!

Dynamo

  • Our initial take on mutable variable trackers was “well, it is probably technically feasible, but it’d be a lot of work and the ROI is not obviously there.” It came up again this week, though,for perf reasons: RFC / Discussion - Mutable Variable Trackers - Google Docs from Voz
  • We discussed accurate python object model: definitely something we should do for user defined objects, maybe Fidget-Spinner will work on it. We have some Dynamo bugs recently relating to classes with nontrivial metaclasses (like abc) and multiple inheritance.
  • We have a proposal for guarding on Dynamo configuration, which should make it a lot easier to tweak config options: Dynamo guard on global configuration · Issue #110682 · pytorch/pytorch · GitHub One notable choice we make is that outer-most torch.compile config wins; if this would be annoying for you please comment on the issue.

Tracing FSDP

  • We talked about the relative importance of landing tracing FSDP quickly during the offsite. The general consensus was that, while this is an important capability, the more time pressing problems are optimizing tensor parallel compute (as it’s harder to manually get optimal overlapping in this regime) and tracing DDP (which Chien-Chin is working on.)
  • During the offsite, we came up with a full plan for Dynamo-level support for propagating hooks to backwards. The primary complication is that, in full generality, a backward hook installed in a Dynamo compiled region may be arbitrary Python code that would vary from run to run, but we emphatically do not want to guard on it (nor can we, since we didn’t inline into the function.) In the simple case, the function is constant from iteration to iteration and we can bake it into the backwards graph (this is what is currently implemented); in the complicated case, Dynamo must construct the residual function, and then somehow pass it to AOTAutograd compiled function, so AOTAutograd knows to know that particular function is what should be invoked when backward rolls around. This can be done but it’s all quite fiddly. For FSDP we don’t need it in full generality because it’s a constant function.
  • More folks are collaborating on Voz’s experimental FSDP tracing branch: https://github.com/pytorch/pytorch/tree/voz/fsdp_autograd3 To run things on the branch just say torchrun --standalone --nproc_per_node=2 fsdp.py (will run with compiled forwards, but NOT compiled backwards). Current status is that compiled forwards works, compiled backwards does not. The problem is that compiled autograd has to do a pre-pass with fake tensors to construct the FX graph, but during this pre-pass it is unable to run hooks, and that means parameters aren’t the sizes it is expecting.
  • Not quite FSDP, but putting it here: on the subject of single controller, Haiping Zhao also looking at it this problem space, much more from the distributed side. He, Zach and Horace have been chatting.

Core

  • We spent a bit of time talking about optimizer in the offsite. @janeyx99 summarized the discussion at Meta only: https://fb.workplace.com/groups/pytorch.oss.dev/posts/1750253475399186/ and Meta only: https://docs.google.com/document/d/1JJhRCl8F51nH_Ke8Yd_BV3scAmv5V4eSVnle5D8__po/edit My brief summary: we’re going to make optimizer support taking parameters in arbitrary pytree structure, rather than forcing just a list of parameters (which gives you the awful integer indexed structure where you have to reverse engineer which parameter is what.) It’s not BC-breaking, but people who use this API will have a much easier to work with state dictionary.
  • Composability sync this week was all about quantization https://www.youtube.com/watch?v=7WhgpAIvxHU Composability meeting notes - Google Docs The resolutions:
    • uint2/uint3/uint4 support in core to be prototyped as Python subclass by torchrec folks (lead by Ivan Kobzarev)
    • Decent chance we are going to get dequantize operators that can show up in export IR
    • We will have a pattern matcher that will let you compile regular Python calls in PyTorch IR into appropriate ATen matchers, mirroring how Inductor’s pattern match infra works. This will support just returning fx.Nodes to you so you can do arbitrary transformations, instead of just doing a replacement.
  • Richard Zou has been trying to convince people to use the new operator registration API, but he has been noticing that people really like the old fashioned autograd.Function API, because it doesn’t require them to do work for things they don’t care about (e.g., supporting other transforms). Since we need to support this anyway, we are going to make sure Dynamo’s support in this regime is good.
  • AOTDispatch subclass PR is approved, close to landing! https://github.com/pytorch/pytorch/pull/104483

Inductor

Export

  • Export input/output matching is being a problem again. This is the AssertionError: traced result #1 (<class 'torch.Tensor'>) is not among graph-captured output. Someone should rewrite this code.

Dynamic shapes

Numbers

I guess I’m doing these monthly now.

Training. 1b34238d67 dashboard

  • nanogpt, stable_diffusion_text_encoder, stable_diffusion_unet newly added torchbench models
  • mobilevit_s now passing timm_models
  • Not sure what’s going on with blueberries readout lol.
  • Compile time does seem to have gotten worse. @Chillee has been complaining about compile time, although a lot of it is in Dynamo tracing. It is hard to see the effect of Dynamo in our current benchmark suite because it is heavily Inductor biased. Some improvement from guard hashing, but Horace says it’s only 1-2 seconds.
  • Unattributed speedup on timm_efficientdet

Inference. 1b34238d67 dashboard

  • Lots of enablement in aot inductor, I like to see that pass rate go up
  • Some speedups are attributable to nanogpt being added
  • 1% improvement in HF, Horace updated some loading logic
  • HF inference: better flash attention matching at low inference +19%, some of this is also FlashAttention v2
  • Some ups and downs with Yanbo’s for equiv invocation, letting us hit baddmm
1 Like