State of symbolic shapes branch

ezyang · June 3, 2023, 8:21pm

State of symbolic shapes: Jun 3 edition

Previous update: State of symbolic shapes branch - #56 by ezyang

Executive summary

This update covers two weeks, on account of memorial day holiday, and also most of the dynamic shapes crew was working on FSDP tracing.
PT Core Libraries offsite. PyTorch Core Libraries had an offsite. The big dynamic shapes relevant conversation we had was with @jbschlosser on nested tensor support in PT2. There will be an announce post coming soon about this working E2E; now we need to roll our sleeves up and put it into core PyTorch. One big resolution from our discussion was that it is not necessary to model jagged dimensions in our lowering stack: while having a jagged dimension, e.g., (B, H, W) where height and width can vary, is intuitive for end users, during lowering it is acceptable to lower this as an ordinary 1D dense tensor (BHW,) which contains a bunch of extra metadata that says how to reconstruct the other metadata. This is because our passes like autograd/etc do not care about the jagged structure of the tensor. Additionally, one theme was a lot of PyTorch library developers really liked doing development now in PT2, so there is a lot of interest in improving the “I want to add a new feature to PT, and I will use PT2 so I don’t have to write a CUDA kernel.” Dynamic shapes is pretty essential for kernel writers!
Drowning in bugs. Two broken models that are not in the benchmark suite which I’ve been eyeballing: Fine-tuning HuggingFace wav2vec 2.0 with `torch.compile` · Issue #101160 · pytorch/pytorch · GitHub (this one is broken in a lot of non-dynamic shape related ways) and [torch.compile] Name 'Ne' is not defined (Stable Diffusion) · Issue #101228 · pytorch/pytorch · GitHub (there a bunch of potential small changes which can fix it)
A reach out from FT users. Some folks using FT to do LLM inference are interested in what the long term state of dynamic shapes and PT2 will be; will PT2 be a viable alternative to FT? Today, our gap with FT is moderately significant, esp because we cannot use CUDA graphs with dynamic shapes. However, we hope that (1) with things like kvcache, you do not actually need dynamic shapes and (2) continual improvements to PT2, we will be a competitive and much more user friendly alternative to FT. Hopefully with our increasing focus on LLMs (thanks @drisspg and the rest of the blueberries folks) we should continue to make progress on this front.
Notable new issues.
- Dynamo should only unroll loops by a preset factor (unless otherwise explicitly instructed) · Issue #102839 · pytorch/pytorch · GitHub - Now that we’re running torch.compile on end-to-end autoregressive generation this is showing up, esp with nanogpt_generate (which is simple enough that there are no graph breaks). Driss volunteered to look into it.
- mark_dynamic may error too aggressively · Issue #102814 · pytorch/pytorch · GitHub - mark_dynamic is kind of hard to use. Stopgap may be to have variant that doesn’t error if it fails to be dynamic.
Notable bug fixes.
- Improve repeat_interleave with scalar repeat value by peterbell10; we no longer specialize if you call repeat_interleave
- Graph break on differentiable boolean mask setitem
- RelaxUnspecConstraint some mor - this makes mark_dynamic no longer complain if you mark a 0/1 size dim dynamic (it will specialize, but no big deal)

CI skips. -1, -1, -1, -2 (+1, 0, -1, 0). hf_T5_generate and cm3leon_generate are now passing (though hf_T5_generate in a somewhat hacky way). New failure is nanogpt_generate which was previously failing even in static, new work item for us.

The dashboard (as of 8215468870). This fortnight on HUD

There is a discontinuity in speedup, due to a change in how we count speedup: we now (1) clamp model speeds to 1x (previously, a PT2 caused slowdown could depress overall speedup; this was the case for torchbench was revised from 1.12 to 1.15), and (2) we now include models that fail accuracy in geomean speedup as 1x (this depresses the geomean speedup, e.g., HuggingFace was revised from 1.48 to 1.45).

Metric	Torchbench	Huggingface	TIMM models
Passrate	88%, 56/64	98%, 44/45	100%, 60/60
Speedup	1.15x → 1.16x	1.45x → 1.53x	1.20x → 1.22x
Comptime	79s	100s → 103s	134s → 135s
Memory	0.93x → 0.94x	0.97x → 1.00x	1.01x

Notes:

HuggingFace and torchbench improvements are due to several broad base optimizations that affected all configurations; these include inductor: eliminate meaningless copy and squash xblock for persistent inner reduction (this one disproportionately affects transformer models and accounts for all of the torchbench improvement).
2% TIMM models geomean increase is attributable to a big improvement in lcnet_050. It is not entirely clear which PR booked this win but my guess is the Triton pin update.
There is a huge performance for TIMM models win landed via convolution layout optimization, but it is disabled for dynamic shapes, so we didn’t see any benefit from it.

I probably ought to report inference numbers too but they are only being run twice a week and our latest set of improvements are not in an official benchmark run.

What’s next?

Voz: working on tracing FSDP
Edward: fixing bugs

Topic		Replies	Views
How to invoke symbolic shape propagation? frontend API	3	445	November 16, 2023
State of PyTorch core: September 2021 edition frontend API	1	9405	September 21, 2021
Lazy Tensor Core hardware-backends	20	7624	July 12, 2022
Symbolic Shape Inference torchscript	1	1573	March 31, 2021
Understanding dynamic shapes and guards and when it does/does not cause graph breaks compiler	1	323	November 7, 2024

State of symbolic shapes branch

State of symbolic shapes: Jun 3 edition

Executive summary

What’s next?

Related topics