State of symbolic shapes: Jun 3 edition
Previous update: State of symbolic shapes branch - #56 by ezyang
Executive summary
- This update covers two weeks, on account of memorial day holiday, and also most of the dynamic shapes crew was working on FSDP tracing.
-
PT Core Libraries offsite. PyTorch Core Libraries had an offsite. The big dynamic shapes relevant conversation we had was with @jbschlosser on nested tensor support in PT2. There will be an announce post coming soon about this working E2E; now we need to roll our sleeves up and put it into core PyTorch. One big resolution from our discussion was that it is not necessary to model jagged dimensions in our lowering stack: while having a jagged dimension, e.g.,
(B, H, W)
where height and width can vary, is intuitive for end users, during lowering it is acceptable to lower this as an ordinary 1D dense tensor(BHW,)
which contains a bunch of extra metadata that says how to reconstruct the other metadata. This is because our passes like autograd/etc do not care about the jagged structure of the tensor. Additionally, one theme was a lot of PyTorch library developers really liked doing development now in PT2, so there is a lot of interest in improving the “I want to add a new feature to PT, and I will use PT2 so I don’t have to write a CUDA kernel.” Dynamic shapes is pretty essential for kernel writers! - Drowning in bugs. Two broken models that are not in the benchmark suite which I’ve been eyeballing: Fine-tuning HuggingFace wav2vec 2.0 with `torch.compile` · Issue #101160 · pytorch/pytorch · GitHub (this one is broken in a lot of non-dynamic shape related ways) and [torch.compile] Name 'Ne' is not defined (Stable Diffusion) · Issue #101228 · pytorch/pytorch · GitHub (there a bunch of potential small changes which can fix it)
- A reach out from FT users. Some folks using FT to do LLM inference are interested in what the long term state of dynamic shapes and PT2 will be; will PT2 be a viable alternative to FT? Today, our gap with FT is moderately significant, esp because we cannot use CUDA graphs with dynamic shapes. However, we hope that (1) with things like kvcache, you do not actually need dynamic shapes and (2) continual improvements to PT2, we will be a competitive and much more user friendly alternative to FT. Hopefully with our increasing focus on LLMs (thanks @drisspg and the rest of the blueberries folks) we should continue to make progress on this front.
-
Notable new issues.
- Dynamo should only unroll loops by a preset factor (unless otherwise explicitly instructed) · Issue #102839 · pytorch/pytorch · GitHub - Now that we’re running torch.compile on end-to-end autoregressive generation this is showing up, esp with nanogpt_generate (which is simple enough that there are no graph breaks). Driss volunteered to look into it.
- mark_dynamic may error too aggressively · Issue #102814 · pytorch/pytorch · GitHub - mark_dynamic is kind of hard to use. Stopgap may be to have variant that doesn’t error if it fails to be dynamic.
-
Notable bug fixes.
- Improve repeat_interleave with scalar repeat value by peterbell10; we no longer specialize if you call repeat_interleave
- Graph break on differentiable boolean mask setitem
- RelaxUnspecConstraint some mor - this makes mark_dynamic no longer complain if you mark a 0/1 size dim dynamic (it will specialize, but no big deal)
CI skips. -1, -1, -1, -2 (+1, 0, -1, 0). hf_T5_generate and cm3leon_generate are now passing (though hf_T5_generate in a somewhat hacky way). New failure is nanogpt_generate which was previously failing even in static, new work item for us.
The dashboard (as of 8215468870). This fortnight on HUD
There is a discontinuity in speedup, due to a change in how we count speedup: we now (1) clamp model speeds to 1x (previously, a PT2 caused slowdown could depress overall speedup; this was the case for torchbench was revised from 1.12 to 1.15), and (2) we now include models that fail accuracy in geomean speedup as 1x (this depresses the geomean speedup, e.g., HuggingFace was revised from 1.48 to 1.45).
Metric | Torchbench | Huggingface | TIMM models |
---|---|---|---|
Passrate | 88%, 56/64 | 98%, 44/45 | 100%, 60/60 |
Speedup | 1.15x → 1.16x | 1.45x → 1.53x | 1.20x → 1.22x |
Comptime | 79s | 100s → 103s | 134s → 135s |
Memory | 0.93x → 0.94x | 0.97x → 1.00x | 1.01x |
Notes:
- HuggingFace and torchbench improvements are due to several broad base optimizations that affected all configurations; these include inductor: eliminate meaningless copy and squash xblock for persistent inner reduction (this one disproportionately affects transformer models and accounts for all of the torchbench improvement).
- 2% TIMM models geomean increase is attributable to a big improvement in lcnet_050. It is not entirely clear which PR booked this win but my guess is the Triton pin update.
- There is a huge performance for TIMM models win landed via convolution layout optimization, but it is disabled for dynamic shapes, so we didn’t see any benefit from it.
I probably ought to report inference numbers too but they are only being run twice a week and our latest set of improvements are not in an official benchmark run.
What’s next?
- Voz: working on tracing FSDP
- Edward: fixing bugs