State of symbolic shapes: Apr 1 edition
Previous update: State of symbolic shapes branch - #48 by ezyang
Executive summary
-
Some bug fixes:
- Propagate inductor guards to ShapeEnv - this should have been fixed long ago, but c’est la vie. Figuring out this mattered in practice was a doozy, checkout aviros’s writeup at Debugging story: The case of the garbage text generation
- Graph break on operators that fake tensor doesn’t support - this fixes models that use nms and dynamic shapes (e.g., maskrcnn)
- Dynamic shapes hits CI. Dynamic shapes is now on the new experimental performance dashboard, thanks to the new CI-based GCP A100 infrastructure it was really easy to add a new configuration. We’re switching our weekly stats reporting to it. One thing to note: this runs dynamic shapes in a non-standard configuration where ONLY batch size is treated as dynamic. The performance and compile time is quite a bit worse if you do “YOLO everything dynamic shapes” with dynamic=True. By the way, there’s also now a script to summarize performance from a CI performance run if you want to do a one-off experiment.
-
Multi-level cache / shape env. Voz, Natalia and I have begun discussing in earnest how exactly to setup multi-level cache and shape env. Here is our current thinking:
- It’s not just a two-level cache. We need the ability to use guards to switch between several possible artifacts in several situations: (1) in backwards, in case compiling the backwards graph results in more guards than were initially installed in forwards, (2) in backwards, in case the passed in gradient is not contiguous (e.g., channels last gradient) and changes the validity of the compiled backwards kernel, (3) in forwards, when compiling graphs with operators that have data-dependent output shapes. Originally, our dogma was that such operators should result in graph breaks until inductor learned how to compile them (much later in the future). However, graph breaks inside deeply inlined functions are extremely bad for performance, since if you have the call stack f1-f2-f3 and f3 graph breaks, you end up having to compile six separate graphs: f1-pre, f2-pre, f3-pre, f3-post, f2-post and f1-post. There is no opportunity to fuse among these graphs. A post-dynamo, pre-inductor notion of graph break would allow inductor to guard on the specifics of the data dependent shape, while still permitting these fusion opportunities.
- No replacements after the first level. A big complication with a generalized multi-stage ShapeEnv is that we have a large number of SymNodes floating around, whose saved sympy expressions automatically get updated if we discover a new replacement (e.g., mapping s0 to 3). Although there are some complicated schemes that can handle this in full generality, our current thinking is that we will restrict replacements to the first level. For later levels (e.g., in backwards), we can only add guards, we will never add new replacements. In practice, this should work well, as backward shapes match their forwards, so the only new guards will just be Inductor-level guards on alignment, etc.
-
State of real world model enablement.
- torchbench pin is updated, but many of the new models we were interested are failing in CI. LLaMA is failing due to missing complex32 support (see also [Inductor] Complex multiplication fails in inductor with KeyError: torch.complex64 · Issue #97562 · pytorch/pytorch · GitHub); gat, gcn and sage are failing due to only supporting CPU (weird, this sounds like a misconfiguration); InstructPix2Pix is currently stuck in canary model until we fix it not to download model weights.
-
detectron2 - I did some initial investigation on
detectron2_fasterrcnn_r_101_c4
and noticed it is failing on translation to torch._assert. (cc Tugsuu) - An internal multimodal model (Meta only) is 5x slower on Azure A100 but 5x faster on devgpu A100. Weird! It also seems to infinite loop on the devgpu A100 if you remove the graph breaks. I have the go ahead from the research scientist to publish a sanitized version of the model without weights, coming soon.
The numbers:
-
Model status on master.
- CI skips (aot_eager inference/training; inductor inference/training): -1, -2, -7, -1 (all unchanged)
- Perf passrate (torchbench, huggingface, timm_models): 54/63, 41/45, 61/62 (new!) (delta: -2, -3, -1)
- Geomean speedup: 1.09x, 1.28x, 1.17x (new!) (delta: -.06x, -.11x, -.18x)
- Mean compilation time: 84s, 104s, 141s (new!) (delta: +48s, +39s, +36s)
- Peak memory footprint compression ratio: 0.76x, 0.99x, 0.91x (new!) (delta: -.16x, -.05x, -.10x)
What’s coming next?
- Voz: Multi-level cache/shape env
- Avik: Ostensibly unblocked on dynamic_dim(…) <= 2 constraints API
- Edward: Some CM3Leon, some understanding why things are slow
- Horace: 1. Dynamic shape minifier, 2. Some shape padding stuff, 3. Pre-autograd make_fx