State of symbolic shapes branch

State of symbolic shapes: Apr 1 edition

Previous update: State of symbolic shapes branch - #48 by ezyang

Executive summary

  • Some bug fixes:
  • Dynamic shapes hits CI. Dynamic shapes is now on the new experimental performance dashboard, thanks to the new CI-based GCP A100 infrastructure it was really easy to add a new configuration. We’re switching our weekly stats reporting to it. One thing to note: this runs dynamic shapes in a non-standard configuration where ONLY batch size is treated as dynamic. The performance and compile time is quite a bit worse if you do “YOLO everything dynamic shapes” with dynamic=True. By the way, there’s also now a script to summarize performance from a CI performance run if you want to do a one-off experiment.
  • Multi-level cache / shape env. Voz, Natalia and I have begun discussing in earnest how exactly to setup multi-level cache and shape env. Here is our current thinking:
    • It’s not just a two-level cache. We need the ability to use guards to switch between several possible artifacts in several situations: (1) in backwards, in case compiling the backwards graph results in more guards than were initially installed in forwards, (2) in backwards, in case the passed in gradient is not contiguous (e.g., channels last gradient) and changes the validity of the compiled backwards kernel, (3) in forwards, when compiling graphs with operators that have data-dependent output shapes. Originally, our dogma was that such operators should result in graph breaks until inductor learned how to compile them (much later in the future). However, graph breaks inside deeply inlined functions are extremely bad for performance, since if you have the call stack f1-f2-f3 and f3 graph breaks, you end up having to compile six separate graphs: f1-pre, f2-pre, f3-pre, f3-post, f2-post and f1-post. There is no opportunity to fuse among these graphs. A post-dynamo, pre-inductor notion of graph break would allow inductor to guard on the specifics of the data dependent shape, while still permitting these fusion opportunities.
    • No replacements after the first level. A big complication with a generalized multi-stage ShapeEnv is that we have a large number of SymNodes floating around, whose saved sympy expressions automatically get updated if we discover a new replacement (e.g., mapping s0 to 3). Although there are some complicated schemes that can handle this in full generality, our current thinking is that we will restrict replacements to the first level. For later levels (e.g., in backwards), we can only add guards, we will never add new replacements. In practice, this should work well, as backward shapes match their forwards, so the only new guards will just be Inductor-level guards on alignment, etc.
  • State of real world model enablement.

The numbers:

  • Model status on master.
    • CI skips (aot_eager inference/training; inductor inference/training): -1, -2, -7, -1 (all unchanged)
    • Perf passrate (torchbench, huggingface, timm_models): 54/63, 41/45, 61/62 (new!) (delta: -2, -3, -1)
    • Geomean speedup: 1.09x, 1.28x, 1.17x (new!) (delta: -.06x, -.11x, -.18x)
    • Mean compilation time: 84s, 104s, 141s (new!) (delta: +48s, +39s, +36s)
    • Peak memory footprint compression ratio: 0.76x, 0.99x, 0.91x (new!) (delta: -.16x, -.05x, -.10x)

What’s coming next?

  • Voz: Multi-level cache/shape env
  • Avik: Ostensibly unblocked on dynamic_dim(…) <= 2 constraints API
  • Edward: Some CM3Leon, some understanding why things are slow
  • Horace: 1. Dynamic shape minifier, 2. Some shape padding stuff, 3. Pre-autograd make_fx