State of symbolic shapes branch

State of symbolic shapes: Jun 19 edition

Previous update: State of symbolic shapes branch - #58 by ezyang

Executive summary

  • Dynamic and blueberries in the benchmark suite as model sets. A model set (notated with the square brackets) is a subset of models from our existing benchmarks which we are aggregating separately to track something we care about. The Dynamic model set covers models which we expect dynamic shapes support to be relevant. Here is the current list, and some potential threats to validity which need follow up:
    // _generate variants are good; they do E2E autoregressive
    // generation and will induce varying context length.
    cm3leon_generate
    nanogpt_generate
    hf_T5_generate
    nanogpt_generate
    // detection models are ok-ish; the good news is they call
    // nonzero internally and exercise dynamic shapes that way,
    // the bad news is we may not run enough iterations with
    // varying data to get varying numbers of bounding boxes.
    detectron2_fcos_r_50_fpn
    vision_maskrcnn
    // this recommendation model internally uses sparse tensors
    // but once again its not clear that dynamic shapes is exercised
    // on this sparsity
    dlrm
    // these language models are only running a single next
    // word prediction, were NOT testing dynamic sequence length
    // performance
    llama
    BERT_pytorch
    hf_T5
    // the GNN benchmarks only one run one batch so you
    // arent actually triggering dynamism (and we didn't
    // explicitly mark something as dynamic)
    basic_gnn_edgecnn
    basic_gnn_gcn
    basic_gnn_gin
    basic_gnn_sage
    
    The blueberries set is meant to capture important LLM models, but it is very much a WIP right now.
  • Dynamic shapes by default. We made a lot of progress. Phase 1 is completely landed in master; Phase 2 has a PR open that is passing all CI tests: Enable automatic_dynamic_shapes by default by ezyang · Pull Request #103623 · pytorch/pytorch · GitHub After discussion with CK/Xiaodong we’re also going to try YOLO’ing internal enablement here too, after I add instrumentation that will help us detect when automatic_dynamic_shapes triggered. I also promised gchanan that I would rename automatic_dynamic_shapes to something more clear, maybe automatic_dynamic_on_recompile. PSA: you probably don’t want dynamic=True, esp if you’re running into bugs; use automatic_dynamic_shapes=True!
  • How to test for dynamic shapes without dynamic_shapes. So you want to add a new feature to PT2 but it doesn’t work with dynamic shapes. What can you do?
    • Force specialization when it applies. All backends (e.g., inductor) are permitted to force extra specializations that were not strictly necessary. So if you know that you absolutely want your feature to apply, you can just specialize (e.g., by just int()'ing a SymInt). With dynamic shapes, you may end up with some extra int inputs in your FX graph that are actually static, but these are easy enough to ignore by testing if your input is Tensor or not. This is what we did for CUDA graphs.
    • Test if there are torch.fx.experimental.symbolic_shapes.free_symbols. If everything is static, then there are no free symbols. This works best if you’re in some local situation where you need to decide to do something to a single tensor, but if you’re doing analysis on an FX graph it’s doable (you just may need to check multiple nodes.) This is what we did for layout optimization.
  • Notable bug fixes.
  • Notable new issues.

CI skips. -3, -1, -1, -2 (-2, 0, 0, 0 WoW.) Regression is dlrm and hf_T5_generate from the switch of inference benchmarking from float32 to bfloat16, tracked at dlrm and hf_T5_generate fails aot_eager with bfloat16+dynamic_shapes · Issue #103760 · pytorch/pytorch · GitHub

Training dashboard (as of 7b3242d5f7). This week on HUD

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 89%, 57/64 98%, 45/46 100%, 60/60 88%, 7/8
Speedup 1.13x → 1.11x 1.59x 1.18x → 1.19x 1.29x → 1.30x
Comptime 79s → 67s 103s → 99s 136s → 110s 33s → 31s
Memory 0.93x → 0.94x 1.00x 1.01x 1.59x

Not much to report. torchbench decrease appears to be due to a clear 10% regression on timm_efficientdet. However, it’s unclear how real this regression is because this model has always failed accuracy. timm is within noise.

Inference dashboard (as of 7b3242d5f7). This week on HUD

Inference was swapped to bfloat16 so… we don’t really have any point of comparison historically, because previously we were only running amp. Here’s the snapshot of data on the most recent run.

Metric Torchbench Huggingface TIMM models Dynamic
Passrate 88%, 63/72 100%, 46/46 100%, 60/60 58%, 7/12
Speedup 1.52x 1.64x 1.72x 1.92x
Comptime 24s 38s 30s 45s
Memory 0.82x 1.15x 1.06x 1.11x

Some thoughts from an apples-to-oranges comparison:

  • In absolute terms, the torchbench pass rate went down, but there are two models (I cannot easily tell from the display) which were removed from the suite entirely.
  • PT2 is more beneficial on bfloat16 than AMP, which is expected!
  • Memory compression is extremely bad. We still need to figure this out.

What’s next?

  • Edward: PSC, dynamic by default last mile and internal telemetry, maybe bug fixing if I can squeeze it in