State of symbolic shapes branch

State of symbolic shapes: Mar 12 edition

Previous update: State of symbolic shapes branch - #45 by ezyang

Executive summary

For your information:

  • Training support is now properly fixed on master; modifying functorch config is no longer necessary. Some Inductor bugs were fixed, some are still pending.
  • specialize_int = False (aka --unspecialize-int) is now the default in CI (with some regressions), and soon will defaulted for regular users too.
  • Dynamic shapes are working for whole graph inference (e.g., BERT), but we often overspecialize when there are graph breaks. Two weeks ago we fixed overspecialization on size ints that were carried across graph breaks (when specialize_int=False); a similar fix for torch.Size carried across graph breaks is pending at Don’t specialize torch.Size with specialize_int = False
  • A reminder that if you are debugging overspecialization problems, you can slap a torch._dynamo.mark_dynamic_dim(tensor, dim) on the dimension you expect to be dynamic to see if it actually is dynamic or not. You’ll still have to diagnose the problem, we’ve recently been having success with extra logging on ShapeEnv, c.f. debug shape guards by avikchaudhuri · Pull Request #95848 · pytorch/pytorch · GitHub

Stuff that happened:

The numbers:

  • Model status on master. See also Symbolic shapes work items tracker - Google Sheets
    • aot_eager inference: -1 (-1 WoW). The regression is a new sympy RecursionError in vision_maskrcnn induced by unspecialize_int=False when running reshape(torch.empty(s1, (s0 + 1)//2, 2), (s1, s0)). Logs
    • aot_eager training: -2 (-2 WoW). The regression are also two sympy RecursionError induced by unspecialize_int=False. The botnet26t_256 looks like the same cause (reshape) as the vision_maskrcnn, but eca_botnext26ts_256 looks like some sort of modulus problem. Logs
    • inductor inference: -4 (+6 WoW, or unchanged, depending on how you count it). We regressed this stat with specialize_int = False, but we fixed most of the regression in time for the report. We did one trace: volo_d1_224 is now fixed, but convit_base is failing with a new error “TypeError: Cannot convert symbols to int”
    • inductor training: -9 (NEW!). Training is enabled in CI! The bulk of the current failures are either 'float' object has no attribute '_has_symbolic_sizes_strides' (this is due to AOTAutograd sending graphs with SymFloat inputs, contrary to inductor’s input contract). There is one accuracy failure with rexnet_100; however, this model is known to be flaky accuracy with static shapes too.
  • Opinfo tests on symbolic shapes.
    • pytest test/ -k test_make_fx_symbolic_exhaustive - 566 passed (+4 WoW), 524 skipped (+1 WoW), 192 xfailed (-3 WoW)
    • pytest test/functorch/ -k test_aot_autograd_symbolic_exhaustive -
  • Graph breaks on master. 0ish (unchanged). hf_Longformer and AllenaiLongformerBase are still diverging intermittently. Graph breaks will be in CI and we will resolve this one way or another.
  • Tracing cost of enabling dynamic shapes (aot_eager). Mean: 15s (-5s), Max: 168s (-72s WoW). Not really sure where the speedups are coming from, but we’ll take it!
    • Repro command: benchmarks/dynamo/ --backend aot_eager --devices cuda --cold-start-latency --ci

Known problems

Unchanged from State of symbolic shapes branch - #45 by ezyang

What’s coming next?

  • ezyang: burning down the biggest blocker bugs (right now, that’s float handling). Also need to setup perf CI.
  • Chillee: unclear
  • bdhirsh: per-dispatch key mode stacks and then torchquant/DTensor PT2 support
  • voz: finishing up in flight work
  • jbschlosser: enabling dynamic shapes for nested tensor