State of symbolic shapes branch

State of PT2: Sep 15, 2023 edition

Previous update: State of symbolic shapes branch - #69 by ezyang

Executive summary

Dynamo

Inductor

Composability sync hit a lot of topics this week. Composability meeting notes - Google Docs Topics that weren’t otherwise covered in this doc:

  • Elias told us about how SDPA pattern matches (and others; both inference and training patterns supported) are now compiled ahead of time, making it a lot cheaper to do lots of patterns. We took advantage of that to add a lot more patterns to match other SDPA variants. Add Python serialization to Pattern Matcher patterns by eellison · Pull Request #108894 · pytorch/pytorch · GitHub
  • Chien-Chin told us about the new PT2 DDP plans. We cannot directly trace DDP because it is implemented in C++, and we cannot easily port it to Python because the implementation is complicated by bucketing. So the idea is to implement a Python non-bucketed DDP, and rely on compile to optimize it away.
  • Horace told us about developments in LLMs. One thing he wants is dequant primitives in PT2: a way to take int3/int4 packed values and unpack them into a larger tensor, with the idea that PT2 would compile away the memory traffic. In general he doesn’t think we should directly do this in PT, as there are so many quantization formats.

Dynamic shapes

  • Last week I mentioned opcheck testing is usable, but Richard Zou is still evolving it on user feedback. A recent new change is to put the xfails into a JSON file so it can easily be automatically updated. However, there are still complaints from folks that it’s too hard to understand what goes wrong when a test crashes. Richard is going to investigate a two stage process now, where by we separate generation of test inputs and actually running the tests. To ensure generation of test inputs is kept up to date, we only need a single new test which runs all of the tests in the test file in one go and xrefs what tests are exercised with what we have recorded.
  • Horace wants a version of Tensor where some of the sizes are stored on device. This would allow you to perform a data-dependent operation without synchronizing; and you would still save on memory traffic because you would have kernels mask out memory loads when they go out of bounds of the dynamic shape. In some sense, this is a specialization of jagged tensor where everything in the jagged dimension has the same size.
  • Notable bug fixes:

Numbers

This is nearly a month worth of numbers!

Training. 34ddf08f27 dashboard

Inference. 34ddf08f27 dashboard

  • A lot of torchbench improvement: detectron2_fcos_r_50_fpn, doctr_reco_predictor, drq, llama, pyhpc_turbulent_kinetic_energy all now pass accuracy.
  • cudagraphs freezing accuracy improvement in timm models, likely from some major bugfixes for freezing
  • pytorch_stargan had huge perf improvement c2ac0da445cfe3d848342926f9cd4422bd35bfe2…781b7ebe912ec24cbd917cd548b748b1650ab6a2
  • HuggingFace regression due to pin update Problems hit when upgrading the version of HF used in CI · Issue #108145 · pytorch/pytorch · GitHub
  • Fairly large aot inductor regression due to ABI changes.
1 Like