A small inductor optimization ablation study

Some optimizations performed by inductor cannot be straightforwardly done with dynamic shapes, as they require us to make assumptions about input shapes. I was curious about how much performance we were losing by not performing these optimizations. Before running this experiment, Natalia Gimelshein gave me some idea of what to expect, as she was responsible for most of these optimizations and remembered about how much percent improvement we got from them when they originally landed.

Here are the optimizations I ablated:

  • Divisible by 16 annotations - When generating Triton code, there is an undocumented feature where you can hint to Triton that an input axis is guaranteed to be divisible by 16 (search for divisible_by_16); this helps Triton know that, e.g., outer dimensions are aligned. Alignment is very important, and Natalia expected this to have the most impact. Added in https://github.com/pytorch/torchdynamo/pull/1338 and [inductor] Fix alignment issue with new runtime by jansel · Pull Request #1412 · pytorch/torchdynamo · GitHub
  • Split reductions - Split reductions refer to a performance optimization when compiling reductions where the non-reduced axis is too small for us to fully occupy all blocks on GPU by simply assigning each reduction to a block. When the number of elements to reduce is large, we want to split these large reductions into smaller reductions for better occupancy. Natalia told me split reductions substantially improve performance when they apply, especially in HuggingFace models. Added in [inductor] Improve handling of reductions by jansel · Pull Request #404 · pytorch/torchdynamo · GitHub
  • Persistent reductions - When performing multiple reductions (e.g., as seen in softmax), the default, non-persistent reduction code will re-load memory on each reduction loop. With persistent reductions, we instead keep the memory in shared memory as we perform multiple reductions. This results in a lot shorter Triton code, at the cost of more shared memory pressure. These originally reported 1-5% performance improvement across a variety of models. Added in https://github.com/pytorch/pytorch/pull/92267

For the ablation, I tested the impact of removing any single one of these optimizations, as well what would happen if I disabled all of these optimizations. My experiment setup is at [skip ci] Guarded inductor optimization ablation by ezyang · Pull Request #98466 · pytorch/pytorch · GitHub . I ran the experiment on our GCP A100 runners using the instructions at Interpreting PT2 benchmark reports - Google Docs

Without further ado, here are the results:

# cuda amp training performance results
## Geometric mean speedup
                                     huggingface    timm_models    torchbench
---------------------------------  -------------  -------------  ------------
inductor                                    1.4            1.35          1.15
inductor_no_all                             1.33           1.11          1.08
inductor_no_divisible_by_16                 1.4            1.33          1.15
inductor_no_persistent_reductions           1.4            1.34          1.15
inductor_no_split_reductions                1.37           1.2           1.1

Some things that popped out to me:

  • Split reductions are most important. They will be a priority for us to reenable in the dynamic shapes workstream.
  • The impact of these optimizations is not additive. For example, on huggingface, removing divisible_by_16 and persistent_reductions individually seemingly has no impact on performance… but combined all together it is worth another 4%!
  • The different benchmark suites have different sensitivity to these changes. Your workload matters! For example, timm_models is a lot more sensitive to alignment individually than huggingface or torchbench.

A general theme in the dynamic shapes workstream is that we will have to make tradeoffs on kernel generality versus performance; for example, should we generate an extra kernel for when a dynamic dimension is divisible by 16, versus when it is not? On a T5 model (T5 model taking too long with torch compile. · Issue #98102 · pytorch/pytorch · GitHub) it turns out it is not worth it: you quadruple compile time, but (seemingly) increased guard overhead makes the overall speed slower.