Inductor updates

ngimel · June 1, 2023, 2:58am

We started this year with approximately 1.5x speedup on HuggingFace with cudagraphs, and as of today we stand at 1.70x. There have been several improvements to inductor over this timeframe, not focused solely on performance

Cudagraphs are now using a new implementation that brought cudagraph memory usage to be on par with non-cudagraphs, thus allowing us to enable cudagraphs much more readily
We have robust utilities that allows us to measure and analyze kernel performance
We have coordinate descent max autotuning algorithm that allows us to explore wider search space to compile faster kernels
We are now regularly updating timm and huggingface suites to make sure we are benchmarking the latest versions of the models. Upstream frequently updates model definitions, and in many cases those changes are beneficial for performance, e.g. they remove graph breaks or use fast implementation of scaled dot product from the core. These changes are now reflected in the benchmarks
We now don’t break on custom autograd functions (this is dynamo improvement, not inductor, but it’s critical for performance). This has been an often requested feature and this is bringing dramatic (2x) speed-ups to Deberta models
We are working on automatic application of a few optimizations that can be achieved by changes to user models, however, since users are usually reluctant to make those changes, out-of-the-box performance is worse than it could be

We are working towards automatic replacement of attention patterns with efficient implementation from core
- We’ve implemented flexible pattern matching infrastructure that can replace patterns in forward-backward graphs
- We’ve fixed dynamic shapes tracing and cudagraphs compatibility issues for fast sdpa attention implementation in core
- We’ve landed RNG refactor that makes it easier to match dropout operations that are part of the pattern
- We’ve added the attention patterns encountered in real models to the pattern matcher
- The remaining step is constant folding that would allow us to remove all-zero mask that’s not supported by efficient implementation and thus complete pattern match
Automatic padding of the matmul shapes brings approximately 3% across HF models. It utilizes the pattern matcher infrastructure mentioned above
Automatic conversion to channels-last layout is expected to bring significant improvements to Timm benchmark suite

Worked around triton perf bug to improve HF performance by 4%
We’ve enabled checks on out-of-bounds accesses for indexing ops that previously resulted in either silent wrong results or IMAs. Since indexing ops are also widely used in decompositions of pooling/upsampling operations where indices are guaranteed to be in bounds, we make sure to not generate the checks for the ops that don’t require them to limit perf impact
Quansight contributors implemented flexible tl.reduce in triton which allowed us to implement min/max/argmin/argmax operations that match eager semantics, previously we were incorrectly propagating nans and not returning same values when there were multiple min/max candidates. 1-pass Welford variance calculation using tl.reduce is in the works
We’ve enabled codegen for for_each ops that will allow us to generate fully fused optimizers. Currently only Adam and AdamW fully fused optimizers are supported in core, we will make this support much more flexible with torch.compile
We are improving capabilities of AOT inductor to support real models and dynamic shapes. Low latency AOT inductor will be very important for LLM inference, and I’m glad we are making progress here.

Topic		Replies	Views
TorchInductor Update 4: CPU backend started to show promising performance boost compiler	1	2954	November 25, 2022
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	2030	September 22, 2023
CUDAGraphs in Pytorch 2.0 compiler	6	5321	November 20, 2024
TorchInductor Update 5: CPU backend backend performance update and deep dive on key optimizations compiler	0	3334	March 9, 2023
Compiling the optimizer with PT2 compiler	8	3700	January 29, 2024

Inductor updates

Related topics