We started this year with approximately 1.5x speedup on HuggingFace with cudagraphs, and as of today we stand at 1.70x. There have been several improvements to inductor over this timeframe, not focused solely on performance
-
Cudagraphs are now using a new implementation that brought cudagraph memory usage to be on par with non-cudagraphs, thus allowing us to enable cudagraphs much more readily
-
We have robust utilities that allows us to measure and analyze kernel performance
-
We have coordinate descent max autotuning algorithm that allows us to explore wider search space to compile faster kernels
-
We are now regularly updating timm and huggingface suites to make sure we are benchmarking the latest versions of the models. Upstream frequently updates model definitions, and in many cases those changes are beneficial for performance, e.g. they remove graph breaks or use fast implementation of scaled dot product from the core. These changes are now reflected in the benchmarks
-
We now don’t break on custom autograd functions (this is dynamo improvement, not inductor, but it’s critical for performance). This has been an often requested feature and this is bringing dramatic (2x) speed-ups to Deberta models
-
We are working on automatic application of a few optimizations that can be achieved by changes to user models, however, since users are usually reluctant to make those changes, out-of-the-box performance is worse than it could be
-
We are working towards automatic replacement of attention patterns with efficient implementation from core
-
We’ve implemented flexible pattern matching infrastructure that can replace patterns in forward-backward graphs
-
We’ve fixed dynamic shapes tracing and cudagraphs compatibility issues for fast sdpa attention implementation in core
-
We’ve landed RNG refactor that makes it easier to match dropout operations that are part of the pattern
-
We’ve added the attention patterns encountered in real models to the pattern matcher
-
The remaining step is constant folding that would allow us to remove all-zero mask that’s not supported by efficient implementation and thus complete pattern match
-
-
Automatic padding of the matmul shapes brings approximately 3% across HF models. It utilizes the pattern matcher infrastructure mentioned above
-
Automatic conversion to channels-last layout is expected to bring significant improvements to Timm benchmark suite
-
Worked around triton perf bug to improve HF performance by 4%
-
We’ve enabled checks on out-of-bounds accesses for indexing ops that previously resulted in either silent wrong results or IMAs. Since indexing ops are also widely used in decompositions of pooling/upsampling operations where indices are guaranteed to be in bounds, we make sure to not generate the checks for the ops that don’t require them to limit perf impact
-
Quansight contributors implemented flexible tl.reduce in triton which allowed us to implement min/max/argmin/argmax operations that match eager semantics, previously we were incorrectly propagating nans and not returning same values when there were multiple min/max candidates. 1-pass Welford variance calculation using tl.reduce is in the works
-
We’ve enabled codegen for for_each ops that will allow us to generate fully fused optimizers. Currently only Adam and AdamW fully fused optimizers are supported in core, we will make this support much more flexible with torch.compile
-
We are improving capabilities of AOT inductor to support real models and dynamic shapes. Low latency AOT inductor will be very important for LLM inference, and I’m glad we are making progress here.