RFC: Polyhedral Optimization Pass for PyTorch Inductor

morrison-turnansky · April 1, 2026, 4:02pm

This RFC proposes adding an optional polyhedral optimization pass to PyTorch Inductor to enable fusion of operations that current fusion heuristics cannot handle. The initial test case demonstrates ~1.4x speedup on RMSNorm + chunking + gating patterns (found in SwiGLU) used in architectures like Llama.

MOTIVATION

PyTorch Inductor currently relies on pattern matching and heuristic-based fusion strategies. While effective for many cases, these approaches fail to fuse certain operation sequences where fusion is mathematically valid and beneficial. For a case study demonstrating the speedup for SwiGLU, see [Example] Benefits of Polyhedral Optimization by morrison-turnansky · Pull Request #3 · morrison-turnansky/pytorch · GitHub

High-Level Design

Add a new opt-in compilation pass that users can enable via compilation flags. This optimization pass would target the loop level IR. We would use a subset of polyhedral optimization specifically targeting tensor workflows.

Scope Constraints (Initial Release)

Inference Only: Training support deferred to avoid complexity with backward pass.
1. Ultimately these techniques can be extended to support training and inference, but we want to defer the potential complexities of training such as synchronization issues.
Subset of Polyhedral: Use a simplified version of polyhedral analysis focusing on identifying fusion opportunities and therefore minimizing overhead during compilation.
1. We will focus on identifying fusion opportunities via loop level dependency analysis. Initially, we will follow a general heuristic that fusion will be profitable (see testing strategy).
2. We can later expand this to use optimization functions standard in polyhedral analysis to rigorously determine if fusion will be profitable.
3. Refer to reference for an example of lightweight polyhedral analysis.
Opt-In: Default disabled to ensure zero impact on existing workflows.

Dynamic vs Static Shapes

A primary benefit of polyhedral analysis is that it handles dynamic shapes by design. Loop level dependency analysis depends on the rank of the tensor, not the specific shape.The end goal would be full dynamic shape support.

Breaking Changes

None. This is an additive feature:

Default behavior unchanged
Requires explicit opt-in via compilation flag
Falls back to standard Inductor fusion when polyhedral analysis is unavailable or unprofitable

Testing Strategy

There are two areas to test at a function level
1. Does our pass result in the generated kernel being fused?
  1. This can be verified by a test similar to test_fusion_codegen in the example. This follows the pattern of tests in pytorch/test/inductor/test_loop_ordering.py.
2. Numerical Accuracy
  1. This should be treated similarly to eager/inductor tests. Eager will be the source of truth.
End to End Performance Analysis
1. Verify a set of desired/common models actually see a performance increase. Monitor performance regressions.

We propose that for a set of graphs where we expect polyhedral to make substantial changes, i.e. Swiglu + others, we add a test verifying that fusion occurred and a numerical accuracy test.

References

See simplified polyhedral form as a precedent for using a subset of polyhedral optimizations on tensor workflows: https://mlir.llvm.org/docs/Rationale/RationaleSimplifiedPolyhedralForm/
For a high level background on polyhedral optimization and a discussion on its wide usage in compilers: http://polyhedral.info/
For a detailed discussion on polyhedral see https://pliss2019.github.io/albert_cohen_slides.pdf

morrison-turnansky · April 13, 2026, 5:28pm

I wanted to add a note concerning compilation overhead. While we will be using the mathematical foundations of polyhedral analysis, we will not be using it in full. In particular, we will not be solving integer linear programs.

The affine dialect and its surrounding implementation in MLIR have established a clear path for identifying use cases of polyhedral analysis that are computationally inexpensive while providing notable performance increases. Many compilers are currently adopting this approach, most notably the JIT, Numba v2.

mrodden · June 9, 2026, 7:10pm

I think this is a low risk change to make to get initial data from, and the fusions it supports are immediately useful in several cases in inference engines like vllm.

Starting with scalar adjustment and an optional pass makes sense to me.

Topic		Replies	Views
TorchInductor Update 5: CPU backend backend performance update and deep dive on key optimizations compiler	0	3659	March 9, 2023
TorchInductor Update 4: CPU backend started to show promising performance boost compiler	1	3113	November 25, 2022
TorchFuser: A Plug-and-Play MLIR-Based Compiler and Optimized Runtime Integration compiler	4	1922	June 3, 2025
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	2228	September 22, 2023
TorchInductor Update 9: Harden Vectorization Support and Enhance Loop Optimizations in TorchInductor CPP Backend compiler	0	758	September 4, 2024

RFC: Polyhedral Optimization Pass for PyTorch Inductor

Related topics