TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes

jansel · September 9, 2022, 4:26am

Does inductor_cudagraphs mean that it generates cuda graphs with kernels generated by triton?

Correct, Triton has high CPU overheads, so cudagraphs is helps a lot and is needed. There is an upstream fix coming in Triton that allows AOT kernel generation and make cudagraphs less important.

How much does TorchInductor generate kernels “from scratch” with triton/etc. vs reusing the existing Torch kernels?

TorchInductor generates nearly all of its kernels automatically from scratch based on its IR.

The two exceptions are matmul/conv where it has a template with auto-generated epilogue fusions. In the current numbers these are disabled by config, and we are just using aten. I’m expecting another 10-20% speedup from enabling this.

There is also a small list of kernels we haven’t implemented yet and are using aten fallbacks: TorchInductor missing ops tracker · Issue #93757 · pytorch/pytorch · GitHub
but eventually we wan’t to codegen everything.

Can you talk about how op fusion works? E.g. can your inner_fn in the post be automatically fused with other “Pointwise” ops or even used as a fused activation function?

Yes, it will be automatically fused. Pointwise ops can be fused with: other pointwise ops; reduction ops; and matmul/conv templates. It also supports fusing multiple reductions/broadcasts together.

The key functions here are can_fuse which tests if two nodes can be fused together, and score_fusion which gives a priority that controls the order fusions happen in. Since some fusions can block other fusions, order matters.

In inner_fn – where is size1 defined?

There is a per-graph database of symbolic size variables defined in terms of the shapes of the inputs. This is handled in sizevars.py and uses sympy. For clarity, it is basically just:

size1 = sympy.Symbol("size1")

the symbol names get allocated all based on the inputs to the graph. So size1 might be input[0].size(2).

When you say that TorchInductor “is able to represent aliasing and mutation” – what does that mean?
What I’ve found is that in practice most backends need to go through a purely functional form and then rely on a buffer allocation pass to make optimal decisions about whether things should reuse buffers or be views/etc. (the way the user wrote is not necessarily optimal).

TorchInductor is “mostly functional,” but not purely functional. There isn’t a good way to represent scatter operations (which show up in backwards a lot) functionally while maintaining good performance. It is really easy to turn O(n) stuff into O(n^2) by trying to functionalize a chain of scatters that only mutate a small fraction of the elements of a tensor. There is also stuff like input mutation, where you don’t control the storage being mutated. The IR directly supports mutation and scatter, though we do make use of dispater level functionalization.

Topic		Replies	Views
TorchDynamo Update 4: LazyTensor & nvFuser Experiments compiler	4	4793	February 9, 2024
TorchDynamo Update 3: GPU Inference Edition compiler	12	6643	February 2, 2023
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	17115	March 9, 2023
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1975	September 22, 2023
TorchInductor Update 7: key optimizations with CPU backend in PyTorch 2.2 release compiler	4	929	March 8, 2024

TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes

Related topics