Does
inductor_cudagraphs
mean that it generates cuda graphs with kernels generated by triton?
Correct, Triton has high CPU overheads, so cudagraphs is helps a lot and is needed. There is an upstream fix coming in Triton that allows AOT kernel generation and make cudagraphs less important.
How much does TorchInductor generate kernels “from scratch” with triton/etc. vs reusing the existing Torch kernels?
TorchInductor generates nearly all of its kernels automatically from scratch based on its IR.
The two exceptions are matmul/conv where it has a template with auto-generated epilogue fusions. In the current numbers these are disabled by config, and we are just using aten. I’m expecting another 10-20% speedup from enabling this.
There is also a small list of kernels we haven’t implemented yet and are using aten fallbacks: TorchInductor missing ops tracker · Issue #93757 · pytorch/pytorch · GitHub
but eventually we wan’t to codegen everything.
Can you talk about how op fusion works? E.g. can your
inner_fn
in the post be automatically fused with other “Pointwise” ops or even used as a fused activation function?
Yes, it will be automatically fused. Pointwise ops can be fused with: other pointwise ops; reduction ops; and matmul/conv templates. It also supports fusing multiple reductions/broadcasts together.
The key functions here are can_fuse which tests if two nodes can be fused together, and score_fusion which gives a priority that controls the order fusions happen in. Since some fusions can block other fusions, order matters.
In
inner_fn
– where issize1
defined?
There is a per-graph database of symbolic size variables defined in terms of the shapes of the inputs. This is handled in sizevars.py and uses sympy. For clarity, it is basically just:
size1 = sympy.Symbol("size1")
the symbol names get allocated all based on the inputs to the graph. So size1
might be input[0].size(2)
.
When you say that TorchInductor “is able to represent aliasing and mutation” – what does that mean?
What I’ve found is that in practice most backends need to go through a purely functional form and then rely on a buffer allocation pass to make optimal decisions about whether things should reuse buffers or be views/etc. (the way the user wrote is not necessarily optimal).
TorchInductor is “mostly functional,” but not purely functional. There isn’t a good way to represent scatter operations (which show up in backwards a lot) functionally while maintaining good performance. It is really easy to turn O(n)
stuff into O(n^2)
by trying to functionalize a chain of scatters that only mutate a small fraction of the elements of a tensor. There is also stuff like input mutation, where you don’t control the storage being mutated. The IR directly supports mutation and scatter, though we do make use of dispater level functionalization.