Question regarding horizontal fusion

I am trying to get my head around Inductor. From inductor/fx_passes/ and inductor/codegen it seems like there is support for both vertical and horizontal fusion. By vertical fusion I guess, what is meant is kernel fusion where sequential ops are fused together.

For horizontal fusion, if I understand the design correctly, if you have op(A,B) and op(A,C) and can_fuse_horizontal=True, then you can do something like g(op(A, f(A,B))) where f and g are op specific transforms that allow you to fuse the op and split the result. Is this correct? Or am I misunderstanding?

And from the code is seems like CUDA and CUTLASS are not yet supported and only Triton is?

So if the codegen defaults to using the CUDA ops then horizontal fusion won’t be possible (for now), but if it chooses Triton than it potentially might be used? Is my understanding correct?

Both horizontal and vertical fusion is supported for pointwise and reduction ops when the shapes are compatible. This is true for the Triton, C++, Halide, and Metal backends on all device types. This is done in the inductor scheduler.

There are heuristics to decide when these fusions are profitable, some of which can be controlled via configs such as:

and:

and many more (search that file for fusion).

For matmul/conv/etc we support epilogue fusion when in max-autotune mode and the Triton/CUTLASS template is selected. Plus prologue fusion in some quantization case.

Thank you, this answers all my questions.

Out of curiosity I also searched for the roadmap and if I’m understanding correctly, horizontal fusion of matmuls is a pending item. Very cool.

KR3.2 Explicit API for horizontal fusion, foreach_map, that includes grouped gemm. SOTA perf on grouped linear MOE.