Questions about Inductor code generation for gemm on Nvidia device

I am trying to figure out how the Inductor deals with the matmul operation, so I simply test the torch.addmm(c, a, b) operator to see what happens. Here are several questions:

  1. The default setting of flag max_autotune is False, which generates extern_kernels.addmm(arg0_1, arg1_1, arg2_1, alpha=1, beta=1, out=buf0) kernel. What is actually being called by this function? Is it a cublas op?

  2. When I set max_autotune=True, I noticed that it was autotuned and ranked the result:

AUTOTUNE addmm(131072x64, 131072x147, 147x64)
  triton_mm_10 0.1935s 100.0%
  triton_mm_0 0.2028s 95.5%
  triton_mm_11 0.2222s 87.1%
  addmm 0.2222s 87.1%
  triton_mm_4 0.2345s 82.5%
  triton_mm_2 0.2345s 82.5%
  triton_mm_5 0.3092s 62.6%
  triton_mm_6 0.3123s 62.0%
  triton_mm_8 0.3318s 58.3%
  triton_mm_1 0.3820s 50.7%

Sub-question one: Is addmm here the same as extern_kernels.addmm(...) ?

Sub-question two: I found triton_mm_# is a 12-candidate search space, and based on my observations, all candidates run the entire triton gemm program with full-size input to obtain the real performance. In comparison to TVM’s strategy of constructing a search space, this approach appears to be much lighter. However, I have concerns: could this potentially lead to missed optimization opportunities?

Sub-question one: Is addmm here the same as extern_kernels.addmm(...) ?

Yes

Sub-question two: I found triton_mm_# is a 12-candidate search space, and based on my observations, all candidates run the entire triton gemm program with full-size input to obtain the real performance. In comparison to TVM’s strategy of constructing a search space, this approach appears to be much lighter. However, I have concerns: could this potentially lead to missed optimization opportunities?

Correct, it is only searching 12 different Triton configs, which can be found here:

One of the big advantages of Triton versus TVM is Triton is able to usually beat the performance of TVM with an order of magnitude smaller search space. Searching a larger search space isn’t inherently good, it is only good if it lets you find a faster configuration. If you search a bigger search space and are slower, you just wasted a lot of time.

Note that the upstream Triton Matmul uses a larger search space than Inductor does. Partially because it supports SPLIT_K, while Inductor’s template does not. Especially for large-K and memory bound matmuls, there is likely speedups to be had in Inductor by expanding the search space.

1 Like

Thanks for your reply!
I got your idea that Triton could always achieve good performance with smaller search space compared to TVM, while I’m unsure how this conclusion has been made? I haven’t found any charts in the Triton paper or on triton-lang.org that demonstrate the relationship between search space size and performance in the performance data. It seems like I might have missed some clues?

In general, I am actually curious about, during the designing phase, how the Inductor team made this decision: to construct a search space with 12 candidates with aten native operators, and be confident that this performance would meet the requirements. Also, I wonder if the team might do more work in optimizing matrix multiplication performance in the future?

The configs were initially copied from the upstream Triton matmul. We started with the ones labeled # basic configs for compute-bound matmuls. There was then a small amount of manual tweaking based on benchmarking. The goal of those configs wasn’t to cover every use case, mainly just the more important compute bound ones. I still think there is room to improve that search space, and we will likely look into it more in the future. Contributions are also welcome there!

The best way to get a sense of performance is to try things out first hand. I’ve been super impressed with Triton from the first time I tried it. You write extremely simple code (our matmul is ~40 lines), with a tiny search space and get awesome performance. In your example above triton_mm_10 is 13% faster than cuBLAS (addmm).

In my experience TVM is not close to state of the art on modern GPUs, though your experience will definitely vary based on your application. There have been many cases where I let it autotune for many days (compared to seconds for Triton), and it ends up being slower than eager PyTorch. There is a TVM backend for torch.compile, so you can always try it out for yourself.

Aha, thanks, I think the key is to enable Inductor to achieve faster performance within an acceptable search time. It is not wise sacrificing a significant amount of compilation time to achieve better performance.