I am trying to figure out how the Inductor deals with the matmul operation, so I simply test the torch.addmm(c, a, b)
operator to see what happens. Here are several questions:
-
The default setting of flag
max_autotune
is False, which generatesextern_kernels.addmm(arg0_1, arg1_1, arg2_1, alpha=1, beta=1, out=buf0)
kernel. What is actually being called by this function? Is it a cublas op? -
When I set
max_autotune=True
, I noticed that it was autotuned and ranked the result:
AUTOTUNE addmm(131072x64, 131072x147, 147x64)
triton_mm_10 0.1935s 100.0%
triton_mm_0 0.2028s 95.5%
triton_mm_11 0.2222s 87.1%
addmm 0.2222s 87.1%
triton_mm_4 0.2345s 82.5%
triton_mm_2 0.2345s 82.5%
triton_mm_5 0.3092s 62.6%
triton_mm_6 0.3123s 62.0%
triton_mm_8 0.3318s 58.3%
triton_mm_1 0.3820s 50.7%
Sub-question one: Is addmm
here the same as extern_kernels.addmm(...)
?
Sub-question two: I found triton_mm_#
is a 12-candidate search space, and based on my observations, all candidates run the entire triton gemm program with full-size input to obtain the real performance. In comparison to TVM’s strategy of constructing a search space, this approach appears to be much lighter. However, I have concerns: could this potentially lead to missed optimization opportunities?