I am trying to figure out how the Inductor deals with the matmul operation, so I simply test the
torch.addmm(c, a, b) operator to see what happens. Here are several questions:
The default setting of flag
max_autotuneis False, which generates
extern_kernels.addmm(arg0_1, arg1_1, arg2_1, alpha=1, beta=1, out=buf0)kernel. What is actually being called by this function? Is it a cublas op?
When I set
max_autotune=True, I noticed that it was autotuned and ranked the result:
AUTOTUNE addmm(131072x64, 131072x147, 147x64) triton_mm_10 0.1935s 100.0% triton_mm_0 0.2028s 95.5% triton_mm_11 0.2222s 87.1% addmm 0.2222s 87.1% triton_mm_4 0.2345s 82.5% triton_mm_2 0.2345s 82.5% triton_mm_5 0.3092s 62.6% triton_mm_6 0.3123s 62.0% triton_mm_8 0.3318s 58.3% triton_mm_1 0.3820s 50.7%
Sub-question one: Is
addmm here the same as
Sub-question two: I found
triton_mm_# is a 12-candidate search space, and based on my observations, all candidates run the entire triton gemm program with full-size input to obtain the real performance. In comparison to TVM’s strategy of constructing a search space, this approach appears to be much lighter. However, I have concerns: could this potentially lead to missed optimization opportunities?