Is there any documentation on the different CUDA kernel backends that inductor
selects from (cutlass
, triton
, etc.)?
More specifically, for each backend would like to understand how modules / layers / ops are mapped to concrete kernel implementations. E.g., for triton
, the autotuning / heuristics selection process, jit compilation, and stitching of the generated kernel back into the graph. For cutlass
, the heuristics used for templated kernel gen.
Thanks!