Is there any documentation on the different CUDA kernel backends that inductor selects from (cutlass, triton, etc.)?
More specifically, for each backend would like to understand how modules / layers / ops are mapped to concrete kernel implementations. E.g., for triton, the autotuning / heuristics selection process, jit compilation, and stitching of the generated kernel back into the graph. For cutlass, the heuristics used for templated kernel gen.
Thanks!