User-defined Kernels vs. `torch.library` custom op

If I have a user-defined triton kernel as part of a torch.nn.Module, what is the difference between:

  1. Registering the user-defined triton kernel using the torch.library API and compiling the module
  2. Calling torch.compile on the module and relying on the inductor backend to inline the user-defined kernel?
  • What are the performance differences to be expected between the two approaches? Are there guidelines for when to one API over the other?
  • Will registering a custom op overcome the current limitations of user-defined kernels per the official tutorial? Specifically, often need to define heuristics after autotune.
  • More generally, are there sharp bits to watch out for when compiling a function / module which includes a user-defined kernel that can result in worse performance than calling the function / module without compilation?

@oulgen @zou3519

The rule of thumb for picking between the raw triton kernels versus through custom ops is generally about composibility requirements. If your code needs to be composible with dtensor or other tensor subclasses, use custom ops. Otherwise, use raw triton kernels. The raw version provides a simpler user experience as there’s no registration etc, code that works without compile works with compile.

In terms of performance, dynamo/inductor have optimization passes that optimize the code around the triton kernel so you’ll likely see the equal or better performance using raw versions.