User-defined Kernels vs. `torch.library` custom op

jeromeku · May 22, 2024, 1:26pm

If I have a user-defined triton kernel as part of a torch.nn.Module, what is the difference between:

Registering the user-defined triton kernel using the torch.library API and compiling the module
vs.
Calling torch.compile on the module and relying on the inductor backend to inline the user-defined kernel?

What are the performance differences to be expected between the two approaches? Are there guidelines for when to one API over the other?
Will registering a custom op overcome the current limitations of user-defined kernels per the official tutorial? Specifically, often need to define heuristics after autotune.
More generally, are there sharp bits to watch out for when compiling a function / module which includes a user-defined kernel that can result in worse performance than calling the function / module without compilation?

oulgen · May 22, 2024, 4:47pm

The rule of thumb for picking between the raw triton kernels versus through custom ops is generally about composibility requirements. If your code needs to be composible with dtensor or other tensor subclasses, use custom ops. Otherwise, use raw triton kernels. The raw version provides a simpler user experience as there’s no registration etc, code that works without compile works with compile.

In terms of performance, dynamo/inductor have optimization passes that optimize the code around the triton kernel so you’ll likely see the equal or better performance using raw versions.

Topic		Replies	Views
Inductor Triton Custom Op compiler	6	1421	March 25, 2025
Inlining a custom triton kernel compiler	0	149	July 22, 2024
Trying to understand flow for compilation compiler	1	329	March 7, 2024
Custom cuda extension support in Inductor compiler	8	655	March 7, 2024
Custom C++ External Kernel for TorchInductor compiler	2	296	June 10, 2024

User-defined Kernels vs. `torch.library` custom op

Related topics