If I have a user-defined triton
kernel as part of a torch.nn.Module, what is the difference between:
- Registering the user-defined
triton
kernel using thetorch.library
API and compiling the module
vs. - Calling
torch.compile
on the module and relying on theinductor
backend to inline the user-defined kernel?
- What are the performance differences to be expected between the two approaches? Are there guidelines for when to one API over the other?
- Will registering a custom op overcome the current limitations of user-defined kernels per the official tutorial? Specifically, often need to define heuristics after autotune.
- More generally, are there sharp bits to watch out for when compiling a function / module which includes a user-defined kernel that can result in worse performance than calling the function / module without compilation?