If I have a user-defined triton kernel as part of a torch.nn.Module, what is the difference between:
- Registering the user-defined
tritonkernel using thetorch.libraryAPI and compiling the module
vs. - Calling
torch.compileon the module and relying on theinductorbackend to inline the user-defined kernel?
- What are the performance differences to be expected between the two approaches? Are there guidelines for when to one API over the other?
- Will registering a custom op overcome the current limitations of user-defined kernels per the official tutorial? Specifically, often need to define heuristics after autotune.
- More generally, are there sharp bits to watch out for when compiling a function / module which includes a user-defined kernel that can result in worse performance than calling the function / module without compilation?