If I register a custom op using the
torch.library.Library api that calls a
triton.jit kernel then compile a module containing this custom op with
cpp_wrapper enabled, is the
cubin of the
triton kernel embedded in the generated CUDA extension?
How does this differ from a module with only (non-custom)
aten ops that are compiled using
inductor and lowered into
triton kernels (through the
inductor lowering pipeline) then output using the
Hi there! We have implemented native support for triton kernels in torch.compile so you do not need to convert them to a custom op. In their current state, they will get compiled just like the other inductor emitted triton kernels. For the AOT Inductor (the C++ version), we will emit cubin file just like the other kernels.
We plan on supporting using this in custom ops but haven’t built that yet. @zou3519 for more information.
When is this version expected to be released?
We are targeting pytorch 2.3 for the official release. It should be working on the nightlies but we want to clean up all performance and composability problems before official release.
Thank you for your prompt response.I appreciate your time in providing the information.