How to Access Triton Kernels from TorchInductor when running on CPU?

When running torch.compile for a PyTorch model and accessing the generated TorchInductor python code for graph with C++/Triton kernels as indicated here, there appears to be no way to access the generated Triton code when using a CPU-device.

It appears as though all of the kernels are written in C++ when running on CPU; however, when running on CUDA, the kernels are written in Triton instead. Does PyTorch / TorchInductor offer a way to access the generated Triton kernels when running on a CPU device?

I can take care of running Triton on a CPU backend, but I need to be able to access the Triton kernels that are generated from PyTorch first.

Hi @JoeLi12345 Thanks for the question. Triton codegen backend for CPU has not been enabled for Inductor yet. Only C++ codegen is available for now.