Profiling torch.compile

lemo · April 10, 2025, 6:48pm

I’m trying to understand where is the time spent when using torch.compile() targeting CUDA.

The official documentation on this is very helpful, but it doesn’t cover a few aspects I’m interested in:

The Triton → PTX → SASS part of the compilation. I suspect that part of if is likely me not knowing how to interpret the Catapult trace, although it’s not clear which range represent the Triton compiler (down to SASS). Also, is there a global tracing integration which would include Triton’s internal steps (MLIR/LLVM and PTX-to-SASS)?
The documentation page doesn’t seem to mention caching or cold/warm profiles. In this case, I’m primarily interested in the cold path (that would include all the compilation steps). I figured out how to disable the FX graph cache (TORCHINDUCTOR_FX_GRAPH_CACHE=0) and of course, CUDA_CACHE_DISABLE=1 to disable the PTX-to-SASS cache. Is there any other cache that I need to disable for cold compilation profiling?

Thanks!

Topic		Replies	Views
How To Bring Compile Time Down to Zero: Our Plans and Direction (May 14th Edition) compiler	0	1425	May 15, 2024
How can I dump the prims IR, triton code, and ptx code when using torch.compile() compiler	2	1444	May 6, 2024
Impact of multithreading and local caching on torch.compile compiler	3	1003	September 27, 2024
Trying to understand flow for compilation compiler	1	397	March 7, 2024
Call for backward compatability to enable users to understand and adapt to pytorch compiler compiler	2	700	November 18, 2023