I’m trying to understand where is the time spent when using torch.compile() targeting CUDA.
The official documentation on this is very helpful, but it doesn’t cover a few aspects I’m interested in:
-
The Triton → PTX → SASS part of the compilation. I suspect that part of if is likely me not knowing how to interpret the Catapult trace, although it’s not clear which range represent the Triton compiler (down to SASS). Also, is there a global tracing integration which would include Triton’s internal steps (MLIR/LLVM and PTX-to-SASS)?
-
The documentation page doesn’t seem to mention caching or cold/warm profiles. In this case, I’m primarily interested in the cold path (that would include all the compilation steps). I figured out how to disable the FX graph cache (
TORCHINDUCTOR_FX_GRAPH_CACHE=0
) and of course,CUDA_CACHE_DISABLE=1
to disable the PTX-to-SASS cache. Is there any other cache that I need to disable for cold compilation profiling?
Thanks!