Anomalous time gaps observed when using CUDA kernels in PyTorch

hexisyztem · August 5, 2024, 8:26am

The issue is that there is a certain time gap between the completion of one kernel’s execution and the launch of the next kernel. From the profiling perspective, I can’t determine what is happening during this time.

How can I analyze what is causing the time consumption during this period? Note: This part of the computation shown in the diagram is a simple encoder layer.

hexisyztem · August 5, 2024, 10:38am

I found that part of the gaps is caused by the host time consumed by torch.cuda.nvtx, but the remaining part still cannot find the specific cause of the time consumption. Is there any method to profile the host time?

Topic		Replies	Views
How profiling Pytorch Using Nsight Compute? NVIDIA CUDA	2	516	October 16, 2024
Using Nsight Systems to profile GPU workload NVIDIA CUDA	12	33264	April 30, 2025
Keeping PyTorch's Ops Maintainable: The Jiterator hardware-backends	7	1771	February 27, 2023
Profiling torch.compile compiler	0	117	April 10, 2025
Strange Increase of non-torch memory for unexpected functions	3	122	January 17, 2025

Anomalous time gaps observed when using CUDA kernels in PyTorch

Related topics