Anomalous time gaps observed when using CUDA kernels in PyTorch

The issue is that there is a certain time gap between the completion of one kernel’s execution and the launch of the next kernel. From the profiling perspective, I can’t determine what is happening during this time.

How can I analyze what is causing the time consumption during this period? Note: This part of the computation shown in the diagram is a simple encoder layer.

I found that part of the gaps is caused by the host time consumed by torch.cuda.nvtx, but the remaining part still cannot find the specific cause of the time consumption. Is there any method to profile the host time?