The issue is that there is a certain time gap between the completion of one kernel’s execution and the launch of the next kernel. From the profiling perspective, I can’t determine what is happening during this time.
How can I analyze what is causing the time consumption during this period? Note: This part of the computation shown in the diagram is a simple encoder layer.