I’m using FSDPv2 to training a LLM with 8 H200 GPU with NVLinks. When I look at the profile, I found when there is communication-computer overlap. For example gemm overlap with allgather in fsdp, gemm’s mfu will decrease from 75 ~ 80% to 45~50%.
I understand that NCCL operations will utilize some of the GPU’s SM but I expected this to be a relatively small number since there is little computation in commucation kernel.
My question is: Is it expected for this high memory bandwidth consumption from NCCL to have such a significant negative impact on the performance of my GEMM computations?
Nowdays there are already some ways to solve this problem like kernel fusion or break a big kernel into smaller kernels and do commucation-computer overlap. But In this model, the communication and the computation are separate and parallel operations. This means its very hard for me to apply optimizations above.
Given these constraints, are there any other strategies to reduce the impact of communication on computation? Does control number of SMs for nccl will help?
I have found many people had the same problem, but there is still no answer to these posts