FSDPv2 communication overlap with compute will slow down compute a lot

AlexW · July 2, 2025, 7:24am

I’m using FSDPv2 to training a LLM with 8 H200 GPU with NVLinks. When I look at the profile, I found when there is communication-computer overlap. For example gemm overlap with allgather in fsdp, gemm’s mfu will decrease from 75 ~ 80% to 45~50%.

I understand that NCCL operations will utilize some of the GPU’s SM but I expected this to be a relatively small number since there is little computation in commucation kernel.

My question is: Is it expected for this high memory bandwidth consumption from NCCL to have such a significant negative impact on the performance of my GEMM computations?

Nowdays there are already some ways to solve this problem like kernel fusion or break a big kernel into smaller kernels and do commucation-computer overlap. But In this model, the communication and the computation are separate and parallel operations. This means its very hard for me to apply optimizations above.

Given these constraints, are there any other strategies to reduce the impact of communication on computation? Does control number of SMs for nccl will help?

I have found many people had the same problem, but there is still no answer to these posts

github.com/NVIDIA/nccl

computation overlapped with nccl get much slower

opened 02:32PM - 22 May 20 UTC

yanc11

I used the environment from https://github.com/NVIDIA/DeepLearningExamples/tree/…master/MxNet/Classification/RN50v1.5 to train resnet-50 with multiple GPUs (with horovod using nccl), and found the duration of each training step is much longer than training with single GPU. Then I profiled with Nsight System, and found that "batch norm backward kernel" overlapped with "nccl allreduce kernel" in different gpu stream was much slower than not overlapped ones (4~8ms vs 2ms), like this. ![image](https://user-images.githubusercontent.com/7665150/82677535-c0a1e500-9c7a-11ea-9ae5-d89429f14f3b.png) I also reproduced it without resnet, which means only keep calling batch norm backward and nccl allreduce in 2 threads, like this. ``` import mxnet as mx import mxnet.ndarray as nd import numpy as np from mxnet import gluon from mxnet.gluon import nn from mxnet import autograd as ag import horovod.mxnet as hvd import time import _thread def build_model(): net = nn.HybridSequential() for i in range(20): net.add(nn.BatchNorm(axis=3, momentum=0.9, epsilon=1e-5, act_type='relu')) net.hybridize(static_shape=True, static_alloc=True) net.cast('float16') return net def nccl_allreduce(): param_num = 25*1024*1024 # 25M data = np.random.uniform(-1, 1, [param_num]) m = nd.array(data, dtype='float16', ctx=mx.gpu(hvd.local_rank())) while True: reduced = hvd.allreduce(m) reduced.wait_to_read() def bn(): model = build_model() model.initialize(mx.init.Initializer(), mx.gpu(hvd.local_rank())) while True: data2 = np.random.uniform(-1, 1, [192,112,112,4]) # NHWC x = nd.array(data2, dtype='float16', ctx=mx.gpu(hvd.local_rank())) x.attach_grad() with ag.record(): z = model(x) dx = ag.grad(z, [x]) if __name__ == "__main__": hvd.init() _thread.start_new_thread(nccl_allreduce, ()) bn() ``` My problem is why computation such as batch norm get much slower when overlapped with nccl allreduce? What kind of resources are they competing for? What can I do to avoid it?

github.com/NVIDIA/nccl

Does NCCL Allreduce kernels slowdown the computation kernels of GPU?

opened 01:23PM - 31 May 22 UTC

ihchoi12

Hello, I'm trying to compare training speed between using 1 node and using 2 …nodes (one GPU per node). From 1 node training, back-propagation (calculate gradients & update parameters) takes about 57ms with full GPU utilization. <img width="915" alt="image" src="https://user-images.githubusercontent.com/37887404/171182342-202eeb70-45d9-4e11-86a9-141261b1f37c.png"> From 2 nodes training (with the same training setup), back-propagation takes about 237ms with full GPU utilization. <img width="968" alt="image" src="https://user-images.githubusercontent.com/37887404/171182700-40948d6e-7f8e-4f58-99f8-c9ab59bf0b42.png"> I believe the GPU is working on NCCL Allreduce operations in training with 2 nodes, and that's why the back-propagation takes much longer than the 1 node case. Does it mean that NCCL may slow down the overall training speed? Additionally, I observe the much higher CPU utilization in the training with 2 nodes. Could you clarify why it is so? Is it because of the busy-waiting thread for the cudaEventSynchronize API call?

Topic		Replies	Views
Memcpy based P2P communication for pipeline parallelism instead NCCL distributed	9	1744	September 4, 2024
Overlapping device to host copy with GPU collectives NVIDIA CUDA	5	756	June 4, 2024
NNC Per-Operator Benchmarks (on CPU) nnc	5	1077	January 27, 2021
Depthwise conv2d: An NNC Case Study compiler	0	1474	April 7, 2021
GPU Overheads and Fused Strassen performance	0	2254	February 13, 2021

FSDPv2 communication overlap with compute will slow down compute a lot

Related topics