Overlapping device to host copy with GPU collectives

kwen2501 · June 4, 2024, 4:30pm

Thanks! I reported this to NCCL team.
If you’d like, you can also open an issue on NCCL’s GitHub: GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication, for easier tracking.

Topic		Replies	Views
Memcpy based P2P communication for pipeline parallelism instead NCCL distributed	9	1648	September 4, 2024
FSDPv2 communication overlap with compute will slow down compute a lot distributed	0	178	July 2, 2025
How to capture NCCL communication ops in FakeTensorMode? compiler	3	895	August 3, 2023
PyTorch SymmetricMemory: Harnessing NVLink Programmability with Ease distributed	4	5005	July 16, 2025
Enabling Float8 All-Gather in FSDP2 distributed	6	3355	August 26, 2024