FSDP & CUDACachingAllocator: an outsider newb perspective

cirquit · November 15, 2023, 1:14pm

I’ve found another artifact related to this discussion. First, we reproduced the problems detailed by @janeyx99 with FSDP and the T5 model family. Second, we only observed the non-deterministic allocation to happen while training the T5 model family, but not the GPT2 model family.

The most significant change between them is that T5 is an encoder-decoder model, and GPT2 is decoder only. I have no further idea why the memory fragmentation is not visible with GPT2.

If anybody else stumbles into this thread in hopes of a quick fix, we have evaluated expandable_segments=True as a CUDA caching allocator parameter added to the environment variables to help keep the fragmented memory to a minimum. It was recently added officially as a documented parameter here.

For a detailed analysis of this, feel free to check out our blog post on this issue.

Topic		Replies	Views
Understanding the difference between the caching behavior of cuda caching allocator and pluggable allocator	7	1161	August 2, 2025
CUDAGraphs in Pytorch 2.0 compiler	6	5899	November 20, 2024
TorchDynamo Update 11: Making FSDP and Dynamo Work Together compiler	5	4989	December 27, 2023
Meta PyTorch Team 2025 H1 Roadmaps	17	7727	June 24, 2025
Rethinking PyTorch Fully Sharded Data Parallel (FSDP) from First Principles distributed	19	13060	September 17, 2024

FSDP & CUDACachingAllocator: an outsider newb perspective

Related topics