I’ve found another artifact related to this discussion. First, we reproduced the problems detailed by @janeyx99 with FSDP and the T5 model family. Second, we only observed the non-deterministic allocation to happen while training the T5 model family, but not the GPT2 model family.
The most significant change between them is that T5 is an encoder-decoder model, and GPT2 is decoder only. I have no further idea why the memory fragmentation is not visible with GPT2.
If anybody else stumbles into this thread in hopes of a quick fix, we have evaluated expandable_segments=True as a CUDA caching allocator parameter added to the environment variables to help keep the fragmented memory to a minimum. It was recently added officially as a documented parameter here.
For a detailed analysis of this, feel free to check out our blog post on this issue.