FSDP & CUDACachingAllocator: an outsider newb perspective

Thanks @ngimel for the comprehensive response. Another way of thinking about this that may be helpful is realizing that record_stream introduces nondeterminism, and the nondeterminism puts the onus on the CPU to deal with it. That’s where the CPU sync comes in–the CPU is verifying the del situation before allocating new memory. Technically, as @ngimel mentioned, these two concepts are decoupled, and the CPU could just not “care”! By replacing the record_stream calls with stream-stream syncs, the CPU can trust that the streams will wait on each other properly, so the CPU can be carefree and schedule whatever it wants without needing to sync. In a sense, the responsibility has shifted from the CPU to the streams.

In the specific case for FSDP, removing the need for a CPU sync would require addressing the nondeterminism introduced by record_stream, and the most straightforward way to do that is to remove/replace record_stream calls.