FSDP & CUDACachingAllocator: an outsider newb perspective

record_stream is not bad, but it has a very particular purpose that maybe not suited for fsdp case. Suppose you are writing a third-party package (or even pytorch package) that returns tensors created on a side stream and you want your users to not think much about using them safely. This is a typical case for communication routines or data-preprocessing. Then record_stream is your friend, you create a tensor on a side stream, call record_stream to current stream and returned tensors are just like the tensors that users would naturally create, as far as safe usage goes. You are paying for this with non-determinism and potential memory spikes, but users don’t have to insert additional syncs in the user code. It’s impossible to achieve this effect without CCA support, so it’s a good think CCA provides this mechanism.
In FSDP case, we are controlling everything, so we don’t have to rely on this, and we also don’t need CCA to support recording events, all this can be done in application code.
CCA could potentially have an option of doing stream synchronization when the block is freed instead of just recording an event and later checking on it, but it probably shouldn’t be default behavior, because it would stall side stream, and in many cases that can be undesirable and will slow down existing codes that rely on current behavior.

6 Likes