Understanding CUDAGraph Trees

Thanks @eellison for your insightful explanation.

Just to clarify on this:

In the above example figure, we capture g1 → g2. Then we replay g1 → g2. As a result of this we have live tensors Output1 and Output2 in the CUDAGraph Memory Pool. (Since you said that the allocations are not accounted for during graph-replay or fast-path, and the allocations appear as deallocated, I assume Output1 and Output2 as just pointers to memory location in the CUDAGraph Memory Pool, which can be assumed logically as a long stretch of memory region)

Next if we try to use the same CUDAGraph Memory Pool for capture of CUDAGraph g3, then it can just overwrite the memory location (for the allocations of g3) which are pointed to by Output1 and/or Output2, since the CachingAllocator has no accounting information, the live tensors (Output1 and Output2) are not actually allocated. We need to preserve the state of the live tensors - therefore we checkpoint. Right?

I guess, I got why do we need to checkpoint. I would like to get some clarifications, about how the checkpointing is done and how it solves the problem.

How checkpointing the memory pool back to the state it was at the end of graph g1, solves the issue of maintaining the live-tensor state? If we checkpoint as above, the state of live tensors shall be lost right? I am not getting this part.

The comment in the CUDACachingAllocator.cpp seems to mention the same thing, but I am unable to see through it.