Understanding CUDAGraph Trees

Abhishekghosh1998 · April 2, 2024, 7:10pm

Thanks @eellison for your insightful explanation.

Just to clarify on this:

In the above example figure, we capture g1 → g2. Then we replay g1 → g2. As a result of this we have live tensors Output1 and Output2 in the CUDAGraph Memory Pool. (Since you said that the allocations are not accounted for during graph-replay or fast-path, and the allocations appear as deallocated, I assume Output1 and Output2 as just pointers to memory location in the CUDAGraph Memory Pool, which can be assumed logically as a long stretch of memory region)

Next if we try to use the same CUDAGraph Memory Pool for capture of CUDAGraph g3, then it can just overwrite the memory location (for the allocations of g3) which are pointed to by Output1 and/or Output2, since the CachingAllocator has no accounting information, the live tensors (Output1 and Output2) are not actually allocated. We need to preserve the state of the live tensors - therefore we checkpoint. Right?

I guess, I got why do we need to checkpoint. I would like to get some clarifications, about how the checkpointing is done and how it solves the problem.

How checkpointing the memory pool back to the state it was at the end of graph g1, solves the issue of maintaining the live-tensor state? If we checkpoint as above, the state of live tensors shall be lost right? I am not getting this part.

github.com

pytorch/pytorch/blob/feabb645a7fbbd695d25aa94150e6b0e90fb07c6/c10/cuda/CUDACachingAllocator.cpp#L1584C3-L1587C60


      
          * In order to record a new graph capture after replaying prior callables in
          * the tree, we need the allocator to reflect the state of the live tensors.
          * We checkpoint the state of the private pool after each recording, and then
          * reapply it when we are starting a new recording chain. Additionally, we

The comment in the CUDACachingAllocator.cpp seems to mention the same thing, but I am unable to see through it.

Topic		Replies	Views
CUDAGraphs in Pytorch 2.0 compiler	6	5110	November 20, 2024
"Fused compiled autograd bwd + optimizer graph" - status update? compiler	4	373	November 21, 2024
Meta PyTorch Team 2025 H1 Roadmaps	17	5345	June 24, 2025
TorchDynamo Update 10: Integrating with PyTorch/XLA for Inference and Training compiler	9	5496	December 29, 2023
Why Dynamo fails to capture the computation graph in this function? compiler	1	545	September 14, 2023

Understanding CUDAGraph Trees

Related topics