Understanding the difference between the caching behavior of cuda caching allocator and pluggable allocator

TL;DR:

CUDAPluggableAllocator does not have caching, does not support torch.cuda.memory_stats() .

The default caching behavior:

  • When an allocation fails, PyTorch will try to release unused memory and try it again.
  • When memory is freed, PyTorch does not return it back to cuda, but will reserve the memory.