Asynchronous Execution and Memory Management

GPU allows asynchronous execution - so I can enqueue all my kernels and wait for the result. It is significant for performance.

Now the question is how do I manage lifetime of tensors/memory allocated for kernels being executed on the stream/opencl execution queue.

For example I have simple relu_:

Tensor & relu_(Tensor & self)
 {
        dlprim::Tensor X = todp(self);
        dlprim::ExecutionContext q = getExecutionContext(self);
        dlprim::core::activation_forward(X,X,dlprim::StandardActivations::relu,q);
        q.finish();
        return self;
 }

If I don’t call q.finish() who promices to me that the self tensor will not be deleted since the code?

x = F.relu_(x)
x = conv(x)

x may be deleted while the kernel wasn’t event executed yet, how do I handle it.

Another issue, how to optimize memory?

For example:

x = f1(x)
x = f2(x)
x = f3(x)

In synchronous execution x is get released and memory freed. In static graph (I use for dlprimitives/opencl DL library) I can calculate memory reuse in advance and reuse it (for inference) and in backprop I can release some memory used for diff.

How can/it is done for eager execution of pytorch with async GPU interface?

Ok after reading: CUDA semantics — PyTorch 1.9.1 documentation I think I may understand how it works, I just want to make sure I understand correctly:

Assumptions:

  1. GPU memory never released but always goes to cache so it remains valid. (at least as long as there is something in queues)
  2. There is assumption that everything is executed over a single queue/stream and only over a queue

My understandings are following:

  1. When the processing is enqueued to execution queue/stream the tensors are allocated and released after enqueue of ops to the cache.
  2. Since gpu operations in the queue are executed in exactly same order as they are enqueued and when they are enqueued the allocation is valid.

Lets look on following:

 x = op1(x)
 x = op2(x)
 x = op3(x)

Now allocations + ordering:

op1:
   x input = M1 (allocated), x output = M2 (allocated)
    GPU queue:  op1(M1 input, M2 output)
x = result op1 <- M2, M1 released to cache:
op2:
    x input = M2 x output = allocate from cache <- M1
    GPU queue   op2(M2 input, M1 output)    
x = result op2 <- M1, M2 released to cache
op3:
    x input = M1 x output = allocate from cache <- M2
    GPU queue   op3(M1 input, M2 output)    
x = result op3 <- M2, M1 released to cache:

Gpu Exec:

- op1(M1,M2)
- op2(M2,M1) 
- op3(M1,M2)

So it looks like GPU execution is valid…

Do I understand correctly? What do I miss? what happens with multi-threading? How do you manage more then one stream/queue? because it is common to use more then one execution stream in general GPU programming…

If I understood correctly this is probably one of the craziest dynamic optimisations I have ever seen, or genius optimisations… or most likely both.

How can/it is done for eager execution of pytorch with async GPU interface?

We have a caching allocator within PyTorch that makes allocation almost free. So we actually don’t do anything special for memory allocations.

The caching allocator also uses the current stream when Tensors are created to know how to sync its de-allocation. If you use the Tensor on a different stream, you can simply use t.record_stream(other_stream) doc to let the caching allocator know that it needs to wait on that other stream as well.

what happens with multi-threading?

The memory pool is global. So nothing changes with threading.

Thanks!

Brilliant design!