Lazy Tensor Core

Lazy execution has been a dream of mine for several years. Back then I had written a rather efficient Caffe-style framework in Java, and I dreamed about a Torch-style dynamic framework with the same performance. (Also I dreamed that it would be in Java, rather than the much more unwieldy C++.)

I’ve thought about the first question in this thread, of when to submit the AST to execution. I think it should mostly be at the end of a training loop. The reason is that there’s a lot of whole-program optimization opportunities. For example, the model might compute a loss that is then not included in a training run. The entire subgraph could be automatically dropped.

But there a few other reasons.

One, is that I imagined that the backward pass would be written differently than it is now. Now, it is static. It is a built-in process that the framework performs. But in fact, it can be dynamic and framework-agnostic. The optimizer (or optimizers) can request the gradient (or any other function) of whatever parameters it cares about and use it. The graph compiler would then be required to deduplicate the graph and get back to an efficient implementation.

A more fundamental reason is memory. A lot of memory needs to be allocated in the forward pass for use in the backward. To make the best planning decisions, the backward pass has to be in the graph. Moreover, the graph compiler could do automatic checkpointing. (Ie, choose to recompute parts of the forward pass instead of spending memory.)

To be honest, I didn’t get around to studying XNA or TorchScript IR. Parts of this discussion seem to assume to offload the work there and not think about it. I’m not yet sure how that would work. This is the case with burned-in vs symbolic tensor shapes. I’m not sure how that affects things. If tensor shapes change, then wouldn’t memory allocation have to be replanned no matter what? That is somewhat expensive (or, I suppose arbitrarily expensive). The best thing to do is to track the maximal shapes so that training iterations can execute with a static memory layout. This also applies to computation with some of the more exotic chip architectures.