Lazy Tensor Core

almson · December 17, 2021, 11:50pm

Lazy execution has been a dream of mine for several years. Back then I had written a rather efficient Caffe-style framework in Java, and I dreamed about a Torch-style dynamic framework with the same performance. (Also I dreamed that it would be in Java, rather than the much more unwieldy C++.)

I’ve thought about the first question in this thread, of when to submit the AST to execution. I think it should mostly be at the end of a training loop. The reason is that there’s a lot of whole-program optimization opportunities. For example, the model might compute a loss that is then not included in a training run. The entire subgraph could be automatically dropped.

But there a few other reasons.

One, is that I imagined that the backward pass would be written differently than it is now. Now, it is static. It is a built-in process that the framework performs. But in fact, it can be dynamic and framework-agnostic. The optimizer (or optimizers) can request the gradient (or any other function) of whatever parameters it cares about and use it. The graph compiler would then be required to deduplicate the graph and get back to an efficient implementation.

A more fundamental reason is memory. A lot of memory needs to be allocated in the forward pass for use in the backward. To make the best planning decisions, the backward pass has to be in the graph. Moreover, the graph compiler could do automatic checkpointing. (Ie, choose to recompute parts of the forward pass instead of spending memory.)

To be honest, I didn’t get around to studying XNA or TorchScript IR. Parts of this discussion seem to assume to offload the work there and not think about it. I’m not yet sure how that would work. This is the case with burned-in vs symbolic tensor shapes. I’m not sure how that affects things. If tensor shapes change, then wouldn’t memory allocation have to be replanned no matter what? That is somewhat expensive (or, I suppose arbitrarily expensive). The best thing to do is to track the maximal shapes so that training iterations can execute with a static memory layout. This also applies to computation with some of the more exotic chip architectures.

Topic		Replies	Views
State of PyTorch core: September 2021 edition frontend API	1	9437	September 21, 2021
Skipping Dispatcher with LazyTensor compiler	10	1657	October 19, 2022
Next Steps for PyTorch Compilers compiler	9	10653	October 21, 2021
Tracing with Primitives: Update 0 compiler	17	8486	September 26, 2022
Loop_tool's lazy frontend - Experimenting with symbolic laziness compiler	0	944	October 11, 2021

Lazy Tensor Core

Related topics