How does torch.compile work with autograd?

the example of sin in the slides seems to have no memory benefit I think

Well, the example in the slides was just about explaining what the joint graph looks like - not about what an optimized version looks like.

I’m confused by the word “partition”. By partition, we usually mean to partition the graph into disjoint subsets.

Haha, this is true. In this case, the min-cut “partitioner” is not really partitioning the graph into two disjoint subsets - the backwards pass will be recomputing significant parts of the forwards pass. I still think of it as a “partitioning” problem because we’re given a graph with signature joint(fw_inputs, bw_inputs) => (fw_outputs, bw_outputs), and we need to return two graphs forward(fw_inputs) => (fw_inputs, activations) and backward(activations, bw_inputs) => bw_outputs.

I used to think aot autograd can discover something like optimized sigmoid (e.g. users write eager code z = 1 / (1 + torch.exp(-x)) , and we can figure out the smart backward as z * (1 - z) ). Now that I understand what aot autograd can do.

It’s possible we could make these kinds of decisions automatically in AOTAutograd (we have all the information), but I’m actually not totally sure this is even the right thing to do :stuck_out_tongue: In this case, we would just recompute sigmoid forwards in the backwards pass, which I think will be just as efficient.