Min-cut optimal(*) recomputation (i.e. activation checkpointing) with AOTAutograd

kelayamatoz · February 13, 2023, 7:46pm

This is a very cool feature! Formulating this problem as a min-cut is quite neat!

Quick question: Is there a way to configure recomputable_ops based on their compute / data bandwidth ratio? It looks to me that the current recomputable_ops list is hard-coded based on the types of operators, which could miss some optimization opportunities.

For example, let’s say that we are looking at A @ B with two configurations:
Config. 1: A.shape = (100, 100), B.shape = (100, 100), Dim_M = Dim_K = Dim_N = 100
Config. 2: A.shape = (100, 2), B.shape = (2, 2), Dim_M = 100, Dim_K = 2, Dim_N = 2

Config 1 and 2 are both MatMuls and should be unrecomputable_ops. However, Config 2 makes compute / memory footprint of A @ B look more like an element-wise operator in the sense that ops per Dim_M is very low (Dim_K and Dim_N << Dim_M). Therefore, Config 2 actually makes A @ B a good candidate to recompute.

If we want to make Min-cut recomputation shape-aware, what in your view could be a good way to approach this?

Thanks!

Topic		Replies	Views
Compiling the optimizer with PT2 compiler	8	3569	January 29, 2024
How does torch.compile work with autograd?	13	3656	November 21, 2023
Performance Comparison between Torch.Compile and APEX optimizers compiler	1	2033	May 1, 2024
Keeping PyTorch's Ops Maintainable: The Jiterator hardware-backends	7	1771	February 27, 2023
Lazy Tensor Core hardware-backends	20	7514	July 12, 2022

Min-cut optimal(*) recomputation (i.e. activation checkpointing) with AOTAutograd

Related topics