I shared a version of these results with PyTorch developers inside Facebook in December, but I wanted to repost it to devdiscuss now that we have this new forum for this type of content. Chillee already posted a followup to the internal version of this post that found: 1) TorchScript closes a lot of the overhead related performance gap, 2) nn.Module
is to blame for much of the overheads, and 3) PyTorch 0.4 had lower overheads.
This post is exploring GPU overheads, pointwise fusion, and an asymmetric fused Strassen algorithm. I show almost 2x speedups from overhead reduction, up to 1.24x speedups from bias/ReLU fusion, and show that a single layer of Strassen is up to a 1.53x win for batch_size>=64 and hidden_size>=1024.
The benchmark and measurement setup
This post measures GPU inference a textbook dense network defined by:
from torch.nn import Linear, ReLU, Sequential
model = []
for _ in range(num_layers):
model.append(Linear(hidden_size, hidden_size))
model.append(ReLU(inplace=True))
model = Sequential(*model)
I measure with torch.no_grad()
enabled running GPU inference on an NVIDIA GTX 1070. I ran each test 5000 times and report the median to reduce noise and show relative performance to PyTorch 1.7.1, where 1.0x is matching the speed of PyTorch and higher is better.
Overhead reduction and codegen
Prior results on CPU have shown huge speedups on small models through overhead reduction. I expected overhead to be less of an issue on GPU because it is possible to overlap host overheads with device compute. It turns out I was wrong. Overhead is an even bigger problem on GPU!
To estimate possible speedups from overhead reduction, I reimplemented this model using PyCUDA. I preallocated the GPU buffers and implemented each layer as 3 CUDA kernels:

cublasSgemm()
(PyTorch also uses cuBLAS)  A naive CUDA kernel to add bias inplace
 A naive CUDA kernel to run ReLU inplace
The code is here and results are here. The speedups are pretty huge. It is faster in all cases and 1.8x faster in the best case. On the 8layer network there is a bit of a divide at hidden_size 1024. Below 1024 the average speedup is 1.73x, while above 1024 the average speedup is 1.05x.
Even though we have seen huge overheads problems before, these results are still surprising to me. I am driving the GPU from unoptimized Python, so it shouldnâ€™t be possible to get such a big speedup this way. The speedups from this naive rewrite are bigger than the speedups in the remainder of this post. This is theoretically achievable in eager mode with a memory allocator that does dynamic buffer reuse.
Pointwise fusion
Next up, I tried combining the bias and ReLU into a single kernel. This is just a one line change to the previous code. Again, results are here.
On CPU this yielded tiny singledigit speedups, but on GPU this helps a lot more. The incremental speedup is up to 1.14x on 1layer and up to 1.24x on 8layers. Similar to overhead reduction, on the 8layer network there is a bit of a divide at hidden_size=1024. Below 1024 the average incremental speedup is 1.22x, while above 1024 the average incremental speedup is only 1.01x. Combining the speedups here with overhead reduction we are now well over a 2x average speedup over PyTorch for smaller sizes below 1024 on the 8layer network.
Asymmetric Fused Strassen
For the final experiment implemented a fused version of Strassen algorithm implemented in Python+PyCUDA. This algorithm uses an asymmetric fusion where I combine parts of the matrix multiple from one layer with parts of the matrix multiply from the next layer.
For a 4layer network there will be 9 CUDA kernels:
 layer1 Strassen prefix (pointwise)
 layer1
cublasSgemmBatched()
(sizes7 x [(batch_size/2) x (hidden_size/2) x (hidden_size/2)]
)  Pointwise fusion: layer1 Strassen suffix, layer1 bias/activation, layer2 Strassen prefix
 layer2
cublasSgemmBatched()
 Pointwise fusion: layer2 Strassen suffix, layer2 bias/activation, layer3 Strassen prefix
 layer3
cublasSgemmBatched()
 Pointwise fusion: layer3 Strassen suffix, layer3 bias/activation, layer4 Strassen prefix
 layer4
cublasSgemmBatched()
 Pointwise fusion: layer4 Strassen suffix, layer4 bias/activation
The data transferred between the CUDA kernels is shape [7, batch_size/2, hidden_size/2]
representing the 7 intermediate buffers used by Strassenâ€™s algorithm. This requires some preprocessing to preformat the weight tensors in this format. This code doesnâ€™t require any fancy codegen (just pointwise), though would rely rather heavily on graphlevel transformations to do the asymmetric fusion of the matrix multiply. This code requires an even batch size and hidden size, but could be extended to support odd numbers.
See the asymmetric fused Strassen code and again the results are here. Compared to PyTorch, fused Strassen is faster most of the time â€“ however for the smaller sizes this is all due to overhead reduction. If you compare to the pointwise fusion above, fused Strassen is an increment improvement only when batch_size>=64
and hidden_size>=1024
. The speedup is large, for those sizes an average of 1.3x on the 8layer network and up to 1.53x. It should also be possible to extend this to training by having the fused kernels have secondary outputs to be used in the backwards pass.