Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript

@jansel recently found this interesting benchmark (on Colab!), which consists of 64 repeated linear layers, batch size of 128 and hidden size of 256. Note that although these tensors certainly aren’t massive, they’re not tiny either.

Although it’s surprising here that raw CuBLAS is nearly 2x(!) faster than PyTorch, it’s perhaps even more shocking that TorchScript is also 74% faster. There’s no Torchscript optimizations going on, it just seems to be that Torchscript has significantly lower overhead.

@zdevito suggested that the performance differences may be due to nn.Module overhead, and suggested some optimizations to be made, which lead to this (prototype) PR: https://github.com/pytorch/pytorch/pull/50431

There’s a couple of things that are shocking here. First, PyTorch eager overhead (on this benchmark) has crept up massively from 0.4 to 1.7. The second is that we can recover a very significant part of this overhead by simply making a couple of changes.

When I originally did these benchmarks we executed 218 Python opcodes per linear layer. One of the sources of overhead was torch_function - this overhead was reduced significantly by a PR (already on master) by @robieta. This reduced the Python opcodes to 158 opcodes per layer. However, the other major sources of overhead are 1. Hook checking (counts for about 55 opcodes), and 2. going through checking __dict__ and then __getattr__ (another 36 opcodes).

If we can find a way to make these paths “fast”, we can probably reduce a significant source of PyTorch overhead when using nn.Module. I plan on exploring some of the suggestions in the linked PR, but it’s currently not really clear what the right way of doing this is.