The reason batch_first is False
by default is because LSTMs and Linear RNNs used to be the norm in sequence to sequence modeling, and when Transformers came about, we were trying to match the output formatting of MHA to match LSTM/RNNs.
The reason why nn.LSTM/nn.RNN defaulted to (seq, batch, size) is because CuDNN provided LSTM and RNN kernels that used that layout. It was easier to write high performance LSTM/RNN kernels where the outer dimension was sequence length – and so CuDNN defaulted to that – and so PyTorch defaulted to that to leverage these kernels.
So, in summary, no good reason why nn.MHA defaults to batch_first=False other than that it tried to match nn.LSTM which tried to use CuDNN optimally.