Understanding Multi-Head Attention for ML Framework Developers

smth · January 18, 2024, 5:46am

The reason batch_first is False by default is because LSTMs and Linear RNNs used to be the norm in sequence to sequence modeling, and when Transformers came about, we were trying to match the output formatting of MHA to match LSTM/RNNs.

The reason why nn.LSTM/nn.RNN defaulted to (seq, batch, size) is because CuDNN provided LSTM and RNN kernels that used that layout. It was easier to write high performance LSTM/RNN kernels where the outer dimension was sequence length – and so CuDNN defaulted to that – and so PyTorch defaulted to that to leverage these kernels.

So, in summary, no good reason why nn.MHA defaults to batch_first=False other than that it tried to match nn.LSTM which tried to use CuDNN optimally.

Topic		Replies	Views
Added Grouped Query Attention to scaled_dot_product_attention API frontend API	0	1224	August 1, 2024
State of PyTorch core: September 2021 edition frontend API	1	9405	September 21, 2021
Defining the Core ATen Opset FX	12	5854	August 21, 2024
[RFC] Scaled Dot Product Attention API Changes frontend API	0	628	October 10, 2023
DTensor - Status, Design and Looking Forward distributed	3	2049	July 14, 2025

Understanding Multi-Head Attention for ML Framework Developers

Related topics