We recently found a new issue that is also filed as Pytorch Issue 54040 where TorchScript’s Autodiff is not respecting the requires_grad
option of a tensor when calculating gradients such that unused gradients are unnecessarily calculated. The summary was updated. This was seen, in particular, on the mask applied to multihead attention in NLP networks.
1 Like