Summary of Issues Found in Adapting Models to use TorchScript

We recently found a new issue that is also filed as Pytorch Issue 54040 where TorchScript’s Autodiff is not respecting the requires_grad option of a tensor when calculating gradients such that unused gradients are unnecessarily calculated. The summary was updated. This was seen, in particular, on the mask applied to multihead attention in NLP networks.

1 Like