Hello,
We’ve got a model containing a bunch of stuff like transformers, slicing, indexing using arrays, concatenation etc and, most awkwardly, a call to torch.autograd.grad(), which I’ve been trying to export to onnx for a while. I’ve made a few tickets about issues I’ve encountered along the way, many of which are linked from this one:
opened 11:08PM - 05 Mar 24 UTC
module: onnx
triaged
### 🐛 Describe the bug
I'm trying to export a model to onnx involving back pr… opagation, and I've run into a number of issues. I can work around some of them, but this one seems a bit tough. You can see the kind of problem I've encountered if you run this python code:
```python
import torch
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.latent_dim = 256
self.num_heads = 4
self.ff_size=1024
self.dropout=0.1
self.activation="gelu"
self.num_layers = 4
root_seqTransEncoderLayer = torch.nn.TransformerEncoderLayer(d_model=self.latent_dim,
nhead=self.num_heads,
dim_feedforward=self.ff_size,
dropout=self.dropout,
activation=self.activation)
self.root_seqTransEncoder = torch.nn.TransformerEncoder(root_seqTransEncoderLayer,
num_layers=self.num_layers)
def forward(self, inputs):
xseq = inputs[0]
xseq = xseq.detach().requires_grad_()
with torch.enable_grad():
output = self.root_seqTransEncoder(xseq)
loss = torch.sqrt(output).sum()
return torch.autograd.grad([loss], [xseq])[0]
mdl = Model()
for p in mdl.parameters():
p.requires_grad_(False)
print("export model")
torch.onnx.export(
Model(),
[torch.randn([20, 2, 256]) ** 2],
"modelthing.onnx",
input_names=["xseq"],
opset_version=17,
output_names=["lossgrad"],
verbose=True
)
```
This probably overlaps with https://github.com/pytorch/pytorch/issues/120820 and https://github.com/pytorch/pytorch/issues/120822 as this is basically what I was trying to do when I found those bugs. There are some things I can work around (eg the backward pass of a gelu unit is unsupported but I can implement that myself using a torch.onnx api), but some things look a lot harder to work around.
This particular export fails with an error that it's trying to insert a parameter as a constant when it requires a gradient. I've turned requires_grad off for all the model's parameters though, so I think it's erroneously trying to insert an intermediate value as a constant like it's doing in https://github.com/pytorch/pytorch/issues/120820 (I found that bug while basically trying to strip this one down). Fixing that issue will probably reveal the layer norm problem I reported here https://github.com/pytorch/pytorch/issues/120822 , the fact that the backward pass for the Gelu nonlinearity isn't implemented (at least that's something I can work around) and probably some other stuff
### Versions
Collecting environment information...
PyTorch version: 2.2.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.25.1
Libc version: N/A
Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 10.0.130
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti
Nvidia driver version: 536.23
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\cudnn_ops_train64_8.dll
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=3501
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3501
Name=AMD Ryzen Threadripper PRO 3975WX 32-Cores
ProcessorType=3
Revision=12544
Versions of relevant libraries:
[pip3] functorch==2.0.0
[pip3] lovely-numpy==0.2.8
[pip3] numpy==1.24.3
[pip3] onnx==1.14.1
[pip3] onnx-graphsurgeon==0.3.27
[pip3] onnxconverter-common==1.13.0
[pip3] onnxruntime==1.15.1
[pip3] optree==0.10.0
[pip3] pytorch-lightning==1.4.2
[pip3] tf2onnx==1.16.0
[pip3] torch==2.2.1+cu118
[pip3] torch-cluster==1.6.1
[pip3] torch-fidelity==0.3.0
[pip3] torch-geometric==2.3.0
[pip3] torch-scatter==2.1.1
[pip3] torch-sparse==0.6.17
[pip3] torch-spline-conv==1.2.2
[pip3] torchaudio==2.2.1+cu118
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.17.1+cu118
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.6.0 hc0ea762_10 conda-forge
[conda] libblas 3.9.0 16_win64_mkl conda-forge
[conda] libcblas 3.9.0 16_win64_mkl conda-forge
[conda] liblapack 3.9.0 16_win64_mkl conda-forge
[conda] mkl 2022.1.0 h6a75c08_874 conda-forge
[conda] mkl-include 2023.2.0 intel_49496 intel
[conda] mkl-static 2023.2.0 intel_49496 intel
[conda] numpy 1.24.2 py310hd02465a_0 conda-forge
[conda] pytorch 1.12.0 py3.10_cuda11.6_cudnn8_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
I was doing it via torch.onnx.export(), as there seems to be zero support for autograd stuff in torch.onnx.dynamo_export(). I’ve managed to brute force it and hack a pytorch version together so the torch.onnx.export() pathway works - I’m actually still not 100% clear if it was even meant to work in the first place…
I’d like to be able to do this with an official pytorch release though, and my hacked pytorch version isn’t really suitable for contributing to the project at the moment as 1) I don’t know if people want to do further work on torch.onnx.export() anyway 2) some of my workarounds are pretty hacky. I could probably put together a writeup of the issues I encountered and put a branch with my workarounds on github though.
There are definitely use cases for exporting models containing backprop to onnx, eg diffusion models with classifier guidance, optimizing latent codes at runtime etc, so it would be good to get proper support for it one way or another