Debugging back-propogation results

I’m working on OpenCL backend for pytorch. I currently validate standard torchvision models, in forward and back-propogation. All nets but efficientnet_bX I tested work. efficientnet_bX give wrong results in backward computations.

Now when I had similar issues with forward propogation I copied then net and saved “checkpoints” in different places (after suspicious operations) till I found one that generated results that aren’t similar to cpu.

I can look on gradients generated in parameters of layers like conv of linear but I can’t do it for layers that don’t have parameters.

How can I do this for backpropogation, since it is all automatic? In caffe for example with its static graph I could run net.backward(from,to) and look into intermediate layers.

Is there anything that can help me extracting each and every value for CPU and my device?


Below example of runtest output where maximal difference and maximal output difference is comouted

Testing  mnist_mlp
Accessing device #1:GeForce RTX 2060 SUPER on NVIDIA CUDA
    Ok       od=0.00000 md=0.00000
Testing  mnist_cnn
    Ok       od=0.00000 md=0.00000
Testing  mnist_bn
    Ok       od=0.00000 md=0.00000
Testing  alexnet
    Ok       od=0.00002 md=0.00002
Testing  resnet18
    Ok       od=0.00001 md=0.00001
Testing  vgg16
    Ok       od=0.00002 md=0.00022
squeezenet1_0 is blacklisted
Testing  densenet161
    Ok       od=0.00001 md=0.00274
Testing  inception_v3
    Ok       od=0.00002 md=0.00002
googlenet is blacklisted
Testing  shufflenet_v2_x1_0
    Ok       od=0.00002 md=0.00002
Testing  mobilenet_v2
    Ok       od=0.00002 md=0.00051
Testing  mobilenet_v3_large
    Ok       od=0.00003 md=0.00003
Testing  mobilenet_v3_small
    Ok       od=0.00003 md=0.00003
Testing  resnext50_32x4d
    Ok       od=0.00002 md=0.00711
Testing  wide_resnet50_2
    Ok       od=0.00001 md=0.00130
Testing  mnasnet1_0
    Ok       od=0.00004 md=0.00031
Testing  efficientnet_b0
    FAIL     od=0.00001 md=1.65422
Testing  efficientnet_b4
    FAIL     od=0.00001 md=0.27367
Testing  regnet_y_400mf
    Ok       od=0.00001 md=0.00259

Ok I found the bug… mostly because I discovered that on Intel GPU platform I had no issues so it allowed me to narrow down the case. However the question is still very valid.

Currently pytorch opencl/dlprimitives backend passes forward/backward computations tests against reference for following networks

alexnet
resnet18
vgg16
densenet161
inception_v3 (fwd only - backprop fails on cuda/cpu)
shufflenet_v2_x1_0
mobilenet_v2
mobilenet_v3_large
mobilenet_v3_small (fwd only - same failure on bwd on cuda) 
resnext50_32x4d
wide_resnet50_2
mnasnet1_0
efficientnet_b0
efficientnet_b4
regnet_y_400mf

My very manual methods probably are

  • use retain grad,
  • use backward hooks,
  • check individual modules,
  • check individual functions with torch.autograd.gradcheck.

I do almost all my testing in Python, even when writing backend code. And of course PyTorch does come with a fair amount of tests.

Best regards

Thomas

1 Like

That is exactly what I was looking for. I’ve read about them and seems that they fit my needs.

Thanks

You can do the same with autograd.grad(from, to, retain_graph=True).

2 Likes