Implementing OpenCL backend for pytorch

And now some more progress:

And performance:

Benchmarks

All benchmarks done on gtx 960/4G to get comparison to native cuda speed.

Test

Test includes copy of data to/from device and forward calculations

Framework alexnet resnet18 resnet50 vgg16 mobilenet
pytorch/cuda 15.253 38.745 114.348 169.038 46.110
pytorch/opencl 22.989 50.272 167.050 258.751 82.044
dlprimitives 22.688 49.193 158.789 238.802 82.080
keras/tf2-cuda 29.104 74.215 161.704 158.084 88.851
keras/plaidml 43.004 91.533 - - 45.693

Full Train

Train includes - io to/from device, zero gadients, forward, backward and optimizer update step. Adam used as optimizer.

Framework alexnet resnet18 resnet50 vgg16 mobilenet
pytorch/cuda 107.108 129.456 388.951 N/A 177.434
pytorch/opencl 147.814 213.319 651.216 N/A 382.590
dlprimitives 106.033 198.092 605.541 1107.756 344.599
keras/tf2-cuda 90.005 183.447 501.362 550.063 322.416
keras/plaidml 222.166 507.116 - - 571.438
  • vgg16 batch 16 failed to run to to lack of memory on pytorch.
  • some setups with plaidml not tested due to lack of performance/memory

Looks very nice :slight_smile:

5 Likes