And now some more progress:
And performance:
Benchmarks
All benchmarks done on gtx 960/4G to get comparison to native cuda speed.
Test
Test includes copy of data to/from device and forward calculations
Framework | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|
pytorch/cuda | 15.253 | 38.745 | 114.348 | 169.038 | 46.110 |
pytorch/opencl | 22.989 | 50.272 | 167.050 | 258.751 | 82.044 |
dlprimitives | 22.688 | 49.193 | 158.789 | 238.802 | 82.080 |
keras/tf2-cuda | 29.104 | 74.215 | 161.704 | 158.084 | 88.851 |
keras/plaidml | 43.004 | 91.533 | - | - | 45.693 |
Full Train
Train includes - io to/from device, zero gadients, forward, backward and optimizer update step. Adam used as optimizer.
Framework | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|
pytorch/cuda | 107.108 | 129.456 | 388.951 | N/A | 177.434 |
pytorch/opencl | 147.814 | 213.319 | 651.216 | N/A | 382.590 |
dlprimitives | 106.033 | 198.092 | 605.541 | 1107.756 | 344.599 |
keras/tf2-cuda | 90.005 | 183.447 | 501.362 | 550.063 | 322.416 |
keras/plaidml | 222.166 | 507.116 | - | - | 571.438 |
- vgg16 batch 16 failed to run to to lack of memory on pytorch.
- some setups with plaidml not tested due to lack of performance/memory
Looks very nice