Ok… here the progress:
TL;DR: I managed to run inference of alexnet using OpenCL/DLPrimitives based pytorch backend!
Details:
- I changed mapping of opencl device to PirvateUse1 dispatch key just to be able to do anything. Currently it is only change I needed to start working on out-of-tree backend. I found that decpide the suggestion to use these dispatch key I need to have device mapped to it and it is impossible to do without modifications of pytorch sources: Set temporary opencl device to PrivateUse1 dispatch key · artyom-beilis/pytorch@eb74af1 · GitHub
- Another item that is missing from this manual about out-of-source backends Extending dispatcher for a new backend in C++ — PyTorch Tutorials 1.9.1+cu102 documentation is need to implement
c10::impl::DeviceGuardImplInterface
and register it via c10::impl::DeviceGuardImplRegistrar
Now I implemented only handful of ops and mostly for forward computations: GitHub - artyom-beilis/pytorch_dlprim: DLPrimitives-OpenCL out of tree backend for pytorch but I managed to do computations and get correct result on pretrained alexnet.
$ python validate_network.py --model alexnet --device cuda --benchmark *.ppm
cat.ppm,281,tabby,0.249897,-0.674675,-2.994513,-1.568204,-2.399394,3.196111,-3.784611,...
dog.ppm,207,golden retriever,-4.164610,-5.017385,3.193234,-6.757652,-1.752393,-1.439135,-3.598867,...
parrot.ppm,87,African grey,-0.318788,5.249665,-6.590664,-4.953464,-3.192156,2.550208,-2.042364,...
$ python validate_network.py --model alexnet --device opencl:1 --benchmark *.ppm
Accessing device #1:GeForce GTX 960 on NVIDIA CUDA
cat.ppm,281,tabby,0.249900,-0.674674,-2.994513,-1.568204,-2.399395,3.196111,-3.784608,...
dog.ppm,207,golden retriever,-4.164612,-5.017380,3.193236,-6.757651,-1.752393,-1.439135,-3.598869,...
parrot.ppm,87,African grey,-0.318790,5.249666,-6.590665,-4.953461,-3.192155,2.550208,-2.042365,...
Performance not brilliant but not horrible (also this net is way too simple): GTX 960, alexnet batch size 16, image 224x224
- Pytorch Cuda/CUDNN: 11.685 ms
- Pytorch OpenCL/DLPrimitives: 23.966 ms
- DLPrim - microframework: 22.401 ms
- Caffe/CuDNN: 16.1812 ms
- Caffe/OpenCL: 41.072 ms
- Caffe/OpenCL+DLPrimitives: 28.618 ms
- Keras/CuDNN: 23.341 ms
- Keras/PlaidML: 44.041 ms
Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand some stuff and for that I’ll open another thread since it isn’t directly related to opencl