Implementing OpenCL backend for pytorch

Small updated: I implemented GPU memory caching + asynchronous execution and got performance results virtually identical for my static graph dlprimitives execution.

Now it works efficiently for all GPUs I tested AMD, 6600XT NVidia 960 and Intel GPUs 530.
Also I fixed pytorch benchmark that by accident didn’t include copy to gpu time and now run times on 960 are ~15ms on pytorch cuda/cudnn 960 and ~22ms on dlprimitives

2 Likes