CUDA loops case study: code generation vs templates

This is what I do in dlprimitives OpenCL library: https://github.com/artyom-beilis/dlprimitives

Many parameters are provided as defined at compilation level so kernels are relatively simple but I can reuse same code for GEMM, Convolutin add activation and bias to the same kernel transparently.

I have long shot to create OpenCL backend for pytorch. However at this point I started with adding dlprimitives support to Caffe that I’m familiar with and already has decent OpenCL support (but relatively poor performance). So I can have some POC outside my mini-framework before I dive into pytorch complicated internals.