Thanks. I’ll read the blog posts. From first glance looks very interesting (and complicated)
My first DL framework was Caffe that I still like a lot due to its highly readable and easy to modify C++ code (and of course OpenCL support)
In any case, dispatcher and other technical things are complicated in terms of system but actually simple in comparison to optimized DL kernels.
For example I hadn’t found a single open source general purpose implementation of Winograd algorithm either in CUDA or OpenCL (ROCm’s are actually binary blows) and Intel ones are highly tighten to Intel architecture. Finally I found a parer in 2020 that described how GPU implementation of Winograd should look like.
Even GEMM based convolutons aren’t very good - clBlast implements one but its performance very poor (and implements only FWD propogation)
So complex is relative thing