OpenCL Backend: Broadcast/Reduce Ops

Yes I’ve seen it. But there is a small but critical difference.

Unlike cuda or CPU code that is compiled in-advance using templates, the OpenCL code is generated and compiled on demand. Which makes it simpler to maintain.

It is relevant to this discussion: