OpenCL Backend: Broadcast/Reduce Ops

One of the nice features of OpenCL is that you can generate kernels on the fly from source code. During development of multiple operators I notices following patterns:

  1. I need numpy style broadcast operations
  2. I need reductions

And apparently I need lots of them. All these functions can be easily implemented via broadcast/reduce patterns: loss functions, elementwise functions (add, div), activations, mean, sum, batch normalization, etc. Lots of things I had written kernels manually I can actually automate… So I did it.

Below examples of use for MSELoss:

Forward op:

Preparation:

auto fwd_ = core::PointwiseOperationBroadcastReduce::create(ctx_,
        in,out, // input and output vectors tensor specifications
        0,dtype_, // extra scalar parameters count and their tope
        "y0 = x0 - x1; y0 = y0*y0; ", // actual calculation 
        "reduce_y0 = 0;", // reduce init
        "reduce_y0 += y0;"); // actual reduce
workspace_size_ = fwd_->workspace(); 

Execution

float scale = cfg_.reduce == cfg_.reduce_mean ? 1.0f/a.shape().total_size() : 1.0f;
fwd_->enqueue({a,b},{y},workspace,{},{scale},{0},q);

Backward op (for both gradients with accumulation of gradient):

core::pointwise_operation_broadcast({dy,a,b,da,db},{da,db},{scale,accum_0,accum_1},
                                  R"xxx(
                                    y0 = 2*(x1 - x2)*x0*w0;
                                    y1 = -y0; 
                                    if(w1!=0)
                                        y0 += x3 * w1;
                                    if(w2!=0)
                                        y1 += x4 * w2;
                                    )xxx"
                                  ,e);

It makes it much simpler to implement lots of operators directly including handling of multiple types like float, float16, bfloat16 and various integer types.

For example use of broadcast:

Use of reduction (for mean/sum):

PyTorch uses TensorIterators for these operations (in aten/src/ATen/native across several files), it might be worth looking at that for getting a feel of what the scope eventually might be.

Yes I’ve seen it. But there is a small but critical difference.

Unlike cuda or CPU code that is compiled in-advance using templates, the OpenCL code is generated and compiled on demand. Which makes it simpler to maintain.

It is relevant to this discussion:

Yes, I understand that, and having worked on an early generation of PyTorch fusers, I have a lot of sympathy for on-demand code generation. :slight_smile: It certainly would be neat to not need gigabytes of ram just to get all cuda kernels ready. TensorIterators would seem to show what you likely want to cover.