OpenCL Backend: Broadcast/Reduce Ops

artyom-beilis · November 4, 2021, 10:23pm

One of the nice features of OpenCL is that you can generate kernels on the fly from source code. During development of multiple operators I notices following patterns:

I need numpy style broadcast operations
I need reductions

And apparently I need lots of them. All these functions can be easily implemented via broadcast/reduce patterns: loss functions, elementwise functions (add, div), activations, mean, sum, batch normalization, etc. Lots of things I had written kernels manually I can actually automate… So I did it.

Below examples of use for MSELoss:

Forward op:

Preparation:

auto fwd_ = core::PointwiseOperationBroadcastReduce::create(ctx_,
        in,out, // input and output vectors tensor specifications
        0,dtype_, // extra scalar parameters count and their tope
        "y0 = x0 - x1; y0 = y0*y0; ", // actual calculation 
        "reduce_y0 = 0;", // reduce init
        "reduce_y0 += y0;"); // actual reduce
workspace_size_ = fwd_->workspace();

Execution

float scale = cfg_.reduce == cfg_.reduce_mean ? 1.0f/a.shape().total_size() : 1.0f;
fwd_->enqueue({a,b},{y},workspace,{},{scale},{0},q);

Backward op (for both gradients with accumulation of gradient):

core::pointwise_operation_broadcast({dy,a,b,da,db},{da,db},{scale,accum_0,accum_1},
                                  R"xxx(
                                    y0 = 2*(x1 - x2)*x0*w0;
                                    y1 = -y0; 
                                    if(w1!=0)
                                        y0 += x3 * w1;
                                    if(w2!=0)
                                        y1 += x4 * w2;
                                    )xxx"
                                  ,e);

It makes it much simpler to implement lots of operators directly including handling of multiple types like float, float16, bfloat16 and various integer types.

For example use of broadcast:

github.com

artyom-beilis/pytorch_dlprim/blob/master/op.cpp#L812

    
      
              return self;
          }
          
          
// {"schema": "aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)", "dispatch": "True", "default": "False"}
          Tensor & add_out(const Tensor & self, const Tensor & other, const Scalar & alpha, Tensor & out)
          {
              GUARD;
              dlprim::Tensor x0=todp(self.contiguous());
              dlprim::Tensor y0=todp(out);
              double value=0;
              if(isCPUScalar(other,value)) {
                  float w0 = alpha.toDouble() * value;
                  dlprim::core::pointwise_operation({x0},{y0},{w0},
                                            "y0 = x0 + w0;",
                                            getExecutionContext(self));
              }
              else {
                  dlprim::Tensor x1=todp(other.contiguous());
                  float w0 = alpha.toDouble();
                  dlprim::core::pointwise_operation_broadcast({x0,x1},{y0},{w0},
                                            "y0 = x0 + x1 * w0;",

Use of reduction (for mean/sum):

github.com

artyom-beilis/pytorch_dlprim/blob/master/op.cpp#L1150

    
      
          dlprim::Tensor X = todp(self.contiguous());
          auto r = squeeze_dim(X.shape(),dim,keepdim);
          dlprim::Tensor Y = todp(out);
          TORCH_CHECK(r.second == Y.shape(),"Invalid output shape");
          Y.reshape(r.first);
          
          
double scale = mean ? double(Y.shape().total_size()) / double(X.shape().total_size()) : 1;
          
          
auto q = getExecutionContext(self);
          dlprim::Context ctx(q);
          auto op = dlprim::core::PointwiseOperationBroadcastReduce::create(
                  ctx,
                  {X.specs()},{Y.specs()},
                  0,dlprim::float_data,
                  "y0=x0;",
                  "reduce_y0 = 0;",
                  "reduce_y0 += y0;");
          
          
WSGuard wsg(op->workspace(),self.device());
          op->enqueue({X},{Y},wsg.ws,{},{scale},{0},q);

tom · November 5, 2021, 8:10am

PyTorch uses TensorIterators for these operations (in aten/src/ATen/native across several files), it might be worth looking at that for getting a feel of what the scope eventually might be.

artyom-beilis · November 5, 2021, 2:30pm

Yes I’ve seen it. But there is a small but critical difference.

Unlike cuda or CPU code that is compiled in-advance using templates, the OpenCL code is generated and compiled on demand. Which makes it simpler to maintain.

It is relevant to this discussion:

tom · November 5, 2021, 2:55pm

Yes, I understand that, and having worked on an early generation of PyTorch fusers, I have a lot of sympathy for on-demand code generation. It certainly would be neat to not need gigabytes of ram just to get all cuda kernels ready. TensorIterators would seem to show what you likely want to cover.

Topic		Replies	Views
Slides from Structured Kernel presentation hardware-backends	4	1577	April 8, 2021
OpenReg Dispatching Structured Ops hardware-backends	0	531	May 28, 2023
Implementing OpenCL backend for pytorch hardware-backends	14	16273	March 1, 2024
OpenCL backend dev - questions/support hardware-backends	4	302	August 29, 2024
Backend Fallbacks hardware-backends	1	3446	August 22, 2024

OpenCL Backend: Broadcast/Reduce Ops

Related topics