Aten::mul.out receives tensors from mixed devices

artyom-beilis · October 15, 2021, 9:31pm

I’m working on OpenCL backend for pytorch. When implementing operations needed for adam I got stuck with following case:

aten::mul.out

Tensor & mul_out(const Tensor & self, const Tensor & other, Tensor & out)
{
    std::cerr << "Self:" << self.device() << " " << self.numel() << std::endl;
    std::cerr << "Other:" << other.device() << " " << other.numel() << std::endl;
    std::cerr << "Out:" << out.device() << " " << out.numel() << std::endl;

I got correctly self and out as opencl devices but other was single element cpu device

Self:opencl:1 288
Other:cpu 1
Out:opencl:1 288

It was surprise for me since I expected that I will get tensors for my device/backend only. Why wasn’t it dispatched to mul_.Scalar or other op?

Do I need to handle cpu tensors being another parameter to a GPU tensor as something normal? How do I handle such situation? (Basides manually check that other is cpu tensor of size 1)

tom · October 16, 2021, 9:52am

The dispatcher will typically use the first tensor argument to decide where to send to.
You might find the discussion at https://github.com/pytorch/pytorch/blob/6831d8e/aten/src/ATen/native/README.md#device_guard useful.
If you scroll down on the above link, there is a discussion that ATen usually generates checks that all your tensors are on the same device. (https://github.com/pytorch/pytorch/blob/6831d8e379392da1340a28fdb3e7e1382176d1d4/aten/src/ATen/core/op_registration/adaption.h#L48 , calls inserted by the codegen in tools/) This would probably be the right thing to check in your custom dispatch targets, too.
As you note, multiplication with scalars is handled by a different overload of mul (there is a difference between how shape-() tensors are handled vs. shape (1,)).

In general, I’d probably try to match what PyTorch does with cuda, but I guess you would already be doing that.

Best regards

Thomas

artyom-beilis · October 16, 2021, 9:08pm

I see. It is strange since I expected dispatch to happen with scaler.

In any case I added handling of cpu-scalar tensor case:

github.com

artyom-beilis/pytorch_dlprim/blob/master/op.cpp#L654

    
      
              dlprim::Tensor x0=todp(self);
              float scale = other.to<double>();
              dlprim::core::pointwise_operation({x0},{x0},{scale},
                                                "y0 = x0*w0;",
                                                getExecutionContext(self));
              sync_if_needed(self.device());
              return self;
          }
          
          
// {"schema": "aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)", "dispatch": "True", "default": "False"}
          Tensor & add_out(const Tensor & self, const Tensor & other, const Scalar & alpha, Tensor & out)
          {
              dlprim::Tensor x0=todp(self);
              dlprim::Tensor y0=todp(out);
              float value=0;
              if(isCPUScalar(other,value)) {
                  float w0 = alpha.toDouble() * value;
                  dlprim::core::pointwise_operation({x0},{y0},{w0},
                                            "y0 = x0 + w0;",
                                            getExecutionContext(self));
              }

So for meanwhile I can continue.

At this point I have full working mnist training, and stock alexnet and vgg16 over opencl.

to be continued

Topic		Replies	Views
OpenCL backend dev - questions/support hardware-backends	4	437	August 29, 2024
OpenCL Backend: Broadcast/Reduce Ops hardware-backends	3	1494	November 5, 2021
Implementing OpenCL backend for pytorch hardware-backends	14	16749	March 1, 2024
Device check will be automatically generated for functions frontend API	0	673	May 7, 2021
Aten::abs_out called with an undefined 'out' tensor hardware-backends	1	1065	November 4, 2021

Aten::mul.out receives tensors from mixed devices

Related topics