I got correctly self and out as opencl devices but other was single element cpu device
Self:opencl:1 288
Other:cpu 1
Out:opencl:1 288
It was surprise for me since I expected that I will get tensors for my device/backend only. Why wasn’t it dispatched to mul_.Scalar or other op?
Do I need to handle cpu tensors being another parameter to a GPU tensor as something normal? How do I handle such situation? (Basides manually check that other is cpu tensor of size 1)
As you note, multiplication with scalars is handled by a different overload of mul (there is a difference between how shape-() tensors are handled vs. shape (1,)).
In general, I’d probably try to match what PyTorch does with cuda, but I guess you would already be doing that.