-
The dispatcher will typically use the first tensor argument to decide where to send to.
You might find the discussion at https://github.com/pytorch/pytorch/blob/6831d8e/aten/src/ATen/native/README.md#device_guard useful. -
If you scroll down on the above link, there is a discussion that ATen usually generates checks that all your tensors are on the same device. (https://github.com/pytorch/pytorch/blob/6831d8e379392da1340a28fdb3e7e1382176d1d4/aten/src/ATen/core/op_registration/adaption.h#L48 , calls inserted by the codegen in tools/) This would probably be the right thing to check in your custom dispatch targets, too.
-
As you note, multiplication with scalars is handled by a different overload of mul (there is a difference between how shape-() tensors are handled vs. shape (1,)).
In general, I’d probably try to match what PyTorch does with cuda, but I guess you would already be doing that.
Best regards
Thomas