NNC walkthrough: how PyTorch ops get fused

I found one reason why NNC doesn’t fuse much for cuda with Hugging Face: some tensors are in cpu, and even if the operation is run on the GPU, NNC seems to bail out when operands have mixed devices.
e.g.:

graph(%0 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0),
      %1 : Long(requires_grad=0, device=cpu),
      %2 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0)):
  %9 : NoneType = prim::Constant()
  %7 : bool = prim::Constant[value=0]()
  %6 : int = prim::Constant[value=4]()
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%0, %1, %3)
  %5 : Tensor = aten::mul(%4, %2)
  %10 : Tensor = aten::to(%5, %6, %7, %7, %9)
  return (%10)

If you change input %1 to have device=cuda, the TensorExprKernel graph gets the 3 instructions instead of just 2.
Adding explicit copies to move operands to the device where the operation will execute seems like a NOP, as anyway data needs to be moved. Plus then it enables fusion with the current code unmodified. Is this something you guys are considering implementing?