I found one reason why NNC doesn’t fuse much for cuda with Hugging Face: some tensors are in cpu, and even if the operation is run on the GPU, NNC seems to bail out when operands have mixed devices.
e.g.:
graph(%0 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0),
%1 : Long(requires_grad=0, device=cpu),
%2 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0)):
%9 : NoneType = prim::Constant()
%7 : bool = prim::Constant[value=0]()
%6 : int = prim::Constant[value=4]()
%3 : int = prim::Constant[value=1]()
%4 : Tensor = aten::add(%0, %1, %3)
%5 : Tensor = aten::mul(%4, %2)
%10 : Tensor = aten::to(%5, %6, %7, %7, %9)
return (%10)
If you change input %1
to have device=cuda, the TensorExprKernel graph gets the 3 instructions instead of just 2.
Adding explicit copies to move operands to the device where the operation will execute seems like a NOP, as anyway data needs to be moved. Plus then it enables fusion with the current code unmodified. Is this something you guys are considering implementing?