NNC walkthrough: how PyTorch ops get fused

nunoplopes · November 3, 2021, 11:18am

I found one reason why NNC doesn’t fuse much for cuda with Hugging Face: some tensors are in cpu, and even if the operation is run on the GPU, NNC seems to bail out when operands have mixed devices.
e.g.:

graph(%0 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0),
      %1 : Long(requires_grad=0, device=cpu),
      %2 : Int(1, 9, strides=[9, 1], requires_grad=0, device=cuda:0)):
  %9 : NoneType = prim::Constant()
  %7 : bool = prim::Constant[value=0]()
  %6 : int = prim::Constant[value=4]()
  %3 : int = prim::Constant[value=1]()
  %4 : Tensor = aten::add(%0, %1, %3)
  %5 : Tensor = aten::mul(%4, %2)
  %10 : Tensor = aten::to(%5, %6, %7, %7, %9)
  return (%10)

If you change input %1 to have device=cuda, the TensorExprKernel graph gets the 3 instructions instead of just 2.
Adding explicit copies to move operands to the device where the operation will execute seems like a NOP, as anyway data needs to be moved. Plus then it enables fusion with the current code unmodified. Is this something you guys are considering implementing?

Topic		Replies	Views
Python Operator Authoring w/ NNC nnc	5	2448	June 7, 2022
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	65667	July 29, 2024
NNC Per-Operator Benchmarks (on CPU) nnc	5	1012	January 27, 2021
TorchDynamo Update 3: GPU Inference Edition compiler	12	6643	February 2, 2023
Tracing with Primitives: Update 2 compiler	4	6909	January 13, 2023

NNC walkthrough: how PyTorch ops get fused

Related topics