NNC walkthrough: how PyTorch ops get fused

nunoplopes · October 25, 2021, 3:23pm

Thanks. Now I see some debug stuff indeed. I’m on linux.
I get this:

[DEBUG tensorexpr_fuser.cpp:696] Considering node:%z.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::mul(%y.1, %y.1) # xx.py:7:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%y.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::sin(%x.1) # xx.py:6:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%x.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::mul(%b.1, %b.1) # xx.py:5:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%b.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::conv2d(%a.1, %6, %4, %3, %2, %3, %5) # xx.py:4:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
...

Investigating canFuseOnDevice, we have this:

    if (device->is_cpu()) {
      // CPU fusion is only supported for single-thread.
      if (!canFuseOnCPU()) {
        return false;
      }
      if (at::get_num_threads() == 1 || texprParallelCPUEnabled()) {
        return true;
      }
      return false;
    }

So CPU only supports fusion in single-thread mode. Doing export OMP_NUM_THREADS=1 did the trick. I’ll try CUDA next.
Thank you!

Topic		Replies	Views
NNC Per-Operator Benchmarks (on CPU) nnc	5	1050	January 27, 2021
Single-op fusion benchmarking compiler	0	837	February 4, 2021
Python Operator Authoring w/ NNC nnc	5	2503	June 7, 2022
About the nnc category nnc	0	557	January 22, 2021
State of PyTorch core: September 2021 edition frontend API	1	9405	September 21, 2021

NNC walkthrough: how PyTorch ops get fused

Related topics