Thanks. Now I see some debug stuff indeed. I’m on linux.
I get this:
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%z.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::mul(%y.1, %y.1) # xx.py:7:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%y.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::sin(%x.1) # xx.py:6:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%x.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::mul(%b.1, %b.1) # xx.py:5:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
[DEBUG tensorexpr_fuser.cpp:696] Considering node:%b.1 : Float(1, 1, 128, 128, strides=[16384, 16384, 128, 1], requires_grad=0, device=cpu) = aten::conv2d(%a.1, %6, %4, %3, %2, %3, %5) # xx.py:4:8
[DEBUG tensorexpr_fuser.cpp:1088] Failed cond isFusableOnDevice(node)
...
Investigating canFuseOnDevice
, we have this:
if (device->is_cpu()) {
// CPU fusion is only supported for single-thread.
if (!canFuseOnCPU()) {
return false;
}
if (at::get_num_threads() == 1 || texprParallelCPUEnabled()) {
return true;
}
return false;
}
So CPU only supports fusion in single-thread mode. Doing export OMP_NUM_THREADS=1
did the trick. I’ll try CUDA next.
Thank you!