Disabling Codegen-Specific Fusions in TorchInductor for Per-Op Kernel Generation

Hi all,

I’m currently working on benchmarking TorchInductor’s optimization impact by selectively disabling various fusion and scheduling passes. So far, I have:

  1. Disabled inlining optimizations (Previous Post)
  2. Disabled scheduler fusions
  3. Disabled all FX-level fusion passes registered

With these changes, I now want to focus specifically on CPP & Triton codegen fusions — the ones happening during kernel generation in Inductor after the scheduler.

My goal is to:

  1. Generate one kernel per node/operation (instead of fused multi-op kernels).

  2. Understand how much these codegen-specific fusions contribute to TorchInductor’s overall performance gains, compared to other optimizations like scheduling and inlining.

Questions:

  1. How can I disable codegen fusions entirely so that TorchInductor generates one kernel per node/operation instead of fusing multiple ops? (Are there any flags?)
  2. How much do these codegen-specific fusions contribute to TorchInductor’s overall performance? (Were codegen fusions part of the “fusion” performance speedups reported in the PyTorch 2.0 paper?)

Thanks!

Shriraj

You can disable fusions by adding return False to the top of:

I’m not sure what you mean by codegen fusions versus scheduler fusions. Fusions always happen in the scheduler.

Thanks for the response! I think I may not have framed my question clearly before, so let me clarify.

I’ve already disabled scheduler fusions (e.g., via can_fuse and related heuristics) and also disabled inlining optimizations.

However, I’m noticing that even with scheduler fusions disabled, some generated kernels still contain multiple operations.

E.g. in the output_code.py file I still see a function called cpp_fused__native_batch_norm_legit_no_training_add_relu_3, this suggests that multiple ops (batch_norm, max_pool2d, relu) are still being fused inside the generated C++ kernel, even though I’ve disabled all scheduler fusions.

I’m trying to quantify the contribution of each optimization stage in TorchInductor — to understand their individual impact on TorchInductor’s performance.

So I wanted to confirm a couple of things:

  1. Are there fusions/optimizations happening in torch/_inductor/codegen/cpp.py or triton.py that are separate from scheduler fusions? Can I get down to one OP per Kernel?

  2. In the PyTorch 2.0 paper, were these potential fusions included when reporting the performance speedups?

Thanks!

You could try adding a call to self.flush() after loop:

https://github.com/pytorch/pytorch/blob/57278d45f046d4f89f45d373b1af4dd56934ff24/torch/_inductor/scheduler.py#L4981

If that fixes it then those weren’t fusions, it is just a quirk of how the C++ backend does function boundaries to avoid creating a new omp for blocks. For C++ a fusion would produce a single loop, while that would produce two.