Disabling Codegen-Specific Fusions in TorchInductor for Per-Op Kernel Generation

skv66 · August 29, 2025, 10:43pm

Hi all,

I’m currently working on benchmarking TorchInductor’s optimization impact by selectively disabling various fusion and scheduling passes. So far, I have:

Disabled inlining optimizations (Previous Post)
Disabled scheduler fusions
Disabled all FX-level fusion passes registered

With these changes, I now want to focus specifically on CPP & Triton codegen fusions — the ones happening during kernel generation in Inductor after the scheduler.

My goal is to:

Generate one kernel per node/operation (instead of fused multi-op kernels).
Understand how much these codegen-specific fusions contribute to TorchInductor’s overall performance gains, compared to other optimizations like scheduling and inlining.

Questions:

How can I disable codegen fusions entirely so that TorchInductor generates one kernel per node/operation instead of fusing multiple ops? (Are there any flags?)
How much do these codegen-specific fusions contribute to TorchInductor’s overall performance? (Were codegen fusions part of the “fusion” performance speedups reported in the PyTorch 2.0 paper?)

Thanks!

Shriraj

jansel · September 4, 2025, 12:59am

You can disable fusions by adding return False to the top of:

github.com/pytorch/pytorch

torch/_inductor/choices.py

9458d1ac3


      
          def can_fuse(
              scheduler: Scheduler,
              node1: BaseSchedulerNode,
              node2: BaseSchedulerNode,
              shared_data_score: int,
          ) -> bool:
              """
              Heuristics to prevent fusion applied to both horizontal and vertical fusions.  Heuristics here should not
              be needed for correctness and tweaking them may yield additional performance.
          
              See also some related heuristics that can be changed via config:
                  - config.triton.tiling_prevents_pointwise_fusion
                  - config.triton.tiling_prevents_reduction_fusion
                  - config.aggressive_fusion (will cause this function to be called more times)
              """

I’m not sure what you mean by codegen fusions versus scheduler fusions. Fusions always happen in the scheduler.

skv66 · September 4, 2025, 2:02am

Thanks for the response! I think I may not have framed my question clearly before, so let me clarify.

I’ve already disabled scheduler fusions (e.g., via can_fuse and related heuristics) and also disabled inlining optimizations.

However, I’m noticing that even with scheduler fusions disabled, some generated kernels still contain multiple operations.

E.g. in the output_code.py file I still see a function called cpp_fused__native_batch_norm_legit_no_training_add_relu_3, this suggests that multiple ops (batch_norm, max_pool2d, relu) are still being fused inside the generated C++ kernel, even though I’ve disabled all scheduler fusions.

I’m trying to quantify the contribution of each optimization stage in TorchInductor — to understand their individual impact on TorchInductor’s performance.

So I wanted to confirm a couple of things:

Are there fusions/optimizations happening in torch/_inductor/codegen/cpp.py or triton.py that are separate from scheduler fusions? Can I get down to one OP per Kernel?
In the PyTorch 2.0 paper, were these potential fusions included when reporting the performance speedups?

Thanks!

jansel · September 4, 2025, 2:32am

You could try adding a call to self.flush() after loop:

https://github.com/pytorch/pytorch/blob/57278d45f046d4f89f45d373b1af4dd56934ff24/torch/_inductor/scheduler.py#L4981

If that fixes it then those weren’t fusions, it is just a quirk of how the C++ backend does function boundaries to avoid creating a new omp for blocks. For C++ a fusion would produce a single loop, while that would produce two.

Topic		Replies	Views
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	74297	July 29, 2024
TorchInductor Update 5: CPU backend backend performance update and deep dive on key optimizations compiler	0	3477	March 9, 2023
Custom cuda extension support in Inductor compiler	8	1094	March 7, 2024
Question regarding horizontal fusion compiler	2	64	December 10, 2025
How to Access Triton Kernels from TorchInductor when running on CPU? compiler	1	969	August 12, 2024

Disabling Codegen-Specific Fusions in TorchInductor for Per-Op Kernel Generation

Related topics