Different points at which fusion occurs?

Amogh · May 16, 2024, 9:46pm

I understand that can_fuse and scores_fusion are responsible for fusion of buffers at the Inductor IR level. But I also see some fusions in the fx_passes before the lowering to Inductor IR goes through. Can someone clarify the entire flow from a fusion perspective?

Another question would be is it possible to access the Dict on scores_fusion or output of self.get_possible_fusions_with_highest_priority, basically a list of all possible fusions with their scores

Amogh · May 20, 2024, 8:50pm

Any idea @jansel regarding how I can do this?

jansel · May 20, 2024, 9:41pm

The pattern matching in fx_passes is different from fusions and matters much less for performance. A lot of the stuff in pattern matching are graph rewrites (such as removing redundant ops), not fusing things together.

Inlining happens during lowering, which is more similar to fusions. This is controlled by buffer.realize(), where any access to a buffer that hasn’t been realized will be inlined into the consumer (possibly recomputing it in multiple consumers). An unrealized buffer is never allocated or computed on its own.

The list of all possible fusions (ordered by score_fusion) is computed in this function:

github.com

pytorch/pytorch/blob/acfe237a71af609e837a34bb38048aa8acb8eb4d/torch/_inductor/scheduler.py#L1981-L2025


      
          def get_possible_fusions(self):
              """
              Helper to find all legal fusion opportunities, sorted by self.score_fusion()
              """
              possible_fusions = []
              seen = set()
          
              def check_all_pairs(nodes):
                  for node1_index, node1 in enumerate(nodes):
                      for node2 in nodes[node1_index + 1 :]:
                          key = (node1, node2)
                          if key in seen:
                              continue
                          seen.add(key)
          
                          if self.can_fuse(node1, node2):
                              possible_fusions.append(key)
                          elif (node2.is_template() or node2.is_foreach()) and self.can_fuse(
                              node2, node1
                          ):

This file has been truncated. show original

You could call that function from a debug environment or print it out.

Amogh · May 20, 2024, 9:55pm

Thanks

I was confused because of the existence of a fuse_fx() function in here pytorch/torch/_inductor/fx_passes/pre_grad.py at main · pytorch/pytorch · GitHub which includes a bunch of other fusion calls, but I guess like you said these are primarily just graph replacements and have no performance related impacts

jansel · May 20, 2024, 11:52pm

The fusions in fuse_fx() are disabled by default. It only does something if you set permute_fusion=True (for GPU-only, three patterns) or freezing=True (for CPU-only, only one pattern). The permute_fusion stuff actually isn’t even fusions, it is rewrite rules to simplify cases of matmul().permute(). The freezing one is more like constant folding.

Likely a distraction to focus on that, it is inconsequential for perf unless your model has some pretty uncommon patterns in it.

Amogh · May 21, 2024, 12:48pm

Thanks

Do the pointwise ops get fused even before inductor lowering happens? Because I noticed in the buffers generated by Triton codegen that my manual RMSNorm implementation was already a single buffer even in the Inductor IRs that were generated pre_fusion

jansel · May 21, 2024, 4:15pm

No, the opposite, before inductor we apply decompositions to break larger ops up into more minimal pieces to simplify lowering.

Amogh · May 21, 2024, 7:05pm

Sure the decompositions part makes sense, but then what about these ComputedBuffers which aren’t just one pointwise op or reduction op, but a combination of multiple nodes

Is this the inlining that you mentioned before?

jansel · May 21, 2024, 10:26pm

Yes, that is the inlining / buffer.realize() thing I talked about before.

Topic		Replies	Views
How is pattern matching in inductor/fx implemented? compiler	7	2054	February 14, 2024
Inductor Passes compiler	0	248	January 13, 2025
Reverse Fusion of Node Pairs in Scheduler compiler	0	174	June 14, 2024
NNC walkthrough: how PyTorch ops get fused nnc	10	7262	November 3, 2021
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1987	September 22, 2023

Different points at which fusion occurs?

Related topics