What is the split strategy for large-dimensional reduction operations in triton code generation?

Minerva_Yu · May 14, 2023, 5:01pm

For example, if I do max operation:

def fn(x):
    return torch.max(x, 1)

opt_fn = torch.compile(fn)
x = torch.randn(16, 2**20, device="cuda")
y = opt_fn(x)

I found the triton code generated has two kernels to do this reduction op, the first kernel do a smaller dimension reduction, result in for example (16, 2**8) tensor, and the second kernel will do the last reduction and result in (16, 1) tensor. This magic happens in code in ir.py.

But could anyone help me on why it need this strategy? Because of the SM’s capacity or maybe for performance considerations?

Thanks very much!

ezyang · May 14, 2023, 5:06pm

This is split reductions. The goal is to get good utilization, as if you’re reducing on dim=1, the naive strategy is to assign work to SMs based on dim=0, but in your case the tensor only has 16 rows and this wouldn’t fully utilize the GPU. Splitting dim=1 lets us get full utilization at the cost of one small final reduction at the end.

Topic		Replies	Views
A small inductor optimization ablation study	0	1035	April 6, 2023
[RFC] Adding Triton Backend for Aten operators hardware-backends	0	602	November 4, 2024
PyTorch Sparse(GNN) Compiler RFC compiler	28	2588	May 21, 2024
New Contributor Looking for Mentorship on a pull request!	0	33	August 13, 2025
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	69506	July 29, 2024

What is the split strategy for large-dimensional reduction operations in triton code generation?

Related topics