Summation bfloat16

dwidemann-lm · October 29, 2021, 3:52pm

Hi,

I’m seeing that torch’s summation of a bfloat16 tensor is more accurate than my naive summation, e.g. x.sum() is better than x[0] +…+ x[n-1]. Could somebody please point me towards where in the repo is the code for torch’s summation? Thanks!

tom · October 30, 2021, 6:18pm

sum is executed through TensorIterators, so it is a bit of a complex setup, this bit of code in ReduceOps.cpp plays a role.

The cause of your accuracy observation is likely that operations on 16 bit floats including bfloat16 typically use 32 bit floats as the internal computation (“accumulation”) scalar type.

Best regards

Thomas

dwidemann-lm · November 1, 2021, 1:10am

Thank you for showing me that. Yes, it’s the FP32 accumulator that improves the accuracy.

Lezcano · November 4, 2021, 12:34pm

That is one part of the story. The other part is that adding lots of numbers in a stable way is not a trivial task! PyTorch implements a fairly involved algorithm based on reducing parts of the tensor and then adding the results of those reductions.

The code is here: pytorch/SumKernel.cpp at bceb1db885cafa87fe8d037d8f22ae9649a1bba0 · pytorch/pytorch · GitHub

A fantastic and up to date talk on the topic with an algorithm similar (actually better) than the one that PyTorch currently implements is here: Talk by Nicholas J. Higham (University of Manchester) - YouTube

dwidemann-lm · November 4, 2021, 6:07pm

@Lezcano Thank you for sending this. I’ll check it out.

Topic		Replies	Views
Float8 in PyTorch [1/x]	1	14517	April 7, 2025
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1987	September 22, 2023
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript performance	0	1670	January 28, 2021
OpenCL backend dev - questions/support hardware-backends	4	317	August 29, 2024
Implementing true boolean tensors	2	358	October 12, 2024

Summation bfloat16

Related topics