The "Ideal" PyTorch FLOP Counter (with __torch_dispatch__)

Interesting. Do you measure memory bandwidth as well?

Some operations like convolution or gemm are mostly flops bounded, however many operations are actually memory bandwidth bounded for example batch normalization, activation etc. I noticed that sometimes GPU like 2060S that has less flops than 1080 can do resnet faster due to large difference in memory speed.

Regarding convolution, do you take in account that if you run for example Wingorad or FFT convolution can actually do it in less FLOPS than “direct”/GEMM based convolution?