Single-op fusion benchmarking

Single-op fusion benchmarking

Following in Horace’s footsteps, I wrote a single-op “fusion” benchmark for NNC, to see how our ops perform in isolation. Obviously we’ll never generate this code “in the wild”, but lousy single-op performance could limit or reverse any wins from fusion.

The benchmark source code is here in

(As an aside, I generally have a preference for minimalism in benchmarking; the less “framework-y” stuff there is, the fewer opportunities for surprise due to bad setup. The fanciest thing I use is timeit, but you could go really low-tech and use time.perf_counter).

The bits of setup needed to conduct this benchmark are to turn on CPU fusion, remove the minimum group size, and turn off threading:


I’m running the benchmark under numactl -C <n> to really avoid threading (Nb: I’m avoiding threading for now, but we need to support it in NNC for the rest of the world. Xiaoqiang is hard at work on adding this!)

numactl -C7 benchmarks/cpp/tensorexpr/

Perf was initially pretty bad. I focused specifically on sigmoid because it’s very common in NNs:

op                        eager        nnc    speedup
sigmoid                   0.155      1.297       0.12

But wait, what the heck, we had fast sigmoid back in March 2020, before I went on parental leave!

A quick run with perf record shows the problem (Here I edit the benchmark to just run NNC sigmoid):

% numactl -C7 perf record -g -- benchmarks/cpp/tensorexpr/bench_ops.par
% perf report
    41.20%  python3.7                  [.] __expf_finite

Oh, there’s the problem. We’re calling un-vectorized expf from libm, instead of the vectorized version from sleef. At some point when we brought NNC into the pytorch mainstream, we’d confused the build system and lost our Sleef bindings! A mere stack of 6 diffs later (#51190, if you feel like vicariously experiencing my build-related suffering :-p), and Sleef is successfully bound into NNC’s kernels, and sigmoid is fine again:

op                        eager        nnc    speedup
sigmoid                   0.155      0.153       1.01

There’s still some room for improvement. Here are the latest numbers:

op                        eager        nnc    speedup
hardswish                 0.189      0.068       2.76
hardswish                 0.068      0.069       1.00
sigmoid                   0.155      1.297       0.12
reciprocal                0.071      0.072       0.99
neg                       0.034      0.035       0.98
relu                      0.035      0.034       1.02
isnan                     0.106      0.021       5.00
log                       0.108      0.259       0.42
log10                     0.139      0.270       0.51
log1p                     0.190      0.266       0.71
log2                      0.284      0.278       1.02
exp                       0.058      1.209       0.05
expm1                     0.362      0.364       0.99
erf                       0.145      0.808       0.18
erfc                      0.169      1.030       0.16
cos                       0.117      0.235       0.50
sin                       0.122      0.205       0.60
tan                       0.264      0.380       0.69
acos                      0.122      0.379       0.32
asin                      0.112      0.277       0.40
cosh                      0.428      0.399       1.07
sinh                      0.426      0.445       0.96
atan                      0.214      0.336       0.64
tanh                      0.293      0.489       0.60
sqrt                      0.043      0.067       0.63
rsqrt                     0.134      0.138       0.98
abs                       0.035      0.035       1.00
ceil                      0.035      0.034       1.04
floor                     0.035      0.035       1.00

I almost blacklisted these ops from NNC, but many of them actually provide speedups when used in a fusion group (thanks Elias for encouraging me to check that).

A good follow-up would be to extend this benchmark to cover these ops in fusion groups (this could be as simple as changing the code to op(x)+1.0) and then using that data to figure out which ops to improve and/or vote off the proverbial island.