NNC Per-Operator Benchmarks (on CPU)

After Mikhail landed an initial prototype of Python bindings for NNC, I thought it would be interesting to compare PyTorch single-op performance vs NNC.

For each of these pointwise ops, I benchmarked PyTorch runtime and NNC runtime across a variety of rank 2 sizes, and plotted them in a heatmap. All of the benchmarks were performed on CPU. All of these use the out= variants of operators, so no memory re-allocation is happening. Green means that NNC is faster, red means that PyTorch eager is faster. 1.0 means that NNC and PyTorch ran at the same speed.

For single-threaded performance, NNC currently seems fairly promising for many ops, such as most binary ops as well as ReLU (where NNC is consistently substantially faster). NNC seems to substantially lag behind PyTorch on many transcendental ops. As NNC currently doesn’t support intra-op parallelism, it’s unsurprising that it tends to substantially lag behind PyTorch when it come to multi-core performance.

These benchmarks also suggest a lot of places where we could see improvements. For example, we found out that we’re not vectorizing some intrinsics correctly.

Also, note that there are other potential gains from NNC that aren’t represented in this benchmark. For example, fusing multiple point-wise ops.

Also, I didn’t do any scheduling for these operators. In my testing, it seems like LLVM already does a good enough job of vectorization and such.

Reproduction code can be found here: pytorch/microbenchmarks.py at master · pytorch/pytorch · GitHub



Thanks for sharing!
Given that you use the out= variants only, that means that this is only usable for inference right?
If so, do you properly disable all the dispatcher work related to autograd when you run the pytorch code?

I don’t know if I disabled all dispatcher work (I seem to remember there was some stuff deep within C++?), but I did run with torch.no_grad(): pytorch/microbenchmarks.py at master · pytorch/pytorch · GitHub

That makes sure that no graph is created, but most of the autograd logic still runs (related to view/inplace in particular).
But that would only change the fixed dispatcher overhead. So won’t change these results too much beyond making PyTorch more competitive for small sizes.

Is there a better way to reduce the autograd logic that runs? I remember there was a discussion that mentioned a C++ flag, but I can’t find it now (and don’t know if it’s exposed in python).

There is some hacky c++ way to do this.
It is not exposed to python on purpose as it can very easily lead to un-detectable silent correctness issues.

From c++ you can do

// This can make UNRELATED gradient computations wrong!
// Only ever use that if you never use the autograd engine in this
// process!
at::AutoNonVariableTypeMode non_var_type_mode(true);
1 Like