After Mikhail landed an initial prototype of Python bindings for NNC, I thought it would be interesting to compare PyTorch single-op performance vs NNC.
For each of these pointwise ops, I benchmarked PyTorch runtime and NNC runtime across a variety of rank 2 sizes, and plotted them in a heatmap. All of the benchmarks were performed on CPU. All of these use the out=
variants of operators, so no memory re-allocation is happening. Green means that NNC is faster, red means that PyTorch eager is faster. 1.0 means that NNC and PyTorch ran at the same speed.
For single-threaded performance, NNC currently seems fairly promising for many ops, such as most binary ops as well as ReLU (where NNC is consistently substantially faster). NNC seems to substantially lag behind PyTorch on many transcendental ops. As NNC currently doesn’t support intra-op parallelism, it’s unsurprising that it tends to substantially lag behind PyTorch when it come to multi-core performance.
These benchmarks also suggest a lot of places where we could see improvements. For example, we found out that we’re not vectorizing some intrinsics correctly.
Also, note that there are other potential gains from NNC that aren’t represented in this benchmark. For example, fusing multiple point-wise ops.
Also, I didn’t do any scheduling for these operators. In my testing, it seems like LLVM already does a good enough job of vectorization and such.
Reproduction code can be found here: pytorch/microbenchmarks.py at master · pytorch/pytorch · GitHub