NNC Per-Operator Benchmarks (on CPU)

Chillee · January 27, 2021, 10:38am

After Mikhail landed an initial prototype of Python bindings for NNC, I thought it would be interesting to compare PyTorch single-op performance vs NNC.

For each of these pointwise ops, I benchmarked PyTorch runtime and NNC runtime across a variety of rank 2 sizes, and plotted them in a heatmap. All of the benchmarks were performed on CPU. All of these use the out= variants of operators, so no memory re-allocation is happening. Green means that NNC is faster, red means that PyTorch eager is faster. 1.0 means that NNC and PyTorch ran at the same speed.

For single-threaded performance, NNC currently seems fairly promising for many ops, such as most binary ops as well as ReLU (where NNC is consistently substantially faster). NNC seems to substantially lag behind PyTorch on many transcendental ops. As NNC currently doesn’t support intra-op parallelism, it’s unsurprising that it tends to substantially lag behind PyTorch when it come to multi-core performance.

These benchmarks also suggest a lot of places where we could see improvements. For example, we found out that we’re not vectorizing some intrinsics correctly.

Also, note that there are other potential gains from NNC that aren’t represented in this benchmark. For example, fusing multiple point-wise ops.

Also, I didn’t do any scheduling for these operators. In my testing, it seems like LLVM already does a good enough job of vectorization and such.

Reproduction code can be found here: pytorch/microbenchmarks.py at master · pytorch/pytorch · GitHub

albanD · January 27, 2021, 3:31pm

Hey!

Thanks for sharing!
Given that you use the out= variants only, that means that this is only usable for inference right?
If so, do you properly disable all the dispatcher work related to autograd when you run the pytorch code?

Chillee · January 27, 2021, 6:15pm

I don’t know if I disabled all dispatcher work (I seem to remember there was some stuff deep within C++?), but I did run with torch.no_grad(): pytorch/microbenchmarks.py at master · pytorch/pytorch · GitHub

albanD · January 27, 2021, 6:27pm

That makes sure that no graph is created, but most of the autograd logic still runs (related to view/inplace in particular).
But that would only change the fixed dispatcher overhead. So won’t change these results too much beyond making PyTorch more competitive for small sizes.

Chillee · January 27, 2021, 6:33pm

Is there a better way to reduce the autograd logic that runs? I remember there was a discussion that mentioned a C++ flag, but I can’t find it now (and don’t know if it’s exposed in python).

albanD · January 27, 2021, 6:57pm

There is some hacky c++ way to do this.
It is not exposed to python on purpose as it can very easily lead to un-detectable silent correctness issues.

From c++ you can do

{
// This can make UNRELATED gradient computations wrong!
// Only ever use that if you never use the autograd engine in this
// process!
at::AutoNonVariableTypeMode non_var_type_mode(true);
your_fun();
}

Topic		Replies	Views
Single-op fusion benchmarking compiler	0	851	February 4, 2021
NNC walkthrough: how PyTorch ops get fused nnc	10	7479	November 3, 2021
Python Operator Authoring w/ NNC nnc	5	2547	June 7, 2022
Depthwise conv2d: An NNC Case Study compiler	0	1486	April 7, 2021
Comparing the performance of 0.4.1 and master performance	0	2375	February 9, 2021

NNC Per-Operator Benchmarks (on CPU)

Related topics