NNC Per-Operator Benchmarks (on CPU)

That makes sure that no graph is created, but most of the autograd logic still runs (related to view/inplace in particular).
But that would only change the fixed dispatcher overhead. So won’t change these results too much beyond making PyTorch more competitive for small sizes.