Micro-optimizations for the most micro of benchmarks

jansel · January 25, 2024, 5:10am

Most of the benchmarking in PyTorch 2 has focused on large models taken from real-world applications. For this note, I want to take the completely opposite approach and instead focus on fixed overheads. Basically everything except the generated kernels in PyTorch 2. Fixed overheads are important for smaller overheard-bound models, they get multiplied by graph breaks, and will start mattering a lot more with torch.compile-backed eager mode backends that compile lots of small 1-op graphs.

So let me introduce our fixed-overhead benchmark that will be the running example in this note:

@torch.compile
def add1(x):
    return x+1

add1(torch.tensor([1]))

This is a silly benchmark, but it is the most overhead-bound micro-benchmark I can think of, and should be a good proxy for the fixed overheads introduced by torch.compile.

Skipping straight to the results, this benchmark running on my local CPU (i9-11900K) takes:

eager (baseline): 4.5us
pt2 before the changes in this post: 24.4us
pt2 with the changes in this post: 10.3us
pt2 with hypothetical optimizations (upper bound): 6.8us

This note will first look at four optimizations I made in order to achieve that 14us saving. Next it will explore three ideas for possible future savings and try to estimate the impact from them.

Optimization 1: Faster Python-to-C++ bindings

Improvement: ~1.9us per call to a CPU kernel

Cumulative add1() perf: 24.4us to 22.5us

The first change I made was replacing how we called our generated C++ kernels. Previously we would use ctypes to load and call our CPU kernels from Python. Unfortunately ctypes is notoriously slow, and by instead generating our own Python bindings, I was able to make calling a 1-element kernel in a loop 5.6x faster. The has a smaller effect on our microbenchmark, since it only calls one kernel, but every bit helps!

Optimization 2: A faster BACKEND_MATCH guard

Improvement: ~4.4us per graph

Cumulative add1() perf: 22.5us to 18.1us

So, based on some profiling and ablation studies, I found that this single guard, BACKEND_MATCH, was taking a lot of time. There were actually two parts to this, there was code to maintain a global guarded_backend_cache and thread-local current_backend that executed on every torch.compile-ed call, plus the actual guard code itself. The fix was python refactors so that we do less work on the critical path. This could likely be optimized even further by moving it to C++.

Note that BACKEND_MATCH is disproportionately represented in this micro-benchmark because it is only the very few guards actually needed.

Optimization 3: Fewer context managers in _TorchDynamoContext

Improvement: ~5.9us per graph

Cumulative add1() perf: 18.1us to 12.2us

The next fix focused on these lines of code that used to be run on every graph in eval_frame.py:

on_enter()
…
backend_ctx = backend_ctx_ctor()
backend_ctx.__enter__()
dynamic_ctx = enable_dynamic(self.dynamic, self.export)
dynamic_ctx.__enter__()
…
dynamic_ctx.__exit__(None, None, None)
backend_ctx.__exit__(None, None, None)

In our specific micro-benchmark, all of these lines do nothing. They are just hooks to handle extension points that aren’t needed most of the time. I refactored the code so that we only put these things on the critical path when they are enabled. I also refactored things not to use contextlib which is expensive.

Optimization 4: Faster memory allocation

Improvement: ~1.7us per memory allocation

Cumulative add1() perf: 12.2us to 10.5us

This last optimization one was inspired by this PR by @swolchok. The basic idea is to replace torch.empty with at::detail::empty_strided_cpu. This bypasses the PyTorch dispatcher and Python bindings and shaves some time off of every memory allocations. While the savings here are small due to only having one allocation in the benchmark, for models with more allocations the savings will be larger.

Opportunity 1: Faster guards

Hypothetical improvement (upper bound): 2.5us on this micro-benchmark

So what else can we do? To estimate the maximum savings we could get by making guards go faster, I created a patch that deletes all the guard checks, which reduced the time from 10.5us to 8us. Faster guards won’t be able to achieve this, since they still need to check something, but this provides an upper bound on the savings possible from better guards. @anijain2305 is working on a guards refactor that could help here. Note that the possible savings from faster guards will be bigger on models that generate more guards (this one only has a few). We could also use my patch to delete guards as a way to estimate savings on larger models.

Guards are now the vast majority of the non-TorchInductor time since calling the generated TorchInductor code directly (without dynamo) only brings the time down from 8us to 7.5us.

Opportunity 2: Removing at::Tensor creation

Hypothetical improvement (upper bound): 1.2us per memory allocation

My next thought is the at::Tensor creation is adding too much overhead. Perhaps with something like memory planning or better caching we could speed this up. To estimate the maximum benefit here, I moved the memory allocation line from inside the generated kernel into global scope – meaning we no longer allocate a new tensor on every call. This improved things from 7.5us to 6.3us (for the TorchInductor-only version).

Opportunity 3: C++ wrapper code

Hypothetical improvement: ???

My next thought was maybe C++ wrapper code will speed things up. So I ran with TORCHINDUCTOR_CPP_WRAPPER=1, to check. To my surprise this was actually a regression from 8us to 8.6us. I’m still convinced that C++ wrapper code can help, though apparently the cost to get into our current C++ wrapper code doesn’t make sense for 1-kernel models. Maybe @desertfire will have some ideas here. Note for non-trivial models the C++ wrapper code is definitely a win, this is just a tricky benchmark.

In theory, I think optimized C++ wrapper code should be able to get us to much closer to that 4.5us eager mode performance.

Closing thoughts

While progress has been made in optimizing the fixed overheads within PyTorch 2, there remains room for further improvements. Although the presented micro-benchmark may not be of paramount importance, delving into simpler benchmarks provides a valuable exercise in dissecting and understanding time allocation.

Also shout out to the low overheads of eager mode here! We clearly did a great job shaving those down. They are super hard to beat.

Topic		Replies	Views
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript performance	0	1670	January 28, 2021
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1986	September 22, 2023
TorchInductor Update 4: CPU backend started to show promising performance boost compiler	1	2917	November 25, 2022
NNC Per-Operator Benchmarks (on CPU) nnc	5	1018	January 27, 2021
GPU Overheads and Fused Strassen performance	0	2223	February 13, 2021