Dispatcher Performance and Inlining: a Report on Two Days Spent on Dispatcher Performance

swolchok · January 28, 2021, 12:10am

Background: dispatcher “improvements” were regressing performance

I’ve had the feeling brewing for a few weeks now that something was wrong with the PyTorch dispatcher: I would make a simple change (e.g., my first attempt at #51165) that should clearly be a small improvement and measure a regression. Further investigation revealed that a regression in inlining decisions seemed to be responsible: some piece of dispatcher machinery that was previously inlined would stop getting inlined, and the resulting argument-passing overhead or missed optimization would cause a regression.

Investigation

I ended up doing two things to get to the bottom of this. First, I outlined the slow path of Dispatcher::callWithDispatchKey where at::shouldRunRecordFunction returns true (#51163). This should help the inliner by reducing the amount of code it has to look at.

Second, I wrote a PR (#51245) to put C10_ALWAYS_INLINE on the parts of the dispatcher that I repeatedly saw get outlined during experiments. This seemed to help with further improvements – on top of #51245, I measured smaller regressions or even improvements from changes like #51165 that should be clear wins.

Workflow

This work is still tricky. The workflow I’ve landed on is to make a change, rebuild our benchmark, measure the effect on benchmark CPU cycles & instructions executed using perf stat --repeat 5 with Turbo disabled on my machine to reduce noise, compare perf record output to ensure that no unexpected outlining has taken place and check if any functions’ performance has moved, export annotated assembly from perf record before and after the change, and inspect differences in assembly to make sure that changes are having the desired effect. I often make enough experimental changes that I have to go back and re-validate the final set of changes when I think I have something that works. Small mistakes can be punishing: for example, I didn’t make Dispatcher::callWithDispatchKeySlowPath static originally, which caused extra work to be done to forward the this pointer to it. I only discovered this through inspecting assembly.

Unfortunately, performance as measured in the simple at::empty benchmark does not appear to be entirely stable under change. For example, #51247 (check KernelFunction::unboxed_kernel_func_ instead of KernelFunction::boxed_kernel_func_ in OperatorEntry::lookup) was a clear improvement on top of #51163, but appears neutral on top of #51165. I argue in its summary that we should commit it anyway because it makes sense on principle and appears to improve generated assembly, but I would really prefer to have clearer wins.

Chains of inlining

I did come up with a partial explanation of why inlining for the dispatcher code is so finicky: there is more code available to inline than I originally realized. As I mentioned in a comment on #51164, there are two or three call chains where lots of inlining needs to happen for each native function.

First is the obvious one. Using x -> y to represent “x calls y”, we have: at::empty → Dispatcher::call → Dispatcher::callWithDispatchKey → KernelFunction::call → KernelFunction::callUnboxedKernelFunction, which then calls into the registered unboxed function pointer for the given operator and can’t inline further. We also have Dispatcher::callWithDispatchKey → OperatorEntry::lookup → KernelFunction::isValid, as well as utility functions like Dispatcher::singleton.

The second chain, which isn’t always relevant and could be considered part of the first chain, is argument processing to translate the TensorOptions argument to at::empty into a set of c10::optional objects for the c10-full calling convention. Here I am referring to the calls wrapping the arguments to TypedOperatorHandle::call in the following expression (found in the body of at::empty in the generated Functions.cpp):

return op.call(size, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt(), c10::impl::check_tensor_options_and_extract_memory_format(options, memory_format));

Because Dispatcher::callWithDispatchKey is inlined into at::empty, this argument processing gets intermixed with callWithDispatchKey.

Finally, we have a chain that surprised me, stemming from TORCH_FN and Library::impl calls used to register operator implementations. For example, in RegisterBackendSelect.cpp, you will find m.impl("aten::empty.memory_format", TORCH_FN(empty_memory_format)), where empty_memory_format is a generated wrapper that calls c10::computeDispatchKey and then back into Dispatcher::callWithDispatchKey. Here we have Library::impl → CppFunction() (the ctor taking a CompileTimeFunctionPointer → KernelFunction::makeFromUnboxedFunction → KernelFunction::makeFromUnboxedFunctor, which takes the address of an instantiation of wrap_kernel_functor_unboxed::call. If all goes well, the generated empty_memory_format wrapper should get inlined directly into wrap_kernel_functor_unboxed::call (and indeed this is part of the point of CompileTimeFunctionPointer,

Further plans

Going forward, I hope to reduce the burden on the inliner (and maybe build times too) by outlining more slow paths in dispatcher related code. For example, does Dispatcher::callWithDispatchKey need to be inlined into each operator implementation? Can we avoid that and retain decent performance?

Other things that might be nice to have:

Get rid of BackendSelect. Concretely, it seems like we could just push the computeDispatchKey call from empty_memory_format in RegisterBackendSelect.cpp into the at::empty implemenation in Functions.cpp. I don’t know the subtleties around whether this would work in general, but it seems like it would save a dispatcher round trip for empty least.
Make the shouldRunRecordFunction happy path inlineable. Argument registers are caller-saved, so calling shouldRunRecordFunction before we call out to the operator implementation via a function pointer means that we have assume our incoming argument registers are all overwritten and move everything back into place before calling the operator implementation.
Try removing TensorOptions from the public API in places where we are just going to convert it to a bunch of optionals for c10-full anyway. At worst, we should end up with similar costs overall, and we may save all the time spent unpacking TensorOptions. IIRC, I tried simply making the TensorOptions version non-default and defaulting to the c10-full version instead, but the availability of implicit conversions to both TensorOptions and c10::optional<caffe2::TypeMeta> (and other c10-full argument types too) made this tricky.

bhosmer · January 28, 2021, 1:38am

Re getting rid of BackendSelect, there’s a simpler possibility that we should probably investigate first: teaching DispatchKeyExtractor about Device. The original blocker for this back when BackendSelect was designed was that IValue stored Device values as ints, depriving the dispatcher of the necessary nominal type info. But it looks like the chain of custody might have been added to IValue in the meantime. (A second, more obscure possibility is that there are ops that take a Device argument that shouldn’t be considered for dispatch - unlikely but we need to confirm.)

Topic		Replies	Views
Comparing the performance of 0.4.1 and master performance	0	2364	February 9, 2021
Skipping Dispatcher with LazyTensor compiler	10	1649	October 19, 2022
NNC Per-Operator Benchmarks (on CPU) nnc	5	1077	January 27, 2021
Perf counters for fun and profit compiler	0	1029	January 23, 2021
State of PyTorch core: September 2021 edition frontend API	1	9425	September 21, 2021