Background: dispatcher “improvements” were regressing performance
I’ve had the feeling brewing for a few weeks now that something was wrong with the PyTorch dispatcher: I would make a simple change (e.g., my first attempt at #51165) that should clearly be a small improvement and measure a regression. Further investigation revealed that a regression in inlining decisions seemed to be responsible: some piece of dispatcher machinery that was previously inlined would stop getting inlined, and the resulting argument-passing overhead or missed optimization would cause a regression.
Investigation
I ended up doing two things to get to the bottom of this. First, I outlined the slow path of Dispatcher::callWithDispatchKey
where at::shouldRunRecordFunction
returns true (#51163). This should help the inliner by reducing the amount of code it has to look at.
Second, I wrote a PR (#51245) to put C10_ALWAYS_INLINE
on the parts of the dispatcher that I repeatedly saw get outlined during experiments. This seemed to help with further improvements – on top of #51245, I measured smaller regressions or even improvements from changes like #51165 that should be clear wins.
Workflow
This work is still tricky. The workflow I’ve landed on is to make a change, rebuild our benchmark, measure the effect on benchmark CPU cycles & instructions executed using perf stat --repeat 5
with Turbo disabled on my machine to reduce noise, compare perf record
output to ensure that no unexpected outlining has taken place and check if any functions’ performance has moved, export annotated assembly from perf record
before and after the change, and inspect differences in assembly to make sure that changes are having the desired effect. I often make enough experimental changes that I have to go back and re-validate the final set of changes when I think I have something that works. Small mistakes can be punishing: for example, I didn’t make Dispatcher::callWithDispatchKeySlowPath
static
originally, which caused extra work to be done to forward the this
pointer to it. I only discovered this through inspecting assembly.
Unfortunately, performance as measured in the simple at::empty
benchmark does not appear to be entirely stable under change. For example, #51247 (check KernelFunction::unboxed_kernel_func_
instead of KernelFunction::boxed_kernel_func_
in OperatorEntry::lookup
) was a clear improvement on top of #51163, but appears neutral on top of #51165. I argue in its summary that we should commit it anyway because it makes sense on principle and appears to improve generated assembly, but I would really prefer to have clearer wins.
Chains of inlining
I did come up with a partial explanation of why inlining for the dispatcher code is so finicky: there is more code available to inline than I originally realized. As I mentioned in a comment on #51164, there are two or three call chains where lots of inlining needs to happen for each native function.
First is the obvious one. Using x -> y
to represent “x calls y”, we have: at::empty
→ Dispatcher::call
→ Dispatcher::callWithDispatchKey
→ KernelFunction::call
→ KernelFunction::callUnboxedKernelFunction
, which then calls into the registered unboxed function pointer for the given operator and can’t inline further. We also have Dispatcher::callWithDispatchKey
→ OperatorEntry::lookup
→ KernelFunction::isValid
, as well as utility functions like Dispatcher::singleton
.
The second chain, which isn’t always relevant and could be considered part of the first chain, is argument processing to translate the TensorOptions
argument to at::empty
into a set of c10::optional
objects for the c10-full calling convention. Here I am referring to the calls wrapping the arguments to TypedOperatorHandle::call
in the following expression (found in the body of at::empty
in the generated Functions.cpp
):
return op.call(size, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt(), c10::impl::check_tensor_options_and_extract_memory_format(options, memory_format));
Because Dispatcher::callWithDispatchKey
is inlined into at::empty
, this argument processing gets intermixed with callWithDispatchKey
.
Finally, we have a chain that surprised me, stemming from TORCH_FN
and Library::impl
calls used to register operator implementations. For example, in RegisterBackendSelect.cpp
, you will find m.impl("aten::empty.memory_format", TORCH_FN(empty_memory_format))
, where empty_memory_format
is a generated wrapper that calls c10::computeDispatchKey
and then back into Dispatcher::callWithDispatchKey
. Here we have Library::impl
→ CppFunction()
(the ctor taking a CompileTimeFunctionPointer
→ KernelFunction::makeFromUnboxedFunction
→ KernelFunction::makeFromUnboxedFunctor
, which takes the address of an instantiation of wrap_kernel_functor_unboxed::call
. If all goes well, the generated empty_memory_format
wrapper should get inlined directly into wrap_kernel_functor_unboxed::call
(and indeed this is part of the point of CompileTimeFunctionPointer
,
Further plans
Going forward, I hope to reduce the burden on the inliner (and maybe build times too) by outlining more slow paths in dispatcher related code. For example, does Dispatcher::callWithDispatchKey
need to be inlined into each operator implementation? Can we avoid that and retain decent performance?
Other things that might be nice to have:
- Get rid of BackendSelect. Concretely, it seems like we could just push the
computeDispatchKey
call fromempty_memory_format
in RegisterBackendSelect.cpp into theat::empty
implemenation in Functions.cpp. I don’t know the subtleties around whether this would work in general, but it seems like it would save a dispatcher round trip forempty
least. - Make the
shouldRunRecordFunction
happy path inlineable. Argument registers are caller-saved, so callingshouldRunRecordFunction
before we call out to the operator implementation via a function pointer means that we have assume our incoming argument registers are all overwritten and move everything back into place before calling the operator implementation. - Try removing
TensorOptions
from the public API in places where we are just going to convert it to a bunch of optionals for c10-full anyway. At worst, we should end up with similar costs overall, and we may save all the time spent unpackingTensorOptions
. IIRC, I tried simply making theTensorOptions
version non-default and defaulting to the c10-full version instead, but the availability of implicit conversions to bothTensorOptions
andc10::optional<caffe2::TypeMeta>
(and other c10-full argument types too) made this tricky.