Background: dispatcher “improvements” were regressing performance
I’ve had the feeling brewing for a few weeks now that something was wrong with the PyTorch dispatcher: I would make a simple change (e.g., my first attempt at #51165) that should clearly be a small improvement and measure a regression. Further investigation revealed that a regression in inlining decisions seemed to be responsible: some piece of dispatcher machinery that was previously inlined would stop getting inlined, and the resulting argument-passing overhead or missed optimization would cause a regression.
I ended up doing two things to get to the bottom of this. First, I outlined the slow path of
at::shouldRunRecordFunction returns true (#51163). This should help the inliner by reducing the amount of code it has to look at.
Second, I wrote a PR (#51245) to put
C10_ALWAYS_INLINE on the parts of the dispatcher that I repeatedly saw get outlined during experiments. This seemed to help with further improvements – on top of #51245, I measured smaller regressions or even improvements from changes like #51165 that should be clear wins.
This work is still tricky. The workflow I’ve landed on is to make a change, rebuild our benchmark, measure the effect on benchmark CPU cycles & instructions executed using
perf stat --repeat 5 with Turbo disabled on my machine to reduce noise, compare
perf record output to ensure that no unexpected outlining has taken place and check if any functions’ performance has moved, export annotated assembly from
perf record before and after the change, and inspect differences in assembly to make sure that changes are having the desired effect. I often make enough experimental changes that I have to go back and re-validate the final set of changes when I think I have something that works. Small mistakes can be punishing: for example, I didn’t make
static originally, which caused extra work to be done to forward the
this pointer to it. I only discovered this through inspecting assembly.
Unfortunately, performance as measured in the simple
at::empty benchmark does not appear to be entirely stable under change. For example, #51247 (check
KernelFunction::unboxed_kernel_func_ instead of
OperatorEntry::lookup) was a clear improvement on top of #51163, but appears neutral on top of #51165. I argue in its summary that we should commit it anyway because it makes sense on principle and appears to improve generated assembly, but I would really prefer to have clearer wins.
Chains of inlining
I did come up with a partial explanation of why inlining for the dispatcher code is so finicky: there is more code available to inline than I originally realized. As I mentioned in a comment on #51164, there are two or three call chains where lots of inlining needs to happen for each native function.
First is the obvious one. Using
x -> y to represent “x calls y”, we have:
KernelFunction::callUnboxedKernelFunction, which then calls into the registered unboxed function pointer for the given operator and can’t inline further. We also have
KernelFunction::isValid, as well as utility functions like
The second chain, which isn’t always relevant and could be considered part of the first chain, is argument processing to translate the
TensorOptions argument to
at::empty into a set of
c10::optional objects for the c10-full calling convention. Here I am referring to the calls wrapping the arguments to
TypedOperatorHandle::call in the following expression (found in the body of
at::empty in the generated
return op.call(size, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt(), c10::impl::check_tensor_options_and_extract_memory_format(options, memory_format));
Dispatcher::callWithDispatchKey is inlined into
at::empty, this argument processing gets intermixed with
Finally, we have a chain that surprised me, stemming from
Library::impl calls used to register operator implementations. For example, in
RegisterBackendSelect.cpp, you will find
m.impl("aten::empty.memory_format", TORCH_FN(empty_memory_format)), where
empty_memory_format is a generated wrapper that calls
c10::computeDispatchKey and then back into
Dispatcher::callWithDispatchKey. Here we have
CppFunction() (the ctor taking a
KernelFunction::makeFromUnboxedFunctor, which takes the address of an instantiation of
wrap_kernel_functor_unboxed::call. If all goes well, the generated
empty_memory_format wrapper should get inlined directly into
wrap_kernel_functor_unboxed::call (and indeed this is part of the point of
Going forward, I hope to reduce the burden on the inliner (and maybe build times too) by outlining more slow paths in dispatcher related code. For example, does
Dispatcher::callWithDispatchKey need to be inlined into each operator implementation? Can we avoid that and retain decent performance?
Other things that might be nice to have:
- Get rid of BackendSelect. Concretely, it seems like we could just push the
empty_memory_formatin RegisterBackendSelect.cpp into the
at::emptyimplemenation in Functions.cpp. I don’t know the subtleties around whether this would work in general, but it seems like it would save a dispatcher round trip for
- Make the
shouldRunRecordFunctionhappy path inlineable. Argument registers are caller-saved, so calling
shouldRunRecordFunctionbefore we call out to the operator implementation via a function pointer means that we have assume our incoming argument registers are all overwritten and move everything back into place before calling the operator implementation.
- Try removing
TensorOptionsfrom the public API in places where we are just going to convert it to a bunch of optionals for c10-full anyway. At worst, we should end up with similar costs overall, and we may save all the time spent unpacking
TensorOptions. IIRC, I tried simply making the
TensorOptionsversion non-default and defaulting to the c10-full version instead, but the availability of implicit conversions to both
c10::optional<caffe2::TypeMeta>(and other c10-full argument types too) made this tricky.