Backend Fallbacks

Backend Fallbacks

tldr;> Backend fallbacks (sometimes called “boxed fallbacks”) landed and now work across all ops. Their performance will have to be improved for some of the use cases we have in mind, but other use cases can profit right now already.

What are backend fallbacks?

Backend fallbacks are a concept similar to Smalltalk/Ruby’s method_missing. A backend can specify a fallback function that is called whenever an op is called with a Tensor from that backend and the backend doesn’t have an implementation for that op. Since this is one fallback function that is called for different ops with different signatures, it has to be written in a op-signature-independent way, i.e. it has to be boxed.

What could we use backend fallbacks for?

  1. Backends that only support a subset of operators: XLA (Ailing Zhang) only supports a subset of operators and currently codegens an XLA kernel for each unsupported op that just calls back into the CPU op. There are plans to remove this codegen and instead use a backend fallback. The new lazy backend worked on by Alex Suhan and Will Constable could also potentially benefit from this. Furthermore, we’re doing some evaluation on whether backend fallbacks might be useful for other backends that only support a subset of operators, e.g. Martin Yuan, Jiakai Liu for hardware accelerators. For most of these backends, performance of the fallback isn’t very important.
  2. Error message customization: If a backend doesn’t have a backend fallback, and there is an operator that the backend doesn’t have a kernel for, but somebody calls it with a tensor from that backend, then this will cause a cryptic error message in the dispatcher saying that there was no kernel found for this dispatch key. Backends can now add backend fallbacks to avoid those errors and instead throw a custom error in the backend fallback. Nested Tensors (Christian Puhrsch) is one of the projects interested in this, but this is going to be useful to many backends.
  3. Autograd: We currently codegen a VariableType kernel for each operator, which does some common logic (i.e. sets up the autograd graph) and then calls into the actual op kernel. This could be implemented with one backend fallback kernel instead of having a codegenerated kernel per op. Doing it this way would make the system simpler and would decrease binary size significantly. This is, however, likely the most difficult project on this list. Due to intricacies and corner cases in autograd, it would require serious engineering effort to come up with a generic implementation of it. Also, this is a use case that is in the hot path and would heavily depend on performance. Nobody is actively looking at this just yet, but it is definitely on our mind.
  4. Tracing: Similar to autograd, we’re codegenerating a tracing kernel for each operator, and similarly we could replace it with a backend fallback for a simpler system and a reduction in binary size. For Tracing, this is likely easier to do than for Autograd, since there’s fewer corner cases (although still some) and performance doesn’t matter as much. Brian Hirsh has done some evaluation about the possibility of doing this.
  5. ZeroTensor optimization: We could imagine a world where a “ZeroTensor” is its own backend that can be optimized away from computations by having the ZeroTensor backend implement some optimized kernels while passing other operators to the CPU backend via a backend fallback. This is a mostly exploratory idea that we’re not actively pursuing at the moment. There’s more design needed here. This would also be a use case that heavily depends on performance.
  6. Your project: If you think you have a use case for backend fallbacks, please reach out to us. Backend fallbacks are an incredibly flexible concept, likely with lots of use cases that we haven’t thought about yet. But we definitely want to hear about them.

How can I use backend fallbacks?

It’s actually very easy. Here’s the code you need to write a backend fallback that just always calls back into the dispatcher to the next backend in line:

void backend_fallback(const c10::OperatorHandle& op,
                      DispatchKeySet dispatch_keys,
                      torch::jit::Stack* stack) {
  op.redispatchBoxed(
      dispatch_keys.remove(DispatchKey::MyBackendKey),
      stack);
}

TORCH_LIBRARY_IMPL(_, MyBackendKey, m) {
  m.fallback(
    torch::CppFunction::makeFromBoxedFunction<
      &backend_fallback>());
}

Backend fallbacks with fallthrough

Backend fallbacks can also be combined with fallthrough kernels. If you do this, then instead of calling a manually supplied fallback function, the dispatcher will just move on to the next backend and call that whenever your backend doesn’t have a more specific kernel for an op. This is actually even easier to do:

TORCH_LIBRARY_IMPL(_, MyBackendKey, m) {
    m.fallback(torch::CppFunction::makeFallthrough());
}

This is faster than having a manually implemented backend fallback, because it is handled inside the dispatcher when we generate the dispatch table. There isn’t any runtime overhead added by this compared to just directly calling the next backend. See the section on performance below for the measurements verifying this.

Using backend fallbacks with a fallthrough kernel is an easy way to implement backends that want to intercept some ops but not others and do so with a minimal impact on performance.

How performant are backend fallbacks?

If you use a fallthrough kernel to delegate to a different backend, there is no overhead. If you specify a manual fallback function that does some work and then redispatches back to a different backend, expect a 10-20% regression in dispatcher overhead. I haven’t done an analysis of why it is so expensive, but this path incurs both boxing (to call into the backend fallback) and then unboxing (to call from the backend fallback back into the dispatcher), so that could be the culprit. There are likely ways to greatly reduce the overhead. Note that those numbers are dispatch overhead, the actual impact on models would be much lower, and the overhead is only that large if the backend fallback actually calls back into the dispatcher to redispatch.

Measurements:

I compared 4 experiments.

  • base: The base case, no backend fallback involved. Dispatcher calls directly to the CPU kernel.
  • per-op codegen: This is what XLA does today. The dispatcher calls into an op-specific unboxed backend kernel, that then calls into the dispatcher to redispatch to the CPU kernel.
  • fallthrough: Use a generic backend fallback, but don’t implement it manually, use a fallthrough instead.
  • backend fallback: Implement a boxed backend fallback that is generic, i.e. works for all ops, and calls the dispatcher to redispatch to the CPU kernel.

I ran those experiments on 2 operators: aten::acos and aten::acos_out.

  • aten::acos_out somewhat underestimates the dispatch overhead because we don’t actually unbox the result argument from the stack and instead just return a reference to the input argument for out overloads.
  • On the other hand, aten::acos heavily overestimates the overhead because it internally calls aten::empty and that goes through the fallback too, so it’s more an upper bound on 2x the dispatcher overhead.
|   aten::acos_out   | instructions  | relative     |
|--------------------|---------------|--------------|
| base               | 3471021       |              |
| per-op codegen     | 3526021       | base + 1.6%  |
| fallthrough        | 3472021       | base + 0%    |
| backend fallback   | 3950021       | base + 13.8% |
|   aten::acos       | instructions  | relative     |
|--------------------|---------------|--------------|
| base               | 5230173       |              |
| per-op codegen     | 5389933       | base + 3.1%  |
| fallthrough        | 5232173       | base + 0%    |
| backend fallback   | 7435264       | base + 42.2% |

Stability concerns

Backend fallbacks have been working in some capacity since the c10-full migration landed in December, but there were some corner case ops it didn’t work with. Since any one backend fallback could be called for potentially all ops, this meant we couldn’t land any backend fallbacks or we would have broken those ops. The stack at PR 55321 fixed all of the corner cases and backend fallbacks can now be used.

However, we don’t have any backend fallbacks yet and they are not tested by CI. They will be implicitly tested once we actually have a first backend fallback. PR 53660 is a good way to test if the system supports backend fallbacks until then. It just adds a backend fallbacks to the call path of every operator and then the whole CI test suite can be run on it. But, for obvious perf reasons, we don’t want to land a dummy backend fallback that is called in the hot path for every op, so we’ll have to use this PR for testing until we actually have a real backend fallback landed.

BC Concerns

Most of the corner case ops mentioned above followed the same pattern - they relied on a Tensor being passed by mutable reference, the op kernel then switched out the TensorImpl of that tensor for a new TensorImpl, and the caller expected to observe the change. This worked, because they called the op in an unboxed way with a mutable Tensor reference, and the dispatcher passed through that Tensor reference directly to the unboxed kernel. But now, with backend fallbacks, this goes through a boxing+unboxing step in the middle. This pattern cannot survive boxing/unboxing because IValue doesn’t preserve the reference to the Tensor object. It only preserves the TensorImpl below it, but it copies or moves the Tensor object to/from the stack.

Most ops that were fixed (reduction ops, aten::kron_out) actually followed a simpler variant of this pattern - they had out overloads that allowed the out argument to be an undefined tensor. And if it was, the op would allocate an out tensor for you and the caller would expect to observe it in their copy of the Tensor since they passed the Tensor in via a mutable reference. But there was also _linalg_solve_out_helper_cuda, which did not pass undefined tensor. Instead, sometimes the kernel noticed that it actually wanted the out argument (which was passed in as a GPU Tensor) to live on the CPU, and it created a new TensorImpl and assigned it to the out tensor reference.

This pattern will not work in a world with backend fallbacks and while we now (hopefully) fixed all internal use cases (well, the ones that had unit tests in CI), external operator libraries or custom ops might follow the same pattern and they will break once we introduce a backend fallback.

3 Likes