Skipping Dispatcher with LazyTensor

Overview

We wanted to measure the overhead of the dispatcher and the impact on lazy tensor core (LTC). Therefore, we’ve implemented tracing for LTC as close to Python as we could.

We made a patch that skips the entire dispatching process and directly calls LazyNativeFunctions from the THPVariable_* methods/functions. This patch gives a speedup of about 10% in our microbenchmarks when not executing the graph. See the results below.

Please note that we have implemented a simple, proof-of-concept patch; the benchmark results may show the highest speedup. We did not implement support for other dispatch keys, such as autograd, autodiff, etc.

Details

To inject code in the THPVariable_* functions, we patched the autogen template of gen_python_functions.py. For example, the generated THPVariable_add function looks as follows:

// add
static PyObject * THPVariable_add(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  // ...
  switch (_r.idx) {
    case 0: {
    // [deprecated] ...
    }
    case 1: {
      auto dispatch_add = [](const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) -> at::Tensor {
        pybind11::gil_scoped_release no_gil;
        if (self.is_lazy() && other.is_lazy()) {
          return torch::lazy::lazy_add(self, other, alpha);
        }
        return self.add(other, alpha);
      };
      return wrap(dispatch_add(self, _r.tensor(0), _r.scalar(1)));
    }
  }
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}

The injected lazy_add calls the corresponding LazyNativeFunctions directly, and thus skips the dispatcher altogether.

Full Benchmark Results

The benchmarks we run are from the Torchy repository. They execute sequences of fusible operations with varying tensor sizes. We executed the combinations of add/sub/mul/div, 8 to 32 times in each iteration, and compared the execution time to the original LTC implementation. Also, we varied the input size to see the effects of it (square matrices of n=100/1k/10k).

The raw data is in this link: LazyTensor Comparison

Results without graph execution

From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.

Results without graph execution

From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.

Results with graph execution

For this test, we forced the execution of the graph by using result.to(device=’cpu’) at the end of each iteration.

Even if the values are lower than the previous cases, we find that the bypass is effective to a degree. It shows that the speedup gets lower when the input size gets bigger, which is expected as the execution time of the operations themselves start to dominate the total time.

Note: benchmarks run with PyTorch 1.12-dev (git hash fc66521ebddeb2f0cf711a0bddabae412bf92923).

Todos and Conclusion

In conclusion, we observe that the dispatcher imposes a non-trivial overhead on LTC. Thus, we suggest that the LTC process should not depend on the current dispatcher, which was designed for eager tensors.

As we are skipping the entire dispatching process, we have missed some features like autocast, debugging and logging. Nevertheless, this benchmark shows the overhead of the dispatcher, and we believe the missing features can be implemented with further work.

We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?

Thank you!

3 Likes

Thanks a lot for sharing this! I’m excited to see the data that shows dispatcher’s role in the trace overhead.

We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?

I would consider two points before trying to merge this.

  1. is this going to be useful/ add additional value on top of using torchdynamo + lazytensor? The current proposal (though it is early) is to support a mode where dynamo associates a lazily traced program with its own guards, making it safe to skip lazy tracing on iter2 and jump directly to the compiled computation as long as dynamo’s guards pass. With this approach, we can skip not only the dispatcher overhead but all the trace overhead, and even some overhead originating in python.

  2. what would it take to make this 100% safe/consistent with eager pytorch behavior? mainly, there are probably cases where there is non-linear code between the THPVariable_foo and underlying foo operator. In these cases, jumping directly to lazy trace could make the lazy tensor behave differently from eager. Can we avoid this, and keep things consistent?

Thanks Will!
The integration with dynamo is a good point. If it manages to bundle a few dozen instructions together on average, then the overhead may not be noticeable anymore. At least not with the current crop of models dominated mostly by big operations.

Is this something we can already run? Or is it under development right now? We would be happy to contribute to that effort it it makes sense.

Regarding your 2nd question, I don’t know the answer. But any weird behavior should probably be fixed. I share your concerns, but with enough fuzzing I think we can get rid of all the issues, or at least have an idea of how big the issue is.

I believe in functorch aot_autograd, the forward and backward dispatch of an nn.Module could be traced into fx Graph, and the traced graphs could be wrapped in a custom torch.autograd.Function to integrate into PyTorch native autograd (eager mode). When the compiled graphs gets called, inside the graph we completely skip the dispatcher.

I am just wondering when it will be able to work with LTC :slight_smile:

@shunting and Jack Cao (google) are working on prototypes of LTC/XLA with TorchDynamo. Dynamo allows skipping the python/lazy-tracing path for subsequent iterations, so in a way it is similar to what you are talking about. Would you be interested in trying that instead?

Thanks for the heads up, Will!
I’ve been experimenting with Dynamo a lot lately. Seems quite stable. Doesn’t handle branching, though. So, there’s certainly an opportunity for LTC there (and/or combine with speculative execution in Dynamo).
Hoyoung (@MerHS) is playing with something else at the moment (besides attending PhD classes), but we will get back to you once he’s a bit more free.
Thanks!

@nunoplopes Out of curiosity, what do you mean by “doesn’t handle branching”? And how would LazyTensor resolve this problem?

Dynamo doesn’t handle branching in the sense that it doesn’t create frames across ‘if’ statements & etc. Dynamo handles control-flow by creating multiple frames.
Right now, there’s no optimization across frames. NNC doesn’t support control-flow well anyway, so it’s not a big issue for it. But maybe other backends support control-flow better (I don’t know; I’m most familiar with NNC).

LazyTensor allows you to delay compilation until it is strictly needed, which means it can transparently traverse control-flow if possible.
If you combine both Dynamo & LTC, you end up with macro blocks with control-flow between them. Dynamo feeds these large blocks to LTC as a single instruction, and LTC can then merge multiple of these across control-flow if possible.

This issue can potentially be “fixed” in Dynamo through, e.g., optimistic trace inlining (through statistics and etc). But, still, if a model is very dynamic, LTC is the way to go.
I see clear benefit in merging Dynamo & LTC at some point. Very dynamic models aren’t common yet, but they may well be in the future. They don’t exist right now because perf isn’t good. But history shows that better compilers enable crazy new applications. See what happened with web 2.0, that was enabled by better javascript compilers.

I feel there’s a bit too much stuff going on right now, so it’s hard to jump in and help out. When the train stabilizes, we are more than happy to help out.

Ok, sorry for the long text. Got too excited :slight_smile:

But in LazyTensor, isn’t control flow (on data-dependent values) when “it is strictly needed”?

Yep, data-dependent branching forces compilation, yes. At least with how things work currently.
But not all branches are data dependent. For example, some optimizers execute different branches based on the iteration number.

@nunoplopes I think in many of the cases where we have branches, Dynamo would also avoid graph breaking. I think there’s some issues around how we handle iteration number-style ints, but I think those should be resolveable as well.

I agree that there are certainly some cases where Dynamo would be forced to pessimistically graph break where LazyTensor wouldn’t, but I think those cases are relatively few.