Skipping Dispatcher with LazyTensor

Overview

We wanted to measure the overhead of the dispatcher and the impact on lazy tensor core (LTC). Therefore, we’ve implemented tracing for LTC as close to Python as we could.

We made a patch that skips the entire dispatching process and directly calls LazyNativeFunctions from the THPVariable_* methods/functions. This patch gives a speedup of about 10% in our microbenchmarks when not executing the graph. See the results below.

Please note that we have implemented a simple, proof-of-concept patch; the benchmark results may show the highest speedup. We did not implement support for other dispatch keys, such as autograd, autodiff, etc.

Details

To inject code in the THPVariable_* functions, we patched the autogen template of gen_python_functions.py. For example, the generated THPVariable_add function looks as follows:

// add
static PyObject * THPVariable_add(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  // ...
  switch (_r.idx) {
    case 0: {
    // [deprecated] ...
    }
    case 1: {
      auto dispatch_add = [](const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) -> at::Tensor {
        pybind11::gil_scoped_release no_gil;
        if (self.is_lazy() && other.is_lazy()) {
          return torch::lazy::lazy_add(self, other, alpha);
        }
        return self.add(other, alpha);
      };
      return wrap(dispatch_add(self, _r.tensor(0), _r.scalar(1)));
    }
  }
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}

The injected lazy_add calls the corresponding LazyNativeFunctions directly, and thus skips the dispatcher altogether.

Full Benchmark Results

The benchmarks we run are from the Torchy repository. They execute sequences of fusible operations with varying tensor sizes. We executed the combinations of add/sub/mul/div, 8 to 32 times in each iteration, and compared the execution time to the original LTC implementation. Also, we varied the input size to see the effects of it (square matrices of n=100/1k/10k).

The raw data is in this link: LazyTensor Comparison

Results without graph execution

From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.

Results without graph execution

From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.

Results with graph execution

For this test, we forced the execution of the graph by using result.to(device=’cpu’) at the end of each iteration.

Even if the values are lower than the previous cases, we find that the bypass is effective to a degree. It shows that the speedup gets lower when the input size gets bigger, which is expected as the execution time of the operations themselves start to dominate the total time.

Note: benchmarks run with PyTorch 1.12-dev (git hash fc66521ebddeb2f0cf711a0bddabae412bf92923).

Todos and Conclusion

In conclusion, we observe that the dispatcher imposes a non-trivial overhead on LTC. Thus, we suggest that the LTC process should not depend on the current dispatcher, which was designed for eager tensors.

As we are skipping the entire dispatching process, we have missed some features like autocast, debugging and logging. Nevertheless, this benchmark shows the overhead of the dispatcher, and we believe the missing features can be implemented with further work.

We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?

Thank you!

2 Likes

Thanks a lot for sharing this! I’m excited to see the data that shows dispatcher’s role in the trace overhead.

We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?

I would consider two points before trying to merge this.

  1. is this going to be useful/ add additional value on top of using torchdynamo + lazytensor? The current proposal (though it is early) is to support a mode where dynamo associates a lazily traced program with its own guards, making it safe to skip lazy tracing on iter2 and jump directly to the compiled computation as long as dynamo’s guards pass. With this approach, we can skip not only the dispatcher overhead but all the trace overhead, and even some overhead originating in python.

  2. what would it take to make this 100% safe/consistent with eager pytorch behavior? mainly, there are probably cases where there is non-linear code between the THPVariable_foo and underlying foo operator. In these cases, jumping directly to lazy trace could make the lazy tensor behave differently from eager. Can we avoid this, and keep things consistent?

Thanks Will!
The integration with dynamo is a good point. If it manages to bundle a few dozen instructions together on average, then the overhead may not be noticeable anymore. At least not with the current crop of models dominated mostly by big operations.

Is this something we can already run? Or is it under development right now? We would be happy to contribute to that effort it it makes sense.

Regarding your 2nd question, I don’t know the answer. But any weird behavior should probably be fixed. I share your concerns, but with enough fuzzing I think we can get rid of all the issues, or at least have an idea of how big the issue is.