Overview
We wanted to measure the overhead of the dispatcher and the impact on lazy tensor core (LTC). Therefore, we’ve implemented tracing for LTC as close to Python as we could.
We made a patch that skips the entire dispatching process and directly calls LazyNativeFunctions from the THPVariable_*
methods/functions. This patch gives a speedup of about 10% in our microbenchmarks when not executing the graph. See the results below.
Please note that we have implemented a simple, proof-of-concept patch; the benchmark results may show the highest speedup. We did not implement support for other dispatch keys, such as autograd, autodiff, etc.
Details
To inject code in the THPVariable_*
functions, we patched the autogen template of gen_python_functions.py. For example, the generated THPVariable_add function looks as follows:
// add
static PyObject * THPVariable_add(PyObject* self_, PyObject* args, PyObject* kwargs)
{
// ...
switch (_r.idx) {
case 0: {
// [deprecated] ...
}
case 1: {
auto dispatch_add = [](const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) -> at::Tensor {
pybind11::gil_scoped_release no_gil;
if (self.is_lazy() && other.is_lazy()) {
return torch::lazy::lazy_add(self, other, alpha);
}
return self.add(other, alpha);
};
return wrap(dispatch_add(self, _r.tensor(0), _r.scalar(1)));
}
}
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
The injected lazy_add
calls the corresponding LazyNativeFunctions directly, and thus skips the dispatcher altogether.
Full Benchmark Results
The benchmarks we run are from the Torchy repository. They execute sequences of fusible operations with varying tensor sizes. We executed the combinations of add/sub/mul/div, 8 to 32 times in each iteration, and compared the execution time to the original LTC implementation. Also, we varied the input size to see the effects of it (square matrices of n=100/1k/10k).
The raw data is in this link: LazyTensor Comparison
Results without graph execution
From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.
Results without graph execution
From this benchmark, we generated only an LTC graph, and did not execute the operations themselves. The results show that bypassing the dispatcher gives us a 10-12% consistent speedup. There is a slight performance degradation for the large size/low iteration case, but it’s in the noise.
Results with graph execution
For this test, we forced the execution of the graph by using result.to(device=’cpu’)
at the end of each iteration.
Even if the values are lower than the previous cases, we find that the bypass is effective to a degree. It shows that the speedup gets lower when the input size gets bigger, which is expected as the execution time of the operations themselves start to dominate the total time.
Note: benchmarks run with PyTorch 1.12-dev (git hash fc66521ebddeb2f0cf711a0bddabae412bf92923).
Todos and Conclusion
In conclusion, we observe that the dispatcher imposes a non-trivial overhead on LTC. Thus, we suggest that the LTC process should not depend on the current dispatcher, which was designed for eager tensors.
As we are skipping the entire dispatching process, we have missed some features like autocast, debugging and logging. Nevertheless, this benchmark shows the overhead of the dispatcher, and we believe the missing features can be implemented with further work.
We would appreciate any suggestions, feedback, etc. Also, would there be interest in merging a feature-complete version of this patch at some point?
Thank you!