Back in September we posted about TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation. To recap, TorchDynamo hooks into the frame evaluation API in CPython to dynamically modify Python bytecode right before it is executed. It rewrites Python bytecode in order to extract sequences of PyTorch operations into an FX Graph which is then just-in-time compiled with a user-defined compiler. It creates this FX Graph through bytecode analysis, not tracing, and is designed to generating smaller graph fragments that can be mixed with Python execution.
Since the last post we grew the set of benchmarks to all of TorchBench (47 models), increased the operator coverage metric reported last time from 60% to 80%, and built a proof-of-concept backend to demonstrate performance improvements with this approach.
For any new frontend like this, it is vitally important to demonstrate performance improvements to validate the approach. To this end, we decided to first start with CPU inference, however the approach will generalize to training and other devices. The backend works as follows:
- Given an FX graph by TorchDynamo, first hash that graph into a graph key
- If the graph key is not in the subgraph database, add it to the subgraph database and run the graph eagerly
- If the graph key is in the subgraph database and has been optimized, use the backend and schedule stored in the subgraph database
There is then an offline autotuning step which iterates through the subgraph database and optimizes the graphs found there. The offline autotuner currently has 6 backends:
- TorchScript Runtime
- Static Runtime
- ONNX Runtime
- TVM (with and without autotuning)
- IPEX (on AVX512 machines)
It operates but running each of those backends (and in some cases calling other autotuners) and selecting the fasted one for each subgraph. It performs validation and correctness checking, as some backends produce incorrect results or crash. We observe that each backend is good in different places and there is no one clear winner. Combining the backends together can lead to better results than any one backend alone. Note that most of these backends require shape specialization, however, TorchDynamo supports both shape specialization and dynamic shapes based on a flag.
The attached figure shows speedup over eager mode for CPU inference measured on an Intel Core i7-8086K. Each result is the median of 100 runs to reduce noise.
TorchDynamo provides a 1.48x geomean speedup (and a 3.73x mean speedup), compared to 1.31x geomean (3.39x mean) for optimize_for_inference. More importantly, TorchDynamo runs correctly on all 47 models – while optimize_for_inference errors on more than half of models. These errors represent usability issues that will require people to change their models and slow down development.
It is still extremely early for TorchDynamo, and it is not ready for production usage, however these early results are extremely promising. It shows that the best of both worlds is possible, where we can support the full dynamism and user experience of PyTorch and Python, but still get performance similar to or better than more restrictive graph mode runtimes.