Since September 2021, we have working on an experimental project called TorchDynamo. TorchDynamo is a Python-level JIT compiler designed to make unmodified PyTorch programs faster. TorchDynamo hooks into the frame evaluation API in CPython to dynamically modify Python bytecode right before it is executed. It rewrites Python bytecode in order to extract sequences of PyTorch operations into an FX Graph which is then just-in-time compiled with an ensemble of different backends and autotuning. It creates this FX Graph through bytecode analysis, not tracing, and is designed to generating smaller graph fragments that can be mixed with Python execution.
Our first post, TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation, introduced the concept and the approach.
Our second post, TorchDynamo Update: 1.48x geomean speedup on TorchBench CPU Inference, shared the first performance results and introduced our ensemble-based proof of concept backend.
This post is primarily a performance update, as we have continued to develop on the strategy described in the earlier posts. Notable changes since then:
- Added GPU support.
- Increased support of Python bytecodes.
- Added new backends, including: nvfuser, cudagraphs, onnxruntime-gpu, tensorrt (fx2trt/torch2trt/onnx2trt), and tensorflow/xla (via onnx).
- Imported new benchmarks added to TorchBenchmark, including 2 that TorchDynamo fails on, which should be fixed soon.
- Switch to measuring on different machines, which made GPU possible and also changed the CPU results due to a jump from 12 to 96 threads and adding AVX512 hardware.
Performance Results
With that, on to the numbers! Attached you will find updated performance for both GPU and CPU inference. This includes the following baselines for comparison:
-
NNC: TorchScript with
fuser1
enabled -
NNC + OFI: TorchScript with
fuser1
and optimize_for_inference -
nvFuser: TorchScript with
fuser2
enabled -
nvFuser + OFI: TorchScript with
fuser2
and optimize_for_inference
Each number is the median of 100 measurements and is normalized to speedup over eager mode.
The first thing that still jumps out at me is the difference is model coverage between TorchScript based backends and TorchDynamo/Eager. Except for eager mode (100%) and TorchDynamo (96%), no backend works on more than half of models. This reflects a massive usability gap we have between eager mode and existing graph mode backends.
TorchDynamo provides larger average speedups than the other backends shown. On GPU TorchDynamo provides a 1.29x geometric mean speedup and on CPU TorchDynamo provides a 1.71x geometric mean speedup. These results show TorchDynamo is faster on average while maintaining high model coverage.
For a bit more raw data to figure out what TorchDynamo is doing, the following are the counts of how often TorchDynamo used each GPU backend:
('eager', 323)
('cudagraphs', 161)
('nvfuser', 62)
('ofi', 58)
('nnc', 50)
('tensorrt', 36)
('onnx2tf', 2)
('onnxrt_cuda', 1)
And here are the same counts for CPU backends:
('eager', 314)
('ts', 157)
('ofi', 149)
('onnxrt_cpu', 24)
('onnx2tf', 24)
One should take these numbers with a grain of salt. The size of these graphs vary dramatically and a small subset of the graphs matter much more than others. Eager is the most commonly selected backend, because 1) other backends don’t support many graphs; and 2) eager often outperforms graph based backends. The biggest area I see for further performance improvement is to break graphs at unsupported ops in order to increase backend choice.
Conclusions
It remains early for TorchDynamo, and it is not ready for production usage, however these early results continue to be extremely promising. It shows that the best of both worlds is possible, where we can support the full dynamism and user experience of PyTorch and Python, but still get performance similar to or better than more restrictive graph mode runtimes.