PyTorch 2.x Inference Recommendations
PyTorch 2.x introduces a range of new technologies for model inference and it can be overwhelming to figure out which technology is most appropriate for your particular use case. This guide aims to provide clarity and guidance on the various options available.
torch.compile
torch.compile
speeds up PyTorch code by JIT compiling PyTorch code into optimized kernels with a simple API . It optimizes the given model using TorchDynamo and creates an optimized graph , which is then lowered into the hardware using the backend specified in the API. The default backend in torch.compile is TorchInductor. For more information, please refer to this tutorial
AOTInductor (CPP)
AOTInductor(AOTI) is a specialized version of TorchInductor that takes a PyTorch exported program , optimizes it and produces a shared library artifact that can be used in a non-Python deployment environment. We use torch.export
to capture the model into a computational graph and then use AOTCompile to generate the shared library which can be loaded and executed in a CPP deployment environment. More details can be found in this tutorial
AOTInductor (Python)
When the shared library generated by AOTI needs to be executed in Python runtime, AOTI provides a python API to do so. torch._export.aot_load
is used to load and execute the shared library generated by AOTCompile. For further details, please consult this tutorial
ExecuTorch
ExecuTorch is an end-to-end PyTorch platform that provides the infrastructure to run PyTorch programs on edge. The devices can range from AR/VR wearables to Android/iOS mobile devices. It relies heavily on torch.compile
and torch.export
. ExecuTorch provides exhaustive documentation on the end-to-end flow for various hardware platforms.
In the below table, you can find PyTorch’s recommendations on which technology to use depending on the platform where the inference is being done and the use case where this deployment is being used. Please note that some of the export related APIs mentioned above could change.
Inference Recommendations
No. | Platform | Use Case | Recommendation |
---|---|---|---|
1 | Mobile (iOS and Android) | Highly Optimized Inference | ExecuTorch |
2 | Embedded and other Edge Devices | Highly Optimized Inference | Executorch |
3 | Server (x86, CUDA, aarch64 ) + Python deployment | High Throughput, highly optimized inference with no graph break | AOTI (Python) |
4 | Server (x86, CUDA, aarch64) + Python deployment | High Throughput, highly optimized inference with graph break | torch.compile |
5 | Server (when AOTI is not supported) + Python deployment | High Throughput, highly optimized inference | torch.compile |
6 | Server (x86, CUDA, aarch64) with non-Python deployment | High Throughput, highly optimized inference with no graph break | AOTI (CPP) |
7 | Server (x86, CUDA, aarch64) + Python deployment | When non inductor backend has the best latency | torch.compile |
8 | Mac (M1/M2/M3) | Local development and inference | eager |