PyTorch 2.x Inference Recommendations

PyTorch 2.x Inference Recommendations

PyTorch 2.x introduces a range of new technologies for model inference and it can be overwhelming to figure out which technology is most appropriate for your particular use case. This guide aims to provide clarity and guidance on the various options available.

torch.compile

torch.compile speeds up PyTorch code by JIT compiling PyTorch code into optimized kernels with a simple API . It optimizes the given model using TorchDynamo and creates an optimized graph , which is then lowered into the hardware using the backend specified in the API. The default backend in torch.compile is TorchInductor. For more information, please refer to this tutorial

AOTInductor (CPP)

AOTInductor(AOTI) is a specialized version of TorchInductor that takes a PyTorch exported program , optimizes it and produces a shared library artifact that can be used in a non-Python deployment environment. We use torch.export to capture the model into a computational graph and then use AOTCompile to generate the shared library which can be loaded and executed in a CPP deployment environment. More details can be found in this tutorial

AOTInductor (Python)

When the shared library generated by AOTI needs to be executed in Python runtime, AOTI provides a python API to do so. torch._export.aot_load is used to load and execute the shared library generated by AOTCompile. For further details, please consult this tutorial

ExecuTorch

ExecuTorch is an end-to-end PyTorch platform that provides the infrastructure to run PyTorch programs on edge. The devices can range from AR/VR wearables to Android/iOS mobile devices. It relies heavily on torch.compile and torch.export. ExecuTorch provides exhaustive documentation on the end-to-end flow for various hardware platforms.

In the below table, you can find PyTorch’s recommendations on which technology to use depending on the platform where the inference is being done and the use case where this deployment is being used. Please note that some of the export related APIs mentioned above could change.

Inference Recommendations

No. Platform Use Case Recommendation
1 Mobile (iOS and Android) Highly Optimized Inference ExecuTorch
2 Embedded and other Edge Devices Highly Optimized Inference Executorch
3 Server (x86, CUDA, aarch64 ) + Python deployment High Throughput, highly optimized inference with no graph break AOTI (Python)
4 Server (x86, CUDA, aarch64) + Python deployment High Throughput, highly optimized inference with graph break torch.compile
5 Server (when AOTI is not supported) + Python deployment High Throughput, highly optimized inference torch.compile
6 Server (x86, CUDA, aarch64) with non-Python deployment High Throughput, highly optimized inference with no graph break AOTI (CPP)
7 Server (x86, CUDA, aarch64) + Python deployment When non inductor backend has the best latency torch.compile
8 Mac (M1/M2/M3) Local development and inference eager
2 Likes