PyTorch 2.x introduces a range of new technologies for model inference and it can be overwhelming to figure out which technology is most appropriate for your particular use case. This guide aims to provide clarity and guidance on the various options available.
torch.compile
torch.compile speeds up PyTorch code by JIT compiling PyTorch code into optimized kernels with a simple API . It optimizes the given model using TorchDynamo and creates an optimized graph , which is then lowered into the hardware using the backend specified in the API. The default backend in torch.compile is TorchInductor. For more information, please refer to this tutorial
AOTInductor (CPP)
AOTInductor(AOTI) is a specialized version of TorchInductor that takes a PyTorch exported program , optimizes it and produces a shared library artifact that can be used in a non-Python deployment environment. We use torch.export to capture the model into a computational graph and then use AOTCompile to generate the shared library which can be loaded and executed in a CPP deployment environment. More details can be found in this tutorial
AOTInductor (Python)
When the shared library generated by AOTI needs to be executed in Python runtime, AOTI provides a python API to do so. torch._export.aot_load is used to load and execute the shared library generated by AOTCompile. For further details, please consult this tutorial
ExecuTorch
ExecuTorch is an end-to-end PyTorch platform that provides the infrastructure to run PyTorch programs on edge. The devices can range from AR/VR wearables to Android/iOS mobile devices. It relies heavily on torch.compile and torch.export. ExecuTorch provides exhaustive documentation on the end-to-end flow for various hardware platforms.
In the below table, you can find PyTorch’s recommendations on which technology to use depending on the platform where the inference is being done and the use case where this deployment is being used. Please note that some of the export related APIs mentioned above could change.
Inference Recommendations
No.
Platform
Use Case
Recommendation
1
Mobile (iOS and Android)
Highly Optimized Inference
ExecuTorch
2
Embedded and other Edge Devices
Highly Optimized Inference
Executorch
3
Server (x86, CUDA, aarch64 ) + Python deployment
High Throughput, highly optimized inference with no graph break
AOTI (Python)
4
Server (x86, CUDA, aarch64) + Python deployment
High Throughput, highly optimized inference with graph break
torch.compile
5
Server (when AOTI is not supported) + Python deployment
High Throughput, highly optimized inference
torch.compile
6
Server (x86, CUDA, aarch64) with non-Python deployment
High Throughput, highly optimized inference with no graph break
I think that it will be also useful for the AOT case to have a documentation clarification over this old but good thread:
Also partially related to this from the export tutorial:
As torch.export is only a graph capturing mechanism, calling the artifact produced by torch.export eagerly will be equivalent to running the eager module. To optimize the execution of the Exported Program, we can pass this exported artifact to backends such as Inductor through torch.compile, AOTInductor, or TensorRT.
It is quite confusing what kind of benefit we could have using torch.export and the different torch.compile options especially in the case where we are exporting on an host with a different hardware from the target inference host.
Regarding your second point, I believe the user experience would be to create the shared library on the same hardware where it would be deployed. I believe this was the same in TorchScript?
And in typical deployments, a given model would be deployed on one kind of hardware. So, this means there is only 1 shared library per model, which shouldn’t make the model ops complicated?
And in typical deployments, a given model would be deployed on one kind of hardware.
I believe this point relates more closely to the first issue and the referenced ticket/thread.
Regarding the second point, users may still feel a bit confused about the differences between AOTI performance/optimization and regular torch.compile.
For instance, consider a scenario where we have a discrete number of inputs with dynamic sizes that aren’t limited to the batch dimension. This situation often presents a trade-off between padding and targeting a discrete number of input dimensions, which is particularly common in vision tasks with varying input resolutions during inference.
With the AOTI approach during export, you can only specify a range of values without the option to define a finite set of actual sparse inputs. You can see more on this in GitHub Issue #136119.
In contrast, using the “classical” torch.compile allows you to generate specific code tailored to the exact types of inputs you’ll be feeding into the model.
Some users might prefer to optimize the code for a finite set of dimensions while still leveraging the torch.compile cache or remote cache.
However, it appears to be challenging for users to grasp the performance benefits of the two solutions, especially when compiling/exporting on the same inference hardware target.
Is there current or planned support for autograd on AOTI and torch.export? By the way, excellent explanation, it is a bit overwhelming, but I’m excited for the new features =)