PyTorch 2.x Inference Recommendations

PyTorch 2.x Inference Recommendations

PyTorch 2.x introduces a range of new technologies for model inference and it can be overwhelming to figure out which technology is most appropriate for your particular use case. This guide aims to provide clarity and guidance on the various options available.

torch.compile

torch.compile speeds up PyTorch code by JIT compiling PyTorch code into optimized kernels with a simple API . It optimizes the given model using TorchDynamo and creates an optimized graph , which is then lowered into the hardware using the backend specified in the API. The default backend in torch.compile is TorchInductor. For more information, please refer to this tutorial

AOTInductor (CPP)

AOTInductor(AOTI) is a specialized version of TorchInductor that takes a PyTorch exported program , optimizes it and produces a shared library artifact that can be used in a non-Python deployment environment. We use torch.export to capture the model into a computational graph and then use AOTCompile to generate the shared library which can be loaded and executed in a CPP deployment environment. More details can be found in this tutorial

AOTInductor (Python)

When the shared library generated by AOTI needs to be executed in Python runtime, AOTI provides a python API to do so. torch._export.aot_load is used to load and execute the shared library generated by AOTCompile. For further details, please consult this tutorial

ExecuTorch

ExecuTorch is an end-to-end PyTorch platform that provides the infrastructure to run PyTorch programs on edge. The devices can range from AR/VR wearables to Android/iOS mobile devices. It relies heavily on torch.compile and torch.export. ExecuTorch provides exhaustive documentation on the end-to-end flow for various hardware platforms.

In the below table, you can find PyTorch’s recommendations on which technology to use depending on the platform where the inference is being done and the use case where this deployment is being used. Please note that some of the export related APIs mentioned above could change.

Inference Recommendations

No. Platform Use Case Recommendation
1 Mobile (iOS and Android) Highly Optimized Inference ExecuTorch
2 Embedded and other Edge Devices Highly Optimized Inference Executorch
3 Server (x86, CUDA, aarch64 ) + Python deployment High Throughput, highly optimized inference with no graph break AOTI (Python)
4 Server (x86, CUDA, aarch64) + Python deployment High Throughput, highly optimized inference with graph break torch.compile
5 Server (when AOTI is not supported) + Python deployment High Throughput, highly optimized inference torch.compile
6 Server (x86, CUDA, aarch64) with non-Python deployment High Throughput, highly optimized inference with no graph break AOTI (CPP)
7 Server (x86, CUDA, aarch64) + Python deployment When non inductor backend has the best latency torch.compile
8 Mac (M1/M2/M3) Local development and inference eager
3 Likes

What is the recommended solution for non-Python mac systems? Is torchscript still the way to go?

non-Python mac systems would be AOTI too.

TorchScript is deprecated and not the way to go.

Do you have any sources on AOTI inference on mac? I only found this: AOT Inductor and macOS · Issue #119803 · pytorch/pytorch · GitHub

And regarding TorchScript, the latest info I saw is coming from this thread:

Is it deprecated now?

The same tutorial mentioned in the guide works on mac (cpu) torch.export AOTInductor Tutorial for Python runtime (Beta) — PyTorch Tutorials 2.4.0+cu121 documentation

Thanks for the link, but this is for python-based inference, whereas the systems in question are non-python.

You can take a look at how torchchat does this on mac GitHub - pytorch/torchchat: Run PyTorch LLMs locally on servers, desktop and mobile

I think that it will be also useful for the AOT case to have a documentation clarification over this old but good thread:

Also partially related to this from the export tutorial:

As torch.export is only a graph capturing mechanism, calling the artifact produced by torch.export eagerly will be equivalent to running the eager module. To optimize the execution of the Exported Program, we can pass this exported artifact to backends such as Inductor through torch.compile, AOTInductor, or TensorRT.

It is quite confusing what kind of benefit we could have using torch.export and the different torch.compile options especially in the case where we are exporting on an host with a different hardware from the target inference host.

1 Like

Regarding your second point, I believe the user experience would be to create the shared library on the same hardware where it would be deployed. I believe this was the same in TorchScript?

And in typical deployments, a given model would be deployed on one kind of hardware. So, this means there is only 1 shared library per model, which shouldn’t make the model ops complicated?

And in typical deployments, a given model would be deployed on one kind of hardware.

I believe this point relates more closely to the first issue and the referenced ticket/thread.

Regarding the second point, users may still feel a bit confused about the differences between AOTI performance/optimization and regular torch.compile.

For instance, consider a scenario where we have a discrete number of inputs with dynamic sizes that aren’t limited to the batch dimension. This situation often presents a trade-off between padding and targeting a discrete number of input dimensions, which is particularly common in vision tasks with varying input resolutions during inference.

With the AOTI approach during export, you can only specify a range of values without the option to define a finite set of actual sparse inputs. You can see more on this in GitHub Issue #136119.

In contrast, using the “classical” torch.compile allows you to generate specific code tailored to the exact types of inputs you’ll be feeding into the model.

Some users might prefer to optimize the code for a finite set of dimensions while still leveraging the torch.compile cache or remote cache.

However, it appears to be challenging for users to grasp the performance benefits of the two solutions, especially when compiling/exporting on the same inference hardware target.

No, torchscript would create one artifact which could run on any other hardware as well.

Is there current or planned support for autograd on AOTI and torch.export? By the way, excellent explanation, it is a bit overwhelming, but I’m excited for the new features =)