PyTorch 2.x Inference Recommendations

agunapal · September 30, 2024, 5:23pm

PyTorch 2.x Inference Recommendations

PyTorch 2.x introduces a range of new technologies for model inference and it can be overwhelming to figure out which technology is most appropriate for your particular use case. This guide aims to provide clarity and guidance on the various options available.

`torch.compile`

torch.compile speeds up PyTorch code by JIT compiling PyTorch code into optimized kernels with a simple API . It optimizes the given model using TorchDynamo and creates an optimized graph , which is then lowered into the hardware using the backend specified in the API. The default backend in torch.compile is TorchInductor. For more information, please refer to this tutorial

AOTInductor (CPP)

AOTInductor(AOTI) is a specialized version of TorchInductor that takes a PyTorch exported program , optimizes it and produces a shared library artifact that can be used in a non-Python deployment environment. We use torch.export to capture the model into a computational graph and then use AOTCompile to generate the shared library which can be loaded and executed in a CPP deployment environment. More details can be found in this tutorial

AOTInductor (Python)

When the shared library generated by AOTI needs to be executed in Python runtime, AOTI provides a python API to do so. torch._export.aot_load is used to load and execute the shared library generated by AOTCompile. For further details, please consult this tutorial

ExecuTorch

ExecuTorch is an end-to-end PyTorch platform that provides the infrastructure to run PyTorch programs on edge. The devices can range from AR/VR wearables to Android/iOS mobile devices. It relies heavily on torch.compile and torch.export. ExecuTorch provides exhaustive documentation on the end-to-end flow for various hardware platforms.

In the below table, you can find PyTorch’s recommendations on which technology to use depending on the platform where the inference is being done and the use case where this deployment is being used. Please note that some of the export related APIs mentioned above could change.

Inference Recommendations

No.	Platform	Use Case	Recommendation
1	Mobile (iOS and Android)	Highly Optimized Inference	ExecuTorch
2	Embedded and other Edge Devices	Highly Optimized Inference	Executorch
3	Server (x86, CUDA, aarch64 ) + Python deployment	High Throughput, highly optimized inference with no graph break	AOTI (Python)
4	Server (x86, CUDA, aarch64) + Python deployment	High Throughput, highly optimized inference with graph break	`torch.compile`
5	Server (when AOTI is not supported) + Python deployment	High Throughput, highly optimized inference	`torch.compile`
6	Server (x86, CUDA, aarch64) with non-Python deployment	High Throughput, highly optimized inference with no graph break	AOTI (CPP)
7	Server (x86, CUDA, aarch64) + Python deployment	When non inductor backend has the best latency	`torch.compile`
8	Mac (M1/M2/M3)	Local development and inference	eager

spiegelball · October 15, 2024, 4:14pm

What is the recommended solution for non-Python mac systems? Is torchscript still the way to go?

smth · October 15, 2024, 5:43pm

non-Python mac systems would be AOTI too.

TorchScript is deprecated and not the way to go.

spiegelball · October 16, 2024, 9:58am

Do you have any sources on AOTI inference on mac? I only found this: AOT Inductor and macOS · Issue #119803 · pytorch/pytorch · GitHub

And regarding TorchScript, the latest info I saw is coming from this thread:

Is it deprecated now?

agunapal · October 16, 2024, 3:42pm

The same tutorial mentioned in the guide works on mac (cpu) torch.export AOTInductor Tutorial for Python runtime (Beta) — PyTorch Tutorials 2.4.0+cu121 documentation

spiegelball · October 16, 2024, 4:49pm

Thanks for the link, but this is for python-based inference, whereas the systems in question are non-python.

agunapal · October 17, 2024, 6:37pm

You can take a look at how torchchat does this on mac GitHub - pytorch/torchchat: Run PyTorch LLMs locally on servers, desktop and mobile

bhack · October 22, 2024, 2:15pm

I think that it will be also useful for the AOT case to have a documentation clarification over this old but good thread:

github.com/pytorch/pytorch

AOT inductor should generate source code instead of a library

opened 12:04AM - 16 Dec 23 UTC

mindbeast

feature triaged oncall: pt2 module: inductor

### 🚀 The feature, motivation and pitch AOT inductor looks like the upcoming me…ans to do inference from native code that was trained in pytorch, and the replacement for torchcript export to native code. It's clear this interface is in [prototype status](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_aot_inductor.rst), but based on what is present right now, it's problematic for many users. torch._export.aot_compile, as currently defined, produces a .so and presumably invokes nvcc for gpu models, and likely a host compiler for cpu models. This is pretty problematic for integration into many native build tools, as the export process takes over building of the inference library. Cross compilation is impossible, as is passing flags to the build tools. An interface that would potentially be much friendlier would yield source code that could be fed into an existing build system rather than directly providing a library. This way pytorch wouldn't have to manage build tools in any capacity. There is likely a lot of complexity here, because code generation likely wants to hardcode many platform specific details, e.g. gpu type, cpu instruction sets. Being able to specify the platform constraints and capabilities on the aoti_compile interface would likely be wise, rather than attempting to automatically infer it from the local machine. ### Alternatives _No response_ ### Additional context _No response_ cc @ezyang @anijain2305 @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @msaroufim @wconstab @bdhirsh @zou3519 @aakhundov

Also partially related to this from the export tutorial:

As torch.export is only a graph capturing mechanism, calling the artifact produced by torch.export eagerly will be equivalent to running the eager module. To optimize the execution of the Exported Program, we can pass this exported artifact to backends such as Inductor through torch.compile, AOTInductor, or TensorRT.

It is quite confusing what kind of benefit we could have using torch.export and the different torch.compile options especially in the case where we are exporting on an host with a different hardware from the target inference host.

agunapal · October 22, 2024, 8:45pm

Regarding your second point, I believe the user experience would be to create the shared library on the same hardware where it would be deployed. I believe this was the same in TorchScript?

And in typical deployments, a given model would be deployed on one kind of hardware. So, this means there is only 1 shared library per model, which shouldn’t make the model ops complicated?

bhack · October 22, 2024, 9:17pm

And in typical deployments, a given model would be deployed on one kind of hardware.

I believe this point relates more closely to the first issue and the referenced ticket/thread.

Regarding the second point, users may still feel a bit confused about the differences between AOTI performance/optimization and regular torch.compile.

For instance, consider a scenario where we have a discrete number of inputs with dynamic sizes that aren’t limited to the batch dimension. This situation often presents a trade-off between padding and targeting a discrete number of input dimensions, which is particularly common in vision tasks with varying input resolutions during inference.

With the AOTI approach during export, you can only specify a range of values without the option to define a finite set of actual sparse inputs. You can see more on this in GitHub Issue #136119.

In contrast, using the “classical” torch.compile allows you to generate specific code tailored to the exact types of inputs you’ll be feeding into the model.

Some users might prefer to optimize the code for a finite set of dimensions while still leveraging the torch.compile cache or remote cache.

However, it appears to be challenging for users to grasp the performance benefits of the two solutions, especially when compiling/exporting on the same inference hardware target.

spiegelball · October 24, 2024, 10:59am

No, torchscript would create one artifact which could run on any other hardware as well.

ipickering · November 3, 2024, 2:13pm

Is there current or planned support for autograd on AOTI and torch.export? By the way, excellent explanation, it is a bit overwhelming, but I’m excited for the new features =)

Topic		Replies	Views
What is the correct, future-proof, way of deploying a pytorch python model in C++ for inference? deployment	12	384	February 25, 2025
What's the difference between torch.export / torchserve / executorch / aotinductor? deployment	17	2037	October 17, 2024
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	66403	July 29, 2024
TorchDynamo Update 6: Training support with AOTAutograd compiler	0	5615	March 29, 2022
TorchDynamo Update 3: GPU Inference Edition compiler	12	6665	February 2, 2023

PyTorch 2.x Inference Recommendations