What’s preventing PyTorch from being competitive with Llamafile?

Over the past year, PyTorch team has done a lot of work to improve the user experience of torch.compile and torch.export. As PyTorch Compiler Deployment team, we’ve seen a lot of interesting use cases around “minimal” deployment where people try to compile PyTorch models down to self-contained executables without dependencies because:

  • While running torch.compile is useful, but the dependencies involved are a non-starter for a large variety of use cases.
  • A lot of people want to just download a binary and run it (e.g. project like Llamfile is a cool project to demo LLMs with very little installation steps)

Although we have a lot of tools for enabling native deployment of PyTorch models (export/aoti), almost nobody is actually using them for building/shipping binary in practice.

Therefore, I want to understand what the obstacles were there, and so I tried packing Whisper into a single binary, and this is what I found.

Summary

  • Although we have a lot of tools to compete with llamafile, the libtorch dependency and export’s top-down UX make it unappealing to use torch.export to turn an existing PyTorch model to a distributable executable.
  • To address these issues we propose to push on the following projects:
    • libtorch-free aotinductor.
    • An eager-style torch.export API.
    • TorchNative: a set of tech demos to showcase using torch.export+aoti on realworld PyTorch models.

Experiment on OpenAI Whisper

To concretely understand what could be helpful for export+aoti (and follow the spirit of PyTorch team’s User Empathy Day™), I decided to have a hack-a-week to see what I can come up with when using export+aoti on some random open source model. I went the extra mile for a whole week’s time because I want to truly understand what it would take to fully get rid of the Python environment to run whisper, and a single user empathy day is definitely not enough to take me from the Python wonderland to the no-Python wonderland :slight_smile:

It turns out that, it’s totally feasible to produce a native binary which takes an audio file and produces the transcription using export and aoti, i.e. a fully reproduced e2e whisper run with only torch compiled kernels and torch ops without Python! I put the demo and code here: https://github.com/zhxchen17/torchnative/tree/main/whisper_aoti (note: I still need to locally patch PyTorch to make this work. Link to the patch: gist:48bcf4bf5af77302a1047f6ba67e8568 · GitHub)

Here is a architecture overview of the produced binary:

Observation 1

Problem: Libtorch dependency makes us non-competitive for single-binary setting, because people are trying to get other users to download this dependency, while libtorch is an extra couple gigabytes (very bad).

Solution: Libtorch-free aotinductor.

Remember the architecture diagram we just saw, everything looks perfect, right? Well I intentionally leave 1 big part out of the diagram: libtorch! A more complete architecture looks like the following:

In our example, the binary size is on the order of magnitude of hundreds of megabytes (and can be optimized further), but libtorch itself takes a few gigabytes, which makes the distribution of the whole binary much harder.

As a side note, today the generated code from AOTI also cannot be easily statically linked because different model binaries tend to have symbol conflict in a single binary setting. Currently this is being worked around by dlopen-ing each shared object separately so their namespaces are isolated from each other.

Another issue is that with libtorch as dependency, we need to carefully maintain the ABI compatibility between the produced binary and the libtorch we’re shipping with. With the design of libtorch.so ↔ model.so shim layer, this problem is managed but the ABI issue will be eliminated all at once if we don’t dynamically link with libtorch.so in the first place, resulting in a much simpler deployment pipeline.

Overall, I strongly agree with Bin Bao’s vision that we should work on libtorch-free codegenerate in AOTI.

Observation 2

Problem: Export’s UX is too “top-down”, meaning either it works on the first try, or it’s hard to make it work. However, for “real” models like Whisper, it’s much easier to do it from a “bottom-up” way.

Solution: Eager style export API.

Today when a user wants to AOTI compile a model, the process is repetitive and tedious: First we need to acquire sample inputs from somewhere, which results in hacky code to “breakpoint” at model’s entry point and observe the inputs there. Secondly you need to pass ExportedProgram and sample inputs to AOTI’s API and package it manually.

For Whisper, I actually needed to export the model into 5 different smaller parts, so people can imagine how much boilerplate code I ended up with. Therefore, based on the sticky cache idea floating around PyTorch team, I want to explore the idea of a decorator style like the following:

In prototype phase, API design subject to change.

@_compile_cache("model", out="/tmp/zhxchen17/", input_matchers=({"x": {0: Dim("a")}, y: None}))
def model_code(x, y):
    return x + y

model_code(x, y)

torch.export team has started the discussion of this API and we believe it will make users’ experience much better when exporting models with smaller granularity.

Observation 3

Problem: Almost nobody is using export/aotinductor in the public community and it’s unclear what is the standard workflow to deploy exported models in distributable binaries.

Solution: TorchNative: A set of tech demos to showcase how to use export/aotinductor and compile PyTorch models into single binaries.

It becomes more and more clear, after the Whisper experiment, that there’s an interesting problem space we can explore to further achieve export+aoti’s potential: reducing the dependency to the absolutely bare minimum and making it much easier to adopt. To give some example what seems to be useful to be built so far:

  • no-libtorch deployment as the first class feature.
  • Easier export API adoption by decorator style API.
  • Better support for profiling.
  • Weight sharing.
  • Better CPU performance.

Feeling there should be a better umbrella name for these related projects, I coined a concept name for what we can achieve here: TorchNative, meaning we are using torch native tools to produce natively runnable binaries on any platform.

Under the umbrella name of TorchNative, our goal is to produce a bunch of highly informative and simple tech demos on how to apply PyTorch Compiler technologies to a variety of open source PyTorch models. While TorchNative’s key deliverables are a set of tech demos, our hope is that we can uncover many interesting issues from using export+aoti workflow like a real end user, and generate a bunch of new ideas like eager-style export API and no-libtorch deployment. For example, we are going to upstream a bunch of local patches we’ve made to PyTorch main branch to make sure everything we put in the demos/tutorials are working and ready to go for end users.

Takeaways

By working more on the end users’ side, we are able to make reproducible and holistic decisions on what to build in PyTorch Compiler to benefit end users, and we should continue working on enabling more models with export+aoti.

This post aims to share some of my thoughts on export+aoti since September and hopefully this brings some awareness to PyTorch developers/users on what we want to build next.

2 Likes

thanks for the great sharing.

Is the motivation for inference deployment? thanks.

Yes, this is for inference deployment.

thanks, and I have some other questions.

Q1.
Regarding the dependencies in * While running torch.compile is useful, but the dependencies involved are a non-starter for a large variety of use cases., are they triton compiler and triton runtime?

Q2.
Regarding libtorch-free aotinductor, my understanding is that libtorch.so and libc10.so are not required. Are there other .so files from pytorch project required (do we still need something from pytorch in the deployment machine)? Is there a technical introduction for the libtorch-free aotinductor?

Q3.
Regarding @_compile_cache("model", out="/tmp/zhxchen17/", ...), does it cover both export and aotinductor, and so the result is /tmp/zhxchen17/model.so w/o link to libtorch.so&libc10.so?

Great thinking and presentation. I do have some comments, questions and thoughts on this:

[1] Can you please elaborate on this statement: “As a side note, today the generated code from AOTI also cannot be easily statically linked because different model binaries tend to have symbol conflict in a single binary setting”

[2] Why it is “much simpler”? : “Another issue is that with libtorch as dependency, we need to carefully maintain the ABI compatibility between the produced binary and the libtorch we’re shipping with. With the design of libtorch.so ↔ model.so shim layer, this problem is managed but the ABI issue will be eliminated all at once if we don’t dynamically link with libtorch.so in the first place, resulting in a much simpler deployment pipeline.”

[3] I have looked at the Rust code. Why do we need to split the Whisper model into K parts and reformat them as standalone shared objects?

[4] How can we design a " * no-libtorch deployment as the first class feature."?

Thanks for your great effort and sharing.

Q1: I think triton should be a fine dependency. I was referring to the CPython runtime which is not allowed or suitable for a variety of use cases.

Q2:

my understanding is that libtorch.so and libc10.so are not required.

I don’t think so. Today by default if we generate AOTI shared objects, they will be dynamically linked to libtorch.so. Also there’re some fallback ops needed from libtorch at least for whisper model.

Is there a technical introduction for the libtorch-free aotinductor?

We are working on this right now, hopefully get something done next year.

Q3: It should cover both export and aoti. Ideally the result .so file shouldn’t link to libtorch, but I think that’s not the reality today.

Thanks for the questions!

[1] Today AOTI’s generated code just dumps the same symbol names into the global namespace. For example, every piece of generated code will have something like pytorch/torch/csrc/inductor/aoti_runtime/interface.h at main · pytorch/pytorch · GitHub (AOTInductorModelContainerRun). This is fine with a single model file, but if we have multiple mode files we will run into the linking issue. This shouldn’t be hard to support this but today it’s not working for this pattern.

[2] I was thinking in terms two perspectives: 1. As PyTorch developers, we need to maintain this surface of shim API and there will be design and development cost associated to this. For example, every time a new feature is added to aten, we need to decide whether we have to put them into libtorch and shim the feature through this surface. Some energy will be spent on making sure shim.h is not bloated and we have a clean separation between generated code and libtorch itself.
2. As PyTorch user, it will be simpler to only download a single file instead of having an additional step to install libtorch, which just adds more friction to the process. Also, some deployment environment doesn’t even have dynamic linking (e.g. small devices, custom OS, etc…), so we may not want to make dynamically linked libtorch.so as a requirement to these group of users.

[3] Glad you ask this. I think there’re two reasons doing this in my experiment: 1. I found it’s much easier to incrementally export each part of the model and test whether they work piece by piece, to upper bound the number of issues I had in one pass.
2. There exists some dynamic control flow in Whisper’s model code which I preferred to not export in the end. I was trying to export Whisper without a single line change of the original code, so I didn’t spend much time exploring control flow ops from PT2.

[4] Yeah, @desertfire shared some thoughts on this project inside the PyTorch team recently. I think it boils down to 1. generating more ops with inductor instead of falling back. 2. providing a portable implementation of libtorch’s tensor library. Meanwhile I plan to keep updating torchnative’s demos until we make libtorch-free AOTI more viable. Hopefully we will have some shared roadmap later.

Thanks for your questions!

I don’t think so. Today by default if we generate AOTI shared objects, they will be dynamically linked to libtorch.so.

Q2 is about the libtorch-free aotinductor, why we call it as libtorch-free if the generated file still links to libtorch.so?

Is the shim layer a new .so file (for example, libtorchshim.so)?