Torch::deploy - The Build

wconstab · May 25, 2021, 3:51pm

Posts about torch::deploy — The Build (OSS)

Overview

torch::deploy offers a way to run python/pytorch code in a multithreaded environment, for example, to enable N threads to serve production traffic against a single copy of a model (tensors/weights) without GIL contention. It does this by constructing N complete copies of cpython and torch_python bindings inside a process.

This post is a deep dive into how the build system is set up to do some unusual things in order to hide multiple cpython instances inside of a pytorch process.

This is the simplest picture I could think of to show the system architecture, but I’m glossing over most of the interesting stuff.

Each copy of the interpreter contains a whole memory image of cpython and libtorch_python bindings, which essentially converts all global objects (such as the infamous GIL) into members of a particular interpreter.

The embedded interpreter payload is literally a binary string that we can use to construct a new copy of an interpreter via dlopen at runtime.

What goes into an interpreter?

Each interpreter has two main components- (1) the public interface, (2) the python distribution. I’ll talk more about the public interface in another post. The python distribution is literally that- a customized build of cpython (3.8, currently), with a particular set of python stdlib modules and pytorch bindings included, and nothing else.

The python distribution is intentionally sealed off - this means it has no notion of import paths on your filesystem, and it can only utilize the packages it comes with. However, you can still customize the distribution, you just have to do that up front by modifying how torch::deploy is built.

any ‘pure python’ modules or packages can be used seamlessly with torch::deploy without modifying the underlying distribution, by bundling them inside a torch.package.
the embedded python version and libraries have no dependency or interaction with the system python you may have installed, or your PATH, etc.

With that out of the way, here are the main components to the python distribution.

(TODO: there is an image I want to post, but can’t post more than one or more links!)

A note on FrozenPython

Freezing refers to a process of compiling a python module and serializing its bytecode in a way it can be directly read by python’s built in import machinery. For details about how freezing works, see freeze.py on github (torch/csrc/deploy/interpreter/freeze.py) - a script that compiles .py files into bytecode and writes them into an embedded binary string.

We extend pytorch’s CMake build system to clone and build a fixed verison of CPython (3.8), and then add extra steps to package up the pieces in a way that they can be embedded inside our library.

(torch/csrc/deploy/interpreter/CMakeLists.txt) is where CPython is built.

How interpreter creation works

In the simplest world, torch::deploy could simply call dlopen on “path_to_libinterpreter.so” and get back a handle to the interpreter interface. But in order to avoid imposing that any torch::deploy user also install this .so on their system, we bundle it as part of the torch::deploy binary itself.

In (torch/csrc/deploy/CMakeLists.txt), we add a new target that takes the embedded interpreter library (.so) created in the previous step and embeds it deeper into our library, so it doesn’t have to be lying around on the filesystem at runtime.

add_custom_command(
OUTPUT libtorch_deployinterpreter.o
COMMAND cp $<TARGET_FILE:torch_deployinterpreter> .
COMMAND ld -r -b binary -o libtorch_deployinterpreter.o libtorch_deployinterpreter.so
COMMAND rm libtorch_deployinterpreter.so
DEPENDS torch_deployinterpreter
VERBATIM
)

The snippet above simply takes the libinterpreter.so and uses ld -r -b binary to serialize it as one long binary blob in a new object file, with its start and end conveniently indicated by _binary_libtorch_deployinterpreter_so_start and _binary_libtorch_deployinterpreter_so_end symbols named according to the .so.

So now, at runtime we just have to write the contents of that symbol into a /tmp file, and call dlopen on that before unlinking it.

Hiding the Symbols

This is a critical step that ensures the only symbol exposed globally from the dlopened libinterpreter is the one for a function that constructs an interpreter impl. Marking everything else as local ensures for example that the copy of static cpython symbols in one interpreter do not interfere with those of another interpreter.

hide_symbols.script:

INTERPRETER_0.1 {
global: new_interpreter_impl;
local: *;
};

an entry on the buck target for libinterpreter.so:

set_property

TARGET torch_deployinterpreter APPEND_STRING PROPERTY

 LINK_FLAGS " -Wl,--version-script=${LINKER_SCRIPT}")

A Few More Important Details

libinterpreter.so depends on python headers, but these must not be the headers from whatever python you are running in your PATH when you run cmake - so we make sure to include this line:
- target_include_directories (torch_deployinterpreter PUBLIC ${PYTHON_INC_DIR} )
the CPython build races with other parts of the pytorch build, so adding the right manual dependencies is important:
- add_dependencies (torch_deployinterpreter cpython)
- add_dependencies (torch_python_obj cpython)
The torch-python (e.g. libtorch-python.so) files that are a part of torch::deploy are the same cmake objs as the ones built for the main torch-python
- add_library (torch_python_static STATIC $< TARGET_OBJECTS :torch_python_obj>)

Resolving Symbols External to the Interpreter

Each interpreter contains code for torch python bindings, but not for the stuff those bindings bind. For example, the function THPModule_getNumThreads is a binding that calls at::get_num_threads() and returns its result as a py integer. Only THPModule_getNumThreads has code inside libinterpreter, but at::get_num_threads is an unresolved symbol located in libtorch that has to be resolved at dlopen time. These symbols must be resolved by libraries linked into the application at runtime, or they will cause a runtime error.

Future Work

Lots is ongoing both in the core of torch::deploy and efforts to integrate it into production workflows such as those using PyTorch predictor.

In particular, I’d like to call out two items related to customizing the torch::deploy python distribution to add support for your favorite python libraries- Zach DeVito is looking into a custom ELF loader which would automate the process of dlopening a library in a ‘replicated’ way for interpreters to use, and Michael Suo is looking broadly at ways to extend the python distribution for customized use cases.

Thanks for reading, hopefully this helped explain the internals of the torch::deploy build system in fbcode, and I plan to post more deep dives on other aspects of torch::deploy.

wconstab · May 25, 2021, 3:51pm

Here is the image I wanted to post above, illustrating the components of libinterpreter.so

Topic		Replies	Views
Running Multiple Python Interpreters via Custom Dynamic Loading	0	873	May 25, 2021
What’s preventing PyTorch from being competitive with Llamafile? compiler	8	488	December 10, 2024
What (and Why) is __torch_dispatch__? frontend API	3	14438	July 2, 2024
What is the correct, future-proof, way of deploying a pytorch python model in C++ for inference? deployment	12	566	February 25, 2025
PyTorch 2.x Inference Recommendations deployment	11	1354	November 3, 2024