The nuances of PyTorch Graph Capture

penguinwu · February 16, 2022, 12:13pm

I have always had trouble explaining to outsiders what PyTorch compilers truly are. In my previous careers, I could point to an anchoring infrastructure. But, PyTorch compilers have been constantly morphing – with new solution stacks or IRs popping up every half, each partially overlapping with previous solutions. Why is that?

Why are PyTorch Compilers Plural?

PyTorch compilers are plural because there is no single mechanism to convert PyTorch programs into IR forms (or graphs ).

Other ML frameworks have either graph abstractions built into the programming model (e.g., TF) or the evaluation model (e.g., TVM), or a language frontend (e.g., Relay) that can be deterministically converted into IRs. In contrast, graph capture for an eager-first ML framework like PyTorch is non-trivial and design space in itself . To a large extent, the solution space of PyTorch compilers reflects the evolution or fragmentation of PyTorch graph capture mechanisms.

The rest of the post dives into the nuances of the PyTorch graph captures and tries to offer a framework for understanding all of them.

PyTorch Graph Capture according to UX

The following table buckets different PyTorch graph captures based on UX.

We gauge user experiences based on the following metrics:

Amount of user efforts required or allowed
- Flip-switch : minimal user efforts but not much customization of capturing scope;
- User-directed : need deeper knowledge to successfully capture a graph; it often implies adoption barriers (negative) and/or customizability (positive);
Whether graph capture is guaranteed to succeed (i.e., always-succeed vs best-effort )
Whether graph (re)play is sound or not (i.e., sound vs unsound )
Whether whole-graphs or multiple partial graphs are captured, which determines characteristics of the replay execution (e.g., whether Python fall-backs are needed)

To aid references, we give short names for each bucket

Out-of-box : flip-switch, always-succeed capture (not necessarily whole-graph), and sound replay w/ Python fallbacks. This one is our best UX option.
- Examples: Lazy Tensor, TorchDynamo
Human-in-the-loop : user-directed best-effort whole-graph capture. This option may have a steep tryout cost upfront but allows customization.
- Examples: torch.jit.script, torch.fx, AOTAutograd.
Best-effort : flip-switch, always-succeed whole-graph capture, and unsound replay. This option is easy to try out.
- Examples: torch.jit.trace

What makes graph capture & replay sound?

Two traits influence the soundness and usability of a graph capture & replay system.

The ability to (transparently) skip unwanted ops in the capture scope, which determines whether capture is guaranteed to succeed. Unless we intend to develop a Python compiler, graph IR for an ML compiler cannot be the same as Python IR. Thus, a sound graph capture must be able to exclude Python ops that are not supported by the graph IR, preferably transparently.
- Multi-graph w/ Python fall-back : the system may capture multiple partial graphs to safely skip any unsupported Python ops in the captured scope. Because of the interleaving of Python IRs and multiple partial graphs, the (re)play system often supports Python execution (or fallback), where the control to enter and exit the execution of a partial graph is often transparent to programmers.
- Whole-graph : the system captures a single graph for the entire capture scope. If the capture scope contains Python construct unsupported by the graph IR, the system may fail to capture a graph. In a single-graph system, users can explicitly replay a captured graph. And the replay system does not have to support Python execution. All existing capture-and-replay systems are whole-graph systems.
Interaction between capture and replay , which determines the soundness of replay.
- Recapture-and-play (i.e., capture many times and play one or more times). Such systems check whether captured graphs match the replay context and re-capture if mismatched.
- Capture-and-replay (i.e., capture once and replay many times). Such a system requires users to guarantee the soundness of the replay.

The following table classifies existing graph capture along the two dimensions.

Quadrant (IV) (i.e., Out-of-box) is sound because it can transparently skip unwanted Python constructs (through multi-graph capture w/ Python fallback) and support sound replay (via recapture-and-play).
Quadrant (II) (i.e., human-in-the-loop, best-effort ) is good for export path because it’s easier to export a whole graph than multiple partial graphs. More importantly, if the execution environment does not support Python fallback, then Quadrant (II) is the only viable solution.
Quadrant (I) is the best of two worlds. We do not have any solution in this quadrant, but there may be a space for innovation to improve Quadrant (IV) solutions to capture more and more whole graphs (perhaps via user intervention).
Quadrant (III) does not make a lot of sense.

It matters when a graph is captured

A system can capture graphs at different stage of a model execution life cycle, which leads to different overhead, IR semantics, and composability.

Before execution by examining Python bytecode (e.g., TorchDynamo) or AST (e.g., torch.jit.script ). This is also called zero-overhead graph capture.
Tracing-based that captures graphs during the execution of Python programs. PyTorch provides two tracing mechanisms that capture IRs w/ different semantics:
- Python-level tracing via torch_function, which captures IR at Python level (e.g., FX).
- C+±level dispatcher tracing via either custom dispatcher key (e.g., Lazy Tensor, torch.jit.trace ) or torch_dispatch (e.g., AOTAutograd), which captures streams of aten ops (i.e., aten IRs).

On overhead

Before-execution capture is zero-overhead;
Tracing-based systems always incur overhead, either during warm-up time as in capture-and-replay systems, or recurring as in recapture-and-play systems.

On composability w/ PyTorch core extension points

Before-execution graph capture is the least composable because it 1) takes additional handling to “see through” functions; 2) cannot access C++ (dispatcher) level semantics;
Tracing-based graph capture is composable w/ functional transforms as it may naturally trace through functions (incl. first, higher-order functions);
Furthermore, dispatcher-level tracing is the most composable as it can transparently incorporate dispatcher-level semantics like autograd and vmap.

On lowering to aten IRs

Dispatcher-level tracing has a huge advantage of lowering to Aten IRs in a way that is naturally consistent w/ eager execution. For instance, TorchDynamo uses torch.jit.trace() to lower from captured Python Bytecode graphs to TorchScript IR. This process is sound because TorchDynamo captured graphs contain no control flow and are shape-specialized. The ongoing explorations of combining TorchDynamo w/ AOTAutograd to capture both forward and backward graphs together and to combine TorchDynamo w/ LazyTensor tracing are all examples of combining before-execution graph capture with tracing-based graph capture that is still sound.

Concluding remarks

The following picture summarizes different aspects of evaluating a graph capture. This post is just the starting point of understanding the existing design space of PyTorch compilers to lay the foundation for building a truly composable and more stable PyTorch compiler stack.

garymm · May 23, 2022, 8:40pm

Great and very useful breakdown.
Would torchdynamo.optimize(nopython=True) fit into quadrant (I) of your 2D table?

penguinwu · May 24, 2022, 8:47pm

Several tables in the post and I did not label them :<, so I do not know which table is referred to :).

torchdynamo.optimize(nopython=True) would fall into sound, best-effort, whole-graph capture because the API may fail (by design). But when it succeeds, it will produce a guarded whole graph. This API would also provide diagnostic outputs to help users rewrite model codes to potentially get a successful graph capture. So it also has the nature of a human-in-the-loop tool.

fortianyou · September 26, 2022, 6:26am

@penguinwu Great! I think another dimension to distinguish those graph capture mechanisms is their capability to fetch control flow graphs versus dataflow. What’s more, dynamic shapes or symbolic shapes can be considered as well.

_sean_silva · October 18, 2022, 1:54pm

Thanks for this wonderful write-up @penguinwu !!

penguinwu · November 7, 2022, 2:55pm

Totally agree. This post was written early in the year. Since then, we have made great strides in TorchDynamo (our next-generation graph capture technology). Ed’s weekly posts on the State of Symbolic shape are a great place to get a pulse on our current development of graph capture w/ dynamic shapes.

And please watch out for the upcoming PyTorch Conference (12/02), where much more will be shared.

kiszk · December 8, 2022, 5:01am

@penguinwu It is a great article.
After the announcement of PyTorch 2.0, I have one question when I revisit the figure in “Concluding remarks”. My question is “WIll torchscript APIs like torch.jit.script and torch.jit.trace in Python programs be depreciated to capture a graph after PT 2.0?”.

PT 2.0 will provide a new API “torch.compile()” to capture the whole graph. Your talk PyTorch Conference 2022 - YouTube in PyTorch conference also presents the migration to TorchDynamo. How will torchscript APIs in users’ Python programs (e.g. transformers/benchmark.py at main · huggingface/transformers · GitHub or transformers/run_clip.py at main · huggingface/transformers · GitHub) exist or be depreciated after PT 2.0?

kiszk · December 8, 2022, 7:23pm

I realized that I am a bit confused. “Will torch.jit.script and torch.jit.trace be depreciated?” is better question.

IMHO, torch.jit.script() may correspond to torch.export(), and torch.jit.trace() may correspond to torch.compile(). Any answers or comments are appreciated.

penguinwu · December 8, 2022, 10:37pm

No, torch.jit.script() and torch.jit.trace() are not yet deprecated because of two reasons:

Many vendors are still integrating their backends via TorchScript IR
torch.export() is not yet ready

We do want to consolidate front-end solutions into TorchDynamo organically. The speed of this consolidation depends on how fast vendors integrate to PT2 stack and how soon we have PT2 export ready.

kiszk · December 9, 2022, 1:25am

Thank you very much for the quick clarification.

Topic		Replies	Views
TorchDynamo Update 5: Improved Capture & Bigger Graphs compiler	4	4034	April 14, 2022
Next Steps for PyTorch Compilers compiler	9	10629	October 21, 2021
TorchDynamo Update 8: TorchDynamo passed correctness check on 7k+ github models compiler	7	6385	July 1, 2022
Why Dynamo fails to capture the computation graph in this function? compiler	1	576	September 14, 2023
TorchDynamo: An Experiment in Dynamic Python Bytecode Transformation compiler	7	17643	March 9, 2023