Why PyTorch does not need a new standardized operator set

with @gchanan

Background: There is no right answer only trade-offs

There have been many past efforts to build a standardized operator sets for PyTorch:

  • PrimTorch was the most expansive effort, taking a maximal position of trying to decompose things into as small a set of operators as possible. The original target was <50 operators, but that target was never reached. There were a number of practical problems with the operator set. It tried to decompose things so much that it was difficult for backend compiles to recover the performance lost by decompositions.
  • NVPrims was fork of the PrimTorch operator set tailored to nvFuser. It disabled many of the decompositions in PrimTorch that were harmful for performance and made a number of design decisions optimized for nvFuser.
  • Executorch IR (aka ”Core ATen IR”) made another attempt at IR standardization. It chose a set of operators optimized for edge devices and produced a list of operators that would support “90% of models” without being exhaustive. Since Executorch was inference-only, many assumptions were baked into this IR that made it unsuitable for training, and it is now being redesigned (see Pre-grad Aten IR below).
  • PyTorch/XLA also proposed IR standardization which would make PyTorch’s IR much more similar to StableHLO, since the main motivation here is converting to XLA. We decided not to make these changes.
  • TorchScript’s approach to IR standardization was to focus on forward/backwards compatibility rather than a minimal operator set. A long-lasting (likely to outlast TorchScript itself) effect of this is restrictions on the type of schema changes that can be made to ATen operators. For example, you can add an optional argument to an ATen operator, but you can’t remove an argument. This makes every ATen operator have compatibility guarantees.
  • ONNX IR focused on cross-framework interoperability, which results in a lowest common denominator effect. It was initially inference-only, which makes it less optimized for training than IRs built with training in mind.
  • Pre-grad Aten IR is an ongoing effort to replace Executorch IR (”Core ATen IR”) for some usecases, to try to fix the inference-only aspects in its design by capturing the IR before running the autograd engine (which destroys information needed for training). It is not a minimal operator set, in that it does not try to apply decompositions.
  • Inductor’s ATen dialect is different from all of the above and includes a few inductor-specific ops for handling random number generation. It is intentionally not trying to be a standard, and does not try to impose its IR design choices on other backends. This gives Inductor the freedom to change things over time.

So what did we learn from all of these prior efforts? The main thing is there is no “best” IR, just many trade-offs. Different IRs are better (and worse) for different use cases.

Another key learning is decompositions destroy information that could be useful to downstream compilers. It is much harder to go from a lower-level IR to a higher-level IR, than the other way around. So it is far better to leave things at a higher level and let the backend progressively lower them as needed. The high level original Torch IR can work for everyone, because it can be easily lowered all the IRs listed above.

The other learning is making everyone use the same IR dialect is a false requirement. Looking at the above list of IRs, many of them are actively being used and working well. There is no practical downside of having multiple IR, since all of those IRs can be created from the original PyTorch program. If we were to pick one winner and impose it on all users things would be much worse.

A better way: Configurable IR dialects

What we have today is actually a lot more powerful than picking a single IR. We have a configurable library of decompositions that allow every backend to rapidly create the perfect IR for their use case. Apply all decompositions: you get something close to PrimTorch. Apply no decompositions: you get something like TorchScript IR. Apply a medium amount of decompositions: you get close to the other IRs listed above (with a different set of decompositions for each one). It is a sliding scale where tuning the IR to meet your needs is just changing a configuration in your call to AOTAutograd.

For portable representations (before passing to a backend) we should keep things in Torch IR or in not-decomposed not-functionalized pre-grad ATen IR. This maintains the maximum flexibility, and we provide tools to easily lower this IR whatever IR you want for your use case. These are the only IRs that works with all backends.

Even though the every set of selected decompositions defines a different IR dialect, there is still lots of shared code between backends. The library of decompositions is shared. The tools for working with the IRs (FX) are shared. And perhaps most importantly, AOTAutograd for dealing with training (which is so hard most backends don’t even try) is shared and easily reused across backends.

In this design, since decompositions are run inside the backend (and backends can define custom decompositions) the mapping from operators to decompositions is not a backwards compatibility surface. If you run a saved PyTorch model with a newer backend, you will use the newer set of decompositions. This allows backends to change their IR dialect while maintaining support for serialized models. From a maintainability standpoint, this is far better because it allows backends to evolve their IR over time.

To be precise, the BC contract is as follows:

  • decomposition operators are a BC surface equivalent to other PyTorch operators, following the Frontend BC Policy
  • the default mapping applied from PyTorch operators to decompositions is not a BC surface, in the same way that the implementation details of a composite implicit PyTorch operator is not a BC surface.

We could also impose a requirement that all new operators have decompositions, however we currently feel the friction here is not worth the benefit.

How can we make things even better?

Obviously, things are not perfect. There are two main areas of investment that we are looking for contributions towards:

  • Long-tail decompositions. There are still many operators in PyTorch missing decompositions, especially in areas like linalg and special. These operators rarely appear in real models, but we should aim to have 100% decomposition coverage for non-primitive ops.
  • Better decomp UX. For example, add way to define an operator set by listing supported ops rather than decompositions. This would allow you to opt into decomposing new operators by default.
  • A build-your-own IR webpage. Our system is very powerful, but new people trying to build a backend to PyTorch don’t realize this. We should build a web page (similar to the installation matrix on PyTorch.org) to that lets you select/deselect a set of decompositions. This should produce a set of operators you need to implement to cover certain percentages of PyTorch models. This page could have a number of presets corresponding to some commonly used options, but lets people tailor the IR to their use cases. This page should provide the code you need to copy-and-paste to lower to your customized IR.

The webpage might look something like:

Questions

Q: What about serialization?

Serialization should be done with no decompositions, pre-autograd, and no-functionalization. Decompositions and functionalization should be applied after loading the model. This preserves the maximum amount of information and can work with all backends. It also preserves flexibility, allowing backends to change their dialect without breaking backwards compatibility.

9 Likes

If a webpage like this exists, it would be quite complicated and difficult to keep update with the code, right?

Maybe we should invest more efforts to document and explain the available functionality in the decomposition subsystem, making it more modular and developer-friendly. I have seen several times that developers are trying to reinvent the wheel when they don’t know such functionality exists in the decomposition system.

The data backing the webpage would need to be auto-generated from the code.

even if we build a website from the code automatically, we still need to clearly document the decomposition subsystem, so that the website designer understands it :slight_smile: