The very first item in the PyTorch Design Philosophy is “Usability over Performance”.
As discussed in the design page, while this principle can be surprising, it is at the core of PyTorch flexibility and has been key in PyTorch’s success.
But, the ML Ecosystem has evolved quickly in the past few years in two aspects of interest here: concentration of most users on a small number of models and continuous increase in hardware specialization.
In particular, while the traditional userbase of PyTorch has not shrunk, most of the new users are focused on a few Transformer-based architectures, leading to the number of modalities and model architecture to reduce significantly when considering a fixed top-% number of users. This concentration means that the performance of these models gained an outsized importance and shifted the original balance of “Usability over Performance”.
Moreover, with more specialized hardware for these transformer-based architectures and even “General Purpose” accelerators adding specialized components to speed up matrix-multiplications, there is an increase in complexity, and thus specialization, of the kernel implementation. This means that a single kernel in PyTorch that used to provide “reasonable performance” is not able to be as close to roof-line performance anymore, again changing the balance of “Usability over Performance”.
What do we need?
At a very high level, I would argue that:
- The Speed of a few ops is key for short-term high visibility
- Fast adoption of new ops and techniques is key for mid-term relevance
- Consistent “reasonable performance” across ops and architectures is necessary for long-term adaptability
Most of these objectives are achieved by providing a more diverse set of kernels in a consistent manner. This leads to a specific challenge for PyTorch Core where Usability remains at the forefront of our concerns.
So what should we do with new kernels?
There are 3 main directions we can go here:
A) Add all of them in core and compile them into our shipped binary
This is the most natural approach, it has key benefits:
- Great out of the box perf, no external dependency or installation needed
- No warmup due to just-in-time compilation
- No compilation toolchain dependency
- Expected User Experience as this is how we have been shipping kernels historically
But it also has major drawbacks:
- Feasibility to ship as a PyPi package due to binary size
- Usability for users given the increased binary size (download failure, disk usage, etc)
- Maintenance burden of all these kernels
- Complex wheel infrastructure to manage compilation and binary size
B) Just-in-time compile them all in core
This approach is mainly tackling the binary size concerns for increased friction for the end user. On the benefits side:
- Minimal binary size
- Simpler wheel build
- Allow leveraging JIT-only kernel writing high level languages like Triton
But it comes with significant drawbacks:
- Startup time for the user that has to compile these kernels
- Runtime dependencies on a full compilation toolchain
- Maintenance burden of both the kernels and the delayed signal on the impact of any change (from build time to when this kernel is run).
C) Move all of these kernels out of core
The last option is to move all these kernels into a collection of third party packages out of core. This has quite a few benefits from the Core point of view:
- Significantly simpler maintenance, in particular to drop features
- Best possible performance as nothing is preventing adding more specialized kernels
- Easy core build as it only needs to support the same small number of kernels as today
It also comes with drawbacks:
- While Core has many extension points, more might be needed to fully enable this (for example add an implementation for a specific dtype or layout for an existing op out-of-core)
- Dependency and multiple package management for the end users
- Surprising performance cliff where a missing “pip install” or “import” leads to significantly different performance
So what should we do?
As with many of these more complex problems, there is no one-size fits all solution.
I would suggest the following:
We should use A for ops that we expect to stay in heavy use long-term (2 years+) and aim to provide 95% of the roof line performance.
To be able to achieve this, we will need:
- Advanced kernel engineering for the different backends
- Improvements in our packaging solution to reduce binary size from where we are today
- Have a policy to move to just-in-time compilation or out of tree to preserve long term binary size and maintenance.
We should use B for two main use cases: as an opt-in increased performance (for less used configurations) and for long-term ops with lower usage (like we do for some ops handling complex dtypes in core today).
To be able to achieve this, we will need:
- Proper per-backend just in time compilation toolchain (cpp_extensions and triton support a couple backends,JITerator is cuda-only)
- Op usage tracking for balancing A and B
- Caching and the option to share compiled artifacts to ensure users of these “lower usage” ops can get best-in-class experience.
We should use C for three main use cases: as a way to get the last drop of performance for a given usage, potentially-short-lived ops and long tail of low usage ops.
To be able to achieve this, we will need:
- ABI stability for C++ extensions
- backend coverage for cpp_extensions submodule beyond cuda/rocm/xpu
- A better balance of extension points overhead vs capability (can we add per-dtype, per-layout, per-memory format extension points for each kernel?)
- Improved dependency management story for end users (both for torch package and for the ecosystem of packages built on top of it)
Conclusion
I think that “Usability over Performance” remains a key differentiation of PyTorch and we need to preserve it even in a world where the performance of a few architectures is overly important.
But we also need to improve our flexibility and extension points to be able to enable increased performance for these few models that have an outsized impact on the industry.
The most important piece here is to allow out-of-core, fast moving, projects to complement core for these extremely specialized usages (per-size or per-dtype kernels) while increasing our capacity to add the most important ones in core directly.