Balance "Usability over Performance" in a concentrating ML Ecosystem

albanD · April 21, 2025, 7:33pm

The very first item in the PyTorch Design Philosophy is “Usability over Performance”.
As discussed in the design page, while this principle can be surprising, it is at the core of PyTorch flexibility and has been key in PyTorch’s success.

But, the ML Ecosystem has evolved quickly in the past few years in two aspects of interest here: concentration of most users on a small number of models and continuous increase in hardware specialization.
In particular, while the traditional userbase of PyTorch has not shrunk, most of the new users are focused on a few Transformer-based architectures, leading to the number of modalities and model architecture to reduce significantly when considering a fixed top-% number of users. This concentration means that the performance of these models gained an outsized importance and shifted the original balance of “Usability over Performance”.
Moreover, with more specialized hardware for these transformer-based architectures and even “General Purpose” accelerators adding specialized components to speed up matrix-multiplications, there is an increase in complexity, and thus specialization, of the kernel implementation. This means that a single kernel in PyTorch that used to provide “reasonable performance” is not able to be as close to roof-line performance anymore, again changing the balance of “Usability over Performance”.

What do we need?

At a very high level, I would argue that:

The Speed of a few ops is key for short-term high visibility
Fast adoption of new ops and techniques is key for mid-term relevance
Consistent “reasonable performance” across ops and architectures is necessary for long-term adaptability

Most of these objectives are achieved by providing a more diverse set of kernels in a consistent manner. This leads to a specific challenge for PyTorch Core where Usability remains at the forefront of our concerns.

So what should we do with new kernels?

There are 3 main directions we can go here:

A) Add all of them in core and compile them into our shipped binary

This is the most natural approach, it has key benefits:

Great out of the box perf, no external dependency or installation needed
No warmup due to just-in-time compilation
No compilation toolchain dependency
Expected User Experience as this is how we have been shipping kernels historically

But it also has major drawbacks:

Feasibility to ship as a PyPi package due to binary size
Usability for users given the increased binary size (download failure, disk usage, etc)
Maintenance burden of all these kernels
Complex wheel infrastructure to manage compilation and binary size

B) Just-in-time compile them all in core

This approach is mainly tackling the binary size concerns for increased friction for the end user. On the benefits side:

Minimal binary size
Simpler wheel build
Allow leveraging JIT-only kernel writing high level languages like Triton

But it comes with significant drawbacks:

Startup time for the user that has to compile these kernels
Runtime dependencies on a full compilation toolchain
Maintenance burden of both the kernels and the delayed signal on the impact of any change (from build time to when this kernel is run).

C) Move all of these kernels out of core

The last option is to move all these kernels into a collection of third party packages out of core. This has quite a few benefits from the Core point of view:

Significantly simpler maintenance, in particular to drop features
Best possible performance as nothing is preventing adding more specialized kernels
Easy core build as it only needs to support the same small number of kernels as today

It also comes with drawbacks:

While Core has many extension points, more might be needed to fully enable this (for example add an implementation for a specific dtype or layout for an existing op out-of-core)
Dependency and multiple package management for the end users
Surprising performance cliff where a missing “pip install” or “import” leads to significantly different performance

So what should we do?

As with many of these more complex problems, there is no one-size fits all solution.
I would suggest the following:

We should use A for ops that we expect to stay in heavy use long-term (2 years+) and aim to provide 95% of the roof line performance.
To be able to achieve this, we will need:

Advanced kernel engineering for the different backends
Improvements in our packaging solution to reduce binary size from where we are today
Have a policy to move to just-in-time compilation or out of tree to preserve long term binary size and maintenance.

We should use B for two main use cases: as an opt-in increased performance (for less used configurations) and for long-term ops with lower usage (like we do for some ops handling complex dtypes in core today).
To be able to achieve this, we will need:

Proper per-backend just in time compilation toolchain (cpp_extensions and triton support a couple backends,JITerator is cuda-only)
Op usage tracking for balancing A and B
Caching and the option to share compiled artifacts to ensure users of these “lower usage” ops can get best-in-class experience.

We should use C for three main use cases: as a way to get the last drop of performance for a given usage, potentially-short-lived ops and long tail of low usage ops.
To be able to achieve this, we will need:

ABI stability for C++ extensions
backend coverage for cpp_extensions submodule beyond cuda/rocm/xpu
A better balance of extension points overhead vs capability (can we add per-dtype, per-layout, per-memory format extension points for each kernel?)
Improved dependency management story for end users (both for torch package and for the ecosystem of packages built on top of it)

Conclusion

I think that “Usability over Performance” remains a key differentiation of PyTorch and we need to preserve it even in a world where the performance of a few architectures is overly important.
But we also need to improve our flexibility and extension points to be able to enable increased performance for these few models that have an outsized impact on the industry.
The most important piece here is to allow out-of-core, fast moving, projects to complement core for these extremely specialized usages (per-size or per-dtype kernels) while increasing our capacity to add the most important ones in core directly.

gottbrath · April 21, 2025, 8:05pm

Thanks Alban for posting this. This should help clarify the strategy for users who want performance and developers working on fast kernels for various hardware.

jeromean · May 20, 2025, 2:23am

Thanks @albanD - interesting perspective. I think we could exploit pytorch-fdn/oota: out-of-tree-accelerators repository further to accommodate this perspective.

albanD · May 20, 2025, 9:22pm

Hey @jeromean

Thanks for sharing.
This work is very important but a bit orthogonal to the discussion above. In particular, the discussion above is for all backends. Were you do have to adapt “in core” to “in the main library” (when the main kernel lib is a separate package).
But I do think the same arguments apply for these, where that “main kernel lib” for a given backend should be careful not to grow uncontrollably.

But of course, we do want to continue to support and encourage out of tree backends and help them ship the best possible solution for our users.

Topic		Replies	Views
Notes from Recent Core Maintainers Meetings (22'Q4 and 23'Q1)	1	1098	April 6, 2023
State of PyTorch core: September 2021 edition frontend API	1	9373	September 21, 2021
Meta PyTorch Team 2025 H1 Roadmaps	17	5430	June 24, 2025
Micro-optimizations for the most micro of benchmarks compiler	0	726	January 25, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1980	September 22, 2023