Native-ops DSL support in PyT core

slayton58 · March 11, 2026, 6:50pm

Native-ops DSL support in PyT core

tl; dr: We have merged support for writing python- and DSL-based native ops into PyTorch - you can now start using your favorite DSLs to write new implementations of native operators for inclusion in core.

Why?

DSLs like triton, helion, cuteDSL and others have become powerful tools for developers to write performant GPU (and CPU) code without having to go all the way to hand-tuned C++ or CUDA. They provide great developer experience, without sacrificing either code readability or performance.

Notably, Flash-Attention-v4, a key optimization for modern LLM models, is implemented using cuteDSL, and after a successful integration into core, the decision was made to formalize a) what’s expected from DSL-based native ops, b) how they should be added into core, and c) what expectations we have of them and how they interact with other pytorch systems.

We strongly suspect that some of the most performant implementations of important operations will start to be implemented to be implemented in DSLs, and with these changes it gives us the option, and a clear path if desired, to inclusion in PyTorch core.

How?

PR 176280 adds the torch/_native directory, containing general utilities to be shared across all DSL runtimes (i.e. runtime-availability and versioning), along with DSL-specific utilities, triton_utils.py and cutedsl_utils.py. These provide APIs to:

Determine if the DSL runtime is a) installed, and b) of a reasonable version
Register implementations when appropriate to override native ops, either for a subset of functionality, or for all cases

We are enforcing that all python native ops are registered through PyTorch’s dispatcher. This immediately means that ops are tied into all dispatch logic, notably (but not limited to!) autograd without requiring further developer intervention.

There are a few smaller restrictions, (see link), but beyond those, PyTorch is now ready for DSLs to start contributing to native operators.

Further, the torch/_native/ops directory has been created to contain all our DSL ops, and will soon fill with all the DSL-based code that’s been written, but didn’t have a path to living within core.

Further details and examples are located at torch/_native/README.md

What’s Next?

Start using it! If you’re thinking of updating or adding on to an op, you have a new way to do this if the pros are worth it to you!

There’s a few things immediately on our list:

Operators
- Start writing implementations using the framework.
Support more DSLs
- Helion / cuTile / anything else
AOT Compilation
- Compiled objects for python-less environments
Custom Ops
- Don’t override an existing op, create an entirely new one. Currently hampered by performance concerns

Topic		Replies	Views
Embrace tensor subclass as a Python device registration API hardware-backends	5	762	March 28, 2025
State of PyTorch core: September 2021 edition frontend API	1	9527	September 21, 2021
[RFC] Adding Triton Backend for Aten operators hardware-backends	0	888	November 4, 2024
Where do the 2000+ PyTorch operators come from?: More than you wanted to know compiler	13	15325	November 15, 2024
Implementing OpenCL backend for pytorch hardware-backends	14	17087	March 1, 2024

Native-ops DSL support in PyT core

Native-ops DSL support in PyT core

Why?

How?

What’s Next?

Related topics