Custom Ops Under torch.compile: autograd.Function vs torch.library.custom_op

albanD · April 1, 2026, 3:10pm

When integrating custom kernels (CUDA, Triton, etc.) into a PyTorch training loop,
you need to tell PyTorch how to run the kernel and how to differentiate it.
Two APIs exist for this, and they interact with torch.compile very differently.

For the full reference, see the Custom Ops landing page and the Custom Ops manual.

API 1: `torch.autograd.Function`

Note: This is the most widely used API, but it is not the recommended
one if you need torch.compile integration. Most users should migrate to
torch.library.custom_op described below.

The classic approach – define forward and backward in a single class:

class MyOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        out, lse, S = my_kernel_fwd(q, k, v)
        ctx.save_for_backward(q, k, v, lse, S)
        return out, lse, S  # intermediates must be returned -- see below

    @staticmethod
    def backward(ctx, grad_output, grad_lse, grad_S):
        q, k, v, lse, S = ctx.saved_tensors
        return my_kernel_bwd(q, k, v, lse, S, grad_output)

def my_op(q, k, v):
    out, _lse, _S = MyOp.apply(q, k, v)  # drop intermediates at the call site
    return out

What happens under `torch.compile`?

When torch.compile encounters MyOp.apply(...), Dynamo traces inside the
forward and backward methods. Whether this succeeds depends entirely on what
those methods contain:

Pure PyTorch ops: Dynamo traces through them, everything compiles. No graph
break.
Numpy calls: Dynamo converts them to PyTorch equivalents and compiles through.
Custom C++/CUDA kernel calls (via pybind11, torch.ops, cpp_extension, etc.):
Dynamo graph-breaks because it can’t symbolically trace through the kernel –
specifically, it can’t determine the output shapes and dtypes without a
FakeTensor/Meta implementation.

Most real custom ops fall into the third category. When a graph break happens,
the compiled graph gets split into pieces around the custom op call, and the op
itself runs in eager mode. You get multiple compiled “frames” instead of one
unified graph, losing optimization opportunities.

API 2: `torch.library.custom_op`

The compile-friendly approach – register the op in PyTorch’s dispatcher:

@torch.library.custom_op("mylib::my_op", mutates_args=())
def _my_op(q: Tensor, k: Tensor, v: Tensor) -> tuple[Tensor, Tensor, Tensor]:
    out, lse, S = my_kernel_fwd(q, k, v)
    return out, lse, S

@_my_op.register_fake
def _(q, k, v):
    # No real computation -- just describe the output shapes and dtypes.
    return (
        torch.empty_like(q),
        torch.empty(q.shape[0], q.shape[1], q.shape[2], device=q.device),
        torch.empty_like(q),
    )

def setup_context(ctx, inputs, output):
    q, k, v = inputs
    out, lse, S = output
    ctx.save_for_backward(q, k, v, lse, S)

def backward(ctx, grad_out, grad_lse, grad_S):
    q, k, v, lse, S = ctx.saved_tensors
    return my_kernel_bwd(q, k, v, lse, S, grad_out)

_my_op.register_autograd(backward, setup_context=setup_context)

def my_op(q, k, v):
    out, _lse, _S = _my_op(q, k, v)  # drop intermediates at the call site
    return out

Dynamo treats this as an opaque node in the graph (like torch.mm). It never
tries to trace inside. The register_fake implementation provides the shape/dtype
metadata needed for symbolic tracing. No graph break, guaranteed.

Answers to common questions

Q1: Does `autograd.Function` fall back to eager under `torch.compile`?

It depends on what’s inside forward(). Dynamo traces into the method body. If
the kernel call is opaque (no FakeTensor/Meta implementation), Dynamo graph-breaks
at that point. The surrounding code compiles, but the custom op itself runs in
eager.

You can verify this by counting compiled frames (see runnable example below).

Q2: Why register autograd separately?

Because torch.compile needs different information at different stages:

Stage	What the compiler needs	What provides it
Tracing (compile-time)	Output shapes and dtypes	`register_fake`
Execution (forward pass)	The actual kernel	`custom_op` function body
Differentiation (backward pass)	Gradient formula	`register_autograd`

autograd.Function bundles all three into one class. When Dynamo traces into
forward() and hits an opaque kernel, it can’t satisfy the tracing stage –
it doesn’t know what shapes come out. The custom_op API forces you to provide
this information explicitly via register_fake, which is what makes it
compile-friendly.

Q3: The forward doesn’t have `ctx` – how do I save intermediates?

Return them as additional outputs from the custom op. The setup_context
callback receives the complete forward output, so you can unpack and save
whatever you need:

@torch.library.custom_op("mylib::my_op", mutates_args=())
def my_op(x: Tensor) -> tuple[Tensor, Tensor]:
    result = torch.sin(x)
    intermediate = torch.cos(x)   # needed for backward, expensive to recompute
    return result, intermediate

def setup_context(ctx, inputs, output):
    result, intermediate = output  # <-- full forward output is available
    ctx.save_for_backward(intermediate)

setup_context runs inline during the forward pass, right after the op executes.
It is not rerunning forward – it just receives the output that was already
computed. No redundant work.

Runnable example

The example below demonstrates everything with simulated “custom kernels”
(registered C++ ops without Meta implementations, which is the situation custom
kernel authors typically face).

import torch
from torch import Tensor

# --- Simulate custom C++ kernels (registered ops without Meta impl) ---

_lib = torch.library.Library("demo", "DEF")
_lib.define("kernel_fwd(Tensor x) -> (Tensor, Tensor)")
_lib.impl("kernel_fwd", lambda x: (torch.sin(x), torch.cos(x)), "CPU")

_lib.define("kernel_bwd(Tensor cos_x, Tensor grad) -> Tensor")
_lib.impl("kernel_bwd", lambda cos_x, grad: grad * cos_x, "CPU")


# --- API 1: autograd.Function ---

class SinOpV1(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        out, cos_x = torch.ops.demo.kernel_fwd(x)
        ctx.save_for_backward(cos_x)
        return out

    @staticmethod
    def backward(ctx, grad):
        cos_x, = ctx.saved_tensors
        return torch.ops.demo.kernel_bwd(cos_x, grad)

def sin_v1(x):
    return SinOpV1.apply(x)


# --- API 2: custom_op ---

@torch.library.custom_op("demo::sin_op_v2", mutates_args=())
def sin_v2(x: Tensor) -> tuple[Tensor, Tensor]:
    return torch.ops.demo.kernel_fwd(x)

@sin_v2.register_fake
def _(x):
    return torch.empty_like(x), torch.empty_like(x)

def _setup_ctx(ctx, inputs, output):
    _out, cos_x = output
    ctx.save_for_backward(cos_x)

def _backward(ctx, grad_out, _grad_cos):
    cos_x, = ctx.saved_tensors
    return torch.ops.demo.kernel_bwd(cos_x, grad_out)

sin_v2.register_autograd(_backward, setup_context=_setup_ctx)


# --- Compare under torch.compile ---

frame_count = 0
def counting_backend(gm, example_inputs):
    global frame_count
    frame_count += 1
    return gm

x = torch.randn(8, requires_grad=True)

# API 1
frame_count = 0
f1 = torch.compile(lambda x: sin_v1(x).sum(), backend=counting_backend)
loss1 = f1(x)
loss1.backward()
grad1 = x.grad.clone()
print(f"autograd.Function: {frame_count} frame(s) {'(graph broke!)' if frame_count > 1 else ''}")

# API 2
x.grad = None
frame_count = 0
f2 = torch.compile(lambda x: sin_v2(x)[0].sum(), backend=counting_backend)
loss2 = f2(x)
loss2.backward()
grad2 = x.grad.clone()
print(f"custom_op:          {frame_count} frame(s)")

# Verify correctness
print(f"\nGradients match: {torch.allclose(grad1, grad2)}")
print(f"Reference (cos):    {torch.cos(x[:4]).tolist()}")
print(f"API 1 grad:         {grad1[:4].tolist()}")
print(f"API 2 grad:         {grad2[:4].tolist()}")

Expected output:

autograd.Function: 2 frame(s) (graph broke!)
custom_op:          1 frame(s)

Gradients match: True
Reference (cos):    [...]
API 1 grad:         [...]        # same values
API 2 grad:         [...]        # same values

Both produce correct gradients, but custom_op compiles as a single graph while
autograd.Function graph-breaks around the opaque kernel call.

Summary

	`autograd.Function`	`torch.library.custom_op`
Eager mode	works	works
`torch.compile`	graph-breaks on opaque kernels	no graph break
Forward intermediates	`ctx.save_for_backward()` in forward	return as extra outputs, save in `setup_context`
Autograd	in the class	`register_autograd`
Shape info for compiler	not provided	`register_fake`

Recommendation: Use torch.library.custom_op for custom kernels that need to
work with torch.compile. The API is more explicit – you tell the compiler exactly
what shapes your op produces (register_fake), how to differentiate it
(register_autograd), and what it computes (the function body). This separation is
what makes graph-break-free compilation possible. Beyond compile, going through the
dispatcher also gives you multiple backend registration (CPU, CUDA, XPU, etc. from
a single op definition), faster inference_mode dispatch, and correct interaction
with other PyTorch subsystems like vmap.

Why you should never save non-input/non-output tensors

A common mistake with autograd.Function is saving intermediate tensors (ones that
are neither inputs nor outputs) via ctx.save_for_backward:

# BAD: lse and S are intermediates, not inputs or outputs
class MyOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        out, lse, S = my_kernel_fwd(q, k, v)
        ctx.save_for_backward(q, k, v, lse, S)  # lse, S are neither input nor output!
        return out

This breaks several things:

Double backward (higher-order gradients) breaks. If you need to
differentiate through the backward pass (e.g., for second-order optimization or
torch.autograd.grad with create_graph=True), the saved intermediates must
carry proper autograd metadata. Only tensors that are inputs or outputs of the
autograd.Function have this metadata. Intermediates stashed in ctx that are
neither input nor output lack the autograd graph linkage needed to propagate
gradients through the backward pass itself.
torch.compile tracing fails. When Dynamo traces inside the
autograd.Function, it builds a graph of the forward. Intermediates stashed in
ctx but not returned become invisible to the graph – they are side effects.
The compiler cannot reason about them, leading to incorrect behavior or errors.
Incorrect memory accounting. The autograd engine tracks the lifecycle of saved
tensors based on input/output relationships. Tensors that are neither input nor
output escape this tracking, potentially keeping large buffers alive longer than
expected.

The fix: Return intermediates as additional outputs and drop them at the call
site:

# GOOD: lse and S are outputs, dropped by the wrapper
class MyOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        out, lse, S = my_kernel_fwd(q, k, v)
        ctx.save_for_backward(q, k, v, lse, S)  # all are inputs or outputs now
        return out, lse, S

    @staticmethod
    def backward(ctx, grad_output, grad_lse, grad_S):
        q, k, v, lse, S = ctx.saved_tensors
        return my_kernel_bwd(q, k, v, lse, S, grad_output)

def my_op(q, k, v):
    out, _lse, _S = MyOp.apply(q, k, v)
    return out

This pattern makes the intermediates visible to the autograd engine, activation
checkpointing, and torch.compile. The custom_op API naturally encourages this
pattern since setup_context already receives the full output.

asglover · April 25, 2026, 2:02pm

You detail this case:

(registered C++ ops without Meta implementations, which is the situation custom
kernel authors typically face)

But could you expand on what it would look like with Meta implementation support? Specifically if I want to define a custom op via a function which is a composition of native PyTorch ops, (so it works everywhere with all PyTorch features). And then I want to register a device specific forwards implementation; without breaking anything else… How would that work?

(As far as I can tell, If I register a device specific forwards implementation for a custom op, Autograd breaks for that device? Ideally it would fall back to the composite native or “meta” implementation?)

Any feedback would be very helpful. I’ve been trying to work this out for a while.

albanD · April 28, 2026, 9:21pm

Hey!

Assuming you’re going with custom_ops.

This comment means that the meta implementation comes from the python side. Specifically, the fake impl being registered will be the one used as a Meta impl.

@sin_v2.register_fake
def _(x):
    return torch.empty_like(x), torch.empty_like(x)

This one specifically.

But if your C++ implementation already registered a meta implementation via TORCH_LIBRARY, then you don’t need this.

If you go from having a composite op (is made of other ops) to a non-composite op (a custom kernel), then yes you need to add implementation for the different components to tell them how it works. Specifically, for compile → fake impl, for autograd → backward.

Autograd cannot “fallback” if it runs your kernel. As the only way for it to use the composite definition is to actually run it for real.

asglover · May 11, 2026, 7:51pm

Yes, I’m going with custom ops.

If you go from having a composite op (is made of other ops) to a non-composite op (a custom kernel), then yes you need to add implementation for the different components to tell them how it works. Specifically, for compile → fake impl, for autograd → backward.

What about in the case where my custom op is just a composition of Aten ops? How can I use to CompositeImplicitAutograd to tell the various torch components how my custom op works? In your example the “meta implementation” does not actually produce the correct values, just the correct shapes. I have a situation where I’m really just registering a custom op which is semantically identical to a fusion of torch native ops. (my understanding is that this is the best way / most supported way to register a device-specific fusion, please lmk if I’m using the wrong APIs.)

When I initially register a function to CompositeImplicitAutograd everything works great. I can use inference mode and autograd on all devices.

Then when I add a device specific registration through the C++ api, the inference mode still works, but backwards for that device produces an error, “autograd kernel was not registered” warning.

The nearest solution I have found is to add re-dispatching to the CompositeImplicitAutograd in the C++ registration. something like this.

void autograd_to_composite(
    const c10::OperatorHandle& op,
    torch::jit::Stack* stack) {
  op.callBoxedForDispatchKey(c10::DispatchKey::CompositeImplicitAutograd, *stack);
}

TORCH_LIBRARY_IMPL(..., AutogradDEVICE, m) {
  m.impl(
      "forward_op",
      torch::CppFunction::makeFromBoxedFunction<&autograd_to_composite>());
}

But it’s a little over zealous / conservative. It runs the CompositeImplicitAutograd implementation in forward always on the device (even though there’s a fast fused impl available). There’s an old article about figuring out which intermediate tensors to preserve in a autograd graph.

It seems in this situation, “having a fast implementation of a fusion of aten operations” is equivalent to “storing none of the intermediate nodes and instead recalculating all of them during the backwards, [and replacing the forward op pattern with the fast fusion impl]”.

It would be really nice if there was some way I could get pytorch autograd to work in this way. (I could add a Tag to my custom operator to assert, “Yes, you can derive device specific autograd from my CIA + forwards. I understand I’m asking for magic / my CIA is accurate / the definition of the op behavior”)

The benefit would be that I would not have to duplicate generic autograd logic in each the C++ registrations for each device backend. (because doing so is verbose and error prone, essentially redefining what is asserted by registering the CompositeImplicitAutograd function).

Please let me know your thoughts on this, it’s okay if the answer is “that’s not a supported workflow” or w/e I’ve just been struggling with it for a while.

albanD · May 18, 2026, 5:52pm

I don’t think your proposal above actually does what you want.

It just ends up calling the composite kernel and bypass your custom implementation altogether.

I’m afraid that there is no way to do that in PyTorch indeed. The whole autograd (and other dispatcher-based) systems are based on actually running the real ops to track what’s happening.

If you want a system that is able to compute autograd ahead of time based on a composite impl and use a fused kernel for the actual compute, then you need to build something similar to the AOTAutograd piece for torch.compile. Which you will find is very very complex.

For most things, I would say that writing the autograd formula is really not that hard (especially using LLMs and out gradcheck() tool to ensure the result is correct).

Topic		Replies	Views
Higher Order Operators, 2023/10	2	3480	March 25, 2025
Embrace tensor subclass as a Python device registration API hardware-backends	5	852	March 28, 2025
A new strategy for automatic custom operators functionalization compiler	3	1008	September 28, 2025
How to read the autograd codebase frontend API	1	3215	October 26, 2021
[GUIDE] Getting C++ custom ops to work with torch.compile compiler	3	2137	January 16, 2024

Custom Ops Under torch.compile: autograd.Function vs torch.library.custom_op

API 1: torch.autograd.Function

What happens under torch.compile?

API 2: torch.library.custom_op