Example inputs to compilers are now fake tensors

(Editor’s note: I meant to send this in December, but forgot. Here you go, later than it should have been!)

The merged PR at Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] by voznesenskym · Pull Request #90039 · pytorch/pytorch · GitHub changes how Dynamo invokes backends: instead of passing real tensors as example inputs, we now pass fake tensors which don’t contain any actual data.

The motivation for this PR is in the dynamic shapes workstream. The essential problem is this: when compiling for dynamic shapes, you don’t want a backend compiler to specialize on the exact sizes the input tensors had. Instead, you want the compiler to parametric over the sizes, and perhaps only peek at the real size occasionally (introducing a guard), when it would really benefit from specializing to a specific size.

There is no way to enforce this if we pass real tensors to backends: the real tensors have all of the real shapes, and don’t say anything about what relationships the sizes have symbolically. However, if we pass fake tensors, we can also replace the concrete sizes with symbolic sizes. If you use those sizes without looking at their concrete sizes, you end up with a compiled result that can work with arbitrary choices of sizes; if you perform boolean tests on the sizes, we automatically introduce guards to limit the validity of the graph to whenever those guards are true.

There is also a minor side benefit to passing fake tensors: with real tensors, it’s easy to believe that to perform static analysis on the graph (e.g., shape propagation), you have to actually run the graph (with real tensor inputs). This is very slow (since you’re running the real tensor compute at compile time) and uses up a lot of memory. Fake tensors encourage you to make use of FakeTensorMode that allows you to run tensor computation, without actually doing any of the real work.

Hopefully, the changes needed to make things work with fake tensors are straightforward. Please ask any questions here if you have any!

4 Likes

hi @ezyang, how do you recommend custom PyTorch ops to better support FakeTensor?

I want to run my model with torch.compile, however the model uses a couple of custom ops, from where I get error like

RuntimeError: Failed running call_function my_custom_ops.my_custom_function(*(FakeTensor(FakeTensor(..., device='meta', size=(1, 109200, 8, 16),
           grad_fn=<ViewBackward0>), cuda:0), FakeTensor(FakeTensor(..., device='meta', size=(3, 2), dtype=torch.int64), cuda:0), FakeTensor(FakeTensor(..., device='meta', size=(3,), dtype=torch.int64), cuda:0), FakeTensor(FakeTensor(..., device='meta', size=(1, 109200, 8, 3, 4, 2),
           grad_fn=<AddBackward0>), cuda:0), FakeTensor(FakeTensor(..., device='meta', size=(1, 109200, 8, 3, 4),
           grad_fn=<ViewBackward0>), cuda:0), 64), **{}):
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
(scroll up for backtrace)
1 Like

Write a meta implementation for the operator. Check out The dynamic shapes manual - Google Docs

I see, seems like directly registering to meta is only supported for smaller ops? I got this error when trying on this kernel here for example Deformable-DETR/ms_deform_attn_cuda.h at 11169a60c33333af00a4849f1808023eba96a931 · fundamentalvision/Deformable-DETR · GitHub

RuntimeError: We should not register a meta kernel directly to the operator 'ms_deform_attn', because it has a CompositeImplicitAutograd kernel in core. Instead we should let the operator decompose, and ensure that we have meta kernels for the base ops that it decomposes into.

looking through the doc that you sent, am I supposed to decorate it to decompose? In my case something like this?

@torch.ops.deformable_attention_ops.py_impl(DispatchKey.CompositeImplicitAutograd) # the ops name here
def ms_deform_attn(arg1, arg2...):  # the function name here

(it’s probably not correct as I got AttributeError: '_OpNamespace' 'deformable_attention_ops' object has no attribute 'py_impl'. But I also saw the The place to put the operator (torch/_decomp or torch/_refs) section, and not sure if it applies to my own custom ops? that looks like it's referring to ops that PyTorch wants to expose to the public?)

For this op, it should “just work” even without doing anything, because for a composite op, the expectation is the inner ops it calls have meta implementations. Is that true for your op? You didn’t link the rest of the decomposition implementation so I can’t tell easily.

this is the op Deformable-DETR/ms_deform_attn_cuda.cu at 11169a60c33333af00a4849f1808023eba96a931 · fundamentalvision/Deformable-DETR · GitHub, it does have a few layers of ops that it’s calling, but after tracing at the end of the day it just uses + and *, although there is no meta implementation for any of the layers I think.

for this kind of nested ops, how should we go about writing their decomposition implementations?

Hey @ezyang , I notice both the doc you shared above and the examples in torch/_meta_registrations.py are all dealing with ATen ops. Does it mean for other custom kernels (e.g. flashattention), we need to manually write the meta backend implementation ?

Yup! Custom kernels use the same registration mechanism as regular kernels, so you can register metas for them the same way.

Update: this post is not valid now. The PR https://github.com/pytorch/pytorch/pull/99320 changed the calling convention to real tensors.

1 Like

Thanks @ezyang for pointing to the nicely written manual on dynamic shapes. Is there a similar documentation on at::SymInt → at::IntRef specialization? If not, what’s the best way to learn SymInt specialization?

P.S.: I think “c10/core/SymIntArrayRef.h” and “c10/core/SymInt.h” are good places to start in the absence of a comprehensive document/tutorial.