Embrace tensor subclass as a Python device registration API

Hey!

Thanks for taking the time to write down this proposal !

The experience of the above journey is overall pleasant, but it still has few issues:

  1. This is ~expected as of today. In particular because we have been focusing the torch.compile/subclass usage for subclass that are “wrapper around other Tensor that eventually desugar into ops on plain Tensors”.
    I think you can make sure your design fits this but it might be awkward? In particular, store all the data as another Tensor that is just a holder, and translates everything to your own custom_ops (as in PyTorch Custom Operators — PyTorch Tutorials 2.6.0+cu124 documentation).
    Then compile will “desugar” into these ops and run them as black boxes.

  2. IIRC the FakeTensor trick is pretty simple: pytorch/torch/_subclasses/fake_tensor.py at main · pytorch/pytorch · GitHub
    I would agree with Ed on the issue that, if you can have the device be accurate based on the device you want to use. It is also a bit tricky and, while FakeTensor helped us clean up a lot of things, there is most likely a few rough edges left.

While we’re working with @janeyx99 to provide ABI-stability for a subset of libtorch, we are focusing on custom kernel writers right now, not out of tree device.
This work will help, if you go down the path of only python Tensor object + custom ops in c++.
But if you need to use the PrivateUse1 extension points, this will not be covered in the current plan.

If we define a custom C++ tensor.Tensor subclass described in 1. can that class live
in the upstream so there’s no torch header dependency from a backend? This class
is generic enough for any backend to use. Maybe give it its own dispatch key so
we don’t need to overload privateuseone?

I’m not sure to understand what you mean here and would have a couple questions:

  • What blocks you today from doing all of this with the subclass?
  • The second concern for most backend writers once they have something that works is performance. I am not sure what will be the actual characteristics performance-wise of this approach and if the overhead will be acceptable.