Possible to use custom backend to create TensorImpl that allows custom datatype?

Forgive the potential naiveté of this question, but would it be possible to build a custom PyTorch backend that would allow me create a Tensor with a custom datatype?

For context, this custom datatype is actually a class defined in C++ and acts as a “virtual” float32. The class has all of its binary / arithmetic operators defined, so one can treat it like a “regular” float32. In a similar use-case, we’re able to use Numpy arrays with dtype=object to create arrays of our “virtual” float32 and conduct various operations using Numpy’s Python API. Effectively, this “virtual” float32 wouldn’t be calculating anything explicitly – more just internally building its own graph of computations that created it.

PyTorch is an impressive and complicated beast (much more so than Numpy) so figuring out how to do an analogous extension is difficult. Previously, I was able to create a wrapper subclass of PyTorch’s Python Tensor class and utilize the __torch__dispatch__ method to achieve my desired results. Basically, these WrappedTensors would just carry along a Numpy array of my “virtual” float32s and __torch_dispatch__ would ensure whatever Tensor operation was occurring would also apply a Numpy equivalent to the carried array.

Unfortunately, this approach ran into trouble when I wanted to overload base aten operations that didn’t involve my WrappedTensor. Without a way to force a __torch_dispatch__ interception, I couldn’t bake in the logic I needed.

So that turned my attention to, potentially, creating a new PyTorch backend that could support my needs and these “virtual” float32s. My understanding of custom TensorImpls and how they can be used to create custom Tensors is rough at best so I’m not even sure what I’m asking is possible. I believe I’d be able to create a TensorImpl that allowed me to carry-along these “virtual” float32s in a new attribute (an array container, perhaps even Numpy’s PyArrayObject). But if I could just make this new TensorImpl allow for these “virtual” float32s off the bat, that would be even better / cleaner.

In the end, this isn’t so much a hardware backend, more like a “virtual” hardware backend (or even just another software backend layer).

If anyone has any thoughts on the feasibility of this idea, they would be much appreciated!