One problem with Tensor::contiguous()
is that when the Tensor
is contiguous, we still copy it and incur reference count increment/decrement costs. (Note that, per Copy elision - cppreference.com, contiguous()
is not eligible for C++17 mandatory copy elision, because *this
is not a prvalue! I’m not enough of a language lawyer to be sure whether NRVO is permissible in this case, but it does not seem to be happening in practice.) To fix this, I’m considering introducing c10::MaybeOwned<T>
, which holds either a const T*
or a T
and takes up two pointers’ worth of space instead of one, together with a new Tensor::expect_contiguous
method returning c10::MaybeOwned<Tensor>
.
The full code is in #53066. The reason I’m writing this post is to get more eyes on the question of how we can tell whether calling expect_contiguous()
is a good idea in any particular case. The cost is always 1 extra pointer of space, an extra branch to access the included Tensor
, and an extra branch at destruction time. There is no benefit for non-contiguous Tensor
s. Contiguous Tensor
s save a reference count increment/decrement.
There is an argument to be made that expect_contiguous()
should always be used: if the Tensor
is non-contiguous, callers are committed to doing a reference count bump around it anyway, so the extra space and branch are probably negligible. If it is contiguous, callers should be happy to pay a small amount of extra space and branching to save expensive reference counting operations. (Note that IIUC expect_contiguous()
cannot replace contiguous()
because that would break backward compatibility.)
I would love to hear thoughts on whether this is useful and how much we should use it, if at all.
1 Like
Since this is a performance optimization, some performance numbers would be helpful to address the “how useful” questions. In particular, how much does this speed up calls to contiguous()
? In the workloads you’ve seen, how much do the contiguous()
calls contribute to overhead?
There are some non-quantifiable costs to the proposed change:
- It adds complexity to the C++ API by introducing a new public type (
MaybeOwned
) and by having two functions that effectively do the same thing (contiguous
and expect_contiguous
).
- The C++ API diverges slightly more from the Python API.
There are alternatives to complicating the API:
Complicate the call sites:
if (!x.is_contiguous()) x = x.contiguous();
Also avoids the reference count operation in the contiguous case. More practical if there are a few “hot spot” contiguous calls or if there are only a few workloads that would substantially benefit.
Complicate the reference counting implementation
We can make reference counting a lot cheaper – to the point where it’s generally not worth optimizing for increments/decrements. I’ve had a good experience using “biased reference counting” in Python, but there is substantial engineering work and complexity in implementing the asynchronous queueing/notification system between threads.
2 Likes
Can you elaborate on this? If x
is of type const Tensor&
or even Tensor &
, how do you get it to work?
Yeah then it gets uglier…
bool is_contiguous = x.is_contiguous();
Tensor holder = is_contiguous ? Tensor() : x.contiguous();
const Tensor& t = is_contiguous ? x : holder;
One problem with Tensor::contiguous()
is that when the Tensor
is contiguous, we still copy it and incur reference count increment/decrement costs
You mean copy of the Tensor object, not the data right?
you mean copy of the Tensor object
yes, copying a Tensor is a refcount bump.
These costs have improved since I originally posted. There is still 1 pointer of extra space (used to hold a boolean isBorrowed_
flag). However, accessing the underlying Tensor
does not require an extra indirection for the borrowed case, nor does it require a branch to determine if the c10::MaybeOwned
is in the borrowed state. [PyTorch] Avoid double indirection in MaybeOwned's borrowed state by swolchok · Pull Request #55685 · pytorch/pytorch · GitHub fixed this.
I am now leaning more strongly toward migrating library code that doesn’t know whether a Tensor
is actually contiguous from Tensor::contiguous()
to Tensor::expect_contiguous()
because of this cost reduction
I don’t have data on that directly, but a pair of PRs ([PyTorch] Support borrowing in/out Tensors in TensorIterator by swolchok · Pull Request #55690 · pytorch/pytorch · GitHub and [PyTorch] Migrate add operators to borrow in TensorIteratorBase by swolchok · Pull Request #55691 · pytorch/pytorch · GitHub) that migrated TensorIterator
to use MaybeOwned<Tensor>
to borrow its operands when possible had a 2.5% regression in cases where borrowing did not happen, but a 5-6% win when it did on an addition microbenchmark.