Optimizing contiguous() for the case where the Tensor is_contiguous()?

swolchok · March 2, 2021, 12:39am

One problem with Tensor::contiguous() is that when the Tensor is contiguous, we still copy it and incur reference count increment/decrement costs. (Note that, per Copy elision - cppreference.com, contiguous() is not eligible for C++17 mandatory copy elision, because *this is not a prvalue! I’m not enough of a language lawyer to be sure whether NRVO is permissible in this case, but it does not seem to be happening in practice.) To fix this, I’m considering introducing c10::MaybeOwned<T>, which holds either a const T* or a T and takes up two pointers’ worth of space instead of one, together with a new Tensor::expect_contiguous method returning c10::MaybeOwned<Tensor>.

The full code is in #53066. The reason I’m writing this post is to get more eyes on the question of how we can tell whether calling expect_contiguous() is a good idea in any particular case. The cost is always 1 extra pointer of space, an extra branch to access the included Tensor, and an extra branch at destruction time. There is no benefit for non-contiguous Tensors. Contiguous Tensors save a reference count increment/decrement.

There is an argument to be made that expect_contiguous() should always be used: if the Tensor is non-contiguous, callers are committed to doing a reference count bump around it anyway, so the extra space and branch are probably negligible. If it is contiguous, callers should be happy to pay a small amount of extra space and branching to save expensive reference counting operations. (Note that IIUC expect_contiguous() cannot replace contiguous() because that would break backward compatibility.)

I would love to hear thoughts on whether this is useful and how much we should use it, if at all.

colesbury · March 2, 2021, 4:28pm

Since this is a performance optimization, some performance numbers would be helpful to address the “how useful” questions. In particular, how much does this speed up calls to contiguous()? In the workloads you’ve seen, how much do the contiguous() calls contribute to overhead?

There are some non-quantifiable costs to the proposed change:

It adds complexity to the C++ API by introducing a new public type (MaybeOwned) and by having two functions that effectively do the same thing (contiguous and expect_contiguous).
The C++ API diverges slightly more from the Python API.

There are alternatives to complicating the API:

Complicate the call sites:

if (!x.is_contiguous()) x = x.contiguous();

Also avoids the reference count operation in the contiguous case. More practical if there are a few “hot spot” contiguous calls or if there are only a few workloads that would substantially benefit.

Complicate the reference counting implementation

We can make reference counting a lot cheaper – to the point where it’s generally not worth optimizing for increments/decrements. I’ve had a good experience using “biased reference counting” in Python, but there is substantial engineering work and complexity in implementing the asynchronous queueing/notification system between threads.

swolchok · March 2, 2021, 10:07pm

Can you elaborate on this? If x is of type const Tensor& or even Tensor &, how do you get it to work?

colesbury · March 2, 2021, 10:58pm

Yeah then it gets uglier…

bool is_contiguous = x.is_contiguous();
Tensor holder = is_contiguous ? Tensor() : x.contiguous();
const Tensor& t = is_contiguous ? x : holder;

gchanan · March 4, 2021, 8:13pm

One problem with Tensor::contiguous() is that when the Tensor is contiguous, we still copy it and incur reference count increment/decrement costs

You mean copy of the Tensor object, not the data right?

swolchok · March 4, 2021, 8:55pm

you mean copy of the Tensor object

yes, copying a Tensor is a refcount bump.

swolchok · May 24, 2021, 11:25pm

These costs have improved since I originally posted. There is still 1 pointer of extra space (used to hold a boolean isBorrowed_ flag). However, accessing the underlying Tensor does not require an extra indirection for the borrowed case, nor does it require a branch to determine if the c10::MaybeOwned is in the borrowed state. [PyTorch] Avoid double indirection in MaybeOwned's borrowed state by swolchok · Pull Request #55685 · pytorch/pytorch · GitHub fixed this.

I am now leaning more strongly toward migrating library code that doesn’t know whether a Tensor is actually contiguous from Tensor::contiguous() to Tensor::expect_contiguous() because of this cost reduction

I don’t have data on that directly, but a pair of PRs ([PyTorch] Support borrowing in/out Tensors in TensorIterator by swolchok · Pull Request #55690 · pytorch/pytorch · GitHub and [PyTorch] Migrate add operators to borrow in TensorIteratorBase by swolchok · Pull Request #55691 · pytorch/pytorch · GitHub) that migrated TensorIterator to use MaybeOwned<Tensor> to borrow its operands when possible had a 2.5% regression in cases where borrowing did not happen, but a 5-6% win when it did on an addition microbenchmark.

Topic		Replies	Views
Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much) performance	0	727	May 3, 2022
We shouldn't feel bad about passing `Tensor` by reference performance	0	2378	January 25, 2021
Comparing the performance of 0.4.1 and master performance	0	2374	February 9, 2021
State of PyTorch core: September 2021 edition frontend API	1	9439	September 21, 2021
Unionizing for Profit: How to Exploit the Power of Unions in C++ performance	2	3429	January 7, 2022

Optimizing contiguous() for the case where the Tensor is_contiguous()?

Related topics