Last year, I added an optimization to c10::intrusive_ptr
for the common case of 0 weak references: before performing an expensive atomic decrement on the weak reference count, we check to see if the count is 1. This optimization works because weak references are uncommon, so the cheap check usually just replaces the atomic weak reference count decrement.
Note that this optimization is also performed by libc++'s shared_ptr implementation for the weak reference count but not the strong reference count.
I worry a lot about the cost of reference counting in the case where a Tensor
actually has unique ownership, so I decided to measure the downside of applying this optimization to the decrement of the strong reference count as well. I added the following benchmark to demonstrate the potential upside:
static void BM_IntrusivePtrExclusiveOwnership(benchmark::State& state) {
while (state.KeepRunning()) {
volatile auto var = make_intrusive<Foo>(0);
}
}
// and similarly for std::shared_ptr
and this one to demonstrate the costs:
static void BM_IntrusivePtrWidelyShared(benchmark::State& state) {
intrusive_ptr<Foo> var = make_intrusive<Foo>(0);
const size_t kLength = state.range(0);
while (state.KeepRunning()) {
intrusive_ptr<Foo> vararray[kLength];
for (const auto i : c10::irange(kLength)) {
vararray[i] = var;
}
}
}
BENCHMARK(BM_IntrusivePtrWidelyShared)->RangeMultiplier(2)->Range(1, 1024);
// again, similarly for std::shared_ptr
The results showed a clear win for the upside case, but I think the cost is too high. These measurements are taken on a Broadwell machine with turbo off.
Before:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_IntrusivePtrExclusiveOwnership 19.0 ns 19.0 ns 36252895
BM_SharedPtrExclusiveOwnership 13.7 ns 13.7 ns 51741294
BM_IntrusivePtrWidelyShared/1 19.3 ns 19.3 ns 35670363
BM_IntrusivePtrWidelyShared/2 35.2 ns 35.2 ns 19897026
BM_IntrusivePtrWidelyShared/4 67.2 ns 67.2 ns 10448680
BM_IntrusivePtrWidelyShared/8 131 ns 131 ns 5358368
BM_IntrusivePtrWidelyShared/16 259 ns 259 ns 2705537
BM_IntrusivePtrWidelyShared/32 515 ns 515 ns 1358814
BM_IntrusivePtrWidelyShared/64 1029 ns 1029 ns 680294
BM_IntrusivePtrWidelyShared/128 2055 ns 2055 ns 340070
BM_IntrusivePtrWidelyShared/256 4109 ns 4109 ns 170251
BM_IntrusivePtrWidelyShared/512 8252 ns 8251 ns 84905
BM_IntrusivePtrWidelyShared/1024 16481 ns 16480 ns 42467
BM_SharedPtrWidelyShared/1 26.4 ns 26.4 ns 26541711
BM_SharedPtrWidelyShared/2 50.7 ns 50.7 ns 13818673
BM_SharedPtrWidelyShared/4 99.2 ns 99.2 ns 7053498
BM_SharedPtrWidelyShared/8 196 ns 196 ns 3570876
BM_SharedPtrWidelyShared/16 392 ns 392 ns 1785831
BM_SharedPtrWidelyShared/32 786 ns 786 ns 890721
BM_SharedPtrWidelyShared/64 1570 ns 1570 ns 445820
BM_SharedPtrWidelyShared/128 3138 ns 3138 ns 223137
BM_SharedPtrWidelyShared/256 6357 ns 6357 ns 106196
BM_SharedPtrWidelyShared/512 12591 ns 12590 ns 55661
BM_SharedPtrWidelyShared/1024 25165 ns 25163 ns 27847
After:
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_IntrusivePtrExclusiveOwnership 12.8 ns 12.8 ns 55437906
BM_SharedPtrExclusiveOwnership 13.6 ns 13.6 ns 51318161
BM_IntrusivePtrWidelyShared/1 31.9 ns 31.9 ns 21965314
BM_IntrusivePtrWidelyShared/2 54.0 ns 54.0 ns 12960875
BM_IntrusivePtrWidelyShared/4 96.0 ns 96.0 ns 7290989
BM_IntrusivePtrWidelyShared/8 201 ns 201 ns 3486922
BM_IntrusivePtrWidelyShared/16 357 ns 357 ns 1959926
BM_IntrusivePtrWidelyShared/32 721 ns 721 ns 971482
BM_IntrusivePtrWidelyShared/64 1385 ns 1385 ns 505587
BM_IntrusivePtrWidelyShared/128 2775 ns 2775 ns 252312
BM_IntrusivePtrWidelyShared/256 5504 ns 5504 ns 127161
BM_IntrusivePtrWidelyShared/512 11026 ns 11025 ns 63486
BM_IntrusivePtrWidelyShared/1024 22044 ns 22044 ns 31750
BM_SharedPtrWidelyShared/1 26.4 ns 26.4 ns 26542642
BM_SharedPtrWidelyShared/2 50.7 ns 50.7 ns 13804664
BM_SharedPtrWidelyShared/4 99.3 ns 99.2 ns 7055370
BM_SharedPtrWidelyShared/8 196 ns 196 ns 3570706
BM_SharedPtrWidelyShared/16 392 ns 392 ns 1785854
BM_SharedPtrWidelyShared/32 784 ns 784 ns 893000
BM_SharedPtrWidelyShared/64 1568 ns 1568 ns 446616
BM_SharedPtrWidelyShared/128 3137 ns 3137 ns 223263
BM_SharedPtrWidelyShared/256 6296 ns 6295 ns 111228
BM_SharedPtrWidelyShared/512 12575 ns 12575 ns 55664
BM_SharedPtrWidelyShared/1024 25140 ns 25138 ns 27822
So, we get a very nice improvement (19 ns → 12.8 ns for malloc/create/destroy/free) in the exclusive ownership case, but a similarly-sized slowdown in the shared ownership cases. We could restrict this optimization to intrusive_ptr<TensorImpl>
to limit the downsides, but unless we are sure we have a lot more Tensors that never take a reference count increment than reference count increment/decrement pairs, it is probably not worthwhile.
Postscript
I did, however, notice that we are worse at handling exclusively-owned objects than std::shared_ptr. I suspect that this cost is due to having to make an extra virtual call for intrusive_ptr_target::release_resources
, which exists to support weak_intrusive_ptr
. If we could enforce that intrusive_ptr_target
subclasses may not rely on release_resources
being called, we could skip this virtual call in the common case of 0 weak references. More on this later!