Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much)

swolchok · May 3, 2022, 4:48pm

Last year, I added an optimization to c10::intrusive_ptr for the common case of 0 weak references: before performing an expensive atomic decrement on the weak reference count, we check to see if the count is 1. This optimization works because weak references are uncommon, so the cheap check usually just replaces the atomic weak reference count decrement.

Note that this optimization is also performed by libc++'s shared_ptr implementation for the weak reference count but not the strong reference count.

I worry a lot about the cost of reference counting in the case where a Tensor actually has unique ownership, so I decided to measure the downside of applying this optimization to the decrement of the strong reference count as well. I added the following benchmark to demonstrate the potential upside:

static void BM_IntrusivePtrExclusiveOwnership(benchmark::State& state) {
  while (state.KeepRunning()) {
    volatile auto var = make_intrusive<Foo>(0);
  }
}

// and similarly for std::shared_ptr

and this one to demonstrate the costs:

static void BM_IntrusivePtrWidelyShared(benchmark::State& state) {
  intrusive_ptr<Foo> var = make_intrusive<Foo>(0);
  const size_t kLength = state.range(0);
  while (state.KeepRunning()) {
    intrusive_ptr<Foo> vararray[kLength];
    for (const auto i : c10::irange(kLength)) {
      vararray[i] = var;
    }
  }
}
BENCHMARK(BM_IntrusivePtrWidelyShared)->RangeMultiplier(2)->Range(1, 1024);

// again, similarly for std::shared_ptr

The results showed a clear win for the upside case, but I think the cost is too high. These measurements are taken on a Broadwell machine with turbo off.

Before:

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
BM_IntrusivePtrExclusiveOwnership       19.0 ns         19.0 ns     36252895
BM_SharedPtrExclusiveOwnership          13.7 ns         13.7 ns     51741294
BM_IntrusivePtrWidelyShared/1           19.3 ns         19.3 ns     35670363
BM_IntrusivePtrWidelyShared/2           35.2 ns         35.2 ns     19897026
BM_IntrusivePtrWidelyShared/4           67.2 ns         67.2 ns     10448680
BM_IntrusivePtrWidelyShared/8            131 ns          131 ns      5358368
BM_IntrusivePtrWidelyShared/16           259 ns          259 ns      2705537
BM_IntrusivePtrWidelyShared/32           515 ns          515 ns      1358814
BM_IntrusivePtrWidelyShared/64          1029 ns         1029 ns       680294
BM_IntrusivePtrWidelyShared/128         2055 ns         2055 ns       340070
BM_IntrusivePtrWidelyShared/256         4109 ns         4109 ns       170251
BM_IntrusivePtrWidelyShared/512         8252 ns         8251 ns        84905
BM_IntrusivePtrWidelyShared/1024       16481 ns        16480 ns        42467
BM_SharedPtrWidelyShared/1              26.4 ns         26.4 ns     26541711
BM_SharedPtrWidelyShared/2              50.7 ns         50.7 ns     13818673
BM_SharedPtrWidelyShared/4              99.2 ns         99.2 ns      7053498
BM_SharedPtrWidelyShared/8               196 ns          196 ns      3570876
BM_SharedPtrWidelyShared/16              392 ns          392 ns      1785831
BM_SharedPtrWidelyShared/32              786 ns          786 ns       890721
BM_SharedPtrWidelyShared/64             1570 ns         1570 ns       445820
BM_SharedPtrWidelyShared/128            3138 ns         3138 ns       223137
BM_SharedPtrWidelyShared/256            6357 ns         6357 ns       106196
BM_SharedPtrWidelyShared/512           12591 ns        12590 ns        55661
BM_SharedPtrWidelyShared/1024          25165 ns        25163 ns        27847

After:

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
BM_IntrusivePtrExclusiveOwnership       12.8 ns         12.8 ns     55437906
BM_SharedPtrExclusiveOwnership          13.6 ns         13.6 ns     51318161
BM_IntrusivePtrWidelyShared/1           31.9 ns         31.9 ns     21965314
BM_IntrusivePtrWidelyShared/2           54.0 ns         54.0 ns     12960875
BM_IntrusivePtrWidelyShared/4           96.0 ns         96.0 ns      7290989
BM_IntrusivePtrWidelyShared/8            201 ns          201 ns      3486922
BM_IntrusivePtrWidelyShared/16           357 ns          357 ns      1959926
BM_IntrusivePtrWidelyShared/32           721 ns          721 ns       971482
BM_IntrusivePtrWidelyShared/64          1385 ns         1385 ns       505587
BM_IntrusivePtrWidelyShared/128         2775 ns         2775 ns       252312
BM_IntrusivePtrWidelyShared/256         5504 ns         5504 ns       127161
BM_IntrusivePtrWidelyShared/512        11026 ns        11025 ns        63486
BM_IntrusivePtrWidelyShared/1024       22044 ns        22044 ns        31750
BM_SharedPtrWidelyShared/1              26.4 ns         26.4 ns     26542642
BM_SharedPtrWidelyShared/2              50.7 ns         50.7 ns     13804664
BM_SharedPtrWidelyShared/4              99.3 ns         99.2 ns      7055370
BM_SharedPtrWidelyShared/8               196 ns          196 ns      3570706
BM_SharedPtrWidelyShared/16              392 ns          392 ns      1785854
BM_SharedPtrWidelyShared/32              784 ns          784 ns       893000
BM_SharedPtrWidelyShared/64             1568 ns         1568 ns       446616
BM_SharedPtrWidelyShared/128            3137 ns         3137 ns       223263
BM_SharedPtrWidelyShared/256            6296 ns         6295 ns       111228
BM_SharedPtrWidelyShared/512           12575 ns        12575 ns        55664
BM_SharedPtrWidelyShared/1024          25140 ns        25138 ns        27822

So, we get a very nice improvement (19 ns → 12.8 ns for malloc/create/destroy/free) in the exclusive ownership case, but a similarly-sized slowdown in the shared ownership cases. We could restrict this optimization to intrusive_ptr<TensorImpl> to limit the downsides, but unless we are sure we have a lot more Tensors that never take a reference count increment than reference count increment/decrement pairs, it is probably not worthwhile.

Postscript

I did, however, notice that we are worse at handling exclusively-owned objects than std::shared_ptr. I suspect that this cost is due to having to make an extra virtual call for intrusive_ptr_target::release_resources, which exists to support weak_intrusive_ptr. If we could enforce that intrusive_ptr_target subclasses may not rely on release_resources being called, we could skip this virtual call in the common case of 0 weak references. More on this later!

Topic		Replies	Views
Optimizing contiguous() for the case where the Tensor is_contiguous()? performance	6	1622	May 24, 2021
Working With `c10::IValue` Efficiently performance	0	2613	April 18, 2022
We shouldn't feel bad about passing `Tensor` by reference performance	0	2318	January 25, 2021
TorchInductor Update 9: Harden Vectorization Support and Enhance Loop Optimizations in TorchInductor CPP Backend compiler	0	496	September 4, 2024
Fixing torch.compile Reference Leaks (Automatic Deletion of Dynamo Code Objects) compiler	0	694	July 2, 2024

Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much)

Postscript

Related topics