Why `torch::jit::pop` (and sometimes push) is worse for performance than direct `std::vector` access

swolchok · March 17, 2021, 6:24pm

There is a convenient interface for stacks of IValues represented as std::vector<IValue> in ATen’s stack.h. Unfortunately, torch::jit::pop() in particular can easily be a performance problem relative to using the native std::vector API. For your convenience, here is pop():

static inline IValue pop(Stack& stack) {
  auto r = std::move(stack.back());
  stack.pop_back();
  return r;
}

There is nothing wrong with this implementation; it does what it needs to do in order to implement the pop() interface. The problem is that even if pop() is inlined, the compiler cannot help it out as much as would be needed. The C++ copy elision rules only permit copies (and thus moves, which are just optimized copies) to be elided in specific circumstances, which typical use of pop() does not qualify for. Let’s take a look at some examples, which I’ve also set up on Compiler Explorer so that we can easily see how many copy and move operations are happening.

Just pop() a stack

First, let’s suppose you want to remove the last value from a Stack. You could do it with pop():

void justPop(Stack& stack) {
  pop(stack);
}

or you could use std::vector::pop_back() directly:

void justPopManual(Stack& stack) {
  stack.pop_back();
}

The problem is that, in the first case, you are creating a new move of a Value, and that cannot be elided even though the Value is unused! If we go look on Compiler Explorer, we can see that justPop does one Value move construction and two Value destructions, whereas justPopManual does just one Value destruction.

Pop one value and push another

Let’s suppose that we want to pop a value off the stack and push another value:

void popAndPush(Stack& stack) {
    auto v = pop(stack);
    // Some accessor on Value that returns another Value; compare IValue::toObject()
    stack.push_back(v.subValue());
}

We can do this better by not popping or pushing at all, but rather using assigment:

void popAndPushManual(Stack& stack) {
    stack.back() = stack.back().subValue();
}

Again checking assembly to confirm, popAndPush does two move constructions and 3 Value destructor calls (and emits code to conditionally extend the vector), whereas popAndPushManual does one move assignment and one Value destructor call.

Extracting multiple items from the stack

Let’s say we want to grab a few items from the end of the stack:

void multiGet(Stack& stack, std::vector<Value>& out, int N) {
  for (int ii = 0; ii < N; ii++) {
      out[ii] = pop(stack);
  }
}

Again, we can do better by moving off the stack without popping and then erasing the end of the stack:

void multiGetManual(Stack& stack, std::vector<Value>& out, int N) {
  for (int ii = 0; ii < N; ii++) {
      out[ii] = std::move(*(stack.end() - 1 - ii));
  }
  stack.erase(stack.end() - N, stack.end());
}

multiGet does 1 move construction, 1 move assignment, and 2 destructor calls per loop iteration. multiGetManual does 1 move assignment and 1 destructor call.

Why does this matter?

Most concretely, all of these cases show up in the JIT interpreter implementation: justPop is DROP, popAndPush is GET_ATTR, and multiGet is STOREN. [PyTorch][JIT] Full pass at push/pop fixes in interpreter by swolchok · Pull Request #54197 · pytorch/pytorch · GitHub attempts to fix these.

In general, the stack.h interface is used elsewhere in the codebase as well. In my opinion, we should stop using it in performance-sensitive paths in favor of direct std::vector access.

Topic		Replies	Views
Working With `c10::IValue` Efficiently performance	0	2647	April 18, 2022
We shouldn't feel bad about passing `Tensor` by reference performance	0	2337	January 25, 2021
Keeping PyTorch's Ops Maintainable: The Jiterator hardware-backends	7	1770	February 27, 2023
Optimizing contiguous() for the case where the Tensor is_contiguous()? performance	6	1652	May 24, 2021
TorchInductor Update 9: Harden Vectorization Support and Enhance Loop Optimizations in TorchInductor CPP Backend compiler	0	513	September 4, 2024

Why `torch::jit::pop` (and sometimes push) is worse for performance than direct `std::vector` access

Just pop() a stack

Pop one value and push another

Extracting multiple items from the stack

Why does this matter?

Related topics