Introducing c10::Synchronized<T> a safe abstraction for mutex based concurrency

dhrubird · March 2, 2022, 9:16pm

Author(s): Dhruv Matani (@dhrubird), Jacob Szwejbka, Pavithran Ramachandran (@pavithran), Chen Lai

c10::Synchronized<T> is heavily inspired by folly::Synchronized and borrows a lot of the good ideas from there. Here’s what folly::Synchronized claims to be:

folly/Synchronized.h introduces a simple abstraction for mutex- based concurrency. It replaces convoluted, unwieldy, and just plain wrong code with simple constructs that are easy to get right and difficult to get wrong.

c10::Synchronized<T> is the same, but in the PyTorch/c10 codebase. Additionally, it provides just the bare minimum API needed to write thread-safe and concurrency-aware code in a way that’s hard to get wrong.

Let’s dive into the details.

Motivation

When using data structures and containers (basically variables) that may be accessed and/or updated from multiple threads concurrently, you want to protect them with a mutex so that you don’t end up corrupting the internal state. For example, if you have an std::vector<T> that you want 2 or more threads to be writing to, then you should use a mutex to prevent corrupting the internal state of the data structure. See this page to learn more about why to use a mutex.

Okay, now that we are convinced that using a mutex to protect access to shared resources is a desirable thing, let’s see how one may do it naively.

// Global vector of integers
std::vector<int> v;
std::mutex m;

void called_from_multiple_threads(int element) {
  std::lock_guard<std::mutex> guard(m);
  v.push_back(element);
}

The code above prevents simultaneous unsafe insertion (push_back) of elements into the vector v.

However, you can see that this relies on the developer remembering to add the std::lock_guard<T> call before inserting into vector v. If the developer forgets this, then all bets are off.

Solution

The idea is to force the mutex(m) and vector(v) to be co-homed so that one can not easily get a handle on the vector(v) without holding a lock on the mutex(m).

// The variable v below is safe to access using the withLock<T> method.
c10::Synchronized<std::vector<int>> v;

void called_from_multiple_threads(int element) {
  v.withLock([element](std::vector<int> &v) {
    v.push_back(element);
  });
}

The withLock<T> method accepts a callback that will be invoked with the mutex safely held. This way, every caller is forced (using this abstraction) to hold a hold on the mutex before they can update the shared data structure that is the vector of integers.

Impact on PyTorch

PyTorch currently has 160 instances of std::lock_guard in the codebase that can be replaced with the use of c10::Synchronized<T>.

Gaps compared to `folly::Synchronized<T>`

As mentioned above, c10::Synchronized<T> doesn’t implement the complete API implemented by folly::Synchronized<T>. Notable absences are:

No way to write a single line method call on T. For example using data.lock()->push_back(...)
No way to use read-write locks using wlock() or rlock()
No way to upgrade locks
No support for acquireLocked() which able to lock multiple mutexes simultaneously and safely if using multiple c10::Synchronized<T> objects

These are definitely implementable as additions to the current API but don’t seem to be critical for the basic functionality that is currently provided.

Thanks

To Edward Yang (@ezyang) for reaching out with a crash in the model tracer that led to the discovery of a missing mutex, and the subsequent discussion that led to the implementation of this (missing) abstraction for safe concurrent access to shared data structures.

Topic		Replies	Views
Question of using DeviceGuard regarding the exclusive use of a device hardware-backends	0	72	December 20, 2024
Impact of multithreading and local caching on torch.compile compiler	3	764	September 27, 2024
Unionizing for Profit: How to Exploit the Power of Unions in C++ performance	2	3239	January 7, 2022
Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much) performance	0	693	May 3, 2022
Optimizing contiguous() for the case where the Tensor is_contiguous()? performance	6	1656	May 24, 2021