Race conditions in persistent kernels

TL;DR: Probably no one really understands CUDA’s async proxy. We didn’t, and thus encountered race conditions when shared memory repeatedly alternates between regular threads and TMA, CLC, … We’ve since found instances of this pattern across other kernels, proving this is somewhat widespread.

The async proxy

The concept of “async proxy” was formalized in Hopper (even though the cp.async ops already existed in Ampere) to reason about the memory model involving asynchronous GPU operations (such as newer Tensor Cores, Tensor Memory Accelerator, Cluster Launch Control, …). It states that, in addition to all the usual parallelism (across threads, warps, CTAs, …), there is another level of concurrency, which is completely unsynchronized with the first one. One way to think about it: all the threads in a CTA (with all their syncs, barriers, …) collectively act like a single “thread”, and they can race with the “thread” consisting in the async operations.

When two racing threads need to safely collaborate on the same memory region, we talk about the acquire/release protocol. That is, only one of the threads is alternately in control of the memory, which gives it the right to freely read and write to it. Acquire and release steps (typically done on locks, semaphores, …) ensure that operations don’t get reordered incorrectly and that modifications are flushed and caches are invalidated.

In CUDA, interleaving regular operations (on the “generic” proxy) with async ones also requires these same acquire/release steps. These are mostly implicit when control goes from the async proxy to the generic proxy (e.g., via a wait on a mbarrier, or a wait_group instruction). However, in the other direction, developers have to insert an explicit async-proxy fence (which is bi-directional). For example, people mostly “know” that they need to do this before using a TMA store operation to take data that was written to smem by the threads and have it be sent to gmem. However, there are other instances as well.

The race condition(s)

Another intriguing need for async-proxy fences can be seen in the sample snippet for CLC. The fence is needed after the threads are done consuming the result, in order to “delay” the next request and prevent it from overwriting the smem. This example defies all intuitive rules of causality: one would think that the syncthreads is enough to ensure that the next CLC operation is only issued after all threads have fully read the previous values, and yet removing that fence causes observable race conditions. That’s weak ordering for you!

This risk exists for all async ops, including TMA! Concretely, it means that persistent kernels that repeatedly load input tiles in a loop via TMA must have a fence after they’re done consuming the tiles. Some of our kernels didn’t have this, and it was causing spurious wrong outputs.

int my_val = smem[tid]; 
cuda::ptx::fence_proxy_async(cuda::ptx::space_shared); // <-- THIS IS REQUIRED! 
__syncthreads(); 
if (tid == 0) { tma_load(smem, gmem); }

A common sub-pitfall is inverting the order of the fence with the syncthreads. The only sequence that’s compliant with the memory model is: fence first, syncthreads second. That’s because the fence only applies to the calling thread’s smem accesses, but we need all of them to be formally ordered before the next TMA operation issued by thread 0, and syncthreads establishes this dependency and must thus go in between these two ops.

Exclusions

Luckily, two classes of kernels are mostly exempt from this issue:

  • In most GEMM kernels, the data loaded from TMA is typically directly fed to TensorCores (wgmma or tcgen05 instructions). These operations are themselves async and have their own dedicated fence instructions, which developers tend to be aware of. This means that GEMMs are usually fine.

  • Kernels that use TMA stores to write out their results tend to have a fence just before that step and, depending on how the kernel is written, this might be enough to prevent overwriting by the next TMA load. However, this pattern looks more accidental than intentional.

Occurrences in the wild

After finding a couple of hits in our own kernels, we tried sweep across various codebases to find other instances. We managed to fix some in FBGEMM and torchao. The blast radius could have been worse, but this is still concerning.

Catching it systematically?

Root-causing this bug proved challenging because NVIDIA’s “compute-sanitizer”, which is purpose-built for this job, didn’t detect it. It’s not the first time that compute-sanitizer shows its limitations, and coverage of TMA is a known gap (supposedly targeted for CUDA 13.3). We will try to insist with NVIDIA on the importance of closing these gaps.

The second-best option would be for our fuzzer tools (e.g., CUTracer) to assist with exposing such bugs more reliably in unit tests and aid with tracing them back to specific parts of the kernel. This is currently being tested.

Bonus: mbarrier init

It’s unclear whether an async-proxy fence is needed after initializing a mbarrier. One would think so, as the init is performed by the threads but it needs to become visible to the async proxy. NVIDIA included the fence in snippets in old versions of the documentation, but removed it in newer versions. Feel free to share your take in the comments.

Thanks

We’d like to thank Natalia Gimelshein for discussing this bug and the CUDA memory model with us (and reassuring us that we’re not crazy).