Tracing with Primitives: Update 1, nvFuser and its Primitives

Tracing with Primitives: Update 1, nvFuser and its Primitives

This article is part of the “Tracing with Primitives” series. See the first update here.

Hey PyTorch community!

nvFuser is a trace executor we’ve been developing for awhile at NVIDIA. We’re excited about the “Tracing with Primitives,” program and our ability to quickly execute traces of primitive operations, and in this update we’d like to introduce nvFuser itself.

To start, check out our GTC 2022 and GTC 2021 talks on nvFuser. The 2022 talk is a great introduction to nvFuser, and the 2021 talk has more technical details (you might want to skip the first 5 minutes). You may need to go through a free registration process to watch them.

nvFuser is designed to accelerate PyTorch programs running on CUDA devices without user intervention. There are a variety of ways to approach this problem, but we decided early on to have three additional goals: to support program diversity, to be consistent with PyTorch’s current behavior, and to minimize slow recompilations.

PyTorch programs are diverse. The Community is especially creative, and PyTorch programs are used for innumerable purposes. Being faster on just a few PyTorch operations, or PyTorch programs written in a very specific way, was never an option for nvFuser. From the beginning we wanted to speedup all programs running on CUDA devices, so we designed nvFuser to be as flexible as possible. We write general optimization policies that aren’t just applicable to existing operations, but that extend to new operations, too. “Tracing with primitives” makes it easier to target program diversity because it represents programs using a smaller set of primitive operations.

Being consistent with PyTorch’s current behavior is also critically important. Users shouldn’t feel like using nvFuser is using a different framework. The “tracing with primitives” program is making this easier by specifying PyTorch’s behavior clearly.

Finally, we gave ourselves a goal to minimize slow recompilations. nvFuser can generate very fast, very specific CUDA kernels for PyTorch programs, but the more specific a kernel the fewer inputs it supports. Part of what makes PyTorch great, however, is its dynamism. PyTorch doesn’t require building a graph or that every batch have the same shape. It doesn’t even require that neural network training loops always call the same operations in the same order. nvFuser had to be as flexible as PyTorch, so we can produce very general kernels that work on a variety of inputs or very specific kernels that are much faster but more narrow in application. nvFuser decides which type of kernel to generate depending on the PyTorch program it’s executing, trying to minimize total runtime by trading off compilation time for better performance.

Internally, nvFuser is a Halide-like system, and we think Halide’s separation of “algorithms” and “schedules” (i.e. traces and optimizations) is the best way to accomplish the above goals. We’ll go more into the details of nvFuser’s implementation in future updates, but a summary of how operations are expressed in PyTorch and executed by nvFuser is:

  • A PyTorch program, with no modification, is traced using TorchDynamo.
  • The traced operations are translated into primitive operations.
  • nvFuser reviews the primitives and the trace’s inputs to determine an appropriate set of optimizations.
  • The optimization strategy is translated to nvFuser’s Halide-like internal representation, and nvFuser generates one or more kernels to implement the strategy.
  • nvFuser caches its generated kernels for future use (so they don’t need to be recompiled), and executes the optimized trace using the generated kernels.

We’re excited to talk about more about nvFuser, its design, and its performance in later updates. Please comment below with your questions!