CUDA loops case study: code generation vs templates

CUDA loops case study: code generation vs templates

Problem statement

It’s not easy defining a pointwise operation in CUDA for TensorIterator. Because we are ahead-of-time compiling a set of kernels that must work universally for all combinations of dtypes/scalar combinations (in this Quip, I ignore the actual iteration aspects of TensorIterator), there is quite a bit of work you have to do:

  • When defining any particular kernel, you must statically know both what the stored in memory type (scalar_t) and the intermediate computation types (accscalar_t) are, because they can be divergent (e.g., read out half from memory, but do computation in float.)
  • You need separate kernels for tensor-tensor computation and tensor-scalar computation, as the scalar typically lives in CPU memory and must be transmitted (via CUDA parameters) to the CUDA kernel. Furthermore, the scalar should be transmitted in accscalar_t, as scalars may still have more precision than the tensor itself (even if they do not cause the type of the tensor to promote; this occurs when you do something like half_tensor * 0.25; the result is a half tensor, even though the constant is represented in float precision).
  • Your kernels may have extra, non-Tensor arguments (e.g., alpha in add) which need to be transmitted to the kernel via CUDA parameters in accscalar_t precision.
  • You cannot use GPU lambdas because NVCC will fail to deduplicate duplicate nested lambdas (as of 2020, see Use explicit templates in `gpu_kernel_with_scalars` by malfet · Pull Request #40992 · pytorch/pytorch · GitHub< ), increasing our binary size.

Naive implementation (aka what to generate in codegen)

Here’s what a naive (i.e., with flagrant code duplication) implementation of addition might look, ignoring the “no GPU lambdas” constraint, assuming gpu_kernel (which handles the actual iteration) as a primitive:

template <typename scalar_t>
void add_kernel(TensorIteratorBase& iter, const Scalar& alpha_scalar) {
  using accscalar_t = at::acc_type<scalar_t, true>;
  auto alpha = alpha_scalar.to<accscalar_t>();

  if (iter.is_cpu_scalar(1)) {
    auto a = iter.scalar_value<accscalar_t>(1);
    iter.remove_operand(1);
    gpu_kernel(iter, [=](scalar_t b) -> scalar_t { return a + b * alpha; });

  } else if (iter.is_cpu_scalar(2)) {
    auto b = iter.scalar_value<accscalar_t>(2);
    iter.remove_operand(2);
    gpu_kernel(iter, [=](scalar_t a) -> scalar_t { return a + b * alpha; });

  } else {
    gpu_kernel(iter, [=](scalar_t a, scalar_t b) -> scalar_t { return a + b * alpha; });
  }
}

There are a number of things about this implementation worth calling out:

  • We generate three kernels in total: Scalar-Tensor, Tensor-Scalar, Tensor-Tensor. This is a 3x binary size increase over just generating a Tensor-Tensor kernel.
    • Each lambda captures by value a different set of values (everyone captures alpha, the first lambda captures a, the second captures b); the capture by value translates into CUDA arguments.
    • Thought: As these kernels are typically memory bound, it may be possible to coalesce these kernels into a single kernel along the lines of as + has_at ? at : 0 + (bs + has_bt ? bt : 0) * alpha if the ternaries translate into conditional loads.
  • The computation in the lambdas is always done in accscalar_t, as alpha is higher precision and forces promotion of all the other arguments. In other kernels, this promotion does not necessarily always happen, and so for correct behavior in half precision it is necessary to explicitly cast arguments to the intermediate computation type; e.g., accscalar_t(a) + accscalar_t(b) * accscalar_t(alpha)
  • It is important for the lambda to explicitly take scalar_t and return a scalar_t; this is how we tell gpu_kernel what the expected input/output types of the tensors are. If there is a mismatch, this triggers a dynamic_casting codepath which, as currently implemented, triggers a 2x slowdown.
  • At the same time, it is important that scalar a and b are represented as accscalar_t; otherwise you could lose precision if the scalar is at higher precision.
  • Each lambda can be individually converted into a functor using the standard transformation (make every closed over value field on the functor), resulting in three structs.

In a code generator implementation of TensorIterator, we can simply choose to generate the code as seen here and call it a day.

Template implementation

The logic seen above must be replicated for every binary operator, so there is a good deal of incentive to reduce the duplication. There are two axes of duplication which are desirable to remove:

  • Logic for testing if either argument is a CPU scalar, and appropriately dispatching
  • The lambda itself, which is replicated three times with different values that it closes over

The current implementation shipped by PyTorch handles this by deriving the Tensor-Scalar/Scalar-Tensor lambdas from the Tensor-Tensor variant. Here is what is effectively done (I’ve taken this code from db2b273d1 prior to the functor-ization of the lambdas for clarity, and done some modest editing for clarity):

template <typename func_t>
void gpu_kernel_with_scalars(TensorIterator& iter, const func_t& f) {
  ASSERT_HOST_DEVICE_LAMBDA(func_t);

  using traits = function_traits<func_t>;
  using arg1_t = typename traits::template arg<0>::type;
  using arg2_t = typename traits::template arg<1>::type;

  if (iter.is_cpu_scalar(1)) {
    auto a = iter.scalar_value<arg1_t>(1);
    iter.remove_operand(1);
    gpu_kernel(iter, [=]GPU_LAMBDA(arg2_t b) {
      return f(a, b);
    });
  } else if (iter.is_cpu_scalar(2)) {
    auto b = iter.scalar_value<arg2_t>(2);
    iter.remove_operand(2);
    gpu_kernel(iter, [=]GPU_LAMBDA(arg1_t a) {
      return f(a, b);
    });
  } else {
    gpu_kernel(iter, f);
  }
}

// Invoked as:

gpu_kernel_with_scalars(iter, [alpha]GPU_LAMBDA(scalar_t a, scalar_t b) -> scalar_t {
    return a + alpha * b;
});

Unfortunately, this implementation fails to handle CPU scalars correctly in half precision. This is because the lambda which the derived Tensor-Scalar lambdas use takes in scalar_t arguments, but we actually want the CPU scalar to be applied directly to the computation in accscalar_t precision. Even if we correct the scalar_value invocations in the body of the code to accept accscalar_t, an undesirable downcast will still occur when we invoke the f functor.

One logical fix for this issue is to make the lambda accept accscalar_t:

gpu_kernel_with_scalars(iter, [alpha]GPU_LAMBDA(accscalar_t a, accscalar_t b) -> scalar_t {
    return a + alpha * b;
});

However, this silently tanks the performance of the kernel by more than 2x (Add acc_gpu_kernel_with_scalars and port add to use it by ezyang · Pull Request #63884 · pytorch/pytorch · GitH); this is because the static type of the lambda no longer matches the type of data in memory in the tensors, and that shunts us to the dynamic_casting codepath. A final fix that does work is to explicitly instruct gpu_kernel_with_scalars what the intended input types are, and then rewrap the lambda to expose the correct static types:

template <typename scalar_t, typename func_t>
void gpu_kernel_with_scalars(TensorIterator& iter, const func_t& f) {
  ASSERT_HOST_DEVICE_LAMBDA(func_t);

  using traits = function_traits<func_t>;
  using arg1_t = typename traits::template arg<0>::type;
  using arg2_t = typename traits::template arg<1>::type;

  if (iter.is_cpu_scalar(1)) {
    auto a = iter.scalar_value<arg1_t>(1);
    iter.remove_operand(1);
    gpu_kernel(iter, [=]GPU_LAMBDA(scalar_t b) -> scalar_t {
      return f(a, b);
    });
  } else if (iter.is_cpu_scalar(2)) {
    auto b = iter.scalar_value<arg2_t>(2);
    iter.remove_operand(2);
    gpu_kernel(iter, [=]GPU_LAMBDA(scalar_t a) -> scalar_t {
      return f(a, b);
    });
  } else {
    gpu_kernel(iter, [=]GPU_LAMBDA(scalar_t a, scalar_t b) -> scalar_t {
      return f(a, b);
    });
  }
}

// Invoked as:

gpu_kernel_with_scalars<scalar_t>(iter, [alpha]GPU_LAMBDA(accscalar_t a, accscalar_t b) -> scalar_t {
    return a + alpha * b;
});

Each inner lambda translates into a separate functor which wraps the initially provided lambda/functor.

Analysis

I originally went into writing this case study with the idea that it was not possible to implement the logic here nicely with templates, and was pleasantly surprised when I came up with a solution while explaining the difficulties with templating. I think in the short term, the modification of gpu_kernel_with_scalars to add an (optional) template argument for true memory type is probably the most expedient course of action. However, I have some other thoughts:

  • The handling of dtypes when writing lambdas for CUDA kernels is extremely subtle. gpu_kernel does not provide any compile-time or runtime feedback about whether or not you have provided the correct types, so the only way to find out if you’ve fallen into the dynamic casting trap is by noticing that your CUDA kernel performance has tanked (granted, we make everyone working on CUDA kernels benchmark their kernels, so hopefully something like this would be noticed).
  • Functorization of lambda (not shown here) is a serious impediment to the readability of our code here, so it is worth carefully analyzing why functorization helps reduce CUDA binary size; it really shouldn’t!
  • We are generating waaaay too much code for CUDA TensorIterator, and Tensor-Scalar shenanigans only make it (3x) worse. We should figure out if there are ways of compacting our AOT kernels without sacrificing performance, by taking advantage of the fact that most pointwise kernels are memory bound.
3 Likes

Question. Why don’t you compile the kernel using NVRTC similarly to what is done in opengl, direct3d, Vulkan and opencl?

This way binary will be small and you’ll be able to combine/merge kernels in whatever whay you need.

Maybe I missed something… just a thought

It’s a great idea, and we should do it!

This is what I do in dlprimitives OpenCL library: https://github.com/artyom-beilis/dlprimitives

Many parameters are provided as defined at compilation level so kernels are relatively simple but I can reuse same code for GEMM, Convolutin add activation and bias to the same kernel transparently.

I have long shot to create OpenCL backend for pytorch. However at this point I started with adding dlprimitives support to Caffe that I’m familiar with and already has decent OpenCL support (but relatively poor performance). So I can have some POC outside my mini-framework before I dive into pytorch complicated internals.