CUDA loops case study: code generation vs templates

artyom-beilis · September 8, 2021, 5:07pm

This is what I do in dlprimitives OpenCL library: https://github.com/artyom-beilis/dlprimitives

Many parameters are provided as defined at compilation level so kernels are relatively simple but I can reuse same code for GEMM, Convolutin add activation and bias to the same kernel transparently.

I have long shot to create OpenCL backend for pytorch. However at this point I started with adding dlprimitives support to Caffe that I’m familiar with and already has decent OpenCL support (but relatively poor performance). So I can have some POC outside my mini-framework before I dive into pytorch complicated internals.

Topic		Replies	Views
Keeping PyTorch's Ops Maintainable: The Jiterator hardware-backends	7	1746	February 27, 2023
Compiling the optimizer with PT2 compiler	8	3372	January 29, 2024
Ufunc codegen for pointwise operators	0	834	October 6, 2021
CUDAGraphs in Pytorch 2.0 compiler	6	5049	November 20, 2024
Python Operator Authoring w/ NNC nnc	5	2446	June 7, 2022

CUDA loops case study: code generation vs templates

Related topics