OpenCL Backend - Important Updates

artyom-beilis · November 2, 2022, 10:32pm

What is New? Ease of use!

I’m working for a while on the out-of-tree OpenCL backend for pytorch.

Recently privateuseone device was introduced and now the integration with mainstream pytorch become transparent. All you need to do is to install stabe 1.13 pytorch version and build a backend against it - matter of few minutes - just setup pip virtual environment run simple cmake and you are good to go.

More than that I even did brief testing on Windows using VS2022 and was able to do the same and run some training and tests as well (of course with little bit less ease - as anything on Windows)

Basically - I think it looks like an only option to use AMD GPU on Windows with pytorch! Unlike ROCm It works very well even with older AMD GPU like “Stoney Ridge” and runs well on latest RDNA2 lineup - like RX 6600XT I use myself. Of course nVidia GPUs work very well as well.

Current version aligned against 1.13 and nightly 1.14 versions.

What was tested?

I managed to validate network on many standard vision networks like ResNet, VGG, AlexNet, DenseNet, MobileNet, SqueezeNet, ShuffleNet, GoogleNet, massnet, RegNet and EfficientNet. I run it on many pytorch examples including super resolution and style transfer.

Many operators are still missing and it is long way to go but it is work in progress and what is more important it is already highly useful and quite easy to use - thanks to latest pytorch improvements.

How is the performance?

When comparing to cuda/cudnn performance on GTX960 I have - it gets from 50-60% of cuda/cudnn performance for training and around 60 to 70% for inference - depending on specific network.

Summary

I encourage everybody to try and test - now it is really easy!

Kimishpatel · November 8, 2022, 9:22pm

This is great. Is AMD GPUs on windows the main motivation? Or are you also thinking of targetting mobile. We have vulkan backend on mobile but I have been curuious about opencl

artyom-beilis · November 8, 2022, 9:52pm

My goal is to have open and cross platform solution that does not depend on specific platform API like nVidia’s CUDA, AMD’s cuda clone rocm/hip, iOS’s metal, Windows DirectCompute etc.

I want true open source cross platform solution that works across various GPUs (of course with relevant optimizations) and multiple OS.

OpenCL is most widely supported computing platform that works very well and actually very similar to CUDA - basically many kernels can be written for CUDA and OpenCL with bunch of defines. OpenCL was designed for computing (not graphics like Vulkan)

I always planned to support Windows as some “poor souls” still tend to use it, just with latest support of out-of-tree backend and no-need of building pytorch itself it become reality - as all you need is VS2022, OpenCL SDK and optional SQLite library. I myself rarely use Windows.

However, since I targeted cross platform support it indeed become reality. And AMD is of course is one of main beneficiary for several reasons:

Since their GPUs are only actual competitors for nVidia that have enough power to do real training
Their own rocm/hip platform also being more mature is way more limited in terms of compatibility. OpenCL backend is supported and actually runs on:
- older GPUs including GPUs that AMD abandoned in ROCm like GCN4.
- APUs
- Windows
- It allows to use even Clover Mesa OpenCL driver - allowing to run GCN4 GPUs connected to chipset PCI-E (rocm requires PCI-E connected directly to CPU)

Regarding other GPUs/Mobile patforms:

nVidia is of course fully supported and runs very well.
Intel integrated GPUs are tested and working but the performance isn’t as optimized as of nVidia or AMD GPUs, but still beats Intel’s own oneDNN in channel first memory format.
I hand’t tested it on Intel Arc yet for obvious reasons - but I’m highly curious what can be done there.
It was found to run on Apple’s M1 fairly well but does not yet explore internal matrix multiplications functions to utilize the GPU to full potential. Also, I don’t own Apple M1 on my own but got positive reports
It was tested in terms of bare functionality on Mali but it is not optimized at all

In comparison to Vulkan it seems that it already provides much more operators than Vulkan backend.

And for inference - dlprimitives itself (the core library) provides initial ONNX support so you can have one small library for inference that runs on nVidia, AMD, Intel, M1 and potentially many more GPUs with minimal dependencies - basically OpenCL driver, protobufs and sqlite3.

ONNX support is work in progress as well but most vision classifications networks like ResNet, MobileNet, EfficientNet and many others are tested and supported.

SteelJ · December 2, 2023, 3:15am

Great work, this is huge for running on embedded hardware.

Question, how would this solution work with torch’s cuda/cpp extensions? I’m trying to migrate this cuRobo library onto an embedded board and would like to convert the cuda code into opencl then load into your torch version, is that even possible with your torch version? Thanks!

artyom-beilis · December 2, 2023, 8:33am

how would this solution work with torch’s cuda/cpp extensions?

I don’t see major problem for using it as an extension. However few things to notice:

You’ll need to link with backend and use its headers/API to access memory allocation/tensor processing facilities.

For operations that pointwise, do reduction or broadcasting you don’t even need to write your kernels just pass small computation code as string and it would work.

For more complex kernels you can access all underlying OpenCL data via its API:

Basically:

// Accessing memory
dlprim::Tensor X = todp(input); // create dlprim Tensor from pytorch kernel
cl::Buffer buffer = X.device_buffer();
// accessing all OpenCL stuff (queue, device, context)
dlprim::ExecutionContext ctx =  getExecutionContext(input.device());

However if you want to enjoy from other goodies dlprim implements like Kernel building and caching it is needed to add small changes in dlprimitives - since all kernels currently embedded in dlprim library itself and need to add an extension to load and use external custom kernels.

Shouldn’t be a big problem.

Great work, this is huge for running on embedded hardware.

Now small warning. Also I tested it on MaliG52 MC2 several times the performance is far from optimal.

The gemm and winograd kernels (basically convolution and matrix multiplication) are optimized for large GPUs like nVidia and AMD, there is some custom optimization of GEMM for intel GPU but there is no such stuff for Mali. So performance may be sub-optimal (it is in To-Do list)

What is the GPU to target?

SteelJ · December 2, 2023, 5:12pm

Thanks for the info. Just to clarify, I want to use your opencl torch version with an extension (not as an extension). The desired extension would be created by converting that cuda code to opencl.

Given
since all kernels currently embedded in dlprim library itself and need to add an extension to load and use external custom kernels
What happens if your opencl torch version calls its torch.utils.cpp_extension function?

Target GPU is Arm Mali™-G57 MC3 3D, any idea how far below sub-optimal performance would be?

artyom-beilis · December 3, 2023, 7:21am

Honestly, I don’t remember. But it wasn’t good.

Nowadays I have much better experience with Android and probably I can start doing better testing - tuning for Mali

TriKri · January 18, 2024, 12:01am

Does this backend work with PyTorch’s C++ interface?

artyom-beilis · January 18, 2024, 5:28am

Yes it should. (up to bugs of course)

I know some managed to run it.

artyom-beilis · August 15, 2024, 8:37pm

Ok some updates:

Now it works with pytorch 2.4 - in fact it is requirement. Either 1.13 or >=2.4
I created a much easier interface to use - all you need is to import pytorch_ocl module and you’ll get all the goodies on Linux and Windows.
With python module you can use torch.ocl.synchonize() and torch.ocl.empty_cache() as with CUDA
I ordered Intel Arc GPU (a380) - so hopefully I’m going to be able to check/optimise for a new platform
Implemented other things needed like manual_seed_all - as required for backed.

Known issues: Currently there is an issue with loading back saved state dictionary if it was saved from ocl device. It crashes for some reason (outside of ocl backend). Workaround: When you save/restore the model move it to CPU

Now question: is there any online service that can help me build pip packages from Linux and Windows so users can just pip install nightly build instead of going through build help (Especially on windows)

artyom-beilis · August 16, 2024, 8:06pm

And finally I released repbuilt whl files for both Linux and Windows.

So I hope it would make it much easier for end users.

artyom-beilis · September 4, 2024, 9:39pm

I released version 0.2.0 - including binary whl files

Lots of improvements and bug fixes. New Arch GPU is well tested and performance improvements added. Visual transformers are now validated and working. But lots more to do.

New benchmarks updated: pytorch_dlprim/benchmarks/v0.2.0/summary.md at master · artyom-beilis/pytorch_dlprim · GitHub

In a short summary comaring OpenCL vs native (rocm/cuda/xpu)

Vendor	Device	% Perf Train	% Perf Test
AMD	rx6600XT	71%	85%
NVidia	gtx960	65%	69%
Intel	Arc A380	40%	54%

Unfortunatly my code isn’t optimal enough for Intel, but on the good side I think I can integrate oneDNN quite easily - but this is next step

artyom-beilis · October 24, 2024, 11:40am

I released binary whl files for v0.2.0 for pytorch 2.5 - at the same release directory

Windows: python3.11, 3.12, Linux python 3.9-3.12

bjourne · February 7, 2025, 10:15am

I’m doing something similar… Except my OpenCL backend is for FPGAs. So it’s OpenCL, but you can’t use any of the SIMT stuff because it doesn’t synthesize well.

artyom-beilis · February 14, 2025, 9:56pm

Interesting.

Most of “custom” kernels are for GEMM, Gemm convolution and Wingorad convolution.

And there are already quite lots of optimisations per platform (AMD, Intel,nVidia etc)

Many of the kernels are quite simple. It would be interesting if it is possible to “join forces”

Finally I had implemented huge

artyom-beilis · February 15, 2025, 3:42pm

I released binaries for pytorch_ocl for latest Pytorch 2.6 (no changes in backend itself) and added python 3.13 to list of python versions.

Linux python versions: 3.10, 3.11, 3.12, 3.13
Windows python versions: 3.11, 3.12, 3.13

bjourne · February 15, 2025, 4:13pm

Yeah, I’m following your work with great interest. You can find some of my kernels here: myopencl/kernels/vgg16.cl at main · bjourne/myopencl · GitHub (I’m using the repo as a code dump). As you can see, it is the “same old” algorithms, max pool, batch norm, relu, etc. But since my hw is FPGA I can’t use “SIMT-aware” functions like get_global_id because they don’t synthesize well. Hence, I can’t use your code verbatim and for example my im2col implementation is quite different from yours.

LukeKiller · May 23, 2025, 4:37pm

Hi, I would like to help if possible in my free time, am an graduate AI student with just a little bit of free time here and there, but I would love to be able to help, due to my NTB having AMD GPU on windows, therefore i am not able to train many of the networks for school locally

artyom-beilis · May 23, 2025, 4:54pm

1st of all start using. Look what works and what operators are missing. It is most important contribution.

For contributing actual code you need basic C++ knowledge, start from simple operators that can be implemented using other functions for more complex you may need to write OpenCL kernels.

So start from here

Topic		Replies	Views
Implementing OpenCL backend for pytorch hardware-backends	14	16502	March 1, 2024
ROCm vs OpenCL/dlprimitives hardware-backends	0	363	August 5, 2024
Private use opencl device hardware-backends	7	1914	November 11, 2022
OpenCL backend dev - questions/support hardware-backends	4	366	August 29, 2024
find_package(Torch REQUIRED) fails - 2.3.1 and nightly hardware-backends	10	840	August 6, 2024

OpenCL Backend - Important Updates

What is New? Ease of use!

What was tested?

How is the performance?

Summary

Related topics