OpenCL Backend - Important Updates

What is New? Ease of use!

I’m working for a while on the out-of-tree OpenCL backend for pytorch.

Recently privateuseone device was introduced and now the integration with mainstream pytorch become transparent. All you need to do is to install stabe 1.13 pytorch version and build a backend against it - matter of few minutes - just setup pip virtual environment run simple cmake and you are good to go.

More than that I even did brief testing on Windows using VS2022 and was able to do the same and run some training and tests as well (of course with little bit less ease - as anything on Windows)

Basically - I think it looks like an only option to use AMD GPU on Windows with pytorch! Unlike ROCm It works very well even with older AMD GPU like “Stoney Ridge” and runs well on latest RDNA2 lineup - like RX 6600XT I use myself. Of course nVidia GPUs work very well as well.

Current version aligned against 1.13 and nightly 1.14 versions.

What was tested?

I managed to validate network on many standard vision networks like ResNet, VGG, AlexNet, DenseNet, MobileNet, SqueezeNet, ShuffleNet, GoogleNet, massnet, RegNet and EfficientNet. I run it on many pytorch examples including super resolution and style transfer.

Many operators are still missing and it is long way to go but it is work in progress and what is more important it is already highly useful and quite easy to use - thanks to latest pytorch improvements.

How is the performance?

When comparing to cuda/cudnn performance on GTX960 I have - it gets from 50-60% of cuda/cudnn performance for training and around 60 to 70% for inference - depending on specific network.

Summary

I encourage everybody to try and test - now it is really easy!

8 Likes

This is great. Is AMD GPUs on windows the main motivation? Or are you also thinking of targetting mobile. We have vulkan backend on mobile but I have been curuious about opencl

1 Like

My goal is to have open and cross platform solution that does not depend on specific platform API like nVidia’s CUDA, AMD’s cuda clone rocm/hip, iOS’s metal, Windows DirectCompute etc.

I want true open source cross platform solution that works across various GPUs (of course with relevant optimizations) and multiple OS.

OpenCL is most widely supported computing platform that works very well and actually very similar to CUDA - basically many kernels can be written for CUDA and OpenCL with bunch of defines. OpenCL was designed for computing (not graphics like Vulkan)

I always planned to support Windows as some “poor souls” still tend to use it, just with latest support of out-of-tree backend and no-need of building pytorch itself it become reality - as all you need is VS2022, OpenCL SDK and optional SQLite library. I myself rarely use Windows.

However, since I targeted cross platform support it indeed become reality. And AMD is of course is one of main beneficiary for several reasons:

  1. Since their GPUs are only actual competitors for nVidia that have enough power to do real training

  2. Their own rocm/hip platform also being more mature is way more limited in terms of compatibility. OpenCL backend is supported and actually runs on:

    • older GPUs including GPUs that AMD abandoned in ROCm like GCN4.
    • APUs
    • Windows
    • It allows to use even Clover Mesa OpenCL driver - allowing to run GCN4 GPUs connected to chipset PCI-E (rocm requires PCI-E connected directly to CPU)

Regarding other GPUs/Mobile patforms:

  • nVidia is of course fully supported and runs very well.
  • Intel integrated GPUs are tested and working but the performance isn’t as optimized as of nVidia or AMD GPUs, but still beats Intel’s own oneDNN in channel first memory format.
  • I hand’t tested it on Intel Arc yet for obvious reasons - but I’m highly curious what can be done there.
  • It was found to run on Apple’s M1 fairly well but does not yet explore internal matrix multiplications functions to utilize the GPU to full potential. Also, I don’t own Apple M1 on my own but got positive reports
  • It was tested in terms of bare functionality on Mali but it is not optimized at all

In comparison to Vulkan it seems that it already provides much more operators than Vulkan backend.

And for inference - dlprimitives itself (the core library) provides initial ONNX support so you can have one small library for inference that runs on nVidia, AMD, Intel, M1 and potentially many more GPUs with minimal dependencies - basically OpenCL driver, protobufs and sqlite3.

ONNX support is work in progress as well but most vision classifications networks like ResNet, MobileNet, EfficientNet and many others are tested and supported.