Recently privateuseone device was introduced and now the integration with mainstream pytorch become transparent. All you need to do is to install stabe 1.13 pytorch version and build a backend against it - matter of few minutes - just setup pip virtual environment run simple cmake and you are good to go.
More than that I even did brief testing on Windows using VS2022 and was able to do the same and run some training and tests as well (of course with little bit less ease - as anything on Windows)
Basically - I think it looks like an only option to use AMD GPU on Windows with pytorch! Unlike ROCm It works very well even with older AMD GPU like “Stoney Ridge” and runs well on latest RDNA2 lineup - like RX 6600XT I use myself. Of course nVidia GPUs work very well as well.
Current version aligned against 1.13 and nightly 1.14 versions.
What was tested?
I managed to validate network on many standard vision networks like ResNet, VGG, AlexNet, DenseNet, MobileNet, SqueezeNet, ShuffleNet, GoogleNet, massnet, RegNet and EfficientNet. I run it on many pytorch examples including super resolution and style transfer.
Many operators are still missing and it is long way to go but it is work in progress and what is more important it is already highly useful and quite easy to use - thanks to latest pytorch improvements.
How is the performance?
When comparing to cuda/cudnn performance on GTX960 I have - it gets from 50-60% of cuda/cudnn performance for training and around 60 to 70% for inference - depending on specific network.
Summary
I encourage everybody to try and test - now it is really easy!
This is great. Is AMD GPUs on windows the main motivation? Or are you also thinking of targetting mobile. We have vulkan backend on mobile but I have been curuious about opencl
My goal is to have open and cross platform solution that does not depend on specific platform API like nVidia’s CUDA, AMD’s cuda clone rocm/hip, iOS’s metal, Windows DirectCompute etc.
I want true open source cross platform solution that works across various GPUs (of course with relevant optimizations) and multiple OS.
OpenCL is most widely supported computing platform that works very well and actually very similar to CUDA - basically many kernels can be written for CUDA and OpenCL with bunch of defines. OpenCL was designed for computing (not graphics like Vulkan)
I always planned to support Windows as some “poor souls” still tend to use it, just with latest support of out-of-tree backend and no-need of building pytorch itself it become reality - as all you need is VS2022, OpenCL SDK and optional SQLite library. I myself rarely use Windows.
However, since I targeted cross platform support it indeed become reality. And AMD is of course is one of main beneficiary for several reasons:
Since their GPUs are only actual competitors for nVidia that have enough power to do real training
Their own rocm/hip platform also being more mature is way more limited in terms of compatibility. OpenCL backend is supported and actually runs on:
older GPUs including GPUs that AMD abandoned in ROCm like GCN4.
APUs
Windows
It allows to use even Clover Mesa OpenCL driver - allowing to run GCN4 GPUs connected to chipset PCI-E (rocm requires PCI-E connected directly to CPU)
Regarding other GPUs/Mobile patforms:
nVidia is of course fully supported and runs very well.
Intel integrated GPUs are tested and working but the performance isn’t as optimized as of nVidia or AMD GPUs, but still beats Intel’s own oneDNN in channel first memory format.
I hand’t tested it on Intel Arc yet for obvious reasons - but I’m highly curious what can be done there.
It was found to run on Apple’s M1 fairly well but does not yet explore internal matrix multiplications functions to utilize the GPU to full potential. Also, I don’t own Apple M1 on my own but got positive reports
It was tested in terms of bare functionality on Mali but it is not optimized at all
In comparison to Vulkan it seems that it already provides much more operators than Vulkan backend.
And for inference - dlprimitives itself (the core library) provides initial ONNX support so you can have one small library for inference that runs on nVidia, AMD, Intel, M1 and potentially many more GPUs with minimal dependencies - basically OpenCL driver, protobufs and sqlite3.
ONNX support is work in progress as well but most vision classifications networks like ResNet, MobileNet, EfficientNet and many others are tested and supported.
Great work, this is huge for running on embedded hardware.
Question, how would this solution work with torch’s cuda/cpp extensions? I’m trying to migrate this cuRobo library onto an embedded board and would like to convert the cuda code into opencl then load into your torch version, is that even possible with your torch version? Thanks!
I don’t see major problem for using it as an extension. However few things to notice:
You’ll need to link with backend and use its headers/API to access memory allocation/tensor processing facilities.
For operations that pointwise, do reduction or broadcasting you don’t even need to write your kernels just pass small computation code as string and it would work.
For more complex kernels you can access all underlying OpenCL data via its API:
Basically:
// Accessing memory
dlprim::Tensor X = todp(input); // create dlprim Tensor from pytorch kernel
cl::Buffer buffer = X.device_buffer();
// accessing all OpenCL stuff (queue, device, context)
dlprim::ExecutionContext ctx = getExecutionContext(input.device());
However if you want to enjoy from other goodies dlprim implements like Kernel building and caching it is needed to add small changes in dlprimitives - since all kernels currently embedded in dlprim library itself and need to add an extension to load and use external custom kernels.
Shouldn’t be a big problem.
Great work, this is huge for running on embedded hardware.
Now small warning. Also I tested it on MaliG52 MC2 several times the performance is far from optimal.
The gemm and winograd kernels (basically convolution and matrix multiplication) are optimized for large GPUs like nVidia and AMD, there is some custom optimization of GEMM for intel GPU but there is no such stuff for Mali. So performance may be sub-optimal (it is in To-Do list)
Thanks for the info. Just to clarify, I want to use your opencl torch version with an extension (not as an extension). The desired extension would be created by converting that cuda code to opencl.
Given since all kernels currently embedded in dlprim library itself and need to add an extension to load and use external custom kernels
What happens if your opencl torch version calls its torch.utils.cpp_extension function?
Target GPU is Arm Mali™-G57 MC3 3D, any idea how far below sub-optimal performance would be?
Now it works with pytorch 2.4 - in fact it is requirement. Either 1.13 or >=2.4
I created a much easier interface to use - all you need is to import pytorch_ocl module and you’ll get all the goodies on Linux and Windows.
With python module you can use torch.ocl.synchonize() and torch.ocl.empty_cache() as with CUDA
I ordered Intel Arc GPU (a380) - so hopefully I’m going to be able to check/optimise for a new platform
Implemented other things needed like manual_seed_all - as required for backed.
Known issues: Currently there is an issue with loading back saved state dictionary if it was saved from ocl device. It crashes for some reason (outside of ocl backend). Workaround: When you save/restore the model move it to CPU
Now question: is there any online service that can help me build pip packages from Linux and Windows so users can just pip install nightly build instead of going through build help (Especially on windows)
I released version 0.2.0 - including binary whl files
Lots of improvements and bug fixes. New Arch GPU is well tested and performance improvements added. Visual transformers are now validated and working. But lots more to do.