OpenCL backend dev - questions/support

Hi,

I continue with the progress of Pytorch OpenCL Backend

Now I get some question I’m not sure how to handle properlt

Questions

  1. double support fallback. Now Intel (iGPU and Arc) do not support fp64. I noticed that double tensor creation was requested during printing of a tensor, it wasn’t an issue with AMD or nVidia devices since fp64 is supported but it failed on Intel. Currently I added workaround this way in implementation of aten::allocate_empty I check if fp64 is enabled on the device and if kDouble tensor requested I allocate float tensor and issue a warning. I’m not entirely shure it is the correct way to do thing. I assume XPU have the same issues but I couldn’t fin the way they deal with it.
  2. There are some operators that can be implemented (and are) in terms of other operators. I noticed recently that I had no problems running vit_b_16 and similar networks, but when the model was set to eval mode and no_grad was set it requested native_multi_head_attention operator that was not implemented… I looked into code and tere are like several very similar multi_head_attention cases… Is there any kind of way to automatically use existing one? (Like non-optimized multi_head_attention - since anyway it seems that training is way more important than inference (in terms of performance)

I also noticed that lots of operators that manipulate tensors don’t actually have any OpenCL specific code - just manipulation of strides/sizes etc.

Anyway my general thoughts are - there is so much effort wasted (amd, xpu, metal etc.) there 5% of code that is actually relevant for device specific optimization and most of the code is just reimplementation of exactly the same stuff. Intel, AMD and Apple reinventing the wheel when there is a standard GPU compute API (OpenCL) - but this is side note… Just pity

Hey!

Thanks for your dedication working on this topic!

  1. We have this not optimal code at pytorch/torch/_tensor_str.py at 0a3c064c124222884a4b10156d070a061741f84b · pytorch/pytorch · GitHub that handles the xpu case. I think we can make the has_fp64 flag exist on privateuse1 and can be defined by the backend. Then we will use fp32 for printing. What do you think?
  2. These generic functions should be registered as CompositeExplicitAutograd such that this device-agnostic code will be use by all backends by default. For various reasons, some of these are not done properly (either by mistake or oversight). The general fix is to either, on your end, register the cpu kernel onto the privateuse key or, in pytorch, update native_functions.yaml to properly register these onto CompositeExplicitAutograd instead of specific CPU/CUDA/etc keys.

I see, yes, this would be excellent.

Ok, interesting.

I’ll try to see if CPU version works. It would be excellent time saver.

Thank you for the help!

I just think it is important to have GPU agnostic backend. It can be a base for cross platform DL. I know it is very long term goal but… I keep dreaming.

(It looks like I’ll be able to integrate Intel’s oneDNN/OpenCL library for better performance of on Intel GPUs)

I’ll try to see if CPU version works. It would be excellent time saver.

My WIP example has some example of re-using the CPU implementation for metadata-only updates: Add device daemon by albanD · Pull Request #131814 · pytorch/pytorch · GitHub

I just think it is important to have GPU agnostic backend. It can be a base for cross platform DL. I know it is very long term goal but… I keep dreaming.

I do think that we’ve made more progress in the past year and a half than in the years prior. So I do see hope for this to actually happen!

1 Like

I’ll try to see if CPU version works. It would be excellent time saver.

1st of all it worked. Instead of implementing full multi_head_attension I could implement only much simpler transform_bias_rescale_qkv and not all vit_x_NN network work properly

It is still not perfect performance wise since I do one extra copy, but one I implement generic dlprim::core::pointwise_operation_broadcast option with support of strides and would improve some other performance problems with non-contiguous tensors (I have in convnext nets)

Most of dlprimitives code assumes contiguous, channel 1st tensors and in some cases it adds an extra copy - but this is fixable at least for basic pointwise broadcast/reduce operations.

My WIP example has some example of re-using the CPU implementation for metadata-only updates …

Ohhh I wish I noticed it some time ago. It would be major time saver because I spend lots of head scratching on some of these operators and couldn’t understand why they need special ops.

Thank you very much for the support!

1 Like