Why so many HW backend and nobody cooperate?

Ok… sorry but I want to address the elephant in the room. Actually elephants.

  1. NVidia - cuda - ok they have the crown
  2. AMD rocm/hip - just lets copy CUDA… why because we are so far behind so lets just do a shortcut - doesn’t matter if it is good or poor long term strategy
  3. Intel - ohh we smarter lets do sycl - with all drawbacks of cuda (all statically compiled) but without advantage - because nobody implements - sycl (besides intel)
  4. Apple - ok. We are who we are we just must be smart a… lets do metal.
  5. Some others lets reinvent the wheel - lets create smart compiler that would help us. Triton…
  6. Lets do Vulkan (yeah… it went well)
  7. We are Microsoft so we do something Direct: DirectML.

What would happen in gaming industry if every GPU vendor would require their own language? There is Vulkan that runs everywhere and there is Direct3D if you want to optimize for Windows.

Ohhh it would be cool if we had something similar for computer… Ohhh but there is one - OpenCL that runs everywhere even on nvidia and your smartphones!

And the cool stuff that 95% of the code isn’t performance critical and can be easily implemented in simple kernels and would run very-very well. And 5% of the code that actually performance critical can be optimized using assembly for each platform - like GEMM, convolution and +/- several other kernels.

And with code generation helpers it is really easy to write simple operators. So why just not cooperate and not reinvent the wheel in Nth time? To do something that is generic, well supported and actually works. Especially since it is kinda important for ML world to have healthy competition and actually see good Open-Source kernels that can be improved.

If I was able on my own to get it working - even not fully featured and managed to reach 70%-85% of performance on some HW in very limited effort and time what it can actually be with some more effort?

1 Like
  1. Standards like OpenCL can’t (by definition) incorporate vendor-specific instructions like mma.async
  2. Because AI is a fast-moving field, vendors innovate a lot, and because of (1), standards don’t stick
  3. vendors enjoy ecosystem lock-in, as it has tangible benefits, and will continue that path until the ecosystem incentivizes common standards, but that wont happen until AI slows down (i.e. because of (2)).
  4. You could probably get to 70% to 85% perf on FP32. Now try BF16 and/or FP8 and tell me how you could pull that off with OpenCL – I doubt you could without reaching in for vendor-specfic inline-assembly
1 Like
  1. Standards like OpenCL can’t (by definition) incorporate vendor-specific instructions like mma.async

It is like saying you can’t use vendor (intel sse/avx, arm neon) code in C++… Because it is cross platform.

You absolutely can and should

1st of all it is more than Ok to use vendor specific inline-assembly where it matters like in GEMM and other important operations, here nVidia example. Same can be done for mma.async - that you can call from OpenCL kernel on nVidia - I’ve seen some examples but didn’t get into implementing it yet. Here some Intel specific code and so on.

So you optimise specific heavy kernels per vendor like convolutions, gemm, even write them in vendor inline assembly, but share vast majority of other operators code that is usually trivial.

Just want to clarify that it isn’t like saying you can’t use vendor intrinsics for SIMD, which are localized and uniform in their memory model.
It’s a lot more fine-grained than that. In latest generation accelerators, leveraging async operations together with various DMA mechanisms involves significant logic – that when written in intrinsic assembly becomes substantially annoying.

operations together with various DMA mechanisms involves significant logic – that when written in intrinsic assembly becomes substantially annoying.

Obviously, implementing high performance gemm/conv and other critical kernels is hard really hard. And they should be vendor specific to get most of it.

But so far most of the time and the effort I spent on the OpenCL backend was actually implementing these simple, not very complex operators that are common. I started with gemm/winograd/conv to see I can get reasonable performance but later 95% of the effort is in implementing quite trivial things: pooling, bilinear, various activation and simple operators, tinkering with stuff with various copy/cast sematics etc.

And this is something that is very small performance-wise and can totally be shared.

AMD have its MIOpen OpenCL variant (I hope they will not depricate it) and Intel has oneDNN-opencl - I plan to use them both for critical operations (and maybe later some arm/mali libraries) while most of other stuff - can be shared.

As an example oneDNN is a small library that can be easily compiled to any system. MIOpen is somewhat more problematic.

But that is the point. Make small vendor specific code for critical part and optimise them well and rest can be easily implemented in much simpler code without performance problems.

Here is a list of operators I need to implement to make Yolo/YoloX and other detection networks work in the OpenCL backend.

There is nothing complex in them except for complex semantics and few generic algorithms. But it is all work to do. And think - every backend need to implement them… Instead of spending a time on optimising few performance critical kernels for each vendor and sharing rest of the things in cross platform way.

'aten::addmv.out'
'aten::all.all_out'
'aten::embedding_dense_backward'
'aten::gather.out'
'aten::_index_put_impl_'
'aten::index_select'
'aten::index.Tensor_out'
'aten::linalg_vector_norm.out'
'aten::masked_fill_.Scalar'
'aten::max.dim_max'
'aten::max_pool2d_with_indices.out'
'aten::nonzero'
'aten::scatter_add.out'
'aten::scatter.value_out'
'aten::sort.values_stable'
'aten::topk.values'
'aten::unfold'
'aten::_unique2'
'aten::where.self'
'torchvision::nms'

it isn’t just gemm/conv kernels. You haven’t gotten to distributed training did you? all the accelerators now have DMA of some sort, and need to use the compute cores to issue cross-accelerator comms.
And the mechanisms of DMA are different across different accelerators.

I think it is possible for all vendors to try be within the same stack – the closer one to that today is MLIR rather than OpenCL, because it allows vendors to write IRs within a common IR framework, but extend them in ways that they need to without needing to build standards.

I just write it down there: implementing cross vendor efficient memory access/dma is by order of magnitude simpler than writing efficient computational kernels. (i.e. close to 80-90% of FLOPS)

Still the point stands