Ok… sorry but I want to address the elephant in the room. Actually elephants.
- NVidia - cuda - ok they have the crown
- AMD rocm/hip - just lets copy CUDA… why because we are so far behind so lets just do a shortcut - doesn’t matter if it is good or poor long term strategy
- Intel - ohh we smarter lets do sycl - with all drawbacks of cuda (all statically compiled) but without advantage - because nobody implements - sycl (besides intel)
- Apple - ok. We are who we are we just must be smart a… lets do metal.
- Some others lets reinvent the wheel - lets create smart compiler that would help us. Triton…
- Lets do Vulkan (yeah… it went well)
- We are Microsoft so we do something Direct: DirectML.
What would happen in gaming industry if every GPU vendor would require their own language? There is Vulkan that runs everywhere and there is Direct3D if you want to optimize for Windows.
Ohhh it would be cool if we had something similar for computer… Ohhh but there is one - OpenCL that runs everywhere even on nvidia and your smartphones!
And the cool stuff that 95% of the code isn’t performance critical and can be easily implemented in simple kernels and would run very-very well. And 5% of the code that actually performance critical can be optimized using assembly for each platform - like GEMM, convolution and +/- several other kernels.
And with code generation helpers it is really easy to write simple operators. So why just not cooperate and not reinvent the wheel in Nth time? To do something that is generic, well supported and actually works. Especially since it is kinda important for ML world to have healthy competition and actually see good Open-Source kernels that can be improved.
If I was able on my own to get it working - even not fully featured and managed to reach 70%-85% of performance on some HW in very limited effort and time what it can actually be with some more effort?