Hi everyone,
as I am getting ready to submit the PR, I wanted to share some performance numbers, and they are better than I anticipated. For 32bit precision, CuSolver seems to be 4-5x faster than MAGMA, for 64bit about 2x. This was tested on an RTX4070 and a Ryzen 5700x with Ubuntu and CUDA 12.8. I am trying to get some tests on a H100 and A100, but this might take a while.
For all the numbers shown here, I have always submitted batches of 10 matrices (which aten loops over, there is no real batching in xgeev yet) and averaged the times over 5 runs of the appropriate matrix types and sizes.
It seems that the CPU path is still faster for small matrices (<256). So it might be worth discussing if it is desirable to keep the automatic switch to CPU. As of now, I removed the automatic CPU path for small matrices, as I think it is a little cleaner, and cuSolver is faster way earlier than MAGMA.
I hope I can submit my PR today or tomorrow. There is still testcase I have to fix, but I think I know what the issue is, so I will keep you updated.
Kind regards, and as always, happy to receive any input.



