Eigendecomposition tests

Hi!,
I am currently working on integrating the new CuSolver XGeev Eigendecomposition into PyTorch.

As my code is mostly working, I have moved on to testing.
Now I did some testing on my own. I mainly checked if the eigenvectors and eigenvalues fulfill the eigenvalue equation
image
This seems to be the case, I get deviations in the order of magnitude of the numerical precision of the used datatype. However, when I am running the official tests, I am getting multiple failures in TestLinalgCUDA.test_eig_compare_backends_cuda_…
I am not entirely sure, but I think that the tests might not be mathematicaly valid. The tests just compare the eigenvalues to each other elementwise. However, eigenvalues are not unique. The documentation seems to reflect that:

The returned eigenvectors are normalized to have norm 1. Even then, the eigenvectors of a matrix are not unique, nor are they continuous with respect to A. Due to this lack of uniqueness, different hardware and software may compute different eigenvectors.

This non-uniqueness is caused by the fact that … produces another set of valid eigenvectors of the matrix. For this reason, the loss function shall not depend on the phase of the eigenvectors, as this quantity is not well-defined. This is checked when computing the gradients of this function. As such, when inputs are on a CUDA device, the computation of the gradients of this function synchronizes that device with the CPU.
(https://docs.pytorch.org/docs/stable/generated/torch.linalg.eig.html)

Therefore, I think this test might be disregarded, and it only worked until now, because MAGMA performed most of its eigenvalue decomposition on the CPU, and that happened to cause the Eigenvalues to be similar.

I would love to hear from someone who has worked more with the different implementations of the eigenvalue problem in MAGMA and on CPU, as I am not entirely sure if my deduction is correct. If it is, I think we could probably move to a pure mathematical test, that is testing if the number of values matches the ones from CPU and if all the eigenvalues fulfill the eigenvalue equation within the expected accuracy of the datatype.

Happy for any input!

Best
Johannes

I just sketched a quick test that basically mimics the behaviour of the test mentioned, but with the addition of allowing for all valid eigenvalues and eigenvectors:

shapes = [(0, 0),  # Empty matrix
          (5, 5),  # Single matrix
          (0, 0, 0), (0, 5, 5),  # Zero batch dimension tensors
          (2, 5, 5),  # 3-dim tensors
          (2, 1, 5, 5)]  # 4-dim tensors
for dtype, rtol, atol in [(torch.complex64, 1e-4, 1e-5), (torch.complex128, 1e-12, 1e-13), (torch.float32, 1e-4, 1e-5), (torch.float64, 1e-12, 1e-13)]:
    print(f"\nTesting dtype: {dtype}")
    for shape in shapes:
        A = torch.randn(shape, dtype=dtype, device="cuda")
        print(f"\nInput matrix A with shape {shape}: {A}")
        print("Calling torch.linalg.eig(A) on CPU...")
        w_cpu, v_cpu = torch.linalg.eig(A.cpu())
        print("Calling torch.linalg.eig(A) on CUDA...")
        w, v = torch.linalg.eig(A)
        if w.numel() != w_cpu.numel():
            passing = False
            raise RuntimeWarning(f"Number of eigenvalues differ between CPU and CUDA for shape {shape}!")
        if v.numel() != v_cpu.numel():
            passing = False
            raise RuntimeWarning(f"Number of eigenvectors differ between CPU and CUDA for shape {shape}!")
        A = A.to(v_cpu.dtype)  # Ensure A is the same dtype as v_cpu for matrix multiplication
        diff_cpu = (A.cpu()@v_cpu) - (v_cpu * w_cpu.unsqueeze(-2))
        diff_cuda = (A@v) - (v * w.unsqueeze(-2))
        if diff_cpu.numel() == 0:
            if diff_cuda.numel() == 0:
                print("both differences are empty, skipping comparison.")
            else:
                raise RuntimeWarning(f"difference on CPU is empty but on CUDA is not for shape {shape}!")
        else:
            if diff_cuda.numel() == 0:
                raise RuntimeWarning(f"difference on CUDA is empty but on CPU is not for shape {shape}!")

            print(f"max difference on CPU: {diff_cpu.abs().max().item()}")
            print(f"Max difference on CUDA: {diff_cuda.abs().max().item()}")
            close = torch.allclose(diff_cpu, diff_cuda.cpu(), atol=atol, rtol=rtol)
            if not close:
                passing = False
                print(f"differences are NOT close for shape {shape}!")
            else:
                print(f"differences are close for shape {shape}.")



if passing:    print("\nAll tests passed!")
else:    print("\nSome tests failed!")

I will probably update the official test and open a PR for it