Performance of "activation" sparsity

aartbik · April 16, 2025, 8:15pm

I already discussed this in the context of bug 151351, but I while looking at issues related to “activation” sparsity I noticed a very big performance difference between converting to COO vs converting to CSR. Take the following extreme example, where A is mostly negative, so after RELU T is mostly zero, and then converted to a truly sparse S:

import time
import torch

d = 1024 * 32
A = -torch.ones(d, d)
A[0, 0] = 111
A[10, 10] = 222     # most entries in A are < 0

T = torch.relu(A)         # materializes very sparse T as dense first
S = T.to_sparse_csr()     # would be nice to have something like S = torch.sparse_relu(A)
                          # but that is not the point of this bug yet

Here, I noticed a huge performance (time and memory) difference between converting to COO or converting to CSR. Adding some delays for easy of profling:

.. construct A ..
time.sleep(10)      # INTERVAL 1
T = torch.relu(A)   # materializes very sparse T as dense first
time.sleep(10)      # INTERVAL 2
S = T.to_sparse()   # to COO
time.sleep(10)      # INTERVAL 3

Then compare the memory/time plots for converting to COO (to_sparse) with CSR (to_sparse_csr). The same problem occurs on CPU (shown) and GPU (since same codepaths are taken).

Here, it is very clear that COO behaves as expected: small bump to allocate A, then small bump to allocate T, and then nothing for the two elements in COO. The three intervals are consecutive, no waste time. However, CSR behaves very poorly in between INTERVAL2 and INTERVAL3. Three spikes in memory (and extra time) are very obvious.

I investigated this a bit further and observed the following.

Spike 1 and 2 are due to cols/rows construction with “arange” in the method _not_zero_mask_to_col_row_indices(). Spike 3 is due to “arange” in the method _mask_to_indices().

I played around with various fixes. For the latter method, replacing

 return at::native::arange(mask.numel(), at::kLong, kStrided, mask.device())
      .masked_select(mask);

with

  return at::native::flatten(at::nonzero(mask));

removed the third spike. Then, replacing

 auto col_indices =
      at::native::arange(
          not_zero_mask.size(-1), index_dtype, kStrided, index_device)
          .view({1, not_zero_mask.size(-1)})
          .expand_as(not_zero_mask)
          .masked_select(not_zero_mask);
  auto row_indices =
      at::native::arange(
          not_zero_mask.size(-2), index_dtype, kStrided, index_device)
          .view({not_zero_mask.size(-2), 1})
          .expand_as(not_zero_mask)
          .masked_select(not_zero_mask);
  return std::pair<Tensor, Tensor>(col_indices, row_indices);

with

  Tensor nz = not_zero_mask.nonzero().transpose(0, 1);
  return std::pair<Tensor, Tensor>(nz[1], nz[0]);

removed the other two spikes. After this, CSR behaved much more like COO! I also verified this runs on GPU, and behaves just as good when the input matrix is full dense (in fact, the original implementation would go Out-Of-Memory on GPU for larger cases).

I have to convince myself that the fixes are correct for all cases (input is welcome), but I hope this will help with “activation’” sparsity cases that prefer CSR over COO for the next layer (and of course, avoiding materialization of T altogether here would be even better, but that is next).

aartbik · April 21, 2025, 5:06pm

To follow up, the issue has been addressed with the following PRs:

pr/151372
pr/151474
pr/151794

Topic		Replies	Views
Adding custom sparsity support	0	424	February 22, 2023
Comparing the performance of 0.4.1 and master performance	0	2364	February 9, 2021
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript performance	0	1694	January 28, 2021
Lazy Tensor Core hardware-backends	20	7725	July 12, 2022
GPU Overheads and Fused Strassen performance	0	2255	February 13, 2021

Performance of "activation" sparsity

Related topics