I already discussed this in the context of bug 151351, but I while looking at issues related to “activation” sparsity I noticed a very big performance difference between converting to COO vs converting to CSR. Take the following extreme example, where A is mostly negative, so after RELU T is mostly zero, and then converted to a truly sparse S:
import time
import torch
d = 1024 * 32
A = -torch.ones(d, d)
A[0, 0] = 111
A[10, 10] = 222 # most entries in A are < 0
T = torch.relu(A) # materializes very sparse T as dense first
S = T.to_sparse_csr() # would be nice to have something like S = torch.sparse_relu(A)
# but that is not the point of this bug yet
Here, I noticed a huge performance (time and memory) difference between converting to COO or converting to CSR. Adding some delays for easy of profling:
.. construct A ..
time.sleep(10) # INTERVAL 1
T = torch.relu(A) # materializes very sparse T as dense first
time.sleep(10) # INTERVAL 2
S = T.to_sparse() # to COO
time.sleep(10) # INTERVAL 3
Then compare the memory/time plots for converting to COO (to_sparse) with CSR (to_sparse_csr). The same problem occurs on CPU (shown) and GPU (since same codepaths are taken).
Here, it is very clear that COO behaves as expected: small bump to allocate A, then small bump to allocate T, and then nothing for the two elements in COO. The three intervals are consecutive, no waste time. However, CSR behaves very poorly in between INTERVAL2 and INTERVAL3. Three spikes in memory (and extra time) are very obvious.
I investigated this a bit further and observed the following.
Spike 1 and 2 are due to cols/rows construction with “arange” in the method _not_zero_mask_to_col_row_indices(). Spike 3 is due to “arange” in the method _mask_to_indices().
I played around with various fixes. For the latter method, replacing
return at::native::arange(mask.numel(), at::kLong, kStrided, mask.device())
.masked_select(mask);
with
return at::native::flatten(at::nonzero(mask));
removed the third spike. Then, replacing
auto col_indices =
at::native::arange(
not_zero_mask.size(-1), index_dtype, kStrided, index_device)
.view({1, not_zero_mask.size(-1)})
.expand_as(not_zero_mask)
.masked_select(not_zero_mask);
auto row_indices =
at::native::arange(
not_zero_mask.size(-2), index_dtype, kStrided, index_device)
.view({not_zero_mask.size(-2), 1})
.expand_as(not_zero_mask)
.masked_select(not_zero_mask);
return std::pair<Tensor, Tensor>(col_indices, row_indices);
with
Tensor nz = not_zero_mask.nonzero().transpose(0, 1);
return std::pair<Tensor, Tensor>(nz[1], nz[0]);
removed the other two spikes. After this, CSR behaved much more like COO! I also verified this runs on GPU, and behaves just as good when the input matrix is full dense (in fact, the original implementation would go Out-Of-Memory on GPU for larger cases).
I have to convince myself that the fixes are correct for all cases (input is welcome), but I hope this will help with “activation’” sparsity cases that prefer CSR over COO for the next layer (and of course, avoiding materialization of T altogether here would be even better, but that is next).