- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 132
Description
CSD
stands for Compressed Sparse Dimensions. It is a multidimensional generalization of CSR
, CSC
and COO
.
CSD
has a few "compressed" dimensions and a few "uncompressed" dimensions. The "compressed" dimensions are linearized and go into indptr
similar to CSR/CSC. The "uncompressed" dimensions (corresponding to indices
in CSR/CSC, but there can be multiple) will be stored in coords
.
The plan is to:
- Write
CSD
. - Test it.
- Make
COO
and others inherit from it.
Task checklist:
Implement CSD (parametrize tests over compressed axes).:
- Element-wiseReductionsSlicingChange compressed dimensionsTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Post-Implementation (parametrize tests over types):
- Make COO inherit from CSDMake CSR inherit from CSDMake CSC inherit from CSDTo pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Compressed Sparse Dimensions — Format Specification
Compressed Sparse Dimensions (CSD) is an multidimensional sparse array storage format.
Let’s say that the number of dimensions is ndim
and compressed_axes
represents the axes/dimensions that are compressed, as a tuple.
It contains three main arrays, one of them two-dimensional:
data
(one dimensional,(nnz,)
)coords
(two-dimensional,(ndim - len(compressed_axes), nnz)
)indptr
(one dimensional(reduce(operator.mul, (shape[a] for a in compressed_axes), 1) + 1,)
)
data
/coords
/indptr
from an ndarray
Relating to a normal NumPy array, we can specify data
and coords
directly. Let’s say arr
is the Numpy array to convert.
non_compressed_axes = tuple(a for a in range(ndim) if a not in set(compressed_axis))
coo_coords = np.where(arr)
data = arr[coo_coords]
axes_order = compressed_axes + non_compressed_axes
if axes_order != tuple(range(ndim)):
lin_coo_coords = linear_loc(coo_coords[np.asarray(axes_order)], tuple(self.shape[a] for a in axes_order))
idx = np.argsort(lin_coo_coords)
data = data[idx]
coords = coords[:, idx]
coords = coo_coords[np.asarray(non_compressed_axes)].astype(np.min_scalar_type(max(non_compressed_axes)))
compressed_linear_shape = reduce(operator.mul, (shape[a] for a in compressed_axes), 1)
compressed_linearized_coords = linear_loc(coo_coords[np.array(compressed_axes)], compressed_axes)
indptr = np.empty(compressed_linear_shape + 1, dtype=np.min_scalar_type(compressed_linear_shape))
indptr[0] = 0
np.cumsum(np.bincount(compressed_linearized_coords, minlength=compressed_linear_shape), out=indptr[1:])
Interpreting the arrays
For CSR, the interpretation is simple: indices
represents the columns
, data
the respective data. indptr[i]:indptr[i+1]
represents the slice corresponding to row i
for which columns
and data
are valid.
We can extend this definition to CSD as well. coords
represent the non-compressed axes, and data
the respective data. indptr
can be interpreted the same way, replacing the word “row” with “linearized compressed axes index”, but it turns out that there is a simpler way to work with things.
Alternative representation for indptr
Consider the following two arrays produced from indptr
:
starts = indptr[:-1].reshape((shape[a] for a in compressed_axes))
stops indptr[1:].reshape((shape[a] for a in compressed_axes))
Note that these are views, no copies are made. This gives a simpler and more intuitive interpretation for indptr
. For every possible index corresponding to the compressed axes: Indexing into starts
gives the starts for that index and stops
gives the stops.
The “multiple-COO” view
For every value in starts
/stops
, coords[:, start_val:stop_val]
and data[start_val:stop_val]
can be seen as a mini-COO array, and can thus have any COO-related algorithms applied to it.
Activity
mrocklin commentedon Apr 18, 2018
I mentioned this effort to some Scikit-Learn developers (notably @ogrisel) and they seemed interested in this. I think that in particular they want an object that is compatible with the CSR and CSC scipy objects in that is has the
data
,indptr
, andindices
attributes (these are necessary to be used within their codebase) and that also satisfies the numpy.ndarray interface well enough to be used within dask array.@hameerabbasi do you have thoughts on how best to accomplish this other than what is stated above? Do you have any plans to work on this near-term? (no pressure, just curious)
hameerabbasi commentedon Apr 18, 2018
Immediately, I plan to work on #114 and #124, as they're relatively easy to do. But since this is relatively important, yes, I plan to work on it soon-ish (right after that, ideally).
As
indices
in the case of N-D will need to store more than one dimension, I was thinking of naming itcoords
and making it 2-D (although it's essentially the same).I've thought a bit about it for the past week or so, but I can't seem to find a better way to proceed than what's already in the OP. I'd like to unify implementations as much as possible, if the performance cost isn't too high. Of course I could hack together 2-D limited CSR/CSC fairly quickly, but I want to keep things general as much as possible, and do things once (and properly).
@stefanv also seemed interested in joining the development team, and getting this into SciPy (dependency or merging) as soon as possible, I'm assuming CSD is essential to that.
On another note, fairly often, I feel my thought processes being limited by what Numba can currently do, and I have to adapt algorithms to match. Example: and no support for nested
list
s (pushed back to Numba 0.39 from 0.38, I have a strong suspicion this will be needed for CSD) ordict
(currently tagged 1.0).hameerabbasi commentedon Apr 18, 2018
The reason I feel it is this way is simple: In CSD, for every value of
indptr
, we get a newCOO
object in essence. So Numba-fication will become important even where it wasn't before.stefanv commentedon Apr 18, 2018
(Aside: my interest is in getting SciPy to support sparse arrays, so that packages like this one can depend on that structure internally, and so that we can eventually remove the Matrix class from NumPy).
Can you clarify in the description what "CSD" stands for?
hameerabbasi commentedon Apr 18, 2018
Thanks, updated the issue with what it stands for and a short description.
hameerabbasi commentedon Apr 18, 2018
Just a bit of background: A lot of machine learning algorithms use CSR/CSC as their primary input, as do BLAS/LAPACK/ARPACK etc. If we want to truly deprecate
scipy.sparse.spmatrix
and thereforenp.matrix
, we'll need this.Overall,
CSD
(whichCSR
/CSC
/COO
will be thin inheritances of) andDOK
will cover at least 80% of all use-cases (DOK
for mutable uses andCSD
for immutable ones).The rest of the 20% will need block formats
BSD
/BDOK
(not a particularly hard extension) andLIL
/DIA
(which can also be in block format).Details of most formats are written up in this SPEP.
stefanv commentedon Apr 18, 2018
About the CSD format, is this described in any publication, or was this designed here? Has it been compared to alternative storage options, such as ECCS (https://ieeexplore.ieee.org/abstract/document/1252859/)? Does it outperform coordinate storage, combined with a good query strategy?
hameerabbasi commentedon Apr 18, 2018
I'll confess I haven't done a literature review on this before starting down this path (other than Wikipedia). But with a short look, I kind of got the gist of how ECCS will work.
Coordinate storage and query strategy will both be superior in ECCS. The downside is that it doesn't generalize CSR/CSC/COO, and most current algorithms are implemented/designed around CSR/CSC (machine learning/LAPACK/ARPACK/BLAS), so implementing those kind of makes more sense practically.
I suppose CSD is a generalization of CRS/CCS in literature, except that you can compress any combination of axes rather than just rows/columns after linearizing it.
Of course, later, we could also add these more sophisticated storage methods, but given the libraries that already exist around CSR/CSC, that may be unwise at this stage.
Another option would be to implement GCRS/GCCS, but that will present issues in several indexing/reduction tasks unless we convert the indices in real-time all the time.
mrocklin commentedon Apr 18, 2018
I think that @stefanv 's question is good at highlighting that perhaps it would be good to ask around a bit about possible sparse storage structures, and lean about their pros and cons, before investing a lot of time into a new approach.
mrocklin commentedon Apr 18, 2018
If you're interested in walking back from the Numba decision that's fine with me. We should probably discuss that in another issue though.
It might be useful if you included more detail about what you're thinking both in terms of the CSD data structure and key algorithms that might be useful for it and how they would work. Perhaps others here could recommend alternatives.
stefanv commentedon Apr 19, 2018
Other formats:
From Twitter:
14 remaining items
daletovar commentedon Jun 10, 2019
I recently started working on this. It's basically just GCRS/GCCS but I opted to represent the rows of the underlying matrix as the first several axes and used the remaining axes to represent the columns. This is in line with how numpy reshapes arrays and so it made the most sense. I like this format a lot for a few reasons: 1) adding additional axes only minimally affects the compression ratio, 2) indexing (in principle) should be faster, especially for 2d arrays, and 3) because a 2d array is just a csr/csc matrix with the data, indices, and indptr attributes. Consequently, you get scipy/scipy#8162 for free. These arrays could well with anything that expects either a scipy.sparse matrix or np.ndarray. Good candidates for all of this are scikit-learn, dask, and xarray. I only just started the project but I like the idea of adding this to pydata/sparse when it's ready.
hameerabbasi commentedon Jun 11, 2019
Hi @daletovar! I’m glad... I lazed a bit too long on this but I’m happy someone really is working on it.
We don’t need a complete implementation, feel free to work within PyData/Sparse. I know some things like coverage and documentation will slow you down, so it’s better to work on a fork.
Feel free to reach out to me personally via email if you’d like to coordinate and concentrate efforts. I’d love to work with you on this! From your last commits, you seem to have a great handle on PyData/Sparse!
daletovar commentedon Jun 11, 2019
Thanks @hameerabbasi, I'd love to coordinate efforts on this.
hameerabbasi commentedon Jun 11, 2019
Feel free to reach out to me at
habbasi [at] quansight [dot] com
. 😄 Let's set up a meeting!daletovar commentedon Jun 11, 2019
Perfect, that sounds great!
ivirshup commentedon Nov 13, 2019
I'm really looking forward to this! Is there more work that needs to be done before the
GXCS
format can be in a release?daletovar commentedon Nov 13, 2019
@ivirshup I'm not sure exactly when, but there are some things I'd like to add (faster slicing, stack, concatenate) plus some documentation. I don't think it'll be too long though.
ivirshup commentedon Nov 13, 2019
Sounds good! Would you mind if I took a crack at implementing any of those in the meantime?
daletovar commentedon Nov 13, 2019
That would be fantastic! I was going to add stack and concatenate in the next couple of days. Faster slicing would be a huge help. We also eventually need to add elementwise ops, reductions, dot, tensordot, and matmul. Any contributions on any of that would be greatly appreciated.
luk-f-a commentedon Jun 27, 2020
hi! was this closed by #258 ?
hameerabbasi commentedon Jun 27, 2020
It still isn't complete, nor is it public API, so I haven't closed it yet.
BorisTheBrave commentedon Jan 1, 2021
I found the idea of CSF intgruiging, but poorly explained in general, so I have made my own attempt in a blog post. Would love a review from @ShadenSmith .
I haven't compared CSF to CSD or GCRS, though.