Skip to content

Commit cbc01d1

Browse files
felixpetschkopre-commit-ci[bot]grst
authored
Speed up clonotype distance calculation (#470)
* compute_distances new version added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * data type for csr max computation set to int16 to allow negative values * data type also for csr_min set to int16 to allow negative values * raise AssertainError instead of assert False * matrix shape changed in lookup function * array size changed for v-gene and column matching * naming conventions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use self._chain_count2 for asymmetric matrix for filter_chain_count * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reverse table assignment adjusted in lookup function * changed AssertionError to NotImplementedError * removed unnecessary for loop * refactored lookup function * refactored lookup function and added docstring * adapted docsting of lookup function * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * j_gene matching implemented * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes typo in j_gene parameter * bind loop variables in function match_gene_segment * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set datatypes of csr matrix arrays explicitely * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docstring and data type checks added for csr_min and csr_max * Changed data types of indices and indptr arrays for the csr matrices in _dist_for_clonotype and lookup to int64 instead of int32. * Changed data type checks to max value checks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * documentation adaptions * update pre-commit config * reformat * deleted print statement * Implemented allowing graph partitioning method "fastgreedy" * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test case for j_gene added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactored filter_chain_count and added documentation * documentation and refactoring * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * documentation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rerun tutorial * Update CHANGELOG * Attempt to fix conda CI * update conda CI * Fix python version conda ci * override python version --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Gregor Sturm <[email protected]>
1 parent 83e0961 commit cbc01d1

File tree

8 files changed

+402
-237
lines changed

8 files changed

+402
-237
lines changed

.github/workflows/conda.yaml

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,27 +22,26 @@ jobs:
2222
matrix:
2323
include:
2424
- os: ubuntu-latest
25-
python: "3.12"
25+
python: "3.11"
2626

2727
env:
2828
OS: ${{ matrix.os }}
2929
PYTHON: ${{ matrix.python }}
3030

3131
steps:
32-
- uses: actions/checkout@v3
32+
- uses: actions/checkout@v4
3333

3434
- name: Setup Miniconda
35-
uses: conda-incubator/setup-miniconda@v2
35+
uses: conda-incubator/setup-miniconda@v3
3636
with:
37-
miniforge-variant: Mambaforge
38-
miniforge-version: latest
37+
mamba-version: "*"
3938
channels: conda-forge,bioconda
4039
channel-priority: strict
41-
python-version: ${{ matrix.python-version }}
40+
python-version: ${{ matrix.python }}
4241

4342
- name: install conda build
4443
run: |
45-
mamba install -y boa conda-verify
44+
mamba install -y boa conda-verify python=${{ matrix.python }}
4645
shell: bash
4746

4847
- name: build and test package

CHANGELOG.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,22 @@ and this project adheres to [Semantic Versioning][].
88
[keep a changelog]: https://keepachangelog.com/en/1.0.0/
99
[semantic versioning]: https://semver.org/spec/v2.0.0.html
1010

11-
## [Unreleased]
11+
## v0.18.0
1212

1313
### Additions
1414

1515
- Isotypically included B cells are now labelled as `receptor_subtype="IGH+IGK/L"` instead of `ambiguous` in `tl.chain_qc` ([#537](https://github.com/scverse/scirpy/pull/537)).
1616
- Added the `normalized_hamming` metric to `pp.ir_dist` that accounts for differences in CDR3 sequence length ([#512](https://github.com/scverse/scirpy/pull/512)).
17+
- `tl.define_clonotype_clusters` now has an option to require J genes to match (`same_j_gene=True`) in addition to `same_v_gene`. ([#470](https://github.com/scverse/scirpy/pull/470)).
1718

1819
### Performance improvements
1920

20-
- The hamming distance was reimplemented with numba, achieving a significant speedup ([#512](https://github.com/scverse/scirpy/pull/512)).
21+
- The hamming distance has been reimplemented with numba, achieving a significant speedup ([#512](https://github.com/scverse/scirpy/pull/512)).
22+
- Clonotype clustering has been accelerated leveraging sparse matrix operations ([#470](https://github.com/scverse/scirpy/pull/470)).
2123

2224
### Fixes
2325

24-
- Fix that pl.clonotype_network couldn't use non-standard obsm key ([#545](https://github.com/scverse/scirpy/pull/545)).
26+
- Fix that `pl.clonotype_network` couldn't use non-standard obsm key ([#545](https://github.com/scverse/scirpy/pull/545)).
2527

2628
### Other changes
2729

@@ -54,7 +56,7 @@ and this project adheres to [Semantic Versioning][].
5456

5557
### Fixes
5658

57-
- Fix issue with detecting the number of available CPUs on MacOD ([#518](https://github.com/scverse/scirpy/pull/502))
59+
- Fix issue with detecting the number of available CPUs on MacOS ([#518](https://github.com/scverse/scirpy/pull/502))
5860

5961
## v0.16.1
6062

docs/tutorials/tutorial_3k_tcr.ipynb

Lines changed: 17 additions & 91 deletions
Large diffs are not rendered by default.

src/scirpy/ir_dist/_clonotype_neighbors.py

Lines changed: 273 additions & 96 deletions
Large diffs are not rendered by default.

src/scirpy/ir_dist/_util.py

Lines changed: 65 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -233,61 +233,90 @@ def n_cols(self):
233233

234234
def lookup(
235235
self,
236-
object_id: int,
237-
forward_lookup_table: str,
238-
reverse_lookup_table: str | None = None,
239-
) -> coo_matrix | np.ndarray:
240-
"""Get ids of neighboring objects from a lookup table.
241-
242-
Performs the following lookup:
236+
object_ids: np.ndarray[int],
237+
forward_lookup_table_name: str,
238+
reverse_lookup_table_name: str | None = None,
239+
) -> sp.csr_matrix:
240+
"""
241+
Creates a distance matrix between objects with the given ids based on a feature distance matrix.
243242
244-
object_id -> dist_mat -> neighboring features -> neighboring objects.
243+
To get the distance between two objects we need to look up the features of the two objects.
244+
The distance between those two features is then the distance between the two objects.
245245
246-
where an object is a clonotype in our case (but could be used for something else)
246+
To do so, we first use the `object_ids` together with the `forward_lookup_table` to look up
247+
the indices of the objects in the feature `distance_matrix`. Afterwards we pick the according row for each object
248+
out of the `distance_matrix` and construct a `rows` matrix (n_object_ids x n_features).
247249
248-
"nan"s are not looked up via the distance matrix, they return a row of zeros
250+
"nan"s (index = -1) are not looked up in the feature `distance_matrix`, they return a row of zeros
249251
instead.
250252
253+
Then we use the entries of the `reverse_lookup_table` to construct a `reverse_lookup_matrix` (n_features x n_object_ids).
254+
By multiplying the `rows` matrix with the `reverse_lookup_matrix` we get the final `object_distance_matrix` that shows
255+
the distances between the objects with the given `object_ids` regarding a certain feature column.
256+
257+
It might not be obvious at the first sight that the matrix multiplication between `rows` and `reverse_lookup_matrix` gives
258+
us the desired result. But this trick allows us to use the built-in sparse matrix multiplication of `scipy.sparse`
259+
for enhanced performance.
260+
251261
Parameters
252262
----------
253-
object_id
254-
The row index of the feature_table.
255-
forward_lookup_table
263+
object_ids
264+
The row indices of the feature_table.
265+
forward_lookup_table_name
256266
The unique identifier of a lookup table previously added via
257267
`add_lookup_table`.
258-
reverse_lookup_table
268+
reverse_lookup_table_name
259269
The unique identifier of the lookup table used for the reverse lookup.
260270
If not provided will use the same lookup table for forward and reverse
261271
lookup. This is useful to calculate distances across features from
262272
different columns of the feature table (e.g. primary and secondary VJ chains).
273+
274+
Returns
275+
-------
276+
object_distance_matrix
277+
A CSR matrix containing the pairwise distances between objects with the
278+
given `object_ids` regarding a certain feature column.
263279
"""
264-
distance_matrix_name, forward, reverse = self.lookups[forward_lookup_table]
280+
distance_matrix_name, forward_lookup_table, reverse_lookup_table = self.lookups[forward_lookup_table_name]
265281

266-
if reverse_lookup_table is not None:
267-
distance_matrix_name_reverse, _, reverse = self.lookups[reverse_lookup_table]
282+
if reverse_lookup_table_name is not None:
283+
distance_matrix_name_reverse, _, reverse_lookup_table = self.lookups[reverse_lookup_table_name]
268284
if distance_matrix_name != distance_matrix_name_reverse:
269285
raise ValueError("Forward and reverse lookup tablese must be defined " "on the same distance matrices.")
270286

271287
distance_matrix = self.distance_matrices[distance_matrix_name]
272-
idx_in_dist_mat = forward[object_id]
273-
if idx_in_dist_mat == -1: # nan
274-
return reverse.empty()
275-
else:
276-
# get distances from the distance matrix...
277-
row = distance_matrix[idx_in_dist_mat, :]
278-
279-
if reverse.is_boolean:
280-
assert (
281-
len(row.indices) == 1 # type: ignore
282-
), "Boolean reverse lookup only works for identity distance matrices."
283-
return reverse[row.indices[0]] # type: ignore
284-
else:
285-
# ... and get column indices directly from sparse row
286-
# sum concatenates coo matrices
287-
return merge_coo_matrices(
288-
(reverse[i] * multiplier for i, multiplier in zip(row.indices, row.data, strict=False)), # type: ignore
289-
shape=(1, reverse.size),
290-
)
288+
289+
if np.max(distance_matrix.data) > np.iinfo(np.uint8).max:
290+
raise OverflowError(
291+
"The data values in the distance scipy.sparse.csr_matrix exceed the maximum value for uint8 (255)"
292+
)
293+
294+
indices_in_dist_mat = forward_lookup_table[object_ids]
295+
indptr = np.empty(distance_matrix.indptr.shape[0] + 1, dtype=np.int64)
296+
indptr[:-1] = distance_matrix.indptr
297+
indptr[-1] = indptr[-2]
298+
distance_matrix_extended = sp.csr_matrix(
299+
(distance_matrix.data.astype(np.uint8), distance_matrix.indices, indptr),
300+
shape=(distance_matrix.shape[0] + 1, distance_matrix.shape[1]),
301+
)
302+
rows = distance_matrix_extended[indices_in_dist_mat, :]
303+
304+
reverse_matrix_data = [np.array([], dtype=np.uint8)] * rows.shape[1]
305+
reverse_matrix_col = [np.array([], dtype=np.int64)] * rows.shape[1]
306+
nnz_array = np.zeros(rows.shape[1], dtype=np.int64)
307+
308+
for key, value in reverse_lookup_table.lookup.items():
309+
reverse_matrix_data[key] = value.data
310+
reverse_matrix_col[key] = value.col
311+
nnz_array[key] = value.nnz
312+
313+
data = np.concatenate(reverse_matrix_data)
314+
col = np.concatenate(reverse_matrix_col)
315+
indptr = np.concatenate([np.array([0], dtype=np.int64), np.cumsum(nnz_array)])
316+
317+
reverse_matrix = sp.csr_matrix((data, col, indptr), shape=(rows.shape[1], reverse_lookup_table.size))
318+
object_distance_matrix = rows * reverse_matrix
319+
return object_distance_matrix
291320

292321
def add_distance_matrix(
293322
self,
4.93 MB
Binary file not shown.

src/scirpy/tests/test_clonotypes.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import sys
33
from typing import cast
44

5+
import anndata as ad
56
import numpy as np
67
import numpy.testing as npt
78
import pandas as pd
@@ -339,3 +340,23 @@ def test_clonotype_convergence(adata_clonotype):
339340
categories=["convergent", "not convergent"],
340341
),
341342
)
343+
344+
345+
def test_j_gene_matching():
346+
from . import TESTDATA
347+
348+
data = ad.read_h5ad(TESTDATA / "clonotypes_test_data/j_gene_test_data.h5ad")
349+
350+
ir.tl.define_clonotype_clusters(
351+
data,
352+
sequence="nt",
353+
metric="normalized_hamming",
354+
receptor_arms="all",
355+
dual_ir="any",
356+
same_j_gene=True,
357+
key_added="test_j_gene",
358+
)
359+
360+
clustering = data.obs["test_j_gene"].tolist()
361+
expected = ["0", "0", "0", "0", "0", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2"]
362+
assert np.array_equal(clustering, expected)

src/scirpy/tl/_clonotypes.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -197,9 +197,10 @@ def define_clonotype_clusters(
197197
receptor_arms: Literal["VJ", "VDJ", "all", "any"] = "all",
198198
dual_ir: Literal["primary_only", "all", "any"] = "any",
199199
same_v_gene: bool = False,
200+
same_j_gene: bool = False,
200201
within_group: Sequence[str] | str | None = "receptor_type",
201202
key_added: str | None = None,
202-
partitions: Literal["connected", "leiden"] = "connected",
203+
partitions: Literal["connected", "leiden", "fastgreedy"] = "connected",
203204
resolution: float = 1,
204205
n_iterations: int = 5,
205206
distance_key: str | None = None,
@@ -249,12 +250,19 @@ def define_clonotype_clusters(
249250
250251
partitions
251252
How to find graph partitions that define a clonotype.
252-
Possible values are `leiden`, for using the "Leiden" algorithm and
253+
Possible values are `leiden`, for using the "Leiden" algorithm,
254+
`fastgreedy` for using the "Fastgreedy" algorithm and
253255
`connected` to find fully connected sub-graphs.
254256
255-
The difference is that the Leiden algorithm further divides
257+
The difference is that the Leiden and Fastgreedy algorithms further divide
256258
fully connected subgraphs into highly-connected modules.
257259
260+
"Leiden" finds the community structure of the graph using the
261+
Leiden algorithm of Traag, van Eck & Waltman.
262+
263+
"Fastgreedy" finds the community structure of the graph according to the
264+
algorithm of Clauset et al based on the greedy optimization of modularity.
265+
258266
resolution
259267
`resolution` parameter for the leiden algorithm.
260268
n_iterations
@@ -289,6 +297,7 @@ def define_clonotype_clusters(
289297
receptor_arms=receptor_arms, # type: ignore
290298
dual_ir=dual_ir, # type: ignore
291299
same_v_gene=same_v_gene,
300+
same_j_gene=same_j_gene,
292301
match_columns=within_group,
293302
distance_key=distance_key,
294303
sequence_key="junction_aa" if sequence == "aa" else "junction",
@@ -304,6 +313,8 @@ def define_clonotype_clusters(
304313
resolution_parameter=resolution,
305314
n_iterations=n_iterations,
306315
)
316+
elif partitions == "fastgreedy":
317+
part = g.community_fastgreedy().as_clustering()
307318
else:
308319
part = g.clusters(mode="weak")
309320

0 commit comments

Comments
 (0)