Skip to content

[BUG] Passing 'groups' parameter to rapids_singlecell.tl.rank_genes_groups gives all 0 scores/NA LFCs #650

@dguin

Description

@dguin

Describe the bug

When calling rsc.tl.rank_genes_groups with the groups=[...] parameter and method='t-test' or method='wilcoxon', the resulting scores, p-values, and log fold changes are all zero or NaN (scores are okay for wilcoxon). Calling the same function on the same AnnData without the groups= parameter (computing all groups) returns correct, non-degenerate values for every group including the one that was previously broken. The method='wilcoxon' path is partially affected: scores compute correctly with groups=, but pvals, pvals_adj, and logfoldchanges are still all zero / NaN.

scanpy's sc.tl.rank_genes_groups on the identical AnnData with the same groupby and groups= arguments returns correct values, so the data itself is valid.

Steps/Code to reproduce bug

import cupy as cp
import cupyx.scipy.sparse as cusp
import anndata as ad
import numpy as np
import pandas as pd
import rapids_singlecell as rsc
import scanpy as sc

n_cells, n_genes = 5000, 1000
X = cusp.random(n_cells, n_genes, density=0.1, dtype=cp.float32, format='csr')
obs = pd.DataFrame({
    'group': pd.Categorical(np.random.choice(list('ABCDE'), n_cells))
})
adata = ad.AnnData(X=X, obs=obs)
rsc.pp.normalize_total(adata)
rsc.pp.log1p(adata)

# Works: scores, pvals, lfc all populated correctly for every group
rsc.tl.rank_genes_groups(adata, groupby='group', method='t-test', use_raw=False)
print("All groups (no `groups=`):")
print(pd.DataFrame(adata.uns['rank_genes_groups']['scores']).head())
print(pd.DataFrame(adata.uns['rank_genes_groups']['pvals']).head())
print(pd.DataFrame(adata.uns['rank_genes_groups']['logfoldchanges']).head())

# Broken: scores all 0, pvals all 0, logfoldchanges all NaN
rsc.tl.rank_genes_groups(adata, groupby='group', groups=['A'],
                         method='t-test', use_raw=False)
print("\nWith `groups=['A']`:")
print(pd.DataFrame(adata.uns['rank_genes_groups']['scores']).head())
print(pd.DataFrame(adata.uns['rank_genes_groups']['pvals']).head())
print(pd.DataFrame(adata.uns['rank_genes_groups']['logfoldchanges']).head())

# scanpy's equivalent on the same data works fine for comparison
adata_cpu = adata.copy()
rsc.get.anndata_to_CPU(adata_cpu)
sc.tl.rank_genes_groups(adata_cpu, groupby='group', groups=['A'],
                        method='t-test', use_raw=False)
print("\nscanpy with `groups=['A']`:")
print(pd.DataFrame(adata_cpu.uns['rank_genes_groups']['scores']).head())

Expected behavior

Calling rsc.tl.rank_genes_groups with groups=[X] should produce the same scores, p-values, adjusted p-values, and log fold changes for group X as calling it without the groups= argument and indexing into the result for group X. Currently the two code paths disagree:

  • Without groups=: correct values for all groups.
  • With groups=[X]: scores 0, pvals 0, logfoldchanges NaN (t-test); scores correct but pvals/lfc still 0/NaN (wilcoxon).

Environment details (please complete the following information):

  • Environment location: Cloud (Lambda Labs / Lambda Cloud GPU VM)
  • Linux Distro/Architecture: Ubuntu 22.04 amd64
  • GPU Model/Driver: NVIDIA H100 PCIe/Driver Version: 580.105.08
  • CUDA: cu13
  • Method of Rapids install: pip
    • pip install 'rapids-singlecell[rapids-cu12]' --extra-index-url=https://pypi.nvidia.com
    • Also tried pip install 'rapids-singlecell[rapids-cu13]' --extra-index-url=https://pypi.nvidia.com which gives a similar output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions