Add secondary data check for Ensembl gene ID index (SCP-5853) #370

bistline · 2024-10-28T17:44:32Z

BACKGROUND & CHANGES

This update adds a secondary data inspection check when determining if an AnnData file is indexed using Ensembl gene IDs (e.g. ENSG0000...) rather than traditional gene symbols. Instead of relying on adata.var.index.name to declare gene_ids, now the adata.var_names list is dumped and the first 4 characters from every entry is added to a set. If the matrix indexes on IDs, then the set should have only one entry of ENSG.

MANUAL TESTING

Initialize local environment as normal and enter a Python shell
Load the AnnDataIngestor module and initialize with a small CellXGene formatted file:

from anndata_ import AnnDataIngestor
ingestor = AnnDataIngestor('../tests/data/anndata/cellxgene.human_liver_b_cells.h5ad', 'addedfeed000000000000000', 'dec0dedfeed1111111111111')
adata = ingestor.obtain_adata()

Confirm that the file does not have a value declared for var.index.name but still returns that it is indexed on Ensembl IDs:

>>> print(adata.var.index.name)
None
>>> ingestor.check_ensembl_index(adata)
True

(Optional): Confirm the exported features file has both gene IDs and symbols:

import pandas as pd
feature_frame = adata.var.feature_name
index = True
pd.DataFrame(feature_frame).to_csv(
    "h5ad_frag.features.processed.tsv.gz",
    sep="\t",
    index=index,
    header=False,
    compression="gzip",
)

### exported file should look like this
ENSG00000000003	TSPAN6
ENSG00000000005	TNMD
ENSG00000000419	DPM1
ENSG00000000457	SCYL3
ENSG00000000460	C1orf112
ENSG00000000938	FGR
ENSG00000000971	CFH
...

eweitz

Code looks good, thanks for the quick solution!

codecov · 2024-10-28T19:42:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.70%. Comparing base (a11e54c) to head (79a6800).
Report is 4 commits behind head on development.

Additional details and impacted files

@@               Coverage Diff               @@
##           development     #370      +/-   ##
===============================================
+ Coverage        75.67%   75.70%   +0.03%     
===============================================
  Files               30       30              
  Lines             4394     4400       +6     
===============================================
+ Hits              3325     3331       +6     
  Misses            1069     1069

Files with missing lines	Coverage Δ
ingest/anndata_.py	`88.57% <100.00%> (+0.51%)`	⬆️

jlchang

Manual review - code works as expected.

I did some searching on adata.var.index.name usage. The check for assignment of gene_id to adata.var.index.name is both legacy and a product of too narrow a sample size of exemplar data. Similar to the test case provided for this PR, several tutorials from Parse Biosciences deliberately assign adata.var.index.name to None (example 1 2. The scanpy developers recommend keeping adata.var.index.name as a unique canonical identifier but point out it can be useful to point the index to gene_symbols for certain use cases. This all suggests that there isn't a strong convention for setting adata.var.index.name with consistency.

I'm satisfied that the checking for "ENS" across gene_id[:3] is a good, general check for AnnData that use Ensembl identifiers (ie. any AnnData that conforms to CellxGene schema) BUT for data whose species don't have Ensembl genome annotations (ie. niche species) AND, very likely, any AnnData that isn't pure RNAseq data (For example, experiments with Feature Barcoding would have both Ensembl genes as features AND the cell surface proteins that were assayed so len(prefixes) won't evaluate to 1 but the experiment would have feature_names that should be accounted for). I guess this is a long way of saying that we will eventually need more than the two checks in this PR, but we can cross that bridge when we encounter a dataset that needs it. (The adata.var.index.name == 'gene_ids' check is legacy, but perhaps not obsolete, even though current h5ad refer to "features" and not "genes")

FWIW, LaminLabs uses a library with a function to standardize the Gene symbols -> Ensembl IDs process similar to Ideogram's gene name-to-id maps - the only caveat with such an approach is whether we'd be referencing the specific genome annotation version that the data was analyzed under...

bistline added 2 commits October 28, 2024 13:26

Add secondary data check for Ensembl gene ID index

616c2bd

adding negative test

e6e61c8

bistline requested review from eweitz and jlchang October 28, 2024 17:44

eweitz approved these changes Oct 28, 2024

View reviewed changes

Only check first 3 chars to allow for other species

79a6800

jlchang approved these changes Oct 29, 2024

View reviewed changes

eweitz merged commit 00109a5 into development Oct 29, 2024
5 checks passed

eweitz mentioned this pull request Oct 30, 2024

Update to Ingest 1.37.0: Better gene search for CELLxGENE AnnData broadinstitute/single_cell_portal_core#2163

Merged

bistline deleted the jb-anndata-ensembl-check branch December 9, 2024 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add secondary data check for Ensembl gene ID index (SCP-5853) #370

Add secondary data check for Ensembl gene ID index (SCP-5853) #370

Uh oh!

bistline commented Oct 28, 2024

Uh oh!

eweitz left a comment

Uh oh!

codecov bot commented Oct 28, 2024 •

edited

Loading

Uh oh!

jlchang left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add secondary data check for Ensembl gene ID index (SCP-5853) #370

Add secondary data check for Ensembl gene ID index (SCP-5853) #370

Uh oh!

Conversation

bistline commented Oct 28, 2024

BACKGROUND & CHANGES

MANUAL TESTING

Uh oh!

eweitz left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jlchang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 28, 2024 •

edited

Loading

jlchang left a comment •

edited

Loading