Skip to content

Conversation

@bistline
Copy link
Contributor

BACKGROUND & CHANGES

This update adds a secondary data inspection check when determining if an AnnData file is indexed using Ensembl gene IDs (e.g. ENSG0000...) rather than traditional gene symbols. Instead of relying on adata.var.index.name to declare gene_ids, now the adata.var_names list is dumped and the first 4 characters from every entry is added to a set. If the matrix indexes on IDs, then the set should have only one entry of ENSG.

MANUAL TESTING

  1. Initialize local environment as normal and enter a Python shell
  2. Load the AnnDataIngestor module and initialize with a small CellXGene formatted file:
from anndata_ import AnnDataIngestor
ingestor = AnnDataIngestor('../tests/data/anndata/cellxgene.human_liver_b_cells.h5ad', 'addedfeed000000000000000', 'dec0dedfeed1111111111111')
adata = ingestor.obtain_adata()
  1. Confirm that the file does not have a value declared for var.index.name but still returns that it is indexed on Ensembl IDs:
>>> print(adata.var.index.name)
None
>>> ingestor.check_ensembl_index(adata)
True
  1. (Optional): Confirm the exported features file has both gene IDs and symbols:
import pandas as pd
feature_frame = adata.var.feature_name
index = True
pd.DataFrame(feature_frame).to_csv(
    "h5ad_frag.features.processed.tsv.gz",
    sep="\t",
    index=index,
    header=False,
    compression="gzip",
)

### exported file should look like this
ENSG00000000003	TSPAN6
ENSG00000000005	TNMD
ENSG00000000419	DPM1
ENSG00000000457	SCYL3
ENSG00000000460	C1orf112
ENSG00000000938	FGR
ENSG00000000971	CFH
...

@bistline bistline requested review from eweitz and jlchang October 28, 2024 17:44
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, thanks for the quick solution!

@codecov
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.70%. Comparing base (a11e54c) to head (79a6800).
Report is 4 commits behind head on development.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #370      +/-   ##
===============================================
+ Coverage        75.67%   75.70%   +0.03%     
===============================================
  Files               30       30              
  Lines             4394     4400       +6     
===============================================
+ Hits              3325     3331       +6     
  Misses            1069     1069              
Files with missing lines Coverage Δ
ingest/anndata_.py 88.57% <100.00%> (+0.51%) ⬆️

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual review - code works as expected.

I did some searching on adata.var.index.name usage. The check for assignment of gene_id to adata.var.index.name is both legacy and a product of too narrow a sample size of exemplar data. Similar to the test case provided for this PR, several tutorials from Parse Biosciences deliberately assign adata.var.index.name to None (example 1 2. The scanpy developers recommend keeping adata.var.index.name as a unique canonical identifier but point out it can be useful to point the index to gene_symbols for certain use cases. This all suggests that there isn't a strong convention for setting adata.var.index.name with consistency.

I'm satisfied that the checking for "ENS" across gene_id[:3] is a good, general check for AnnData that use Ensembl identifiers (ie. any AnnData that conforms to CellxGene schema) BUT for data whose species don't have Ensembl genome annotations (ie. niche species) AND, very likely, any AnnData that isn't pure RNAseq data (For example, experiments with Feature Barcoding would have both Ensembl genes as features AND the cell surface proteins that were assayed so len(prefixes) won't evaluate to 1 but the experiment would have feature_names that should be accounted for). I guess this is a long way of saying that we will eventually need more than the two checks in this PR, but we can cross that bridge when we encounter a dataset that needs it. (The adata.var.index.name == 'gene_ids' check is legacy, but perhaps not obsolete, even though current h5ad refer to "features" and not "genes")

FWIW, LaminLabs uses a library with a function to standardize the Gene symbols -> Ensembl IDs process similar to Ideogram's gene name-to-id maps - the only caveat with such an approach is whether we'd be referencing the specific genome annotation version that the data was analyzed under...

@eweitz eweitz merged commit 00109a5 into development Oct 29, 2024
5 checks passed
@bistline bistline deleted the jb-anndata-ensembl-check branch December 9, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants