-
Notifications
You must be signed in to change notification settings - Fork 0
Add secondary data check for Ensembl gene ID index (SCP-5853) #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, thanks for the quick solution!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## development #370 +/- ##
===============================================
+ Coverage 75.67% 75.70% +0.03%
===============================================
Files 30 30
Lines 4394 4400 +6
===============================================
+ Hits 3325 3331 +6
Misses 1069 1069
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual review - code works as expected.
I did some searching on adata.var.index.name usage. The check for assignment of gene_id to adata.var.index.name is both legacy and a product of too narrow a sample size of exemplar data. Similar to the test case provided for this PR, several tutorials from Parse Biosciences deliberately assign adata.var.index.name to None (example 1 2. The scanpy developers recommend keeping adata.var.index.name as a unique canonical identifier but point out it can be useful to point the index to gene_symbols for certain use cases. This all suggests that there isn't a strong convention for setting adata.var.index.name with consistency.
I'm satisfied that the checking for "ENS" across gene_id[:3] is a good, general check for AnnData that use Ensembl identifiers (ie. any AnnData that conforms to CellxGene schema) BUT for data whose species don't have Ensembl genome annotations (ie. niche species) AND, very likely, any AnnData that isn't pure RNAseq data (For example, experiments with Feature Barcoding would have both Ensembl genes as features AND the cell surface proteins that were assayed so len(prefixes) won't evaluate to 1 but the experiment would have feature_names that should be accounted for). I guess this is a long way of saying that we will eventually need more than the two checks in this PR, but we can cross that bridge when we encounter a dataset that needs it. (The adata.var.index.name == 'gene_ids' check is legacy, but perhaps not obsolete, even though current h5ad refer to "features" and not "genes")
FWIW, LaminLabs uses a library with a function to standardize the Gene symbols -> Ensembl IDs process similar to Ideogram's gene name-to-id maps - the only caveat with such an approach is whether we'd be referencing the specific genome annotation version that the data was analyzed under...
BACKGROUND & CHANGES
This update adds a secondary data inspection check when determining if an AnnData file is indexed using Ensembl gene IDs (e.g. ENSG0000...) rather than traditional gene symbols. Instead of relying on
adata.var.index.nameto declaregene_ids, now theadata.var_nameslist is dumped and the first 4 characters from every entry is added to a set. If the matrix indexes on IDs, then the set should have only one entry ofENSG.MANUAL TESTING
AnnDataIngestormodule and initialize with a small CellXGene formatted file:var.index.namebut still returns that it is indexed on Ensembl IDs: