Skip to content

Conversation

@bistline
Copy link
Contributor

BACKGROUND & CHANGES

This fixes an issue in AnnData metadata extraction where columns with boolean data fail to extract and cause entire ingest runs to fail. This is due to an underlying bug with the is_numeric_dtype() method in Pandas that returns True for boolean data. Now, ingest will specifically check for a boolean column first and coerce that to a group-based annotation, storing T/F values as strings instead of native booleans. The rest of the type-checking logic is unchanged.

MANUAL TESTING

  1. Initialize your environment as normal
  2. Run the command to extract a flat metadata file using the test AnnData file that contains a boolean column:
python3 ingest_pipeline.py  --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/anndata_boolean_test.h5ad  --extract "['metadata']"
  1. You should see similar output - note the is_primary_data column listed at the end of obs:
AnnData object with n_obs × n_vars = 40 × 4
    obs: 'donor_id', 'biosample_id', 'sex', 'species', 'species__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'organ', 'organ__ontology_label', 'disease', 'disease__ontology_label', 'is_primary_data'
    obsm: 'X_tsne', 'spatial'
distinct_id: 2f30ec50-a04d-4d43-8fd1-b136a2045079
studyAccession: SCPdev
fileName: 5dd5ae25421aa910a723a337
fileType: input_validation_bypassed
fileSize: 1
trigger: dev-mode
logger: ingest-pipeline
appId: single-cell-portal
action: ingest_anndata
status: success
functionName: extract_from_anndata
perfTime: 0.335
  1. Open the h5ad_frag.metadata.tsv.gz file and confirm the entire last column is of type GROUP and all the values are False

@bistline bistline requested review from eweitz and jlchang October 23, 2024 15:47
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Together with retry-on-OOM, fixing this false-positive metadata validation failure seems like it gets us close to unblocking (and perhaps even solves) full ingest for these files.

@codecov
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.67%. Comparing base (941b2fe) to head (f062a1e).
Report is 7 commits behind head on development.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #368      +/-   ##
===============================================
+ Coverage        75.66%   75.67%   +0.01%     
===============================================
  Files               30       30              
  Lines             4392     4394       +2     
===============================================
+ Hits              3323     3325       +2     
  Misses            1069     1069              
Files with missing lines Coverage Δ
ingest/anndata_.py 88.05% <100.00%> (+0.18%) ⬆️

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense to me!

@jlchang jlchang merged commit a11e54c into development Oct 23, 2024
4 checks passed
@bistline bistline deleted the jb-metadata-boolean branch October 23, 2024 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants