-
Notifications
You must be signed in to change notification settings - Fork 0
DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Build failure seems to be an issue with fetching PMC text, and is unrelated to this work. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #399 +/- ##
==============================================
Coverage ? 77.30%
==============================================
Files ? 31
Lines ? 4635
Branches ? 0
==============================================
Hits ? 3583
Misses ? 1052
Partials ? 0
🚀 New features to boost your workflow:
|
eweitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! Kudos on the parallelized approach.
I suggest a few refinements, and ask a question for specific code below. I also have some higher-level questions. I don't consider any of them blocking.
Do you have a sense for how much time (and money) it will take to pre-compute this dot plot data in a typical study?
Also:
Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.
Can you link to the provenance for this? I assume it's somewhere in the Morpheus code, which I couldn't uncover in a cursory scan last week. (If we have no formal provenance and arrived at this definition through empirical tests, I think that suffices, but a formal link would be nice if we have it.)
Somewhat related, I suspect, is a note about a comment from Sarah Nyquist years ago. That linked notes describes how our understanding of scaled mean expression might be contradicted by the existence of dots with "0.00". Is there any contradiction? Now seems like an opportune time to assess that.
Finally: what is the use case for including cell names, as seen in e.g. the test file tests/data/expression_writer/gene_dicts/Sergef.json? Could we potentially later optimize these to omit explicit cell names, and deduce cell names by array index offsets as we do in cell filtering? I imagine cell names will be our single biggest storage element, and inferring them away if possible would reduce storage costs, and speed up times for transfer and parsing.
|
I'll respond to the questions here first and then the code suggestions inline.
Not yet, because I haven't been able to run this in the Batch API yet. But processing on my 4-core machine takes about 10 minutes total using the
This was checked empirically using the
I'm still a little fuzzy on that as well, but my suspicion is that the gene has non-zero expression for some cells in the study, but a given group in an annotation may not. If there are no cells from that group with expression, then it's going to show up as 0 since that's what it gets scaled by. I will say that this code does the proper filtering of cells such that we only save significant values. But the above scenario can still happen. I'm open to discussion on whether or not we need to save this - I am for now just to make the code simpler.
This is an intermediate file that isn't persisted anywhere. This is what gets rendered by |
…o jb-dot-plot-gene-processing
Co-authored-by: Eric Weitz <[email protected]>
BACKGROUND & CHANGES
This update adds the new
DotPlotGenesclass for computing gene-level dot plot metrics for all applicable annotations in a study scoped to a particular cluster. This class leverages the existingExpressionWriterclass for parallel rendering of gene-level expression values that are already filtered by the list of cells from a given cluster. These gene documents are then processed in parallel against a map of qualifying (i.e. group-based, 2-200 values) annotations and their associated cells to calculate the scaled mean and percentage of cells expressing for that given label. Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.MANUAL TESTING
ingestdirectory, run the following example command to process a small dense matrix, taking note of the log message that starts withcreating data directoryas this is where your output files are:dot_plot_genesdirectory. OpenSergef.json, confirming the contents are as follows: