Skip to content

Conversation

@bistline
Copy link
Contributor

@bistline bistline commented Jul 9, 2025

BACKGROUND & CHANGES

This update adds the new DotPlotGenes class for computing gene-level dot plot metrics for all applicable annotations in a study scoped to a particular cluster. This class leverages the existing ExpressionWriter class for parallel rendering of gene-level expression values that are already filtered by the list of cells from a given cluster. These gene documents are then processed in parallel against a map of qualifying (i.e. group-based, 2-200 values) annotations and their associated cells to calculate the scaled mean and percentage of cells expressing for that given label. Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

MANUAL TESTING

  1. Initialize your dev environment as normal
  2. From the ingest directory, run the following example command to process a small dense matrix, taking note of the log message that starts with creating data directory as this is where your output files are:
$ python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_dot_plot_genes --cluster-group-id dec0dedfeed2222222222222 --matrix-file-path ../tests/data/dense_expression_matrix.txt --matrix-file-type dense --cell-metadata-file ../tests/data/metadata_example.txt --cluster-file ../tests/data/cluster_example.txt --ingest-dot-plot-genes

beginning rendering of ../tests/data/dense_expression_matrix.txt into DotPlotGene entries
getting cluster cells from ../tests/data/cluster_example.txt
preprocessing annotation data from ../tests/data/metadata_example.txt
Opening ../tests/data/metadata_example.txt as: text/plain
Opening ../tests/data/cluster_example.txt as: text/plain
reading Cluster, found 3 labels
reading Sub-Cluster, found 6 labels
reading Category, found 3 labels
Annotation data preprocessing for ../tests/data/metadata_example.txt complete
Rendering cluster-filtered gene expression from ../tests/data/dense_expression_matrix.txt
 creating data directory at cluster_entry_dec0dedfeed2222222222222_DJsfyW
 reading ../tests/data/dense_expression_matrix.txt as dense matrix
 determining seek points for ../tests/data/dense_expression_matrix.txt with chunk size 112
...
  1. After the processing stops, confirm you see the following output:
STATUS after ingest dot plot genes: [0]
distinct_id: 2f30ec50-a04d-4d43-8fd1-b136a2045079
studyAccession: SCPdev
fileName: dec0dedfeed1111111111111
fileType: input_validation_bypassed
fileSize: 1
trigger: dev-mode
logger: ingest-pipeline
appId: single-cell-portal
action: ingest_dot_plot_genes
  1. Go into the output folder you saw above and look for the dot_plot_genes directory. Open Sergef.json, confirming the contents are as follows:
{
  "study_id": "addedfeed000000000000000",
  "study_file_id": "dec0dedfeed1111111111111",
  "cluster_group_id": "dec0dedfeed2222222222222",
  "exp_scores": {
    "Cluster--group--study": {
      "CLST_A": [4.281, 0.6],
      "CLST_B": [2.869, 0.4],
      "CLST_C": [3.876, 0.6]
    },
    "Sub-Cluster--group--study": {
      "CLST_A_1": [2.364, 0.3333],
      "CLST_A_2": [7.157, 1.0], 
      "CLST_B_1": [0.0, 0.0],
      "CLST_B_2": [4.782, 0.6667],
      "CLST_C_1": [4.59, 0.6667],
      "CLST_C_2": [2.805, 0.5]
    },
    "Category--group--cluster" : {
      "A": [4.111, 0.6],
      "B": [5.763, 0.8333],
      "C": [0.0, 0.0]
    }
  },
  "gene_symbol": "Sergef",
  "searchable_gene": "sergef"
}

@bistline
Copy link
Contributor Author

bistline commented Jul 9, 2025

Build failure seems to be an issue with fetching PMC text, and is unrelated to this work.

@codecov
Copy link

codecov bot commented Jul 9, 2025

Codecov Report

Attention: Patch coverage is 96.46018% with 8 lines in your changes missing coverage. Please review.

Please upload report for BASE (development@07a3ae6). Learn more about missing BASE report.
Report is 14 commits behind head on development.

Files with missing lines Patch % Lines
ingest/dot_plot_genes.py 97.63% 4 Missing ⚠️
ingest/ingest_pipeline.py 76.47% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff               @@
##             development     #399   +/-   ##
==============================================
  Coverage               ?   77.30%           
==============================================
  Files                  ?       31           
  Lines                  ?     4635           
  Branches               ?        0           
==============================================
  Hits                   ?     3583           
  Misses                 ?     1052           
  Partials               ?        0           
Files with missing lines Coverage Δ
ingest/cli_parser.py 99.11% <100.00%> (ø)
ingest/expression_writer.py 90.90% <100.00%> (ø)
ingest/writer_functions.py 97.87% <100.00%> (ø)
ingest/dot_plot_genes.py 97.63% <97.63%> (ø)
ingest/ingest_pipeline.py 62.08% <76.47%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Kudos on the parallelized approach.

I suggest a few refinements, and ask a question for specific code below. I also have some higher-level questions. I don't consider any of them blocking.

Do you have a sense for how much time (and money) it will take to pre-compute this dot plot data in a typical study?

Also:

Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

Can you link to the provenance for this? I assume it's somewhere in the Morpheus code, which I couldn't uncover in a cursory scan last week. (If we have no formal provenance and arrived at this definition through empirical tests, I think that suffices, but a formal link would be nice if we have it.)

Somewhat related, I suspect, is a note about a comment from Sarah Nyquist years ago. That linked notes describes how our understanding of scaled mean expression might be contradicted by the existence of dots with "0.00". Is there any contradiction? Now seems like an opportune time to assess that.

Finally: what is the use case for including cell names, as seen in e.g. the test file tests/data/expression_writer/gene_dicts/Sergef.json? Could we potentially later optimize these to omit explicit cell names, and deduce cell names by array index offsets as we do in cell filtering? I imagine cell names will be our single biggest storage element, and inferring them away if possible would reduce storage costs, and speed up times for transfer and parsing.

@bistline
Copy link
Contributor Author

I'll respond to the questions here first and then the code suggestions inline.

Do you have a sense for how much time (and money) it will take to pre-compute this dot plot data in a typical study?

Not yet, because I haven't been able to run this in the Batch API yet. But processing on my 4-core machine takes about 10 minutes total using the compliant_liver.h5ad AnnData file, so we can extrapolate out to 5 min on an n2d-highmem-8 for the same file. Granted, this file is small (~150 cells) but I expect the cell counts to scale ingest times linearly, not exponentially. My feeling so far is that these jobs will cost roughly the same as other non-DE ingest processes, which is to say pennies rather than dollars per run.

Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

Can you link to the provenance for this? I assume it's somewhere in the Morpheus code, which I couldn't uncover in a cursory scan last week. (If we have no formal provenance and arrived at this definition through empirical tests, I think that suffices, but a formal link would be nice if we have it.)

This was checked empirically using the compliant_liver.h5ad file - I was able to sample random genes and the scaled mean/pct expressing matched perfectly for everything I checked. I then found this comment which sort of reaffirmed my conclusion. We should still ask Tim/Farzaneh about this to be sure, which I plan on doing at demo next Tuesday.

Somewhat related, I suspect, is a note about a comment from Sarah Nyquist years ago. That linked notes describes how our understanding of scaled mean expression might be contradicted by the existence of dots with "0.00". Is there any contradiction? Now seems like an opportune time to assess that.

I'm still a little fuzzy on that as well, but my suspicion is that the gene has non-zero expression for some cells in the study, but a given group in an annotation may not. If there are no cells from that group with expression, then it's going to show up as 0 since that's what it gets scaled by. I will say that this code does the proper filtering of cells such that we only save significant values. But the above scenario can still happen. I'm open to discussion on whether or not we need to save this - I am for now just to make the code simpler.

Finally: what is the use case for including cell names, as seen in e.g. the test file tests/data/expression_writer/gene_dicts/Sergef.json? Could we potentially later optimize these to omit explicit cell names, and deduce cell names by array index offsets as we do in cell filtering? I imagine cell names will be our single biggest storage element, and inferring them away if possible would reduce storage costs, and speed up times for transfer and parsing.

This is an intermediate file that isn't persisted anywhere. This is what gets rendered by ExpressionWriter when it processes the expression matrix and filters everything by the list of cluster cells, and then I read in all those filtered gene-level documents and compute the scaled mean/pct expression by finding the intersection of those observed cells and the ones from each annotation label. What gets saved in Mongo looks exactly like what we designed:

{
  "organ__ontology_label--group--study": {
    "brain": [1.234, .5142],
    "brain stem": [0.294, .2284],
    …
  },
  "cell_type__ontology_label--group--study": {
    "dopaminergic neuron:" [ … ],
}

Co-authored-by: Eric Weitz <[email protected]>
@bistline bistline merged commit c440ba0 into development Jul 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants