DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399

bistline · 2025-07-09T19:07:13Z

BACKGROUND & CHANGES

This update adds the new DotPlotGenes class for computing gene-level dot plot metrics for all applicable annotations in a study scoped to a particular cluster. This class leverages the existing ExpressionWriter class for parallel rendering of gene-level expression values that are already filtered by the list of cells from a given cluster. These gene documents are then processed in parallel against a map of qualifying (i.e. group-based, 2-200 values) annotations and their associated cells to calculate the scaled mean and percentage of cells expressing for that given label. Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

MANUAL TESTING

Initialize your dev environment as normal
From the ingest directory, run the following example command to process a small dense matrix, taking note of the log message that starts with creating data directory as this is where your output files are:

$ python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_dot_plot_genes --cluster-group-id dec0dedfeed2222222222222 --matrix-file-path ../tests/data/dense_expression_matrix.txt --matrix-file-type dense --cell-metadata-file ../tests/data/metadata_example.txt --cluster-file ../tests/data/cluster_example.txt --ingest-dot-plot-genes

beginning rendering of ../tests/data/dense_expression_matrix.txt into DotPlotGene entries
getting cluster cells from ../tests/data/cluster_example.txt
preprocessing annotation data from ../tests/data/metadata_example.txt
Opening ../tests/data/metadata_example.txt as: text/plain
Opening ../tests/data/cluster_example.txt as: text/plain
reading Cluster, found 3 labels
reading Sub-Cluster, found 6 labels
reading Category, found 3 labels
Annotation data preprocessing for ../tests/data/metadata_example.txt complete
Rendering cluster-filtered gene expression from ../tests/data/dense_expression_matrix.txt
 creating data directory at cluster_entry_dec0dedfeed2222222222222_DJsfyW
 reading ../tests/data/dense_expression_matrix.txt as dense matrix
 determining seek points for ../tests/data/dense_expression_matrix.txt with chunk size 112
...

After the processing stops, confirm you see the following output:

STATUS after ingest dot plot genes: [0]
distinct_id: 2f30ec50-a04d-4d43-8fd1-b136a2045079
studyAccession: SCPdev
fileName: dec0dedfeed1111111111111
fileType: input_validation_bypassed
fileSize: 1
trigger: dev-mode
logger: ingest-pipeline
appId: single-cell-portal
action: ingest_dot_plot_genes

Go into the output folder you saw above and look for the dot_plot_genes directory. Open Sergef.json, confirming the contents are as follows:

{
  "study_id": "addedfeed000000000000000",
  "study_file_id": "dec0dedfeed1111111111111",
  "cluster_group_id": "dec0dedfeed2222222222222",
  "exp_scores": {
    "Cluster--group--study": {
      "CLST_A": [4.281, 0.6],
      "CLST_B": [2.869, 0.4],
      "CLST_C": [3.876, 0.6]
    },
    "Sub-Cluster--group--study": {
      "CLST_A_1": [2.364, 0.3333],
      "CLST_A_2": [7.157, 1.0], 
      "CLST_B_1": [0.0, 0.0],
      "CLST_B_2": [4.782, 0.6667],
      "CLST_C_1": [4.59, 0.6667],
      "CLST_C_2": [2.805, 0.5]
    },
    "Category--group--cluster" : {
      "A": [4.111, 0.6],
      "B": [5.763, 0.8333],
      "C": [0.0, 0.0]
    }
  },
  "gene_symbol": "Sergef",
  "searchable_gene": "sergef"
}

…, test stubs

…w tests

bistline · 2025-07-09T19:07:53Z

Build failure seems to be an issue with fetching PMC text, and is unrelated to this work.

codecov · 2025-07-09T19:13:22Z

Codecov Report

Attention: Patch coverage is 96.46018% with 8 lines in your changes missing coverage. Please review.

Please upload report for BASE (development@07a3ae6). Learn more about missing BASE report.
Report is 14 commits behind head on development.

Files with missing lines	Patch %	Lines
ingest/dot_plot_genes.py	97.63%	4 Missing ⚠️
ingest/ingest_pipeline.py	76.47%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             development     #399   +/-   ##
==============================================
  Coverage               ?   77.30%           
==============================================
  Files                  ?       31           
  Lines                  ?     4635           
  Branches               ?        0           
==============================================
  Hits                   ?     3583           
  Misses                 ?     1052           
  Partials               ?        0

Files with missing lines	Coverage Δ
ingest/cli_parser.py	`99.11% <100.00%> (ø)`
ingest/expression_writer.py	`90.90% <100.00%> (ø)`
ingest/writer_functions.py	`97.87% <100.00%> (ø)`
ingest/dot_plot_genes.py	`97.63% <97.63%> (ø)`
ingest/ingest_pipeline.py	`62.08% <76.47%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eweitz

Code looks good! Kudos on the parallelized approach.

I suggest a few refinements, and ask a question for specific code below. I also have some higher-level questions. I don't consider any of them blocking.

Do you have a sense for how much time (and money) it will take to pre-compute this dot plot data in a typical study?

Also:

Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

Can you link to the provenance for this? I assume it's somewhere in the Morpheus code, which I couldn't uncover in a cursory scan last week. (If we have no formal provenance and arrived at this definition through empirical tests, I think that suffices, but a formal link would be nice if we have it.)

Somewhat related, I suspect, is a note about a comment from Sarah Nyquist years ago. That linked notes describes how our understanding of scaled mean expression might be contradicted by the existence of dots with "0.00". Is there any contradiction? Now seems like an opportune time to assess that.

Finally: what is the use case for including cell names, as seen in e.g. the test file tests/data/expression_writer/gene_dicts/Sergef.json? Could we potentially later optimize these to omit explicit cell names, and deduce cell names by array index offsets as we do in cell filtering? I imagine cell names will be our single biggest storage element, and inferring them away if possible would reduce storage costs, and speed up times for transfer and parsing.

ingest/dot_plot_genes.py

ingest/writer_functions.py

tests/data/expression_writer/gene_dicts/THRA1%2FBTR.json

bistline · 2025-07-10T16:14:02Z

I'll respond to the questions here first and then the code suggestions inline.

Do you have a sense for how much time (and money) it will take to pre-compute this dot plot data in a typical study?

Not yet, because I haven't been able to run this in the Batch API yet. But processing on my 4-core machine takes about 10 minutes total using the compliant_liver.h5ad AnnData file, so we can extrapolate out to 5 min on an n2d-highmem-8 for the same file. Granted, this file is small (~150 cells) but I expect the cell counts to scale ingest times linearly, not exponentially. My feeling so far is that these jobs will cost roughly the same as other non-DE ingest processes, which is to say pennies rather than dollars per run.

Scaled mean for dot plots is defined as the mean expression of observed cells multiplied by the percentage of cells expressing in that label - e.g. a mean expression of 1.5 with 50% cells expressing will have a scaled mean of 0.75.

Can you link to the provenance for this? I assume it's somewhere in the Morpheus code, which I couldn't uncover in a cursory scan last week. (If we have no formal provenance and arrived at this definition through empirical tests, I think that suffices, but a formal link would be nice if we have it.)

This was checked empirically using the compliant_liver.h5ad file - I was able to sample random genes and the scaled mean/pct expressing matched perfectly for everything I checked. I then found this comment which sort of reaffirmed my conclusion. We should still ask Tim/Farzaneh about this to be sure, which I plan on doing at demo next Tuesday.

Somewhat related, I suspect, is a note about a comment from Sarah Nyquist years ago. That linked notes describes how our understanding of scaled mean expression might be contradicted by the existence of dots with "0.00". Is there any contradiction? Now seems like an opportune time to assess that.

I'm still a little fuzzy on that as well, but my suspicion is that the gene has non-zero expression for some cells in the study, but a given group in an annotation may not. If there are no cells from that group with expression, then it's going to show up as 0 since that's what it gets scaled by. I will say that this code does the proper filtering of cells such that we only save significant values. But the above scenario can still happen. I'm open to discussion on whether or not we need to save this - I am for now just to make the code simpler.

Finally: what is the use case for including cell names, as seen in e.g. the test file tests/data/expression_writer/gene_dicts/Sergef.json? Could we potentially later optimize these to omit explicit cell names, and deduce cell names by array index offsets as we do in cell filtering? I imagine cell names will be our single biggest storage element, and inferring them away if possible would reduce storage costs, and speed up times for transfer and parsing.

This is an intermediate file that isn't persisted anywhere. This is what gets rendered by ExpressionWriter when it processes the expression matrix and filters everything by the list of cluster cells, and then I read in all those filtered gene-level documents and compute the scaled mean/pct expression by finding the intersection of those observed cells and the ones from each annotation label. What gets saved in Mongo looks exactly like what we designed:

{
  "organ__ontology_label--group--study": {
    "brain": [1.234, .5142],
    "brain stem": [0.294, .2284],
    …
  },
  "cell_type__ontology_label--group--study": {
    "dopaminergic neuron:" [ … ],
}

…o jb-dot-plot-gene-processing

Co-authored-by: Eric Weitz <[email protected]>

bistline added 9 commits July 2, 2025 17:22

Adding support for writing JSON gene dicts from expression_writer.py

71fdf26

Adding DotPlotGenes for processing expression/cluster/annotation data…

31721b4

…, test stubs

fixing metrics computation, adding test coverage

fb2d8bb

Standardizing cli interface, adding to ingest harness, refactoring

dc61626

fixing transform, init bug

4df2242

fixing transform again, docstring updates, adding ingest test

bfec7fa

fixing ingest tests, divide by zero issue, example code

c0f3474

fixing Mongo connection & remainder insert, naming/ObjectID issue, ne…

35ec2bb

…w tests

fixing mock issues, adding random string to avoid dirname collisions

353293c

bistline requested a review from eweitz July 9, 2025 19:07

bistline added the build failure: false positive label Jul 9, 2025

Update minified ontologies via GitHub Actions

69624d8

eweitz approved these changes Jul 10, 2025

View reviewed changes

ingest/dot_plot_genes.py Outdated Show resolved Hide resolved

ingest/dot_plot_genes.py Outdated Show resolved Hide resolved

ingest/writer_functions.py Outdated Show resolved Hide resolved

tests/data/expression_writer/gene_dicts/THRA1%2FBTR.json Show resolved Hide resolved

bistline added 2 commits July 10, 2025 12:51

addressing PR comments

c115357

Merge remote-tracking branch 'origin/jb-dot-plot-gene-processing' int…

dc69a7a

…o jb-dot-plot-gene-processing

bistline removed the build failure: false positive label Jul 10, 2025

Removing fallback imports

a5442bf

Co-authored-by: Eric Weitz <[email protected]>

bistline merged commit c440ba0 into development Jul 10, 2025
6 checks passed

This was referenced Jul 14, 2025

Fixing parameter default to skip delocalizing outputs (SCP-5979) #401

Merged

Rails-based integration for DotPlotGene processing (SCP-6029) broadinstitute/single_cell_portal_core#2283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399

DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399

Uh oh!

bistline commented Jul 9, 2025

Uh oh!

bistline commented Jul 9, 2025

Uh oh!

codecov bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

eweitz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bistline commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399

DotPlotGenes class for pre-computing gene-level dot plot metrics (SCP-5979) #399

Uh oh!

Conversation

bistline commented Jul 9, 2025

BACKGROUND & CHANGES

MANUAL TESTING

Uh oh!

bistline commented Jul 9, 2025

Uh oh!

codecov bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eweitz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bistline commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 9, 2025 •

edited

Loading