Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
68e8811
adding gitignore file, adding .DS_store to that
graceakatsu Jun 16, 2025
4a09f2f
adding input data contents (other than intersecting genes csv file) t…
graceakatsu Jun 16, 2025
0d803ce
moving loading renv to a new script on its own and renaming the get_d…
graceakatsu Jun 17, 2025
ea8081a
updating numbering for scripts
graceakatsu Jun 17, 2025
bd1e9ac
Adding renv folder with activate.R and settings.json
graceakatsu Jun 17, 2025
664bdf0
updating nextflow file and config. also editing run_pipeline to remov…
graceakatsu Jun 18, 2025
eac1f9a
replace python unzipping file with R to simplify file types and also …
graceakatsu Jun 18, 2025
b51c4d0
updating git ignore with nextflow and R hidden files
graceakatsu Jun 18, 2025
11c61ca
updating readme, removing conda environment, updating main.nf to incl…
graceakatsu Jun 18, 2025
d78e4de
updating .gitignore
graceakatsu Jun 18, 2025
768131a
replacing R unzipping file with python one
graceakatsu Jun 18, 2025
8d8ca52
adding back in conda environment with python packages only
graceakatsu Jun 18, 2025
42031d1
making it so that the pipeline shell script activates conda and remov…
graceakatsu Jun 18, 2025
8055b58
updating readme with info about optional unzipping script
graceakatsu Jun 18, 2025
b9e3eaf
adding optional commented-out line to run the unzipping script
graceakatsu Jun 18, 2025
25e2499
minimally modifying run_pipeline to be suitable for HPC
graceakatsu Jun 18, 2025
6462b01
specifying CRAN mirror
graceakatsu Jun 18, 2025
9f0253f
adding amc partition name to nextflow.config
graceakatsu Jun 18, 2025
f9af7be
adding more info needed to nextflow config
graceakatsu Jun 18, 2025
9239777
fixing extra bracket
graceakatsu Jun 18, 2025
d56aa27
fixing order of CRAN mirror specification
graceakatsu Jun 18, 2025
4b2e56e
editing .sh file to add some file paths specific to alpine, per convo…
Jun 25, 2025
cb5781b
fixing the paths so that it is executable locally and on HPC
Jun 25, 2025
cb26931
Update README.md
graceakatsu Jun 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rprofile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
source("renv/activate.R")
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
*.user
*.userosscache
*.sln.docstates
*.nextflow/
*scripts/.Rhistory
*.Rhistory
*.nextflow.log.*

# User-specific files (MonoDevelop/Xamarin Studio)
*.userprefs
Expand Down Expand Up @@ -398,3 +402,8 @@ FodyWeavers.xsd

# JetBrains Rider
*.sln.iml
.DS_Store

# All input_data, except for intersecting genes csv file
/input_data/*
!/input_data/intersect_3d.csv
27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
1) Introduction
2) Scripts
3) Input and output data preparation and organization
4) Running the pipeline
# 1) Introduction:
The purpose of the code in this repository is to use the InstaPrism R package (https://github.com/humengying0907/InstaPrism/tree/master) to run Bayes Prism deconvolution on the following bulk RNA seq and microarray datasets of high grade serous ovarian carcinoma (HGSOC) samples:
- “SchildkrautB” – bulk RNA sequencing of HGSOC from Black patients.
Expand All @@ -13,23 +14,23 @@ The purpose of the code in this repository is to use the InstaPrism R package (h

Bayes Prism requires a single cell reference dataset to perform deconvolution. However, it has been previously shown that certain cell types, notably adipocytes, are present in bulk tumor samples but largely absent from single cell RNA sequencing results (https://www.biorxiv.org/content/10.1101/2024.04.25.590992v1). However, adipocytes can be captured using single nucleus RNA sequencing. Here, we incorporate single nucleus RNA sequencing data from adipocytes, in addition to single cell RNA sequencing data of HGSOC, in deconvolution of bulk HGSOC RNA sequencing and microarray data.
# 2) Scripts:
Prior to running these scripts, please download the required data as outlined below in “Input and output data preparation and organization.” Please also ensure that the renv folder has been downloaded and is in the same directory as the scripts. These scripts are intended to be run in the following order:
### (Optional) unzip_input_data.py
To uncompress all files in the input data requiered, if not done manually. Exmaple: python unzip_input_data.py path/to/input_data/
## 1_get_data.R
Prior to running this pipeline, please download the required data as outlined below in “Input and output data preparation and organization.”
## (OPTIONAL) 00_unzip_input_data.R
This script checks for any zipped .gz or .zip files present in input_data, and will unzip and format anything it finds. It will not run automatically, and is commented out in the shell script. Uncomment to use if needed.
## 01_load_renv.R
This script loads the environment using the renv lockfile.
## 02_get_data.R
This script reads the bulk RNA sequencing and microarray datasets and filters them to only include genes present in one common gene mapping list. It transforms the microarray data using 2^(...) to match the scale of the bulk RNA sequencing data values – this is used for InstaPrism deconvolution. It also transforms the bulk RNA sequencing data using log10(...+1) to match the scale of the microarray data – this is used for clustering. All of these matrices are saved in a uniform format containing on sample ID (rows) and genes (columns); file names are appended with either “asImported” or “transformed.” It also saves any metadata information (ex. prior clustering) about the samples in a separate file for reference.
## 2_get_clustering.R
## 03_get_clustering.R
This script performs k-means clustering (k=2,3,4), NMF clustering (k=2,3,4), and consensusOV subtyping (https://github.com/bhklab/consensusOV) for each bulk dataset. For k-means clustering, it uses log10(...+1) transformed data (for RNAseq), and raw log2 data (for microarray). For NMF clustering and consensusOV subtyping, it uses raw counts (for RNAseq), and 2^(...) scaled "pseudocounts" data (for microarray). It saves the results in one csv file per dataset, where each row corresponds to one sample. It also saves a csv file containing the results for all datasets.
## 3_prepare_reference_data.R
## 04_prepare_reference_data.R
This script creates a single-cell/single-nucleus reference matrix for use as input into InstaPrism. It requires single cell RNA sequencing data of HGSOC (from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE217517) and cell type labels for those cells (from https://github.com/greenelab/deconvolution_pilot/tree/main/data/cell_labels). It also requires single nucleus RNA sequencing data of adipocytes (from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176171). Please refer to the “Input and output data preparation and organization” subsection of README for specific files necessary. It reads in the HGSOC single cell and adipocyte single nucleus RNA sequencing data and performs some pre-processing on the adipocyte data (removing duplicate samples, removing non-adipocyte samples, removing samples with many mitochondrial gene reads, and using Seurat to remove low-quality nuclei, empty droplets, and nuclei doublets/multiplets). It then combines the single cell and single nucleus data to generate an expression matrix where each row corresponds to a gene (GeneCards symbols) and each column corresponds to a sample (each assigned a unique numerical ID). It also generates a cell type file which serves as a key, and associates each sample ID to its cell type.
## 4_run_instaprism.R
This script runs InstaPrism, an R package that performs deconvolution with a similar but faster method to BayesPrism. First, it loads in the reference single cell plus single nucleus RNA sequencing reference dataset created in the previous script. Since the reference dataset is so large, it randomly selects 500 of each cell type to use. It then generates and saves two reference objects for input into InstaPrism, one with adipocytes and one without adipocytes. It runs InstaPrism twice on each of the six bulk datasets, both with and without the adipocytes in the reference data. Of note, Instaprism requires non-log-transformed bulk data, so it is performed on the original, non-transformed data for the bulk RNA sequencing datasets and on the 2^(…) transformed data for the microarray datasets.
## 5_visualize_instaprism_outputs.R
## 05_run_instaprism.R
This script runs InstaPrism, an R package that performs deconvolution with a similar but faster method to BayesPrism. First, it loads in the reference single cell plus single nucleus RNA sequencing reference dataset created in the previous script. It also removes a set of previously identified genes to mitigate technical factors unique to single nucleus RNA sequencing. Since the reference dataset is so large, it also randomly selects 500 of each cell type to use. It then generates and saves two reference objects for input into InstaPrism, one with adipocytes and one without adipocytes. It runs InstaPrism twice on each of the six bulk datasets, both with and without the adipocytes in the reference data. Of note, Instaprism requires non-log-transformed bulk data, so it is performed on the original, non-transformed data for the bulk RNA sequencing datasets and on the 2^(…) transformed data for the microarray datasets.
## 06_visualize_instaprism_outputs.R
This script creates several figures to visualize the deconvolution results generated in the previous script, and compare the results when run with and without adipocyte single nucleus RNA sequencing data in the reference data. It creates 100% stacked bar charts to visualize the total cell proportions per bulk dataset, 100% stacked bar charts showing cell proportions per sample in each dataset, and bar charts showing the absolute change of cell type proportions in total per dataset.
# 3) Input and output data preparation and organization:
Prior to running these scripts, please ensure that the below raw data files have been downloaded and are present in a folder entitled “input_data” inside the same project and directory as the scripts. (The results will be created and stored in another folder inside the same directory entitled “output_data”.) Sample directory contents:

![Screenshot 2025-02-25 at 2 28 25 PM](https://github.com/user-attachments/assets/b44cfc2a-8242-4f5e-ba2b-f75ff6712de9)
Prior to running these scripts, please ensure that the below raw data files (.gz and .zip zipped files okay) have been downloaded and are present in a folder entitled “input_data” inside the same project and directory as the scripts. (The results will be created and stored in another folder inside the same directory entitled “output_data”.)
## SchildkrautB, SchildkrautW, and reference gene list:
### From https://github.com/greenelab/hgsc_characterization:
- /reference_data/ensembl_hgnc_entrez.tsv
Expand Down Expand Up @@ -129,3 +130,5 @@ Prior to running these scripts, please ensure that the below raw data files have
- GSM5820686_Hs_SAT_11-1.dge.tsv.gz
#### Metadata
- GSE176171_cell_metadata.tsv.gz
# 4) Running the pipeline:
The pipeline can be run by executing run_pipeline.sh, which will in turn run main.nf, a Nextflow pipeline to execute the scripts in order (except for the optional unzipping script). run_pipeline.sh has configuration profile options to run either locally or on high performance computing (HPC) using Slurm. Be sure to open the file before running and uncomment running the unzipping script if needed.
16 changes: 16 additions & 0 deletions env_hgsoc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: env_hgsoc
channels:
- conda-forge
- bioconda
dependencies:
# ---- core language runtimes ----
- r-base=4.4.1
- r-irkernel
- python=3.10.16
- nextflow # workflow engine
- openjdk # Nextflow runtime
# ---- Python packages ----
- pandas
- numpy
- jupyterlab
- matplotlib
49 changes: 0 additions & 49 deletions environments/env_hgsoc.yml

This file was deleted.

194 changes: 121 additions & 73 deletions main.nf
Original file line number Diff line number Diff line change
@@ -1,87 +1,135 @@
#!/usr/bin/env nextflow
/*
* main.nf — run 5 R scripts in strict order
* main.nf — run 6 R scripts in strict order
*
* You already version-lock packages with renv.lock.
* Each script therefore starts with `renv::load()` (or restore),
* Each script therefore starts with `renv::load()`,
* so the only thing we need is the R interpreter on PATH.
*/

workflow {
nextflow.enable.dsl = 2

/*
* 0) Decompress all input data.
*/
process UNZIP {
tag 'unzip_data'
script:
"""
cd ${params.projectDir}
python ${params.scriptDir}/0_unzip_input_data.py ${params.projectDir}
"""
}
/*
* 1) Load the renv
*/
process LOAD_RENV {
tag '01_load_renv'

input:
val dummy_input // Receive the trigger

output:
val true, emit: renv_loaded

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/01_load_renv.R
"""
}

/*
* 1) Download / tidy raw data
*/
process GET_DATA {
tag '1_get_data'
script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/1_get_data.R
"""
}
/*
* 2) Download / tidy raw data
*/
process GET_DATA {
tag '02_get_data'

input:
val dummy_input

output:
val true, emit: data_ready

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/02_get_data.R
"""
}

/*
* 2) Cluster single-cell data
*/
process GET_CLUSTERING {
tag '2_get_clustering'
script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/2_get_clustering.R
"""
}
GET_CLUSTERING.after GET_DATA // enforce order
/*
* 3) Cluster single-cell data
*/
process GET_CLUSTERING {
tag '03_get_clustering'

input:
val dummy_input

output:
val true, emit: clustering_done

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/03_get_clustering.R
"""
}

/*
* 3) Build reference matrices
*/
process PREP_REF_DATA {
tag '3_prepare_reference_data'
script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/3_prepare_reference_data.R
"""
}
PREP_REF_DATA.after GET_CLUSTERING
/*
* 4) Build reference matrices
*/
process PREP_REF_DATA {
tag '04_prepare_reference_data'

input:
val dummy_input

output:
val true, emit: ref_data_ready

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/04_prepare_reference_data.R
"""
}

/*
* 4) Deconvolution with InstaPrism
*/
process RUN_INSTAPRISM {
tag '4_run_instaprism'
script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/4_run_instaprism.R
"""
}
RUN_INSTAPRISM.after PREP_REF_DATA
/*
* 5) Deconvolution with InstaPrism
*/
process RUN_INSTAPRISM {
tag '05_run_instaprism'

input:
val dummy_input

output:
val true, emit: instaprism_complete

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/05_run_instaprism.R
"""
}

/*
* 5) Visualisation step
*/
process VISUALISE {
tag '5_visualize'
script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/5_visualize_instaprism_outputs.R
"""
}
VISUALISE.after RUN_INSTAPRISM
/*
* 6) Visualisation step
*/
process VISUALISE {
tag '06_visualize_instaprism_outputs'

input:
val dummy_input

script:
"""
cd ${params.projectDir}
Rscript --vanilla ${params.scriptDir}/06_visualize_instaprism_outputs.R
"""
// No output needed as this is the last step
}

workflow {
// Create initial trigger channel
trigger_channel = Channel.value(true)

// Run processes in strict order using the output of the previous as input for the next
LOAD_RENV(trigger_channel)
GET_DATA(LOAD_RENV.out.renv_loaded)
GET_CLUSTERING(GET_DATA.out.data_ready)
PREP_REF_DATA(GET_CLUSTERING.out.clustering_done)
RUN_INSTAPRISM(PREP_REF_DATA.out.ref_data_ready)
VISUALISE(RUN_INSTAPRISM.out.instaprism_complete)
}
Loading