greenelab · graceakatsu · Jun 16, 2025 · Jun 16, 2025 · Jun 17, 2025 · Jun 17, 2025
diff --git a/.Rprofile b/.Rprofile
@@ -0,0 +1 @@
+source("renv/activate.R")
diff --git a/.gitignore b/.gitignore
@@ -9,6 +9,10 @@
 *.user
 *.userosscache
 *.sln.docstates
+*.nextflow/
+*scripts/.Rhistory
+*.Rhistory
+*.nextflow.log.*
 
 # User-specific files (MonoDevelop/Xamarin Studio)
 *.userprefs
@@ -398,3 +402,8 @@ FodyWeavers.xsd
 
 # JetBrains Rider
 *.sln.iml
+.DS_Store
+
+# All input_data, except for intersecting genes csv file
+/input_data/*
+!/input_data/intersect_3d.csv
diff --git a/README.md b/README.md
@@ -2,6 +2,7 @@
 1)	Introduction
 2)	Scripts
 3)	Input and output data preparation and organization
+4)  Running the pipeline
 # 1) Introduction:
 The purpose of the code in this repository is to use the InstaPrism R package (https://github.com/humengying0907/InstaPrism/tree/master) to run Bayes Prism deconvolution on the following bulk RNA seq and microarray datasets of high grade serous ovarian carcinoma (HGSOC) samples:
 - “SchildkrautB” – bulk RNA sequencing of HGSOC from Black patients.
@@ -13,23 +14,23 @@ The purpose of the code in this repository is to use the InstaPrism R package (h
 
 Bayes Prism requires a single cell reference dataset to perform deconvolution. However, it has been previously shown that certain cell types, notably adipocytes, are present in bulk tumor samples but largely absent from single cell RNA sequencing results (https://www.biorxiv.org/content/10.1101/2024.04.25.590992v1). However, adipocytes can be captured using single nucleus RNA sequencing. Here, we incorporate single nucleus RNA sequencing data from adipocytes, in addition to single cell RNA sequencing data of HGSOC, in deconvolution of bulk HGSOC RNA sequencing and microarray data. 
 # 2) Scripts:
-Prior to running these scripts, please download the required data as outlined below in “Input and output data preparation and organization.” Please also ensure that the renv folder has been downloaded and is in the same directory as the scripts. These scripts are intended to be run in the following order:
-### (Optional) unzip_input_data.py
-To uncompress all files in the input data requiered, if not done manually. Exmaple: python unzip_input_data.py path/to/input_data/
-## 1_get_data.R
+Prior to running this pipeline, please download the required data as outlined below in “Input and output data preparation and organization.” 
+## (OPTIONAL) 00_unzip_input_data.R
+This script checks for any zipped .gz or .zip files present in input_data, and will unzip and format anything it finds. It will not run automatically, and is commented out in the shell script. Uncomment to use if needed.
+## 01_load_renv.R
+This script loads the environment using the renv lockfile.
+## 02_get_data.R
 This script reads the bulk RNA sequencing and microarray datasets and filters them to only include genes present in one common gene mapping list. It transforms the microarray data using 2^(...) to match the scale of the bulk RNA sequencing data values  – this is used for InstaPrism deconvolution. It also transforms the bulk RNA sequencing data using log10(...+1) to match the scale of the microarray data – this is used for clustering. All of these matrices are saved in a uniform format containing on sample ID (rows) and genes (columns); file names are appended with either “asImported” or “transformed.” It also saves any metadata information (ex. prior clustering) about the samples in a separate file for reference.
-## 2_get_clustering.R
+## 03_get_clustering.R
 This script performs k-means clustering (k=2,3,4), NMF clustering (k=2,3,4), and consensusOV subtyping (https://github.com/bhklab/consensusOV) for each bulk dataset. For k-means clustering, it uses log10(...+1) transformed data (for RNAseq), and raw log2 data (for microarray). For NMF clustering and consensusOV subtyping, it uses raw counts (for RNAseq), and 2^(...) scaled "pseudocounts" data (for microarray). It saves the results in one csv file per dataset, where each row corresponds to one sample. It also saves a csv file containing the results for all datasets.
-## 3_prepare_reference_data.R
+## 04_prepare_reference_data.R
 This script creates a single-cell/single-nucleus reference matrix for use as input into InstaPrism. It requires single cell RNA sequencing data of HGSOC (from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE217517) and cell type labels for those cells (from https://github.com/greenelab/deconvolution_pilot/tree/main/data/cell_labels). It also requires single nucleus RNA sequencing data of adipocytes (from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176171). Please refer to the “Input and output data preparation and organization” subsection of README for specific files necessary. It reads in the HGSOC single cell and adipocyte single nucleus RNA sequencing data and performs some pre-processing on the adipocyte data (removing duplicate samples, removing non-adipocyte samples, removing samples with many mitochondrial gene reads, and using Seurat to remove low-quality nuclei, empty droplets, and nuclei doublets/multiplets). It then combines the single cell and single nucleus data to generate an expression matrix where each row corresponds to a gene (GeneCards symbols) and each column corresponds to a sample (each assigned a unique numerical ID). It also generates a cell type file which serves as a key, and associates each sample ID to its cell type. 
-## 4_run_instaprism.R
-This script runs InstaPrism, an R package that performs deconvolution with a similar but faster method to BayesPrism. First, it loads in the reference single cell plus single nucleus RNA sequencing reference dataset created in the previous script. Since the reference dataset is so large, it randomly selects 500 of each cell type to use. It then generates and saves two reference objects for input into InstaPrism, one with adipocytes and one without adipocytes. It runs InstaPrism twice on each of the six bulk datasets, both with and without the adipocytes in the reference data. Of note, Instaprism requires non-log-transformed bulk data, so it is performed on the original, non-transformed data for the bulk RNA sequencing datasets and on the 2^(…) transformed data for the microarray datasets.
-## 5_visualize_instaprism_outputs.R
+## 05_run_instaprism.R
+This script runs InstaPrism, an R package that performs deconvolution with a similar but faster method to BayesPrism. First, it loads in the reference single cell plus single nucleus RNA sequencing reference dataset created in the previous script. It also removes a set of previously identified genes to mitigate technical factors unique to single nucleus RNA sequencing. Since the reference dataset is so large, it also randomly selects 500 of each cell type to use. It then generates and saves two reference objects for input into InstaPrism, one with adipocytes and one without adipocytes. It runs InstaPrism twice on each of the six bulk datasets, both with and without the adipocytes in the reference data. Of note, Instaprism requires non-log-transformed bulk data, so it is performed on the original, non-transformed data for the bulk RNA sequencing datasets and on the 2^(…) transformed data for the microarray datasets.
+## 06_visualize_instaprism_outputs.R
 This script creates several figures to visualize the deconvolution results generated in the previous script, and compare the results when run with and without adipocyte single nucleus RNA sequencing data in the reference data. It creates 100% stacked bar charts to visualize the total cell proportions per bulk dataset, 100% stacked bar charts showing cell proportions per sample in each dataset, and bar charts showing the absolute change of cell type proportions in total per dataset.
 # 3) Input and output data preparation and organization:
-Prior to running these scripts, please ensure that the below raw data files have been downloaded and are present in a folder entitled “input_data” inside the same project and directory as the scripts. (The results will be created and stored in another folder inside the same directory entitled “output_data”.) Sample directory contents:
-
-![Screenshot 2025-02-25 at 2 28 25 PM](https://github.com/user-attachments/assets/b44cfc2a-8242-4f5e-ba2b-f75ff6712de9)
+Prior to running these scripts, please ensure that the below raw data files (.gz and .zip zipped files okay) have been downloaded and are present in a folder entitled “input_data” inside the same project and directory as the scripts. (The results will be created and stored in another folder inside the same directory entitled “output_data”.)
 ## SchildkrautB, SchildkrautW, and reference gene list:
 ### From https://github.com/greenelab/hgsc_characterization:
 - /reference_data/ensembl_hgnc_entrez.tsv
@@ -129,3 +130,5 @@ Prior to running these scripts, please ensure that the below raw data files have
 - GSM5820686_Hs_SAT_11-1.dge.tsv.gz
 #### Metadata
 - GSE176171_cell_metadata.tsv.gz
+# 4) Running the pipeline:
+The pipeline can be run by executing run_pipeline.sh, which will in turn run main.nf, a Nextflow pipeline to execute the scripts in order (except for the optional unzipping script). run_pipeline.sh has configuration profile options to run either locally or on high performance computing (HPC) using Slurm. Be sure to open the file before running and uncomment running the unzipping script if needed.
diff --git a/env_hgsoc.yml b/env_hgsoc.yml
@@ -0,0 +1,16 @@
+name: env_hgsoc
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  # ---- core language runtimes ----
+  - r-base=4.4.1
+  - r-irkernel                
+  - python=3.10.16
+  - nextflow                   # workflow engine
+  - openjdk                    # Nextflow runtime
+   # ---- Python packages ---- 
+  - pandas
+  - numpy
+  - jupyterlab
+  - matplotlib
diff --git a/environments/env_hgsoc.yml b/environments/env_hgsoc.yml
diff --git a/main.nf b/main.nf
@@ -1,87 +1,135 @@
 #!/usr/bin/env nextflow
 /*
- * main.nf — run 5 R scripts in strict order
+ * main.nf — run 6 R scripts in strict order
  *
  * You already version-lock packages with renv.lock.
- * Each script therefore starts with `renv::load()` (or restore),
+ * Each script therefore starts with `renv::load()`,
  * so the only thing we need is the R interpreter on PATH.
  */
 
-workflow {
+nextflow.enable.dsl = 2
 
-    /*
-     * 0) Decompress all input data.
-     */
-    process UNZIP {
-        tag 'unzip_data'
-        script:
-        """
-        cd ${params.projectDir}
-        python ${params.scriptDir}/0_unzip_input_data.py ${params.projectDir}
-        """
-    }
+/*
+ * 1) Load the renv
+ */
+process LOAD_RENV {
+    tag '01_load_renv'
+
+    input:
+    val dummy_input // Receive the trigger
+
+    output:
+    val true, emit: renv_loaded
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/01_load_renv.R
+    """
+}
 
-    /*
-     * 1) Download / tidy raw data
-     */
-    process GET_DATA {
-        tag '1_get_data'
-        script:
-        """
-        cd ${params.projectDir}
-        Rscript --vanilla ${params.scriptDir}/1_get_data.R
-        """
-    }
+/*
+ * 2) Download / tidy raw data
+ */
+process GET_DATA {
+    tag '02_get_data'
+
+    input:
+    val dummy_input
+
+    output:
+    val true, emit: data_ready
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/02_get_data.R
+    """
+}
 
-    /*
-     * 2) Cluster single-cell data
-     */
-    process GET_CLUSTERING {
-        tag '2_get_clustering'
-        script:
-        """
-        cd ${params.projectDir}
-        Rscript --vanilla ${params.scriptDir}/2_get_clustering.R
-        """
-    }
-    GET_CLUSTERING.after GET_DATA        // enforce order
+/*
+ * 3) Cluster single-cell data
+ */
+process GET_CLUSTERING {
+    tag '03_get_clustering'
+
+    input:
+    val dummy_input
+
+    output:
+    val true, emit: clustering_done
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/03_get_clustering.R
+    """
+}
 
-    /*
-     * 3) Build reference matrices
-     */
-    process PREP_REF_DATA {
-        tag '3_prepare_reference_data'
-        script:
-        """
-        cd ${params.projectDir}
-        Rscript --vanilla ${params.scriptDir}/3_prepare_reference_data.R
-        """
-    }
-    PREP_REF_DATA.after GET_CLUSTERING
+/*
+ * 4) Build reference matrices
+ */
+process PREP_REF_DATA {
+    tag '04_prepare_reference_data'
+
+    input:
+    val dummy_input
+
+    output:
+    val true, emit: ref_data_ready
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/04_prepare_reference_data.R
+    """
+}
 
-    /*
-     * 4) Deconvolution with InstaPrism
-     */
-    process RUN_INSTAPRISM {
-        tag '4_run_instaprism'
-        script:
-        """
-        cd ${params.projectDir}
-        Rscript --vanilla ${params.scriptDir}/4_run_instaprism.R
-        """
-    }
-    RUN_INSTAPRISM.after PREP_REF_DATA
+/*
+ * 5) Deconvolution with InstaPrism
+ */
+process RUN_INSTAPRISM {
+    tag '05_run_instaprism'
+
+    input:
+    val dummy_input
+
+    output:
+    val true, emit: instaprism_complete
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/05_run_instaprism.R
+    """
+}
 
-    /*
-     * 5) Visualisation step
-     */
-    process VISUALISE {
-        tag '5_visualize'
-        script:
-        """
-        cd ${params.projectDir}
-        Rscript --vanilla ${params.scriptDir}/5_visualize_instaprism_outputs.R
-        """
-    }
-    VISUALISE.after RUN_INSTAPRISM
+/*
+ * 6) Visualisation step
+ */
+process VISUALISE {
+    tag '06_visualize_instaprism_outputs'
+
+    input:
+    val dummy_input
+
+    script:
+    """
+    cd ${params.projectDir}
+    Rscript --vanilla ${params.scriptDir}/06_visualize_instaprism_outputs.R
+    """
+    // No output needed as this is the last step
 }
+
+workflow {
+    // Create initial trigger channel
+    trigger_channel = Channel.value(true)
+
+    // Run processes in strict order using the output of the previous as input for the next
+    LOAD_RENV(trigger_channel)
+    GET_DATA(LOAD_RENV.out.renv_loaded)
+    GET_CLUSTERING(GET_DATA.out.data_ready)
+    PREP_REF_DATA(GET_CLUSTERING.out.clustering_done)
+    RUN_INSTAPRISM(PREP_REF_DATA.out.ref_data_ready)
+    VISUALISE(RUN_INSTAPRISM.out.instaprism_complete)
+}