|
| 1 | +--- |
| 2 | +title: Pipeline Setup |
| 3 | +layout: default |
| 4 | +nav_order: 2 |
| 5 | +--- |
| 6 | + |
| 7 | +### Before you start |
| 8 | + |
| 9 | +There are two parts to set up this pipeline: |
| 10 | + |
| 11 | +- **Software installation**: to install all tools required by this pipeline. |
| 12 | +- **Reference preparation**: to generate reference files used in this pipeline. |
| 13 | + |
| 14 | +For Yu Lab members, we have set this pipeline up on HPC: |
| 15 | + |
| 16 | +- All the tools have been compiled in one conda environment, which can be launched by: |
| 17 | + |
| 18 | + ``` bash |
| 19 | + module load conda3/202402 |
| 20 | + conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 |
| 21 | + ``` |
| 22 | + |
| 23 | +- We have generated the files for four reference genomes: hg38, hg19, mm39 and mm10. |
| 24 | + |
| 25 | + |
| 26 | + | Genome | GENCODE release | Release date | Ensembl release | Path | |
| 27 | + | ----------------- | --------------- | ------------ | --------------- | ------------------------------------------------------------ | |
| 28 | + | hg38 (GRCh38.p14) | v48 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48 | |
| 29 | + | hg19 (GRCh37.p13) | v48lift37* | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg19/gencode.release48 | |
| 30 | + | mm39 (GRCm39) | vM37 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37 | |
| 31 | + | mm10 (GRCm38.p6) | vM25 | 04.2020** | v100 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm10/gencode.releaseM25 | |
| 32 | + |
| 33 | + *: The updates for the hg19/GRCh37 genome assembly have stopped in 2013. However, gene annotation continue to be updated by mapping the comprehensive gene annotations originally created for the GRCh38/hg38 reference chromosomes onto GRCh37 primary assembly using [gencode-backmap](https://github.com/diekhans/gencode-backmap) . |
| 34 | + |
| 35 | + **: The updates for the mm10/GRCm38 genome assembly and gene annotation have stopped in 2019. |
| 36 | + |
| 37 | +You should run this tutorial only when you want to set up this pipeline locally or complie other reference genome / gene annotation. |
| 38 | + |
| 39 | +### Part I: Software installation |
| 40 | + |
| 41 | +We selcted the tools for this pipeline mainly based on two considerations: 1) they are well-established and widely-used; 2) they can work together with each other. We finally managed to complie all tools in one single conda environment. To install them: |
| 42 | + |
| 43 | +1. Install **conda** if it's not available yet. This can be done by following this [tutorial](https://www.anaconda.com/docs/getting-started/getting-started). |
| 44 | + |
| 45 | +2. Create a conda env for this pipeline: |
| 46 | + |
| 47 | + ``` shell |
| 48 | + conda create --prefix /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 python=3.9 r-base=4.4 |
| 49 | + ``` |
| 50 | + |
| 51 | +3. Install key software and dependencies: |
| 52 | + |
| 53 | + ``` shell |
| 54 | + ## activate the conda env |
| 55 | + conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 |
| 56 | + |
| 57 | + ## install tools from bioconda channel |
| 58 | + conda install -c bioconda rseqc=5.0.4 fastp=1.0.1 biobambam=2.0.185 samtools=1.22.1 fastqc=0.12.1 bedtools=2.31.1 bowtie2=2.5.4 rsem=1.3.3 star=2.7.11b salmon=1.10.3 cutadapt=5.1 htseq=2.0.9 ucsc-genepredtobed=482 ucsc-gtftogenepred=482 |
| 59 | + |
| 60 | + ## install dependencies from conda-forge |
| 61 | + conda install -c conda-forge pandoc=3.7.0.2 |
| 62 | + |
| 63 | + ## install R packages from r channel |
| 64 | + conda install -c r r-rmarkdown=2.29 r-ggplot2=3.5.2 r-dplyr=1.1.4 r-envstats=3.1.0 r-kableextra=1.4.0 r-rjson=0.2.23 r-cowplot=1.2.0 |
| 65 | + |
| 66 | + ## deactivate conda env |
| 67 | + conda deactivate |
| 68 | + ``` |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | +### Part II: Reference preparation |
| 73 | + |
| 74 | +In addition to the tools, you will also need to prepare **reference genomes** for alignment, quantification and QC assessment. Below is a summary of reference preparation (use hg38 as an example): |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +#### 1. Data collection |
| 79 | + |
| 80 | +The reference preparation stars with FOUR files that can be directly downloaded from websites: |
| 81 | + |
| 82 | +- **Gene Annotation file in [GTF](https://biocorecrg.github.io/PhD_course_genomics_format_2021/gtf_format.html) (Gene Transfer Format) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf |
| 83 | + |
| 84 | +- **Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa |
| 85 | + |
| 86 | +- **Transcriptome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., `/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa`. |
| 87 | + |
| 88 | + ***<u>NOTE:</u>*** For the three files above, they are usually available at open-sourced websites. For human and mouse, we recommend [GENCODE](https://www.gencodegenes.org/) to collect them, while for other species, we recommend [Ensembl](https://useast.ensembl.org/info/data/ftp/index.html). |
| 89 | + |
| 90 | +- **HouseKeeping gene list**: the housekeeping genes defined by [this study](https://www.sciencedirect.com/science/article/pii/S0168952513000899?via%3Dihub) (N = 3804), e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt. |
| 91 | + |
| 92 | + ***<u>NOTE:</u>*** For other species, you can generate the housekeeping gene list by gene homology conversion using BiomaRt or other tools. Below is the codes I used to generate the housekeeping genes in mouse: |
| 93 | + |
| 94 | + ``` R |
| 95 | + library(NetBID2) |
| 96 | + |
| 97 | + HK_hg <- read.table("/Volumes/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt") |
| 98 | + HK_mm <- get_IDtransfer_betweenSpecies( |
| 99 | + from_spe = "human", to_spe = "mouse", from_type = "hgnc_symbol", to_type = "mgi_symbol", |
| 100 | + use_genes = unique(HK_hg$V1)); |
| 101 | + colnames(HK_mm) <- paste0("#", colnames(HK_mm)) |
| 102 | + write.table(HK_mm[,c(2,1)], ## Please make sure the mouse gene symbols are in the FIRST column |
| 103 | + file = "/Volumes/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37/housekeeping_genes.mouse.txt", col.names = T, row.names = F, sep = "\t", quote = F) |
| 104 | + ``` |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | +#### 2. Parsing annotation file |
| 109 | + |
| 110 | +In this step, we will parse the gene annotation file and generate four files that required in downstream analysis: |
| 111 | + |
| 112 | +- gencode.v48.primary_assembly.annotation.gene2transcript.txt and gencode.v48.primary_assembly.annotation.transcript2gene.txt: the mappings between transcripts and genes. They are required in gene-level quantification. |
| 113 | +- gencode.v48.primary_assembly.annotation.geneAnnotation.txt and gencode.v48.primary_assembly.annotation.transcriptAnnotation.txt: the gene or transcript annotation file. They are required in the final gene expression matrix generation. |
| 114 | + |
| 115 | +To make the parsing analysis easiler, we created a script, `parseAnnotation.pl`, that you can easily generate the four files wit h it: |
| 116 | + |
| 117 | +``` bash |
| 118 | +## parse the gene anotation file |
| 119 | +## This command will generate the four files in the same folder as gencode.v48.primary_assembly.annotation.gtf. |
| 120 | +## Only ONE argument is need: gene annotation file in GTF format. |
| 121 | +perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/parseAnnotation.pl gencode.v48.primary_assembly.annotation.gtf |
| 122 | +``` |
| 123 | + |
| 124 | + |
| 125 | + |
| 126 | +#### 3. Creating gene body bins |
| 127 | + |
| 128 | +In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required in genebody coverage statistics - an important QC metrics that indicates the extent of RNA degradation. |
| 129 | + |
| 130 | +Two files will be generated: |
| 131 | + |
| 132 | +- ./bulkRNAseq/genebodyBins/genebodyBins_allTranscripts.txt: bins of the longest transcripts of all genes. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower. |
| 133 | +- ./bulkRNAseq/genebodyBins/genebodyBins_HouseKeepingTranscripts.txt: bins of the longest transcripts of precurated housekeeping genes. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). So, we set it as the default in this pipeline. |
| 134 | + |
| 135 | +You can easily generate these two files with the command below: |
| 136 | + |
| 137 | +``` bash |
| 138 | +## create the gene body bins |
| 139 | +## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes. |
| 140 | +## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files. |
| 141 | +perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/createBins.pl gencode.v48.transcripts.fa housekeeping_genes.human.txt ./bulkRNAseq/genebodyBins |
| 142 | +``` |
| 143 | + |
| 144 | + |
| 145 | + |
| 146 | +#### 4. Create genome index files for RSEM |
| 147 | + |
| 148 | +``` bash |
| 149 | +#BSUB -P buildIndex |
| 150 | +#BSUB -n 8 |
| 151 | +#BSUB -M 8000 |
| 152 | +#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
| 153 | +#BSUB -J buildIndex |
| 154 | +#BSUB -q priority |
| 155 | + |
| 156 | +rsem_index=/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin/rsem-prepare-reference |
| 157 | + |
| 158 | +## Bowtie2-RSEM |
| 159 | +mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2 |
| 160 | +$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2/hg38 |
| 161 | + |
| 162 | +## STAR-RSEM |
| 163 | +mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star |
| 164 | +$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --star --star-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 --star-sjdboverhang 100 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star/hg38 |
| 165 | +``` |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | +#### 5. Create genome index files for Salmon |
| 170 | + |
| 171 | +``` bash |
| 172 | +#BSUB -P salmonIndex |
| 173 | +#BSUB -n 8 |
| 174 | +#BSUB -M 8000 |
| 175 | +#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
| 176 | +#BSUB -J buildIndex |
| 177 | +#BSUB -q standard |
| 178 | + |
| 179 | +## generate a decoy-aware transcriptome |
| 180 | +# https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ |
| 181 | +grep "^>" /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | cut -d ">" -f 2 > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt |
| 182 | + |
| 183 | +cat /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa |
| 184 | + |
| 185 | +salmon index -t /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa -d /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt -p 8 -i index_decoy --gencode -k 31 |
| 186 | +``` |
| 187 | + |
| 188 | + |
| 189 | + |
| 190 | +#### 6. Create genome index files for STAR |
| 191 | + |
| 192 | +``` bash |
| 193 | +#BSUB -P STAR_Index |
| 194 | +#BSUB -n 8 |
| 195 | +#BSUB -M 8000 |
| 196 | +#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
| 197 | +#BSUB -J buildIndex |
| 198 | +#BSUB -q standard |
| 199 | + |
| 200 | +STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/STAR/index_overhang100 --genomeFastaFiles /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --sjdbOverhang 100 |
| 201 | +``` |
| 202 | + |
0 commit comments