Skip to content

Commit 41a0492

Browse files
committed
update setup.md
1 parent 49251d6 commit 41a0492

File tree

7 files changed

+204
-2
lines changed

7 files changed

+204
-2
lines changed

.DS_Store

0 Bytes
Binary file not shown.

docs/.DS_Store

0 Bytes
Binary file not shown.
826 KB
Loading
459 KB
Loading

docs/preprocessing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Data Preprocessing
33
layout: default
4-
nav_order: 2
4+
nav_order: 3
55
---
66

77
**Why is preprocessing required?** The RNA-Seq data you start with could **vary in format** (e.g., FASTQ, BAM, FASTA) and usually **contain noisy sequences** (e.g., adapters leftovers, poor quality bases and other contaminations). So, we need to pre-process these data to generate the standard-in-format, clean-in-sequence FASTQ files which can be directly proceed to quantification analysis.

docs/quantification/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
22
title: Quantification
3-
nav_order: 3
3+
nav_order: 4
44
permalink: /docs/quantification/index
55
---

docs/setup.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
---
2+
title: Pipeline Setup
3+
layout: default
4+
nav_order: 2
5+
---
6+
7+
### Before you start
8+
9+
There are two parts to set up this pipeline:
10+
11+
- **Software installation**: to install all tools required by this pipeline.
12+
- **Reference preparation**: to generate reference files used in this pipeline.
13+
14+
For Yu Lab members, we have set this pipeline up on HPC:
15+
16+
- All the tools have been compiled in one conda environment, which can be launched by:
17+
18+
``` bash
19+
module load conda3/202402
20+
conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
21+
```
22+
23+
- We have generated the files for four reference genomes: hg38, hg19, mm39 and mm10.
24+
25+
26+
| Genome | GENCODE release | Release date | Ensembl release | Path |
27+
| ----------------- | --------------- | ------------ | --------------- | ------------------------------------------------------------ |
28+
| hg38 (GRCh38.p14) | v48 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48 |
29+
| hg19 (GRCh37.p13) | v48lift37* | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg19/gencode.release48 |
30+
| mm39 (GRCm39) | vM37 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37 |
31+
| mm10 (GRCm38.p6) | vM25 | 04.2020** | v100 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm10/gencode.releaseM25 |
32+
33+
*: The updates for the hg19/GRCh37 genome assembly have stopped in 2013. However, gene annotation continue to be updated by mapping the comprehensive gene annotations originally created for the GRCh38/hg38 reference chromosomes onto GRCh37 primary assembly using [gencode-backmap](https://github.com/diekhans/gencode-backmap) .
34+
35+
**: The updates for the mm10/GRCm38 genome assembly and gene annotation have stopped in 2019.
36+
37+
You should run this tutorial only when you want to set up this pipeline locally or complie other reference genome / gene annotation.
38+
39+
### Part I: Software installation
40+
41+
We selcted the tools for this pipeline mainly based on two considerations: 1) they are well-established and widely-used; 2) they can work together with each other. We finally managed to complie all tools in one single conda environment. To install them:
42+
43+
1. Install **conda** if it's not available yet. This can be done by following this [tutorial](https://www.anaconda.com/docs/getting-started/getting-started).
44+
45+
2. Create a conda env for this pipeline:
46+
47+
``` shell
48+
conda create --prefix /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 python=3.9 r-base=4.4
49+
```
50+
51+
3. Install key software and dependencies:
52+
53+
``` shell
54+
## activate the conda env
55+
conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
56+
57+
## install tools from bioconda channel
58+
conda install -c bioconda rseqc=5.0.4 fastp=1.0.1 biobambam=2.0.185 samtools=1.22.1 fastqc=0.12.1 bedtools=2.31.1 bowtie2=2.5.4 rsem=1.3.3 star=2.7.11b salmon=1.10.3 cutadapt=5.1 htseq=2.0.9 ucsc-genepredtobed=482 ucsc-gtftogenepred=482
59+
60+
## install dependencies from conda-forge
61+
conda install -c conda-forge pandoc=3.7.0.2
62+
63+
## install R packages from r channel
64+
conda install -c r r-rmarkdown=2.29 r-ggplot2=3.5.2 r-dplyr=1.1.4 r-envstats=3.1.0 r-kableextra=1.4.0 r-rjson=0.2.23 r-cowplot=1.2.0
65+
66+
## deactivate conda env
67+
conda deactivate
68+
```
69+
70+
71+
72+
### Part II: Reference preparation
73+
74+
In addition to the tools, you will also need to prepare **reference genomes** for alignment, quantification and QC assessment. Below is a summary of reference preparation (use hg38 as an example):
75+
76+
![Picture](/Users/qpan/Documents/Pipelines/bulkRNAseq_quantification_pipeline/docs/figures/referencePreparation.png)
77+
78+
#### 1. Data collection
79+
80+
The reference preparation stars with FOUR files that can be directly downloaded from websites:
81+
82+
- **Gene Annotation file in [GTF](https://biocorecrg.github.io/PhD_course_genomics_format_2021/gtf_format.html) (Gene Transfer Format) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf
83+
84+
- **Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa
85+
86+
- **Transcriptome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., `/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa`.
87+
88+
***<u>NOTE:</u>*** For the three files above, they are usually available at open-sourced websites. For human and mouse, we recommend [GENCODE](https://www.gencodegenes.org/) to collect them, while for other species, we recommend [Ensembl](https://useast.ensembl.org/info/data/ftp/index.html).
89+
90+
- **HouseKeeping gene list**: the housekeeping genes defined by [this study](https://www.sciencedirect.com/science/article/pii/S0168952513000899?via%3Dihub) (N = 3804), e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt.
91+
92+
***<u>NOTE:</u>*** For other species, you can generate the housekeeping gene list by gene homology conversion using BiomaRt or other tools. Below is the codes I used to generate the housekeeping genes in mouse:
93+
94+
``` R
95+
library(NetBID2)
96+
97+
HK_hg <- read.table("/Volumes/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt")
98+
HK_mm <- get_IDtransfer_betweenSpecies(
99+
from_spe = "human", to_spe = "mouse", from_type = "hgnc_symbol", to_type = "mgi_symbol",
100+
use_genes = unique(HK_hg$V1));
101+
colnames(HK_mm) <- paste0("#", colnames(HK_mm))
102+
write.table(HK_mm[,c(2,1)], ## Please make sure the mouse gene symbols are in the FIRST column
103+
file = "/Volumes/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37/housekeeping_genes.mouse.txt", col.names = T, row.names = F, sep = "\t", quote = F)
104+
```
105+
106+
107+
108+
#### 2. Parsing annotation file
109+
110+
In this step, we will parse the gene annotation file and generate four files that required in downstream analysis:
111+
112+
- gencode.v48.primary_assembly.annotation.gene2transcript.txt and gencode.v48.primary_assembly.annotation.transcript2gene.txt: the mappings between transcripts and genes. They are required in gene-level quantification.
113+
- gencode.v48.primary_assembly.annotation.geneAnnotation.txt and gencode.v48.primary_assembly.annotation.transcriptAnnotation.txt: the gene or transcript annotation file. They are required in the final gene expression matrix generation.
114+
115+
To make the parsing analysis easiler, we created a script, `parseAnnotation.pl`, that you can easily generate the four files wit h it:
116+
117+
``` bash
118+
## parse the gene anotation file
119+
## This command will generate the four files in the same folder as gencode.v48.primary_assembly.annotation.gtf.
120+
## Only ONE argument is need: gene annotation file in GTF format.
121+
perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/parseAnnotation.pl gencode.v48.primary_assembly.annotation.gtf
122+
```
123+
124+
125+
126+
#### 3. Creating gene body bins
127+
128+
In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required in genebody coverage statistics - an important QC metrics that indicates the extent of RNA degradation.
129+
130+
Two files will be generated:
131+
132+
- ./bulkRNAseq/genebodyBins/genebodyBins_allTranscripts.txt: bins of the longest transcripts of all genes. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower.
133+
- ./bulkRNAseq/genebodyBins/genebodyBins_HouseKeepingTranscripts.txt: bins of the longest transcripts of precurated housekeeping genes. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). So, we set it as the default in this pipeline.
134+
135+
You can easily generate these two files with the command below:
136+
137+
``` bash
138+
## create the gene body bins
139+
## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes.
140+
## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files.
141+
perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/createBins.pl gencode.v48.transcripts.fa housekeeping_genes.human.txt ./bulkRNAseq/genebodyBins
142+
```
143+
144+
145+
146+
#### 4. Create genome index files for RSEM
147+
148+
``` bash
149+
#BSUB -P buildIndex
150+
#BSUB -n 8
151+
#BSUB -M 8000
152+
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
153+
#BSUB -J buildIndex
154+
#BSUB -q priority
155+
156+
rsem_index=/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin/rsem-prepare-reference
157+
158+
## Bowtie2-RSEM
159+
mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2
160+
$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2/hg38
161+
162+
## STAR-RSEM
163+
mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star
164+
$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --star --star-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 --star-sjdboverhang 100 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star/hg38
165+
```
166+
167+
168+
169+
#### 5. Create genome index files for Salmon
170+
171+
``` bash
172+
#BSUB -P salmonIndex
173+
#BSUB -n 8
174+
#BSUB -M 8000
175+
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
176+
#BSUB -J buildIndex
177+
#BSUB -q standard
178+
179+
## generate a decoy-aware transcriptome
180+
# https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/
181+
grep "^>" /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | cut -d ">" -f 2 > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt
182+
183+
cat /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa
184+
185+
salmon index -t /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa -d /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt -p 8 -i index_decoy --gencode -k 31
186+
```
187+
188+
189+
190+
#### 6. Create genome index files for STAR
191+
192+
``` bash
193+
#BSUB -P STAR_Index
194+
#BSUB -n 8
195+
#BSUB -M 8000
196+
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
197+
#BSUB -J buildIndex
198+
#BSUB -q standard
199+
200+
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/STAR/index_overhang100 --genomeFastaFiles /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --sjdbOverhang 100
201+
```
202+

0 commit comments

Comments
 (0)