Skip to content

Commit e2ba264

Browse files
committed
reorganize setup
1 parent e07015e commit e2ba264

File tree

3 files changed

+3
-436
lines changed

3 files changed

+3
-436
lines changed

docs/pipeline_setup/reference.md

Lines changed: 2 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,15 @@
11
---
2-
title: Reference Preparation
2+
title: II-Reference Preparation
33
layout: default
44
nav_order: 2
55
parent: Pipeline Setup
66
---
77

8-
### Before you start
9-
10-
There are two parts to set up this pipeline:
11-
12-
- **Software installation**: to install all tools required by this pipeline.
13-
- **Reference preparation**: to generate reference files used in this pipeline.
14-
15-
For Yu Lab members, we have set this pipeline up on HPC:
16-
17-
- All the tools have been compiled in one conda environment, which can be launched by:
18-
19-
``` bash
20-
module load conda3/202402
21-
conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
22-
```
23-
24-
- We have generated the files for four reference genomes: hg38, hg19, mm39 and mm10.
25-
26-
27-
| Genome | GENCODE release | Release date | Ensembl release | Path |
28-
| ----------------- | --------------- | ------------ | --------------- | ------------------------------------------------------------ |
29-
| hg38 (GRCh38.p14) | v48 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48 |
30-
| hg19 (GRCh37.p13) | v48lift37* | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg19/gencode.release48 |
31-
| mm39 (GRCm39) | vM37 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37 |
32-
| mm10 (GRCm38.p6) | vM25 | 04.2020** | v100 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm10/gencode.releaseM25 |
33-
34-
*: The updates for the hg19/GRCh37 genome assembly have stopped in 2013. However, gene annotation continue to be updated by mapping the comprehensive gene annotations originally created for the GRCh38/hg38 reference chromosomes onto GRCh37 primary assembly using [gencode-backmap](https://github.com/diekhans/gencode-backmap) .
35-
36-
**: The updates for the mm10/GRCm38 genome assembly and gene annotation have stopped in 2019.
37-
38-
You should run this tutorial only when you want to set up this pipeline locally or complie other reference genome / gene annotation.
39-
40-
### Part I: Software installation
41-
42-
We selcted the tools for this pipeline mainly based on two considerations: 1) they are well-established and widely-used; 2) they can work together with each other. We finally managed to complie all tools in one single conda environment. To install them:
43-
44-
1. Install **conda** if it's not available yet. This can be done by following this [tutorial](https://www.anaconda.com/docs/getting-started/getting-started).
45-
46-
2. Create a conda env for this pipeline:
47-
48-
``` shell
49-
conda create --prefix /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 python=3.9 r-base=4.4
50-
```
51-
52-
3. Install key software and dependencies:
53-
54-
``` shell
55-
## activate the conda env
56-
conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
57-
58-
## install tools from bioconda channel
59-
conda install -c bioconda rseqc=5.0.4 fastp=1.0.1 biobambam=2.0.185 samtools=1.22.1 fastqc=0.12.1 bedtools=2.31.1 bowtie2=2.5.4 rsem=1.3.3 star=2.7.11b salmon=1.10.3 cutadapt=5.1 htseq=2.0.9 ucsc-genepredtobed=482 ucsc-gtftogenepred=482
60-
61-
## install dependencies from conda-forge
62-
conda install -c conda-forge pandoc=3.7.0.2
63-
64-
## install R packages from r channel
65-
conda install -c r r-rmarkdown=2.29 r-ggplot2=3.5.2 r-dplyr=1.1.4 r-envstats=3.1.0 r-kableextra=1.4.0 r-rjson=0.2.23 r-cowplot=1.2.0
66-
67-
## deactivate conda env
68-
conda deactivate
69-
```
70-
71-
72-
738
### Part II: Reference preparation
749

7510
In addition to the tools, you will also need to prepare **reference genomes** for alignment, quantification and QC assessment. Below is a summary of reference preparation (use hg38 as an example):
7611

77-
![Picture](/Users/qpan/Documents/Pipelines/bulkRNAseq_quantification_pipeline/docs/figures/referencePreparation.png)
12+
![Picture](docs/figures/referencePreparation.png)
7813

7914
#### 1. Data collection
8015

docs/pipeline_setup/software.md

Lines changed: 1 addition & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,10 @@
11
---
2-
title: Software Installation
2+
title: I-Software Installation
33
layout: default
44
nav_order: 1
55
parent: Pipeline Setup
66
---
77

8-
### Before you start
9-
10-
There are two parts to set up this pipeline:
11-
12-
- **Software installation**: to install all tools required by this pipeline.
13-
- **Reference preparation**: to generate reference files used in this pipeline.
14-
15-
For Yu Lab members, we have set this pipeline up on HPC:
16-
17-
- All the tools have been compiled in one conda environment, which can be launched by:
18-
19-
``` bash
20-
module load conda3/202402
21-
conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
22-
```
23-
24-
- We have generated the files for four reference genomes: hg38, hg19, mm39 and mm10.
25-
26-
27-
| Genome | GENCODE release | Release date | Ensembl release | Path |
28-
| ----------------- | --------------- | ------------ | --------------- | ------------------------------------------------------------ |
29-
| hg38 (GRCh38.p14) | v48 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48 |
30-
| hg19 (GRCh37.p13) | v48lift37* | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg19/gencode.release48 |
31-
| mm39 (GRCm39) | vM37 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37 |
32-
| mm10 (GRCm38.p6) | vM25 | 04.2020** | v100 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm10/gencode.releaseM25 |
33-
34-
*: The updates for the hg19/GRCh37 genome assembly have stopped in 2013. However, gene annotation continue to be updated by mapping the comprehensive gene annotations originally created for the GRCh38/hg38 reference chromosomes onto GRCh37 primary assembly using [gencode-backmap](https://github.com/diekhans/gencode-backmap) .
35-
36-
**: The updates for the mm10/GRCm38 genome assembly and gene annotation have stopped in 2019.
37-
38-
You should run this tutorial only when you want to set up this pipeline locally or complie other reference genome / gene annotation.
39-
408
### Part I: Software installation
419

4210
We selcted the tools for this pipeline mainly based on two considerations: 1) they are well-established and widely-used; 2) they can work together with each other. We finally managed to complie all tools in one single conda environment. To install them:
@@ -67,137 +35,3 @@ We selcted the tools for this pipeline mainly based on two considerations: 1) th
6735
## deactivate conda env
6836
conda deactivate
6937
```
70-
71-
72-
73-
### Part II: Reference preparation
74-
75-
In addition to the tools, you will also need to prepare **reference genomes** for alignment, quantification and QC assessment. Below is a summary of reference preparation (use hg38 as an example):
76-
77-
![Picture](/Users/qpan/Documents/Pipelines/bulkRNAseq_quantification_pipeline/docs/figures/referencePreparation.png)
78-
79-
#### 1. Data collection
80-
81-
The reference preparation stars with FOUR files that can be directly downloaded from websites:
82-
83-
- **Gene Annotation file in [GTF](https://biocorecrg.github.io/PhD_course_genomics_format_2021/gtf_format.html) (Gene Transfer Format) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf
84-
85-
- **Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa
86-
87-
- **Transcriptome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., `/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa`.
88-
89-
***<u>NOTE:</u>*** For the three files above, they are usually available at open-sourced websites. For human and mouse, we recommend [GENCODE](https://www.gencodegenes.org/) to collect them, while for other species, we recommend [Ensembl](https://useast.ensembl.org/info/data/ftp/index.html).
90-
91-
- **HouseKeeping gene list**: the housekeeping genes defined by [this study](https://www.sciencedirect.com/science/article/pii/S0168952513000899?via%3Dihub) (N = 3804), e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt.
92-
93-
***<u>NOTE:</u>*** For other species, you can generate the housekeeping gene list by gene homology conversion using BiomaRt or other tools. Below is the codes I used to generate the housekeeping genes in mouse:
94-
95-
``` R
96-
library(NetBID2)
97-
98-
HK_hg <- read.table("/Volumes/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt")
99-
HK_mm <- get_IDtransfer_betweenSpecies(
100-
from_spe = "human", to_spe = "mouse", from_type = "hgnc_symbol", to_type = "mgi_symbol",
101-
use_genes = unique(HK_hg$V1));
102-
colnames(HK_mm) <- paste0("#", colnames(HK_mm))
103-
write.table(HK_mm[,c(2,1)], ## Please make sure the mouse gene symbols are in the FIRST column
104-
file = "/Volumes/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37/housekeeping_genes.mouse.txt", col.names = T, row.names = F, sep = "\t", quote = F)
105-
```
106-
107-
108-
109-
#### 2. Parsing annotation file
110-
111-
In this step, we will parse the gene annotation file and generate four files that required in downstream analysis:
112-
113-
- gencode.v48.primary_assembly.annotation.gene2transcript.txt and gencode.v48.primary_assembly.annotation.transcript2gene.txt: the mappings between transcripts and genes. They are required in gene-level quantification.
114-
- gencode.v48.primary_assembly.annotation.geneAnnotation.txt and gencode.v48.primary_assembly.annotation.transcriptAnnotation.txt: the gene or transcript annotation file. They are required in the final gene expression matrix generation.
115-
116-
To make the parsing analysis easiler, we created a script, `parseAnnotation.pl`, that you can easily generate the four files wit h it:
117-
118-
``` bash
119-
## parse the gene anotation file
120-
## This command will generate the four files in the same folder as gencode.v48.primary_assembly.annotation.gtf.
121-
## Only ONE argument is need: gene annotation file in GTF format.
122-
perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/parseAnnotation.pl gencode.v48.primary_assembly.annotation.gtf
123-
```
124-
125-
126-
127-
#### 3. Creating gene body bins
128-
129-
In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required in genebody coverage statistics - an important QC metrics that indicates the extent of RNA degradation.
130-
131-
Two files will be generated:
132-
133-
- ./bulkRNAseq/genebodyBins/genebodyBins_allTranscripts.txt: bins of the longest transcripts of all genes. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower.
134-
- ./bulkRNAseq/genebodyBins/genebodyBins_HouseKeepingTranscripts.txt: bins of the longest transcripts of precurated housekeeping genes. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). So, we set it as the default in this pipeline.
135-
136-
You can easily generate these two files with the command below:
137-
138-
``` bash
139-
## create the gene body bins
140-
## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes.
141-
## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files.
142-
perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/createBins.pl gencode.v48.transcripts.fa housekeeping_genes.human.txt ./bulkRNAseq/genebodyBins
143-
```
144-
145-
146-
147-
#### 4. Create genome index files for RSEM
148-
149-
``` bash
150-
#BSUB -P buildIndex
151-
#BSUB -n 8
152-
#BSUB -M 8000
153-
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
154-
#BSUB -J buildIndex
155-
#BSUB -q priority
156-
157-
rsem_index=/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin/rsem-prepare-reference
158-
159-
## Bowtie2-RSEM
160-
mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2
161-
$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2/hg38
162-
163-
## STAR-RSEM
164-
mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star
165-
$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --star --star-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 --star-sjdboverhang 100 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star/hg38
166-
```
167-
168-
169-
170-
#### 5. Create genome index files for Salmon
171-
172-
``` bash
173-
#BSUB -P salmonIndex
174-
#BSUB -n 8
175-
#BSUB -M 8000
176-
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
177-
#BSUB -J buildIndex
178-
#BSUB -q standard
179-
180-
## generate a decoy-aware transcriptome
181-
# https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/
182-
grep "^>" /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | cut -d ">" -f 2 > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt
183-
184-
cat /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa
185-
186-
salmon index -t /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa -d /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt -p 8 -i index_decoy --gencode -k 31
187-
```
188-
189-
190-
191-
#### 6. Create genome index files for STAR
192-
193-
``` bash
194-
#BSUB -P STAR_Index
195-
#BSUB -n 8
196-
#BSUB -M 8000
197-
#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err
198-
#BSUB -J buildIndex
199-
#BSUB -q standard
200-
201-
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/STAR/index_overhang100 --genomeFastaFiles /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --sjdbOverhang 100
202-
```
203-

0 commit comments

Comments
 (0)