|
1 | 1 | --- |
2 | | -title: Software Installation |
| 2 | +title: I-Software Installation |
3 | 3 | layout: default |
4 | 4 | nav_order: 1 |
5 | 5 | parent: Pipeline Setup |
6 | 6 | --- |
7 | 7 |
|
8 | | -### Before you start |
9 | | - |
10 | | -There are two parts to set up this pipeline: |
11 | | - |
12 | | -- **Software installation**: to install all tools required by this pipeline. |
13 | | -- **Reference preparation**: to generate reference files used in this pipeline. |
14 | | - |
15 | | -For Yu Lab members, we have set this pipeline up on HPC: |
16 | | - |
17 | | -- All the tools have been compiled in one conda environment, which can be launched by: |
18 | | - |
19 | | - ``` bash |
20 | | - module load conda3/202402 |
21 | | - conda activate /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025 |
22 | | - ``` |
23 | | - |
24 | | -- We have generated the files for four reference genomes: hg38, hg19, mm39 and mm10. |
25 | | - |
26 | | - |
27 | | - | Genome | GENCODE release | Release date | Ensembl release | Path | |
28 | | - | ----------------- | --------------- | ------------ | --------------- | ------------------------------------------------------------ | |
29 | | - | hg38 (GRCh38.p14) | v48 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48 | |
30 | | - | hg19 (GRCh37.p13) | v48lift37* | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg19/gencode.release48 | |
31 | | - | mm39 (GRCm39) | vM37 | 05.2025 | v114 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37 | |
32 | | - | mm10 (GRCm38.p6) | vM25 | 04.2020** | v100 | /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/mm10/gencode.releaseM25 | |
33 | | - |
34 | | - *: The updates for the hg19/GRCh37 genome assembly have stopped in 2013. However, gene annotation continue to be updated by mapping the comprehensive gene annotations originally created for the GRCh38/hg38 reference chromosomes onto GRCh37 primary assembly using [gencode-backmap](https://github.com/diekhans/gencode-backmap) . |
35 | | - |
36 | | - **: The updates for the mm10/GRCm38 genome assembly and gene annotation have stopped in 2019. |
37 | | - |
38 | | -You should run this tutorial only when you want to set up this pipeline locally or complie other reference genome / gene annotation. |
39 | | - |
40 | 8 | ### Part I: Software installation |
41 | 9 |
|
42 | 10 | We selcted the tools for this pipeline mainly based on two considerations: 1) they are well-established and widely-used; 2) they can work together with each other. We finally managed to complie all tools in one single conda environment. To install them: |
@@ -67,137 +35,3 @@ We selcted the tools for this pipeline mainly based on two considerations: 1) th |
67 | 35 | ## deactivate conda env |
68 | 36 | conda deactivate |
69 | 37 | ``` |
70 | | - |
71 | | - |
72 | | - |
73 | | -### Part II: Reference preparation |
74 | | - |
75 | | -In addition to the tools, you will also need to prepare **reference genomes** for alignment, quantification and QC assessment. Below is a summary of reference preparation (use hg38 as an example): |
76 | | - |
77 | | - |
78 | | - |
79 | | -#### 1. Data collection |
80 | | - |
81 | | -The reference preparation stars with FOUR files that can be directly downloaded from websites: |
82 | | - |
83 | | -- **Gene Annotation file in [GTF](https://biocorecrg.github.io/PhD_course_genomics_format_2021/gtf_format.html) (Gene Transfer Format) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf |
84 | | - |
85 | | -- **Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa |
86 | | - |
87 | | -- **Transcriptome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format**: e.g., `/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa`. |
88 | | - |
89 | | - ***<u>NOTE:</u>*** For the three files above, they are usually available at open-sourced websites. For human and mouse, we recommend [GENCODE](https://www.gencodegenes.org/) to collect them, while for other species, we recommend [Ensembl](https://useast.ensembl.org/info/data/ftp/index.html). |
90 | | - |
91 | | -- **HouseKeeping gene list**: the housekeeping genes defined by [this study](https://www.sciencedirect.com/science/article/pii/S0168952513000899?via%3Dihub) (N = 3804), e.g., /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt. |
92 | | - |
93 | | - ***<u>NOTE:</u>*** For other species, you can generate the housekeeping gene list by gene homology conversion using BiomaRt or other tools. Below is the codes I used to generate the housekeeping genes in mouse: |
94 | | - |
95 | | - ``` R |
96 | | - library(NetBID2) |
97 | | - |
98 | | - HK_hg <- read.table("/Volumes/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/housekeeping_genes.human.txt") |
99 | | - HK_mm <- get_IDtransfer_betweenSpecies( |
100 | | - from_spe = "human", to_spe = "mouse", from_type = "hgnc_symbol", to_type = "mgi_symbol", |
101 | | - use_genes = unique(HK_hg$V1)); |
102 | | - colnames(HK_mm) <- paste0("#", colnames(HK_mm)) |
103 | | - write.table(HK_mm[,c(2,1)], ## Please make sure the mouse gene symbols are in the FIRST column |
104 | | - file = "/Volumes/projects/software_JY/yu3grp/yulab_databases/references/mm39/gencode.releaseM37/housekeeping_genes.mouse.txt", col.names = T, row.names = F, sep = "\t", quote = F) |
105 | | - ``` |
106 | | - |
107 | | - |
108 | | - |
109 | | -#### 2. Parsing annotation file |
110 | | - |
111 | | -In this step, we will parse the gene annotation file and generate four files that required in downstream analysis: |
112 | | - |
113 | | -- gencode.v48.primary_assembly.annotation.gene2transcript.txt and gencode.v48.primary_assembly.annotation.transcript2gene.txt: the mappings between transcripts and genes. They are required in gene-level quantification. |
114 | | -- gencode.v48.primary_assembly.annotation.geneAnnotation.txt and gencode.v48.primary_assembly.annotation.transcriptAnnotation.txt: the gene or transcript annotation file. They are required in the final gene expression matrix generation. |
115 | | - |
116 | | -To make the parsing analysis easiler, we created a script, `parseAnnotation.pl`, that you can easily generate the four files wit h it: |
117 | | - |
118 | | -``` bash |
119 | | -## parse the gene anotation file |
120 | | -## This command will generate the four files in the same folder as gencode.v48.primary_assembly.annotation.gtf. |
121 | | -## Only ONE argument is need: gene annotation file in GTF format. |
122 | | -perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/parseAnnotation.pl gencode.v48.primary_assembly.annotation.gtf |
123 | | -``` |
124 | | - |
125 | | - |
126 | | - |
127 | | -#### 3. Creating gene body bins |
128 | | - |
129 | | -In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required in genebody coverage statistics - an important QC metrics that indicates the extent of RNA degradation. |
130 | | - |
131 | | -Two files will be generated: |
132 | | - |
133 | | -- ./bulkRNAseq/genebodyBins/genebodyBins_allTranscripts.txt: bins of the longest transcripts of all genes. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower. |
134 | | -- ./bulkRNAseq/genebodyBins/genebodyBins_HouseKeepingTranscripts.txt: bins of the longest transcripts of precurated housekeeping genes. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). So, we set it as the default in this pipeline. |
135 | | - |
136 | | -You can easily generate these two files with the command below: |
137 | | - |
138 | | -``` bash |
139 | | -## create the gene body bins |
140 | | -## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes. |
141 | | -## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files. |
142 | | -perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/git_repo/scripts/setup/createBins.pl gencode.v48.transcripts.fa housekeeping_genes.human.txt ./bulkRNAseq/genebodyBins |
143 | | -``` |
144 | | - |
145 | | - |
146 | | - |
147 | | -#### 4. Create genome index files for RSEM |
148 | | - |
149 | | -``` bash |
150 | | -#BSUB -P buildIndex |
151 | | -#BSUB -n 8 |
152 | | -#BSUB -M 8000 |
153 | | -#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
154 | | -#BSUB -J buildIndex |
155 | | -#BSUB -q priority |
156 | | - |
157 | | -rsem_index=/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin/rsem-prepare-reference |
158 | | - |
159 | | -## Bowtie2-RSEM |
160 | | -mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2 |
161 | | -$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_bowtie2/hg38 |
162 | | - |
163 | | -## STAR-RSEM |
164 | | -mkdir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star |
165 | | -$rsem_index --gtf /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --star --star-path /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/bin --num-threads 8 --star-sjdboverhang 100 /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/RSEM/index_star/hg38 |
166 | | -``` |
167 | | - |
168 | | - |
169 | | - |
170 | | -#### 5. Create genome index files for Salmon |
171 | | - |
172 | | -``` bash |
173 | | -#BSUB -P salmonIndex |
174 | | -#BSUB -n 8 |
175 | | -#BSUB -M 8000 |
176 | | -#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
177 | | -#BSUB -J buildIndex |
178 | | -#BSUB -q standard |
179 | | - |
180 | | -## generate a decoy-aware transcriptome |
181 | | -# https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ |
182 | | -grep "^>" /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | cut -d ">" -f 2 > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt |
183 | | - |
184 | | -cat /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.transcripts.fa /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa > /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa |
185 | | - |
186 | | -salmon index -t /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/gentrome.fa -d /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/Salmon/decoys.txt -p 8 -i index_decoy --gencode -k 31 |
187 | | -``` |
188 | | - |
189 | | - |
190 | | - |
191 | | -#### 6. Create genome index files for STAR |
192 | | - |
193 | | -``` bash |
194 | | -#BSUB -P STAR_Index |
195 | | -#BSUB -n 8 |
196 | | -#BSUB -M 8000 |
197 | | -#BSUB -oo 01_buildIndex.out -eo 01_buildIndex.err |
198 | | -#BSUB -J buildIndex |
199 | | -#BSUB -q standard |
200 | | - |
201 | | -STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/bulkRNAseq/STAR/index_overhang100 --genomeFastaFiles /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/yulab_databases/references/hg38/gencode.release48/gencode.v48.primary_assembly.annotation.gtf --sjdbOverhang 100 |
202 | | -``` |
203 | | - |
0 commit comments