- Introduction
- Recommended Readings
- Setup
- Exploratory Analysis
- Differential Gene Expression (DGE) Analysis
- Gene Set Enrichment Analysis (GSEA)
- Visualizing Results
- Pathway Analysis (to be added)
- TF Enrichment Analysis (to be added)
- TIME Analysis (to be added)
This is an analysis pipeline for Bulk RNAseq data. We will re-analyze the mouse RNAseq data from Donahue et al. The data can be downloaded from here
This tutorial will not cover essentials for RNAseq statistics. Your can read more from the following resources:
First we will load essential libraries and import the data.
The goal of this section is to construct a numeric matrix where the rows represent gene names/ids and the columns represent samples and metadata dataframe where the row names match the column names of the count matrix to create DESeq2 object.
## load libraries
suppressPackageStartupMessages({
library(DESeq2)
library(tidyverse)
library(PCAtools)
library(fgsea)
library(msigdbr)
library(EnhancedVolcano)
library(pheatmap)
library(decoupleR)
library(ashr)
})
## Create output directory
dir.create("outputs", showWarnings = F, recursive = T)
## loading count data
## NOTES: here we will store the gene symbols in a dataframe and keep the gene id as the rowname
## This is because in the count matrix that we have, some genes ids have the same gene symbols
## so we will keep the gene id as rownames to avoid duplicates and add the gene symbols in the end
## for visualization.
ct_mtx <- read.table("data/gene_expected_count.annot.txt", quote = "", header = T, fill = T, sep = "\t")
gene_symbols <- data.frame(gene_symbol = ct_mtx$external_gene_name) %>% `rownames<-`(ct_mtx$gene_id)
ct_mtx <- ct_mtx %>%
select(-c("entrezgene_id", "external_gene_name", "description")) %>%
column_to_rownames('gene_id')
## Remove the X in the begining of each sample name and make them match the sample names in metadata file
colnames(ct_mtx) <- gsub("^X", "", colnames(ct_mtx))
colnames(ct_mtx) <- gsub("\\.", "-", colnames(ct_mtx))
## loading metadata
metadata <- read.csv("data/metadata.csv", header = T) %>%
column_to_rownames('sample')
metadata <- metadata[colnames(ct_mtx),]
## converting the batch info to factor because it is categorical data not numerical
metadata$Batch <- as.factor(metadata$Batch)
# DESeq object all samples ------------------------------------------------
dds <- DESeqDataSetFromMatrix(countData = ct_mtx,
colData = metadata,
rowData = gene_symbols,
design = ~ 1)Here we are visualizing the PCA plots for the samples. You can change
the shape and color based on the color_by and shape_by variables.
You can review PCAtools
Vignette
to learn how to customize the PCA plot.
vst <- assay(vst(dds))
rv <- rowVars(vst)
select <- order(rv, decreasing=TRUE)[seq_len(min(2000, length(dds)))]
pca <- pca(vst[select,], metadata = colData(dds))
color_by <- 'IL33.Status'
shape_by <- 'Cell.Line'
biplot(pca,
showLoadings = F,
colby = color_by,
shape = shape_by,
labSize = 2, pointSize = 5, drawConnectors = T,
legendPosition = 'right',
legendLabSize = 7, legendIconSize = 4, legendTitleSize = 10)You can review this tutorial to learn how to customize the heatmap using pheatmap R package, including adding column and row annotations
ann <- data.frame(KO_status = metadata$IL33.Status,
CellLine = metadata$Cell.Line,
row.names = rownames(metadata))
pheatmap(cor(vst), annotation_col = ann, annotation_row = ann)Here we will run the heart of RNAseq analysis. Defining differentially expressed genes (Up- and down-regulated genes). It is curcial to know how comparisons you are trying to make and construct the design matrix accordingly. Most of the RNAseq experiments are based on categorical data analysis (e.g., WT vs KO, Treated vs. Non-Treated), but you can use continuous variables (e.g., Glucose level, Drug Dosage). You can review the guidelines for constructing a design matrix here.
Here we will compare the Il33 KO to WT while accounting for other variables (Cell.Line, Treatment and batch). This can be conducted using one of two ways as follows:
1) Setting WT as a reference, and comparing everything to it
dds$IL33.Status <- factor(dds$IL33.Status, levels = c('WT', 'KO'))
design(dds) <- model.matrix(~ IL33.Status + Batch + Treatment, data = colData(dds))
dds <- DESeq(dds, quiet = T)
## We can check the coefficients for different contrasts using `resultsNames(dds)`
resultsNames(dds)## [1] "Intercept" "IL33.StatusKO" "Batch2" "Batch3"
## [5] "Batch4" "TreatmentNT"
We can see that we can check the effect of being KO using the
coefficient IL33.StatusKO
KO_v_wt <- lfcShrink(dds, coef = 'IL33.StatusKO', type = 'ashr', quiet = T) %>%
data.frame() %>%
merge(gene_symbols, by = 0) %>%
filter(!is.na(padj)) %>%
rename(gene_id = Row.names) 2) Removing the intercept and setting up contrasts to compare different groups.
This is particularly beneficial when you have more than one level in the variable (e.g., WT, KO, DKO) and you want to have pairwise comparisons. Read more here
design(dds) <- model.matrix(~ 0 + IL33.Status + Batch + Treatment, data = colData(dds))
dds <- DESeq(dds, quiet = T)
## We can check the coefficients for different contrasts using `resultsNames(dds)`
resultsNames(dds)## [1] "IL33.StatusWT" "IL33.StatusKO" "Batch2" "Batch3"
## [5] "Batch4" "TreatmentNT"
We can see that here to get the DEGs we have first to specific the contrast.
KO_v_wt <- lfcShrink(dds, contrast = list('IL33.StatusKO','IL33.StatusWT'), type = 'ashr', quiet = T) %>%
data.frame() %>%
merge(gene_symbols, by = 0) %>%
filter(!is.na(padj)) %>%
rename(gene_id = Row.names) Both approaches should have very similar results. We will save the results as a csv file
write.csv(KO_v_wt, "outputs/mouse_il33_v_wt_dge.csv", row.names = F)
head(KO_v_wt[,-1])## baseMean log2FoldChange lfcSE pvalue padj gene_symbol
## 1 52.404144 0.01835905 0.2511604 0.80139134 0.9979924 eGFP
## 2 7973.141256 0.17936812 0.1905877 0.13272198 0.6359739 Gnai3
## 3 425.362690 0.24824921 0.2380886 0.03812011 0.4038640 Cdc45
## 4 19.674564 0.22507335 0.5498473 0.03161745 0.3765405 H19
## 5 163.227191 -0.11123294 0.1980364 0.33159105 0.8275319 Scml2
## 6 8.329456 0.02685919 0.2313623 0.76173908 0.9979924 Apoh
Here we will run GSEA using Hallmark pathways from MSigDB. Pay attention to changing the species to match the species of your data.
species <- 'Mus musculus'
hallmarks_pathways <- msigdbr(species = species, category = 'H', subcategory = NULL)
hallmarks_pathways <- split(x = hallmarks_pathways$ensembl_gene, f = hallmarks_pathways$gs_name)
# > GSEA ------------------------------------------------------------------
gene_rank <- KO_v_wt %>%
pull(log2FoldChange) %>%
`names<-`(KO_v_wt$gene_id) %>%
sort()
gsea_res <- fgsea(pathways = hallmarks_pathways,
stats = gene_rank)## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (4.89% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## We will save the results as a csv file
write.csv(gsea_res[,-8], "outputs/mouse_il33_v_wt_gsea.csv", row.names = F)Visualizing results can be done using Volcano plots and checking normalized expression for specific genes. Here we will use EnhancedVolcano R package. Glimma provides an interactive app for visualising the results of DGE analysis.
EnhancedVolcano(KO_v_wt,
lab = KO_v_wt$gene_symbol,
x = 'log2FoldChange',
y = 'padj',
title = 'IL33 KO vs WT',
subtitle = NULL,
pCutoff = 0.05,
FCcutoff = 1,
pointSize = 2,
xlim = c(-5, 5),
ylim = c(0, 20),
legendLabels = c('NS', 'FDR < 0.05', 'log2FC > 1', 'FDR < 0.05 & log2FC > 1'),
legendPosition = 'top',
legendLabSize = 10,
legendIconSize = 5,
labSize = 2)gene <- 'Igfbp5'
gene_id <- gene_symbols %>% filter(gene_symbol == gene) %>% rownames()
group <- 'IL33.Status'
plot_df <- plotCounts(dds, gene = gene_id, intgroup = group, returnData = T)
ggplot(plot_df, aes_string(x = group, y = 'count', fill = group)) +
geom_boxplot() +
geom_jitter(width = 0.1) +
labs(title = gene)## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
topPathwaysUp <- gsea_res[NES > 0][head(order(padj), n=5), pathway]
topPathwaysDown <- gsea_res[NES < 0][head(order(padj), n=5), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(hallmarks_pathways[topPathways],
gene_rank,
gsea_res,
gseaParam=0.5)
We can also
visualize one pathway as follows
plotEnrichment(hallmarks_pathways$HALLMARK_MYC_TARGETS_V1,
gene_rank,
gseaParam=0.5)



