Usage

Functional analysis

InterTADs provides the option of performing gene enrichment analysis (GO MF Terms, GO BP Terms, KEGG Pathways) and motif enrichment analysis using the script 03_functional_analysis.R. The script analyzes the integrated tables produced by the scripts: 02c_evenDiff.R and 02d_TADiff.R. It accepts as input the filepath of a folder and performs the analysis for each individual file inside the folder recursively.

Prerequisites

The extra packages that must be installed in your local machine are:

install.packages(c("ggseqlogo","seqinr","httr","jsonlite","xml2","enrichR","stats","purrr","igraph","ggraph","hrbrthemes","extrafont","gridExtra","ggpubr","pryr", "ape"))

devtools::install_github("nikopech/saveImageHigh")

If you install the "extrafont" package make sure to download the fonts with the following command. It only needs to be done once.

font_import()

And from Bioconductor:

BiocManager::install(c("KEGGREST","PWMEnrich","PWMEnrich.Hsapiens.background"))

Running the step

In order to run this final step use the following command:

source("03_functional_analysis.R")

Input files

Each input file must contain the following columns:

tad_name: TAD that the event corresponds to
chromosome_name: chromosome that the event corresponds to
ID: unique event identifier
start_position: start position of the event in the chromosome
end_ position: end position of the event in the chromosome
Gene_id: symbols of the genes the event corresponds to, multiple values are separated with "|"

It could also contain:

extra columns with the sample data
extra columns with information of the statistical significance of the event

An example of the necessary input data is provided:

tad_name	chromosome_name	ID	start_position	end_position	Gene_id
TAD107	1	cg00295853	87245070	87245071	NA
TAD107	1	cg00994394	86934606	86934607	CLCA1
TAD107	1	cg01118679	87018134	87018135	CLCA4
TAD107	1	cg01712258	86819370	86819371	ODF2L

Output files

The output graphs and csv files are stored in different output folders for each file.Each folder consists of three subfolders:

GO EA Outputs (includes the folders GO enrich all and GO enrich per TAD, and each one of them includes the subfolders GO MF and GO BP)
Motif EA Outputs
Pathways EA Outputs (includes the folders Path enrich all and Path enrich per TAD)

Each one of them corresponds to the biological characteristic the input data was enriched with. In the case of the files produced by the 02d_TADiff.R, the enrichment analysis focused on two different directions for each biological characteristic. The statistical significance of the enriched terms (enrich all folders) and the importance of the TADs as functional genomic regions (enrich per TAD folders). In the case of the files produced by the 02c_evenDiff.R, the latter was omitted. The results of each type of analysis were stored mainly into two csv files for each characteristic. The first one is grouped per the enriched term and en example is presented below:

Term	TAD	P.value	P.adjust
ABC transporters	TAD2726\|TAD2727	6.31521032101889e-09\|9.723691144917e-05	1.89456309630567e-08\|0.000116684293739004
Pancreatic secretion	TAD107	3.74E-05	5.61E-05
Renin secretion	TAD107	1.30E-05	2.60E-05
Salivary secretion	TAD2130	0.00012399	0.00012399

And the second one is grouped per TAD:

TAD	Term	P.value	P.adjust
TAD1916	ABC transporters\|Taste transduction	0.0778524501891508\|0.139171179564888	0.0934229402269809\|0.139171179564888
TAD2130	Salivary secretion\|Taste transduction	0.000123989884158722\|5.71267530160134e-23	0.000185984826238082\|3.4276051809608e-22
TAD2726	ABC transporters	6.32E-09	1.89E-08
TAD2727	ABC transporters	9.72E-05	0.000185985

Input parameters

The main script has the following inputs:

tech: Human Genome Reference used (accepted values are hg19 and hg38)
exp.parent: number of the parent file of the expression data (can be found in the 01_data_integration.R output txt file)
dbs: vector with a list of the Enrichr libraries used for the enrichment analysis (default is the vector c("GO_Molecular_Function_2018", "GO_Biological_Process_2018", "KEGG_2019_Human"), for the complete list of the Enrichr libraries type the command listEnrichrDbs())
type: the prevously selected databases acronyms used for the names of the outputs files(default is: c("GO.MF","GO.BP","KEGG"))
genes.cover: vector with the number of genes covered by the Enrichr libraries, which are in the dbs (taking into account the chosen libraries, default is c(11459,14433,7802))
p.adjust.method: the method used for adjustment of the p-values of the enrichment analysis (default value is "fdr", accepted values are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
criterio: Enrichr result column selected as criterio (accepted values are: "P.value" and ```"Adjusted.P.value"``)
cut.off: the threshold of significant (adjusted) p-values (default value is 0.05)
cut.off.TF: cut-off motif enrichment adjusted p-value (default value is 0.05)
min.genes: the minimum number of genes in over-represented terms (default value is 3)
system: the OS of the local machine (default value is "win")
dir_name : the name of the input data folder
output_folder : the name of the outputs folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly