-
Notifications
You must be signed in to change notification settings - Fork 0
Usage
InterTADs provides the option of performing gene enrichment analysis (GO MF Terms, GO BP Terms, KEGG Pathways) and motif enrichment analysis using the script 03_functional_analysis.R. The script analyzes the integrated tables produced by the scripts: 02c_evenDiff.R and 02d_TADiff.R. It accepts as input the filepath of a folder and performs the analysis for each individual file inside the folder recursively.
The extra packages that must be installed in your local machine are:
install.packages(c("ggseqlogo","seqinr","httr","jsonlite","xml2","enrichR","stats","purrr","igraph","ggraph","hrbrthemes","extrafont","gridExtra","ggpubr","pryr", "ape"))
devtools::install_github("nikopech/saveImageHigh")
If you install the "extrafont" package make sure to download the fonts with the following command. It only needs to be done once.
font_import()
And from Bioconductor:
BiocManager::install(c("KEGGREST","PWMEnrich","PWMEnrich.Hsapiens.background"))
In order to run this final step use the following command:
source("03_functional_analysis.R")
Each input file must contain the following columns:
- tad_name: TAD that the event corresponds to
- chromosome_name: chromosome that the event corresponds to
- ID: unique event identifier
- start_position: start position of the event in the chromosome
- end_ position: end position of the event in the chromosome
- Gene_id: symbols of the genes the event corresponds to, multiple values are separated with "|"
It could also contain:
- extra columns with the sample data
- extra columns with information of the statistical significance of the event
An example of the necessary input data is provided:
| tad_name | chromosome_name | ID | start_position | end_position | Gene_id |
|---|---|---|---|---|---|
| TAD107 | 1 | cg00295853 | 87245070 | 87245071 | NA |
| TAD107 | 1 | cg00994394 | 86934606 | 86934607 | CLCA1 |
| TAD107 | 1 | cg01118679 | 87018134 | 87018135 | CLCA4 |
| TAD107 | 1 | cg01712258 | 86819370 | 86819371 | ODF2L |
The output graphs and csv files are stored in different output folders for each file.Each folder consists of three subfolders:
- GO EA Outputs (includes the folders GO enrich all and GO enrich per TAD, and each one of them includes the subfolders GO MF and GO BP)
- Motif EA Outputs
- Pathways EA Outputs (includes the folders Path enrich all and Path enrich per TAD)
Each one of them corresponds to the biological characteristic the input data was enriched with. In the case of the files produced by the 02d_TADiff.R, the enrichment analysis focused on two different directions for each biological characteristic. The statistical significance of the enriched terms (enrich all folders) and the importance of the TADs as functional genomic regions (enrich per TAD folders). In the case of the files produced by the 02c_evenDiff.R, the latter was omitted. The results of each type of analysis were stored mainly into two csv files for each characteristic. The first one is grouped per the enriched term and en example is presented below:
| Term | TAD | P.value | P.adjust |
|---|---|---|---|
| ABC transporters | TAD2726|TAD2727 | 6.31521032101889e-09|9.723691144917e-05 | 1.89456309630567e-08|0.000116684293739004 |
| Pancreatic secretion | TAD107 | 3.74E-05 | 5.61E-05 |
| Renin secretion | TAD107 | 1.30E-05 | 2.60E-05 |
| Salivary secretion | TAD2130 | 0.00012399 | 0.00012399 |
And the second one is grouped per TAD:
| TAD | Term | P.value | P.adjust |
|---|---|---|---|
| TAD1916 | ABC transporters|Taste transduction | 0.0778524501891508|0.139171179564888 | 0.0934229402269809|0.139171179564888 |
| TAD2130 | Salivary secretion|Taste transduction | 0.000123989884158722|5.71267530160134e-23 | 0.000185984826238082|3.4276051809608e-22 |
| TAD2726 | ABC transporters | 6.32E-09 | 1.89E-08 |
| TAD2727 | ABC transporters | 9.72E-05 | 0.000185985 |
The main script has the following inputs:
-
tech: Human Genome Reference used (accepted values arehg19andhg38) -
exp.parent: number of the parent file of the expression data (can be found in the01_data_integration.Routput txt file) -
dbs: vector with a list of the Enrichr libraries used for the enrichment analysis (default is the vectorc("GO_Molecular_Function_2018", "GO_Biological_Process_2018", "KEGG_2019_Human"), for the complete list of the Enrichr libraries type the commandlistEnrichrDbs()) -
type: the prevously selected databases acronyms used for the names of the outputs files(default is:c("GO.MF","GO.BP","KEGG")) -
genes.cover: vector with the number of genes covered by the Enrichr libraries, which are in thedbs(taking into account the chosen libraries, default isc(11459,14433,7802)) -
p.adjust.method: the method used for adjustment of the p-values of the enrichment analysis (default value is"fdr", accepted values are"holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none") -
criterio: Enrichr result column selected as criterio (accepted values are:"P.value"and ```"Adjusted.P.value"``) -
cut.off: the threshold of significant (adjusted) p-values (default value is0.05) -
cut.off.TF: cut-off motif enrichment adjusted p-value (default value is0.05) -
min.genes: the minimum number of genes in over-represented terms (default value is3) -
system: the OS of the local machine (default value is"win") -
dir_name: the name of the input data folder -
output_folder: the name of the outputs folder