Skip to content
mcmaniou edited this page Mar 8, 2021 · 3 revisions

Functional analysis

InterTADs provides the option of performing gene enrichment analysis (GO MF Terms, GO BP Terms, KEGG Pathways) and motif enrichment analysis using the script 03_functional_analysis.R. The script analyzes the integrated tables produced by the scripts: 02c_evenDiff.R and 02d_TADiff.R. It accepts as input the filepath of a folder and performs the analysis for each individual file inside the folder recursively.

Prerequisites

The extra packages that must be installed in your local machine are:

install.packages(c("ggseqlogo","seqinr","httr","jsonlite","xml2","enrichR","stats","purrr","igraph","ggraph","hrbrthemes","extrafont","gridExtra","ggpubr","pryr", "ape"))

devtools::install_github("nikopech/saveImageHigh")

If you install the "extrafont" package make sure to download the fonts with the following command. It only needs to be done once.

font_import() 

And from Bioconductor:

BiocManager::install(c("KEGGREST","PWMEnrich","PWMEnrich.Hsapiens.background"))

Running the step

In order to run this final step use the following command:

source("03_functional_analysis.R")

Input files

Each input file must contain the following columns:

  • tad_name: TAD that the event corresponds to
  • chromosome_name: chromosome that the event corresponds to
  • ID: unique event identifier
  • start_position: start position of the event in the chromosome
  • end_ position: end position of the event in the chromosome
  • Gene_id: symbols of the genes the event corresponds to, multiple values are separated with "|"

It could also contain:

  • extra columns with the sample data
  • extra columns with information of the statistical significance of the event

An example of the necessary input data is provided:

tad_name chromosome_name ID start_position end_position Gene_id
TAD107 1 cg00295853 87245070 87245071 NA
TAD107 1 cg00994394 86934606 86934607 CLCA1
TAD107 1 cg01118679 87018134 87018135 CLCA4
TAD107 1 cg01712258 86819370 86819371 ODF2L

Output files

The output graphs and csv files are stored in different output folders for each file.Each folder consists of three subfolders:

  • GO EA Outputs (includes the folders GO enrich all and GO enrich per TAD, and each one of them includes the subfolders GO MF and GO BP)
  • Motif EA Outputs
  • Pathways EA Outputs (includes the folders Path enrich all and Path enrich per TAD)

Each one of them corresponds to the biological characteristic the input data was enriched with. In the case of the files produced by the 02d_TADiff.R, the enrichment analysis focused on two different directions for each biological characteristic. The statistical significance of the enriched terms (enrich all folders) and the importance of the TADs as functional genomic regions (enrich per TAD folders). In the case of the files produced by the 02c_evenDiff.R, the latter was omitted. The results of each type of analysis were stored mainly into two csv files for each characteristic. The first one is grouped per the enriched term and en example is presented below:

Term TAD P.value P.adjust
ABC transporters TAD2726|TAD2727 6.31521032101889e-09|9.723691144917e-05 1.89456309630567e-08|0.000116684293739004
Pancreatic secretion TAD107 3.74E-05 5.61E-05
Renin secretion TAD107 1.30E-05 2.60E-05
Salivary secretion TAD2130 0.00012399 0.00012399

And the second one is grouped per TAD:

TAD Term P.value P.adjust
TAD1916 ABC transporters|Taste transduction 0.0778524501891508|0.139171179564888 0.0934229402269809|0.139171179564888
TAD2130 Salivary secretion|Taste transduction 0.000123989884158722|5.71267530160134e-23 0.000185984826238082|3.4276051809608e-22
TAD2726 ABC transporters 6.32E-09 1.89E-08
TAD2727 ABC transporters 9.72E-05 0.000185985

Input parameters

The main script has the following inputs:

  1. tech: Human Genome Reference used (accepted values are hg19 and hg38)
  2. exp.parent: number of the parent file of the expression data (can be found in the 01_data_integration.R output txt file)
  3. dbs: vector with a list of the Enrichr libraries used for the enrichment analysis (default is the vector c("GO_Molecular_Function_2018", "GO_Biological_Process_2018", "KEGG_2019_Human"), for the complete list of the Enrichr libraries type the command listEnrichrDbs())
  4. type: the prevously selected databases acronyms used for the names of the outputs files(default is: c("GO.MF","GO.BP","KEGG"))
  5. genes.cover: vector with the number of genes covered by the Enrichr libraries, which are in the dbs (taking into account the chosen libraries, default is c(11459,14433,7802))
  6. p.adjust.method: the method used for adjustment of the p-values of the enrichment analysis (default value is "fdr", accepted values are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
  7. criterio: Enrichr result column selected as criterio (accepted values are: "P.value" and ```"Adjusted.P.value"``)
  8. cut.off: the threshold of significant (adjusted) p-values (default value is 0.05)
  9. cut.off.TF: cut-off motif enrichment adjusted p-value (default value is 0.05)
  10. min.genes: the minimum number of genes in over-represented terms (default value is 3)
  11. system: the OS of the local machine (default value is "win")
  12. dir_name : the name of the input data folder
  13. output_folder : the name of the outputs folder

Clone this wiki locally