OG files from Lidia were pre-treated in bash before R import:
# Just gene-KEGG annotation
awk '{print $1, $2}' kegg_BBMOSOLA-GC_250bp_gene.lengthNorm.SingleCopyGeneNorm.counts.tbl > gene_kegg.tbl
# Just SOLA abundance table
awk '{for(i=87;i<=176;i++) printf "%s%s", $i, (i==176?RS:FS)}' kegg_BBMOSOLA-GC_250bp_gene.lengthNorm.SingleCopyGeneNorm.counts.tbl > SOLA_2009_2014_gene.tbl
# stitch them back together
paste -d ' ' gene_kegg.tbl SOLA_2009_2014_gene.tbl > SOLA_2009_2014_gene_annot.tbl
# filtered out empty rows (absent genes)
awk 'NR==1 {print; next} {sum=0; for(i=3;i<=NF;i++) sum+=$i; if(sum!=0) print}' SOLA_2009_2014_gene_annot.tbl > SOLA_2009_2014_gene_annot_filtered.tbl
# convert delimiters to tab sep
awk -v OFS='\t' '{$1=$1}1' SOLA_2009_2014_gene_annot_filtered.tbl > SOLA_2009_2014_gene.tbl
# remove transition files
rm SOLA_2009_2014_gene_annot.tbl
rm mOLA_2009_2014_gene_annot_filtered.tbl
rm gene_kegg.tbl
- 1 sample info table (with dates)
- 1 KEGG Abundance table
- 1 KEGG Functional Annotation table
- 1 Genes Abundance table
- 1 Genes-KEGG annotation table
Results table for the 8375 KEGGs looks like this:
kegg PNmax Period p.value Phase
1 K00525 0.207330 362.000 0.010 Mar
2 K02014 0.140910 181.000 0.212 Oct
3 K02335 0.186610 362.000 0.061 May
4 K03046 0.284880 362.000 0.000 Apr
5 K03043 0.223030 362.000 0.000 Apr
6 K02337 0.088779 38.394 0.939 May
7 K01915 0.180820 362.000 0.020 Apr
8 K00605 0.251340 362.000 0.000 Apr
9 K01952 0.124520 31.675 0.424 May
10 K01652 0.111540 362.000 0.535 Apr
2857 Keggs were found to be rhythmic (PNmax > 0.1 and p.value < 0.01) 5517 Keggs were not significantly rhythmic (over the 7 years time period)
As expected, rhythmicity was well correlated to KEGG's total abundance, with rarer KEGGs having sparser distribution (potentially absent for long time periods):

We also can explore the differences in rhythmicity across KEGGs functional categories:

We also investigated the most rhythmic KEGG's patterns over the time series:

Its functional annotation is not very revealing:
egg Gene names kegg annotation GN1 GN2 GN3 GN4 GN5 GN6 GN7 GN8 GN9 KH1
3292 K06287 maf septum formation protein maf 09190 Not Included in Pathway or Brite
KH2 KH3 KH4 KH5 KH6 KH7 KH8 KH9 KH10 KH11 KH12 KH13 KH14 KH15 KH16
3292 09193 Unclassified: signaling and cellular processes 99978 Cell growth <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
KH17 KH18 KH19 KH20 KH21 KH22 KH23 KH24 KH25 KH26 KH27 KH28 KH29 KH30 KH31 tot.ab nb.genes
3292 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 48.81385 5067
Next we check the patterns of the genes annotated to this KEGG (the KEGG's composition).
5067 genes were annotated to this specific KEGG, here are their cumulated abundance over the time-series:

We note that the abundances of genes annotated to this KEGG is not equal to the abundance of the KEGG (see here across samples):

The idea of this script is to compute LSP rhythmicity for all KEGGs based on all possible random combinations of Years. First we create a matrix of years combinations, from 2-Years to 7-Years:
# Initialize a list to store data.frames
Years=unique(info$Year)
nYears=length(unique(info$Year))
yl<-year.length(Years)
# Create all unique combinations of years from 2 years to 7
comb_list <- list()
for (k in 2:nYears) {
tmp <- as.data.frame(t(combn(Years, k)))
colnames(tmp) <- paste0("Year", 1:k)
comb_list[[as.character(k)]] <- tmp
}
# Combine all into a single data.frame (with varying number of columns)
all_combs <- bind_rows(comb_list, .id = "size" )[,-1]
all_combs$NB <- apply(all_combs, 1, function(x) max(which(!is.na(x))))
Which gives us the following table with 120 combinations, in which each year is represented equitatively (63 times):
Year1 Year2 Year3 Year4 Year5 Year6 Year7 NB
1: 2009 2010 <NA> <NA> <NA> <NA> <NA> 2
2: 2009 2011 <NA> <NA> <NA> <NA> <NA> 2
3: 2009 2012 <NA> <NA> <NA> <NA> <NA> 2
4: 2009 2013 <NA> <NA> <NA> <NA> <NA> 2
5: 2009 2014 <NA> <NA> <NA> <NA> <NA> 2
---
116: 2009 2010 2011 2013 2014 2015 <NA> 6
117: 2009 2010 2012 2013 2014 2015 <NA> 6
118: 2009 2011 2012 2013 2014 2015 <NA> 6
119: 2010 2011 2012 2013 2014 2015 <NA> 6
120: 2009 2010 2011 2012 2013 2014 2015 7
We then create 120 subdatasets using only each of these Years combinations, we create a fake consecutive "Julian Days" column that run over the year-gaps. We can then run the LSP algorithm and save the results for all 8374 KEGGs over the 120 Year-combinations:
list_comb_lsp<-list()
for (cb in seq_len(length(list_comb_info))){
# merge info on the time series with KEGGs abundances
lsp.keggs<-merge(list_comb_info[[cb]], t(keggs.ab), by.x = "ID", by.y = "row.names" )
# loop the command over all Keggs
r=NULL
for (k in 11:ncol(lsp.keggs)){
rl<-summary(randlsp(x = lsp.keggs[, c(9, k)], repeats = 99, type = "period", from = 30, to = 1000, trace = FALSE, plot = FALSE ))
r<-rbind(r,data.frame(kegg = colnames(lsp.keggs)[k], PNmax = as.numeric(rl[9,]), Period = as.numeric(rl[10,]), p.value = as.numeric(rl[13,]), Phase = format(lsp.keggs[which.max(lsp.keggs[,k]),"Date"], "%b") ))
print(paste(cb, k, ncol(lsp.keggs), sep = " / "))
}
list_comb_lsp[[paste(cb)]]<-r
}
saveRDS(list_comb_lsp, "list_comb_lsp.RData")
# results
list_comb_lsp<-readRDS("~/Desktop/PETRIMED/ANALYSES/ts_rhythm/list_comb_lsp.RData")
## format results
# empty files
pnmax=NULL
period=NULL
pv=NULL
phase=NULL
# extract and merge results per coefficient
for (i in 1:120){
pnmax<-cbind(pnmax,list_comb_lsp[[i]][,2]);colnames(pnmax)[ncol(pnmax)]<-i
period<-cbind(period,list_comb_lsp[[i]][,3]);colnames(period)[ncol(period)]<-i
pv<-cbind(pv,list_comb_lsp[[i]][,4]);colnames(pv)[ncol(pv)]<-i
phase<-cbind(phase,list_comb_lsp[[i]][,5]);colnames(phase)[ncol(phase)]<-i
}
# format rownmaes of result table
rownames(pnmax)<-list_comb_lsp[[i]][,1]
rownames(period)<-list_comb_lsp[[i]][,1]
rownames(pv)<-list_comb_lsp[[i]][,1]
rownames(phase)<-list_comb_lsp[[i]][,1]
# which combination is the best for each KEGG in terms of PNmax
comb.PN<-data.frame(PNmax = apply(pnmax,1, which.max))
We studied which combinations yielded the highest PNmax (rhytmicity) for each KEGG. We can also study which Years' removal generated the highest PNmax over all KEGGs:
# which Year's removal generate the highest PNmax?
missing_years<-list()
for (y in 1:120){missing_years[[y]]<-Years[!Years %in% list_comb_info[[y]]$Year]}
all.KO=NULL
for (k in rownames(comb.PN) ){all.KO<-c(all.KO,missing_years[[comb.PN[k,]]])}
table(all.KO)
which gives us:
table(all.KO)
all.KO
2009 2010 2011 2012 2013 2014 2015
5599 5030 5615 5911 6691 6364 6494
It seems that the absence of the years 2013 and 2015 generated the highest PNmax for most KEGGs (6691 and 6494 over 8734). Could these years harbor more perturbations than other years causing a lack of rythmicity for most KEGGs?
Let's check the variability of environmental variables across years (source: SOMLIT SOLA):

We can also study the number of significantly rhythmic KEGGs (PNmax > 0.1, p.value ≤ 0.01, period >100) per combination of years:
# results
list_comb_lsp<-readRDS("~/Desktop/PETRIMED/ANALYSES/ts_rhythm/list_comb_lsp.RData")
# filter only results
list_comb_lsp_filtered<-list()
for (cb in 1:120){list_comb_lsp_filtered[[cb]]<-list_comb_lsp[[cb]][list_comb_lsp[[cb]]$PNmax>0.1 &list_comb_lsp[[cb]]$p.value<=0.01 & list_comb_lsp[[cb]]$Period>100,]}
comb.kr<-cbind(all_combs,data.frame(nb.rhytmic.keggs = unlist(lapply(list_comb_lsp_filtered,nrow))))
comb.kr$string=apply(comb.kr[,1:7], 1, function(x) paste(x[!is.na(x)], collapse = " ") )
ggplot(data = comb.kr, aes(y = string, x = nb.rhytmic.keggs))+
facet_grid(NB~., space = "free", scale = "free")+
scale_x_continuous(expand = c(0,0))+
geom_bar(stat = "identity", color = "steelblue4" )+
labs(y = "Years included in the\ncomputation of Rythmicity", x = "# of significantly rhythmic KEGGs")+
theme_minimal()+
theme(strip.text.y = element_text(angle= 0, hjust = 0))
Overall the number of rhythmic KEGGs increases with the number of years included (to reach a maximu when including the 7 years). But within subsets
We can also study the number of significantly rhythmic KEGGs per year included and combinations of years:

Additionally we can test the significativity of the difference between years within combinations of years (in case some spy is interested...):

In combinations of 2 or 3 years we can observe that including the 2013, 2014, and 2015 significantly descreases the number of rhythmic KEGGs.