Thanks for the tool it sounds very useful! I want to use it to annotate the viruses I find in my metagenomes but I have a few questions concerning the use of the tool:
-
In the literature (example the article "Structure-guided discovery of anti-CRISPR and anti-phage defense proteins" from last month), they use a TM-score>0.6 between an unknown protein and a known one to predict the function of the known one. The default thresholds in Phold are an e-value of 1e-3 and a sensitivity of 9.5 for Foldseek. How does the default thresholds of phold compare to this score? My guess is that it is less sensitive because the aim is to get true annotations rather than extreme novelty. Also you take a stricter e-value cutoff for CARD hits and I am not sure why?
-
Phold makes a great use of sequence and stucture alignments to make a maximum of protein annotations. Do you feel like large language models might improve the result of Phold by providing at least the PHROG category of some unknown genes? The results obtained in "Large language models improve annotation of prokaryotic viral proteins" in 2023 sounded promising
-
To further improve the annotations, I feel like using the colocalization of viral genes might work. PHROG incorporates a network of colocalized genes: do you think it might be leveraged to make a decision beween several hits that would be as likely otherwise?
-
Overlapping genes are not provided by any viral annotation tool I know. What I do so far is looking of potential overlapping genes by making blastp requests within the viral genes to find potential additional genes. Would there be a way to look for and add likely overlapping genes to Phold output?
Thanks for the tool it sounds very useful! I want to use it to annotate the viruses I find in my metagenomes but I have a few questions concerning the use of the tool:
In the literature (example the article "Structure-guided discovery of anti-CRISPR and anti-phage defense proteins" from last month), they use a TM-score>0.6 between an unknown protein and a known one to predict the function of the known one. The default thresholds in Phold are an e-value of 1e-3 and a sensitivity of 9.5 for Foldseek. How does the default thresholds of phold compare to this score? My guess is that it is less sensitive because the aim is to get true annotations rather than extreme novelty. Also you take a stricter e-value cutoff for CARD hits and I am not sure why?
Phold makes a great use of sequence and stucture alignments to make a maximum of protein annotations. Do you feel like large language models might improve the result of Phold by providing at least the PHROG category of some unknown genes? The results obtained in "Large language models improve annotation of prokaryotic viral proteins" in 2023 sounded promising
To further improve the annotations, I feel like using the colocalization of viral genes might work. PHROG incorporates a network of colocalized genes: do you think it might be leveraged to make a decision beween several hits that would be as likely otherwise?
Overlapping genes are not provided by any viral annotation tool I know. What I do so far is looking of potential overlapping genes by making blastp requests within the viral genes to find potential additional genes. Would there be a way to look for and add likely overlapping genes to Phold output?