Skip to content

Question about excessive featurising time caused by large MSA inputs and template option #641

@kimlab-cnu

Description

@kimlab-cnu

Hello! I am a student in Korea studying protein design, and I would like to ask about an MSA-related issue we encountered during prediction of a 3-chain complex.

We are currently predicting a tertiary complex composed of:
two designed proteins, each approximately 400 aa in length + one cyclic peptide represented as SMILES (roughly 10–20 residues in size)

To apply the template option supported by AlphaFold 3, we downloaded experimentally validated mmCIF files from the RCSB PDB, prepared template structure files for each protein chain, and added the template information to a JSON file that already contained MSA results.

However, after enabling templates, the runtime increased substantially.
Previously, one complex took about 1 hour, but with templates applied, the same job required about 6 hours.

From the log, we found the following:

Featurising data with 1 seed(s)...
Featurising data with seed 220.
Featurising data with seed 220 took 21036.24 seconds.
Featurising data with 1 seed(s) took 21046.55 seconds.

The actual model inference takes only about 1 minute, while the featurising step takes nearly 5 hours and 50 minutes.

1. I would also like to ask whether the template option can have a substantial impact on the overall runtime of AlphaFold 3, including structure inference.

We then checked the MSA sizes for each chain:

Chain A
unpairedMsa length: 5,035,677
pairedMsa length: 15,933,678

Chain C
unpairedMsa length: 4,819,630
pairedMsa length: 15,324,761

These were generated using the default jackhmmer search settings.
Since Chain B is a ligand represented by SMILES, it does not contribute much meaningful paired MSA information.

Based on this, we are currently considering the following possibilities:

A. Reducing the size of both unpairedMSA and pairedMSA
B. Since Chain B has little meaningful pairedMSA information, we suspect that the pairedMSA may fail to preserve useful inter-chain linkage and instead become overly biased toward Chains A and C. Therefore, we are also considering reducing or even removing pairedMSA

I would greatly appreciate your thoughts on whether these ideas seem reasonable.

In addition, when reducing MSA depth, I understand that there may not be a single correct answer. However, I have had difficulty finding clear guidance in the paper or related resources regarding:

2. How to decide how much to reduce the MSA?
My initial idea was to truncate the current MSA to approximately 70%, 50%, 30%, and 20% of its original size, then compare the resulting structures using metrics such as iPTM and PAE.

Would this seem like a reasonable benchmarking strategy? Any advice or practical recommendations would be greatly appreciated.

Thank you very much for your time and help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions