Skip to content

ML-Bioinfo-CEITEC/KPDwESM3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advancing Knotted Protein Design with ESM3

Code and experiments for the paper "Advancing Knotted Protein Design with ESM3".

We investigate how multimodal protein language models interact with topological complexity, using knotted proteins as a test case. Using ESM3's guided generation, we achieve an 89% success rate in producing knotted proteins (compared to ~0.5% for unguided approaches), reveal that knot topology is remarkably robust to sequence perturbation (mean breaking point: 84%), and show that structural drift precedes topological disruption.

Repository Structure

├── src/                    # Experiment scripts (Modal GPU compute)
├── tex/                    # Paper LaTeX source and figures
├── results/                # Experiment outputs (JSON)
├── notes/                  # Working notes and analysis
└── old_codes/              # Original Jupyter notebooks

Experiment Scripts (src/)

All experiments run on Modal serverless GPUs using the ESM3-SM (1.4B) model.

Script Description
smoke_test.py Minimal validation: load ESM3, generate structure, run topoly
benchmark.py Per-operation timing benchmarks on Modal A10G
guided_gen_run.py De novo guided generation of knotted proteins (n=100)
masking_experiment.py Knot stability under random masking (n=250, 10 levels)
rmsd_analysis.py RMSD structural drift analysis (n=80)
embeddings_classifier.py ESM3 embedding extraction + MLP classifier (n=5000)
contiguous_masking.py Contiguous vs random masking comparison (n=50)
targeted_masking.py Core vs non-core targeted masking (n=40)
sliding_window.py Position-resolved vulnerability profiles (n=40)
unknot_final.py Unknotted-to-knotted conversion (n=99)
typed_gen.py Knot-type-specific guided generation (4 types × 10)
length_gen.py Length-dependent generation success (6 lengths × 10)
extract_embeddings.py Full embedding extraction for UMAP visualization
restyle_figures.py Generate all paper figures from result JSONs (3 selectable themes)

Setup

Requires uv and a Modal account.

# Install dependencies
uv sync

# Set Modal token (one-time)
uv run modal token set --token-id <ID> --token-secret <SECRET>

# Create HuggingFace secret on Modal
uv run modal secret create huggingface-secret HF_TOKEN=<TOKEN> --force

Running Experiments

# Smoke test (~5 min)
uv run modal run src/smoke_test.py

# Guided generation (n=100, ~10 min with 10 GPUs)
uv run modal run src/guided_gen_run.py --n-attempts 100

# Masking stability (n=300, ~90 min with 10 GPUs)
uv run modal run src/masking_experiment.py \
  --n-proteins 300 --n-trials 8 \
  --levels "10,20,30,40,50,60,70,80,85,90"

# Generate all figures locally (Okabe-Ito colorblind-safe theme used in the paper)
uv run python src/restyle_figures.py --theme bold

# Or render all three themes side-by-side for comparison (calm / minimal / bold)
uv run python src/restyle_figures.py

Add --detach to keep jobs running if your terminal disconnects.

Key Results

Experiment N Result
Guided generation 100 89% success (95% CI: 81–94%)
Masking breaking point 250 Mean 84% (±1.2% SE)
RMSD at 50% masking 80 3.24 Å median, knot prob 0.85
Embedding classifier 5000 97.1% accuracy
Max seq identity to known 89 14.5% (random baseline: 12.5%)
Unknotted-to-knotted 99 17% (95% CI: 10–26%)

Dataset

EvaKlimentova/Diffusion-all_knots on HuggingFace Hub. 15,000 proteins (Real, RFdiffusion, EvoDiff), 1,000 knotted + 4,000 unknotted each.

Dependencies

  • ESM3 (esm==3.2.3): EvolutionaryScale multimodal protein language model
  • Topoly: Alexander polynomial knot detection
  • Modal: Serverless GPU compute
  • AlphaKnot 2.0: Knot core position ground truth

Citation

@inproceedings{simecek2025advancing,
  title     = {Advancing Knotted Protein Design with ESM3: Guided Generation and Topological Insights},
  author    = {Simecek, Petr and Marsalkova, Eva},
  booktitle = {Proceedings of the ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences},
  year      = {2025},
  url       = {https://openreview.net/forum?id=gYUAJPJeWP}
}

Acknowledgments

Supported by the Czech Science Foundation, project no. 23-04260L ("Biological code of knots"). Computational resources provided by Modal serverless GPU compute.

About

Knotted Protein Design with ESM3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors