Code and experiments for the paper "Advancing Knotted Protein Design with ESM3".
We investigate how multimodal protein language models interact with topological complexity, using knotted proteins as a test case. Using ESM3's guided generation, we achieve an 89% success rate in producing knotted proteins (compared to ~0.5% for unguided approaches), reveal that knot topology is remarkably robust to sequence perturbation (mean breaking point: 84%), and show that structural drift precedes topological disruption.
├── src/ # Experiment scripts (Modal GPU compute)
├── tex/ # Paper LaTeX source and figures
├── results/ # Experiment outputs (JSON)
├── notes/ # Working notes and analysis
└── old_codes/ # Original Jupyter notebooks
All experiments run on Modal serverless GPUs using the ESM3-SM (1.4B) model.
| Script | Description |
|---|---|
smoke_test.py |
Minimal validation: load ESM3, generate structure, run topoly |
benchmark.py |
Per-operation timing benchmarks on Modal A10G |
guided_gen_run.py |
De novo guided generation of knotted proteins (n=100) |
masking_experiment.py |
Knot stability under random masking (n=250, 10 levels) |
rmsd_analysis.py |
RMSD structural drift analysis (n=80) |
embeddings_classifier.py |
ESM3 embedding extraction + MLP classifier (n=5000) |
contiguous_masking.py |
Contiguous vs random masking comparison (n=50) |
targeted_masking.py |
Core vs non-core targeted masking (n=40) |
sliding_window.py |
Position-resolved vulnerability profiles (n=40) |
unknot_final.py |
Unknotted-to-knotted conversion (n=99) |
typed_gen.py |
Knot-type-specific guided generation (4 types × 10) |
length_gen.py |
Length-dependent generation success (6 lengths × 10) |
extract_embeddings.py |
Full embedding extraction for UMAP visualization |
restyle_figures.py |
Generate all paper figures from result JSONs (3 selectable themes) |
Requires uv and a Modal account.
# Install dependencies
uv sync
# Set Modal token (one-time)
uv run modal token set --token-id <ID> --token-secret <SECRET>
# Create HuggingFace secret on Modal
uv run modal secret create huggingface-secret HF_TOKEN=<TOKEN> --force# Smoke test (~5 min)
uv run modal run src/smoke_test.py
# Guided generation (n=100, ~10 min with 10 GPUs)
uv run modal run src/guided_gen_run.py --n-attempts 100
# Masking stability (n=300, ~90 min with 10 GPUs)
uv run modal run src/masking_experiment.py \
--n-proteins 300 --n-trials 8 \
--levels "10,20,30,40,50,60,70,80,85,90"
# Generate all figures locally (Okabe-Ito colorblind-safe theme used in the paper)
uv run python src/restyle_figures.py --theme bold
# Or render all three themes side-by-side for comparison (calm / minimal / bold)
uv run python src/restyle_figures.pyAdd --detach to keep jobs running if your terminal disconnects.
| Experiment | N | Result |
|---|---|---|
| Guided generation | 100 | 89% success (95% CI: 81–94%) |
| Masking breaking point | 250 | Mean 84% (±1.2% SE) |
| RMSD at 50% masking | 80 | 3.24 Å median, knot prob 0.85 |
| Embedding classifier | 5000 | 97.1% accuracy |
| Max seq identity to known | 89 | 14.5% (random baseline: 12.5%) |
| Unknotted-to-knotted | 99 | 17% (95% CI: 10–26%) |
EvaKlimentova/Diffusion-all_knots on HuggingFace Hub. 15,000 proteins (Real, RFdiffusion, EvoDiff), 1,000 knotted + 4,000 unknotted each.
- ESM3 (
esm==3.2.3): EvolutionaryScale multimodal protein language model - Topoly: Alexander polynomial knot detection
- Modal: Serverless GPU compute
- AlphaKnot 2.0: Knot core position ground truth
@inproceedings{simecek2025advancing,
title = {Advancing Knotted Protein Design with ESM3: Guided Generation and Topological Insights},
author = {Simecek, Petr and Marsalkova, Eva},
booktitle = {Proceedings of the ICML 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences},
year = {2025},
url = {https://openreview.net/forum?id=gYUAJPJeWP}
}
Supported by the Czech Science Foundation, project no. 23-04260L ("Biological code of knots"). Computational resources provided by Modal serverless GPU compute.