Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ npm-debug.log
.nextflow.log*
.nextflow/
nf_workspace/
work/
nextflow

# Input/Output directories
input/
Expand All @@ -60,3 +62,8 @@ uniprot/outputs/parquet/
# Generated ModelCIF metadata
examples/complexes/modelcif_metadata/
examples/multimer_examples/*.test.cif

# ipSAE C++ build artifacts and fetched dependencies
afdb_integration_kit/ipsae/ipsae_cpp
afdb_integration_kit/ipsae/deps/eigen-*/
afdb_integration_kit/ipsae/deps/*.tar.gz
227 changes: 177 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ A comprehensive toolkit for integrating structural models into the AlphaFold Dat
- [3. Install Mol\* CLI](#3-install-mol-cli)
- [4. Install DSSP](#4-install-dssp)
- [5. Download mmCIF Dictionary (Required for ModelCIF Generator)](#5-download-mmcif-dictionary-required-for-modelcif-generator)
- [6. Install Nextflow (Optional)](#6-install-nextflow-optional)
- [7. Install Docker (Optional)](#7-install-docker-optional)
- [6. Install Production Pipeline Dependencies (Optional)](#6-install-production-pipeline-dependencies-optional)
- [7. Install Nextflow (Optional)](#7-install-nextflow-optional)
- [8. Install Docker (Optional)](#8-install-docker-optional)
- [Quick Start](#quick-start)
- [Verify Installation](#verify-installation)
- [Basic Usage Example](#basic-usage-example)
Expand All @@ -24,6 +25,8 @@ A comprehensive toolkit for integrating structural models into the AlphaFold Dat
- [CIF to BCIF Converter](#cif-to-bcif-converter)
- [DSSP Secondary Structure Assignment](#dssp-secondary-structure-assignment)
- [Metadata Schema Validation](#metadata-schema-validation)
- [Production Pipeline](#production-pipeline)
- [Prepare Inputs (Standalone)](#prepare-inputs-standalone)
- [Docker Usage](#docker-usage)
- [Use Prebuilt Docker Image (Recommended)](#use-prebuilt-docker-image-recommended)
- [Build Docker Image (Optional)](#build-docker-image-optional)
Expand Down Expand Up @@ -54,6 +57,7 @@ A comprehensive toolkit for integrating structural models into the AlphaFold Dat
- **Metadata Schema Validation**: Validate model and provider metadata JSONs against AFDB-defined schemas
- **UniProt Metadata Tooling**: Streamline UniProt subset extraction and AF metadata generation (see [uniprot/README.md](uniprot/README.md))
- **Automated Workflows**: Nextflow-based end-to-end processing pipelines
- **Production Pipeline**: Standalone Python pipeline with logging, caching, resume capability, structure analysis (clash detection, interface residues), iPSAE quality scoring, and mmCIF QA metric embedding
- **Docker Support**: Containerized execution for reproducible results
- **Validation Tools**: Built-in testing and validation utilities

Expand Down Expand Up @@ -110,7 +114,9 @@ Without nvm:
npm install -g molstar
```

### 4. Install DSSP
### 4. Install DSSP (Nextflow workflow only)

The production pipeline uses built-in Python DSSP algorithms (`pydssp`, `psea`, `tmalign`) and does **not** require an external DSSP binary. This step is only needed if you use the Nextflow workflow.

We use the modern DSSP implementation by the PDB-REDO team:

Expand Down Expand Up @@ -141,7 +147,44 @@ curl -o mmcif_ma.dic https://raw.githubusercontent.com/ihmwg/ModelCIF/refs/heads

**Note:** This step is automatically handled in the Docker environment, but is required for local installations.

### 6. Install Nextflow (Optional)
### 6. Install Production Pipeline Dependencies (Optional)

The production pipeline (`scripts/production_pipeline.py`) requires additional dependencies for structure analysis (clash detection, interface residues). These use PyTorch and torch_cluster.

**Option A: Using `environment.yml` (recommended):**

```bash
conda env create -f environment.yml
conda activate afdb-toolkit

# Install Mol* CLI into the environment
npm install -g molstar
```

This installs everything (core + production + C++ build tools + Node.js) in one step.

**Option B: Manual pip installation:**

```bash
# Install PyTorch 2.8.0 (CPU version) - pinned for torch_cluster compatibility
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu

# Install torch_cluster (CPU version)
pip install torch_cluster -f https://data.pyg.org/whl/torch-2.8.0+cpu.html

# Install other production dependencies
uv pip install ".[production]"
```

**Verify installation:**

```bash
python -c "import torch; from torch_cluster import radius_graph; print('torch_cluster OK')"
```

For available torch_cluster versions, see: https://data.pyg.org/whl/

### 7. Install Nextflow (Optional)

For workflow automation:

Expand All @@ -154,7 +197,7 @@ chmod +x nextflow
sudo mv nextflow /usr/local/bin/
```

### 7. Install Docker (Optional)
### 8. Install Docker (Optional)

For containerized execution:
- **macOS/Windows**: Download Docker Desktop from https://www.docker.com/products/docker-desktop
Expand Down Expand Up @@ -198,59 +241,27 @@ uv run main.py run-dssp \

Convert ColabFold score JSON + PDB to AFDB ingest JSONs (pLDDT/PAE) and optional UniProt-style manifests.

#### Included example data

Sample ColabFold outputs are bundled under `examples/colabfold-output/` as zipped result folders:

- `ACATN_HUMAN_19de7.result.zip`
- `C76C2_ARATH_6db51.result.zip`
- `CDK9_CAEEL_5ca86.result.zip`

Each archive contains the files produced by ColabFold (scores JSON, PAE JSON, per-model unrelaxed PDBs, `config.json`, run markers, etc.). Unpack one to inspect or test the converter:

```bash
unzip examples/colabfold-output/ACATN_HUMAN_19de7.result.zip -d /tmp/colabfold
ls /tmp/colabfold/ACATN_HUMAN_19de7
```

#### Walkthrough using the bundled data

Prerequisites for `afdb-colabfold-convert`:

- Python deps: `orjson`, `duckdb` (install via `uv pip install -r requirements.txt` if you haven't already)
- (Optional) AFDB chain manifest CSV with columns `model_entity_id,entity_id,chain_id,uniprot_ac` (see merge instructions below)
- (Optional) DuckDB generated from UniProt flat files (built once per UniProt release with `afdb-uniprot-extract`/`afdb-uniprot-build-db`)
Requirements: `orjson`, `duckdb`, a chain manifest (`model_entity_id,entity_id,chain_id,uniprot_ac` at minimum), and a DuckDB built from the UniProt subset.

The converter can emit pLDDT/PAE JSONs using just the ColabFold score JSON + PDB. Provide the manifest + DuckDB when you also need UniProt-aware chain metadata or want to write chain/model manifest CSVs.

Run the converter on a sample ColabFold model by pointing to the unpacked score JSON and a PDB of your choice:
Example (per model, safer for many parallel jobs):

```
afdb-colabfold-convert \
/tmp/colabfold/ACATN_HUMAN_19de7/ACATN_HUMAN_19de7_scores_rank_001_alphafold2_ptm_model_1_seed_000.json \
/tmp/colabfold/ACATN_HUMAN_19de7/ACATN_HUMAN_19de7_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb \
/path/to/<AC>_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json \
/path/to/<AC>_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb \
--manifest /mnt/disks/data/sample/config/uniprot_afid_mapping.csv \
--duckdb /mnt/disks/data/sample/db/uniprot_2025_04.duckdb \
--model-entity-id AF-0000000000001201 \
--outdir /mnt/disks/data/sample/colabfold_output/ACATN_HUMAN_19de7-model_v4 \
--outdir /mnt/disks/data/sample/colabfold_output/<AC>-model_v4 \
--chain-manifest-dir /mnt/disks/data/sample/per_accession/manifests/chains \
--model-manifest-dir /mnt/disks/data/sample/per_accession/manifests/models
```

Drop the `--manifest`, `--duckdb`, `--chain-manifest-dir`, and `--model-manifest-dir` flags if you only need the AFDB JSON outputs; they are optional extras for UniProt-aware metadata.

**What the manifest directories do:**

- `--chain-manifest-dir`: writes `<model_entity_id>_afid_mapping.csv` per run containing chain-level averages/fractions (pLDDT bins, residue ranges) sourced from the manifest/DuckDB. These files mirror the schema expected by `uniprot_afid_mapping.csv` and live in a staging area until you merge them.
- `--model-manifest-dir`: writes `<model_entity_id>_model_metadata.csv` per run with model-level averages (pLDDT only). These append into the global `uniprot_model_metadata.csv` referenced by other tooling.

Use these directories when you want each ColabFold conversion to emit the per-model snippets that eventually roll up into the UniProt manifests; merge them later using the commands below once you finish processing a batch.

Outputs from the walkthrough:

- AFDB JSONs in `--outdir`: `<model_entity_id>-confidence_v1.json` (pLDDT) and `<model_entity_id>-predicted_aligned_error_v1.json` (PAE)
- Per-model manifest CSVs (created inside the respective `--chain-manifest-dir` / `--model-manifest-dir` paths) for aggregating pLDDT summaries
- Optional UniProt-style manifests can be merged across models as described next
Outputs:
- AFDB JSONs: `<model_entity_id>-confidence_v1.json` and `<model_entity_id>-predicted_aligned_error_v1.json` in `--outdir`.
- Per-model manifests:
- Chains: `<model_entity_id>_afid_mapping.csv` with pLDDT averages/fractions and local 1..N residue ranges.
- Models: `<model_entity_id>_model_metadata.csv` with average pLDDT and ipTM (if present in scores JSON).

Merge per-model manifests when needed (keep the header, append rows):

Expand Down Expand Up @@ -340,7 +351,11 @@ uv run main.py run-cif2bcif -i <input_cif> -o <output_bcif>

### DSSP Secondary Structure Assignment

Assigns secondary structure annotations based on atomic coordinates.
Assigns 3-state secondary structure annotations (helix, strand, coil) based on atomic coordinates. Three algorithms are available:

- **pydssp** (default) — hydrogen-bond based assignment
- **psea** — geometry-based assignment using CA coordinates
- **tmalign** — CA-CA distance-based assignment

**Command:**
```bash
Expand Down Expand Up @@ -437,6 +452,120 @@ uv run main.py validate-sequences-file --file path/to/sequences.fasta

Each command exits with code `1` if it encounters validation errors, making them easy to embed in automated pipelines.

### Production Pipeline

The production pipeline (`scripts/production_pipeline.py`) provides a standalone alternative to the Nextflow workflow with comprehensive logging, caching, and resume capability. It processes models through 16 stages (executed in this order):

1. **Prepare assets** – symlink PDB + meta JSON to staging
2. **Validate assets** – check PDB/JSON consistency
3. **Convert ColabFold** – produce AFDB-format confidence & PAE JSONs
4. **Merge manifests** – merge per-model chain/model manifests
5. **Calculate ipSAE scores** – interface quality metrics (ipSAE, pDockQ, LIS)
6. **Analyze clashes/interfaces** – VDW clashes, interface residues
7. **Export model metadata** – generate per-model metadata JSONs (enriched with iPSAE/clash metrics)
8. **Export chain metadata** – generate per-chain metadata JSONs (enriched with iPSAE metrics)
9. **Combine model metadata** – batch into chunked JSONs
10. **Combine chain metadata** – batch into chunked JSONs
11. **Export ModelCIF input** – prepare ModelCIF metadata from template
12. **Generate ModelCIF** – PDB → mmCIF with full metadata and optional QA metrics
13. **DSSP** – secondary structure annotation (3-state: helix/strand/coil)
14. **Enrich PDB** – add AFDB headers to PDB files
15. **CIF → BCIF** – BinaryCIF conversion
16. **Cleanup** – optional intermediate file cleanup (skipped by default)

> **Note:** ipSAE and clash analysis (stages 5-6) run *before* metadata export (stages 7-8) so that quality metrics are available for JSON enrichment and CIF embedding.

**Prerequisites:** Install production dependencies first (see [Installation section 6](#6-install-production-pipeline-dependencies-optional)), or use the `environment.yml`:

```bash
conda env create -f environment.yml
conda activate afdb-toolkit
```

#### Homodimer mode (default)

All config files are provided up front — no API calls, no manifest resolution:

```bash
python scripts/production_pipeline.py \
--output-dir /path/to/output \
--input-dir /path/to/input \
--mapping-file /path/to/mapping.tsv \
--chain-mapping /path/to/manifest.csv \
--dataset-config /path/to/config.json \
--provider-json /path/to/provider.json \
--uniprot-db /path/to/uniprot.duckdb \
--workers 30 \
--cif-qa-metrics auto
```

#### Heterodimer mode

Enable with `--heterodimers`. Requires `--chain-mapping` and `--uniprot-db`. Config files (mapping TSV, dataset config, provider JSON) are auto-generated if not provided. Model IDs are derived from the chain mapping CSV.

```bash
python scripts/production_pipeline.py \
--output-dir /path/to/output \
--input-dir /path/to/raw_colabfold \
--heterodimers \
--chain-mapping /path/to/manifest.csv \
--uniprot-db /path/to/uniprot.duckdb \
--workers 4 \
--cif-qa-metrics auto
```

The `--input-dir` may contain raw ColabFold outputs (long suffixes like `_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb` are detected automatically).

#### Key options

| Flag | Description |
|------|-------------|
| `--resume` | Resume from previous run (skip completed stages) |
| `--skip-stages stage_12,stage_13` | Skip specific stages (comma-separated) |
| `--dry-run` | Show what would be executed without running |
| `--dssp-algorithm` | Secondary structure algorithm: `psea`, `pydssp` (default), or `tmalign` |
| `--workers N` | Parallel workers (default: all CPUs) |
| `--pae-cutoff` / `--dist-cutoff` | ipSAE thresholds (default: 10.0 / 15.0) |
| `--clash-cutoff` / `--interface-cutoff` | Clash/interface thresholds (default: 0.4 / 8.0 Å) |
| `--analysis-batch-size N` | Batch size for clash/interface GPU analysis (default: 4) |
| `--cif-qa-metrics` | QA metrics to embed in mmCIF: `auto` (default, all metrics) or comma-separated list (e.g. `ipsae_AB,iptm_af,N_clash_backbone`) |
| `--enrichment-metrics` | iPSAE/clash metric names to include in model/chain metadata JSONs (default: all known metrics) |
| `--interface-clash-analysis` | Which analyses to run: `interface`, `backbone_clashes`, `heavy_atom_clashes` (default: all three) |
| `--modelcif-template` | Path to ModelCIF metadata template JSON (default: `uniprot/templates/modelcif_metadata.json`) |

**Output:** Results are written to the output directory with logs in `logs/`, cache in `.pipeline_cache.json`, and a results summary in `pipeline_results.json`.

Run `python scripts/production_pipeline.py --help` for full documentation.

### Prepare Inputs (Standalone)

`scripts/prepare_inputs.py` can also be used independently (outside the production pipeline) to prepare ColabFold outputs into the canonical layout the pipeline expects. It scans for matched PDB + scores-JSON pairs, builds config files, and symlinks inputs.

**Production mode** (pre-built assets, no network):

```bash
python scripts/prepare_inputs.py \
--input-dir /data/colabfold/gpu0 \
--output-dir /data/workdir \
--chain-mapping /data/prebuilt_manifest.csv \
--uniprot-db /data/uniprot.duckdb \
--provider-id afcdb-heterodimers \
--provider-name "AFCDB Heterodimers"
```

**Dev mode** (resolves AF-IDs from the AFCDB manifest + fetches from UniProt API):

```bash
python scripts/prepare_inputs.py \
--input-dir ./gpu0 \
--output-dir ./workdir \
--build-from-api /data/afdb_toolkit_manifest_file.csv \
--provider-id afcdb-heterodimers \
--provider-name "AFCDB Heterodimers"
```

By default, scores files are **symlinked** as meta JSONs (zero I/O). Pass `--extract-meta` to parse and re-write leaner JSONs, or `--copy` to copy instead of symlink.

## Docker Usage

### Use Prebuilt Docker Image (Recommended)
Expand Down Expand Up @@ -619,8 +748,6 @@ AF-0001234567890125
AF-0001234567890126
```

The ModelPDB step also requires provider metadata. By default the workflow reads this from `input/provider.json`; override it with `--provider_json <path>` if your provider file is elsewhere.

**Example input.txt:**
```bash
# Create the input list file
Expand Down
Loading