Functional ANnoTAtion based on embedding space SImilArity
FANTASIA is an advanced pipeline for the automatic functional annotation of protein sequences using state-of-the-art protein language models. It integrates deep learning embeddings and in-memory similarity searches, retrieving reference vectors from a PostgreSQL database with pgvector-backed storage, to associate Gene Ontology (GO) terms with proteins.
Note
For the full FANTASIA workflow, input proteomes should be provided as decompressed FASTA files.
If your source file is gzip-compressed, such as .fa.gz or .fasta.gz, decompress it before
running embedding or full-pipeline jobs.
Important
Two intended usage modes
1. Annotation mode
- Use
k = 1 - Do not apply self-exclusion-style identity filtering
- Goal: maximize annotation coverage for unknown genomes or proteomes not present in the reference set
2. Benchmark / leakage-control mode
- Use
k > 1 - Apply explicit sequence-identity exclusion against the query
- Keep the best remaining donor by highest
reliability_indexafter filtering - Goal: reduce near-self or near-orthologue leakage during benchmarking
For full documentation, visit FANTASIA Documentation.
For users who need a lightweight, standalone alternative, FANTASIA-Lite provides fast Gene Ontology annotation directly from local FASTA files, without requiring a database server or the full FANTASIA infrastructure. It leverages protein language model embeddings and nearest-neighbor similarity in embedding space to deliver high-quality functional annotations with minimal setup.
For FANTASIA-Lite, visit https://github.com/CBBIO/FANTASIA-Lite
Two packaged reference datasets are available; select one depending on your analysis needs:
-
Main Reference (last layer, default)
Embeddings extracted only from the final hidden layer of each PLM.
Recommended for most annotation tasks (smaller, faster to load).
Record: https://zenodo.org/records/17795871 -
Multilayer Reference (early layers + final layers)
Embeddings extracted from multiple hidden layers (including intermediate and final).
Suitable for comparative and exploratory analyses requiring layer-wise representations.
Record: https://zenodo.org/records/17793273
Available Embedding Models
Supports protein language models: ESM-2, ProtT5, ProstT5, Ankh3-Large, and ESM3c for sequence representation.
-
Redundancy Filtering
Provides optional MMseqs2-based query-aware redundancy masking during lookup. This masks donor sequences that MMseqs2 assigns to the same cluster as the query, but it does not guarantee removal of all clearly similar or high-identity donors. For benchmark or leakage-control workflows, retrieving more neighbors and applying explicit sequence-identity filtering is the more reliable approach. -
Optimized Data Storage
Embeddings are stored in HDF5 format for input sequences. The reference table, however, is hosted in a public relational PostgreSQL database using pgvector. -
Efficient Similarity Lookup
High-throughput similarity search with a hybrid approach: reference embeddings are stored in a PostgreSQL + pgvector database, then loaded per model/layer into memory so similarities can be computed efficiently in the application with vectorized GPU or CPU operations. In the repository default configuration, lookup runs on GPU (lookup.use_gpu: true). CPU lookup is available by settinglookup.use_gpu: false. -
Sequential Embedding + Lookup
FANTASIA first computes query embeddings and stores them inembeddings.h5, then runs the lookup stage. These stages execute sequentially within a run, so embedding and lookup do not compete for GPU resources unless multiple FANTASIA jobs are launched at the same time. -
Global & Local Alignment of Hits
Candidate hits from the reference table are aligned both globally and locally against the input protein for validation and scoring. -
Multi-layer Embedding Support
Optional support for intermediate + final layers to enable layer-wise analyses and improved exploration. Layer indices are specified per model in the YAML config underembedding.models.<Model>.layer_index. Indexing is relative to the output end of the network:0 = final/output layer,1 = penultimate layer,2 = second-to-last, and so on. -
Raw Outputs & Flexible Post-processing
Exposes raw result tables for custom analyses and includes a flexible post-processing & scoring system that produces TopGO-ready files.
Performs high-speed searches using in-memory computations. Reference vectors are retrieved from a PostgreSQL database with pgvector-backed storage for comparison. -
Functional Annotation by Similarity
Assigns Gene Ontology (GO) terms to proteins based on embedding space similarity, using pre-trained embeddings from all supported models.
-
Embedding Generation
Computes protein embeddings using deep learning models (ProtT5, ProstT5, ESM-2, Ankh3-Large, and ESM3c). -
GO Term Lookup
Performs vector similarity searches using in-memory computations to assign Gene Ontology terms. Reference embeddings are retrieved from a PostgreSQL database with pgvector-backed storage and loaded per model/layer into memory. In the default configuration, this stage runs on GPU (lookup.use_gpu: true). Only experimental evidence codes are used for transfer.
The repository default is GPU execution for both embedding (embedding.device: cuda) and lookup (lookup.use_gpu: true). CPU remains available as an explicit fallback by setting embedding.device: cpu and lookup.use_gpu: false. In the current pipeline, embeddings are generated first and lookup runs afterward, so Stage A and Stage B do not overlap within the same run.
When processing multiple proteomes on a single GPU-equipped machine, a sequential launcher script is recommended. Running one proteome at a time preserves the same non-overlapping execution model used within a single FANTASIA run and avoids GPU contention between concurrent jobs. This is often the simplest and most reliable strategy for small-to-medium batches of proteomes.
If you plan to annotate with several embedding models, it is usually better to run one model at a time rather than enabling all models in a single launch. This keeps GPU memory use predictable, makes failures easier to isolate, and avoids contention between large model loads on remote or shared machines.
Example:
./scripts/run_sequential_proteomes.sh config/prott5_full.yaml /path/to/proteomes /path/to/experiments prott5The GPU memory required by the lookup stage depends mainly on:
- the size of the reference embedding matrix
- the lookup query batch size
- the embedding dimensionality
- temporary tensors created during cosine or euclidean distance computation
Because FANTASIA runs embeddings first and lookup afterward, GPU lookup memory requirements do not depend on the embedding step being active within the same run.
For a typical single-model Prot-T5 layer-0 lookup on a proteome, the reference matrix may be on the order of 123,977 x 1024, with lookup batches such as 516 x 1024 using float32 tensors. In practice, this fits comfortably on a 24 GB GPU and is generally expected to fit on a 16 GB GPU as well. Actual memory requirements still depend on the selected reference dataset, enabled layers/models, and lookup batch size.
The table below summarizes a lookup-only benchmark on a single proteome using the same precomputed Prot-T5 embeddings and the same reference table. Only the lookup execution device was changed.
Benchmark hardware for the GPU run:
- GPU:
NVIDIA GeForce RTX 3090 Ti - VRAM:
24 GB - CUDA available in the runtime environment:
True - PyTorch build used for the benchmark:
2.11.0+cu130
| Proteome | Input proteins | Mean protein length (aa) | Max protein length (aa) | Embedded proteins | Lookup tasks | Lookup device | Distance time (total) | Distance time / batch | Lookup wall time |
|---|---|---|---|---|---|---|---|---|---|
| A proteome (Prot-T5, layer 0) | 20,223 | 392.25 | 8,215 | 20,223 | 20,223 | CPU | 1,835.89 s | 45.90 s | 1,933.08 s |
| A proteome (Prot-T5, layer 0) | 20,223 | 392.25 | 8,215 | 20,223 | 20,223 | GPU | 17.05 s | 0.43 s | 126.95 s |
Observed speedup in this benchmark:
- Distance kernel: about
108xfaster on GPU (1835.89 s→17.05 s) - Lookup wall time: about
15xfaster on GPU (1933.08 s→126.95 s)
In this benchmark, no proteins were discarded before embedding: the input FASTA contained 20,223 proteins and the generated embeddings.h5 also contained 20,223 embedded accessions.
Long proteins are not removed either. FANTASIA only truncates query sequences before embedding when embedding.max_sequence_length is set to a positive value. The repository default is 0 (no truncation).
FANTASIA writes lookup results in three main forms:
- Per-accession raw CSV files under
raw_results/{model}/layer_{k}/ - A global
summary.csvproduced during post-processing - TopGO-ready files under
topgo/
If you need to consolidate many per-accession raw CSV files into a single table for downstream analysis, use the merge utility.
Example:
python scripts/merge_raw_results.py \
/path/to/experiment/raw_results/prot-t5/layer_0 \
-o /path/to/experiment/raw_results/prot-t5/layer_0_merged.csv \
--add-source-fileThe raw CSVs are the most detailed output. Each row represents one transferred GO annotation associated with one retrieved reference hit for one query protein.
Typical columns include:
accession: query protein accessiongo_id: transferred GO termgo_description: GO term namecategory: GO namespace, typicallyBP,MF, orCCdistance: embedding-space distance between the query and the selected reference hitreliability_index: similarity-derived score computed fromdistancemodel_name: embedding model used for the lookuplayer_index: model layer used for the lookupprotein_id,organism,gene_name: metadata from the matched reference proteinevidence_code: evidence code associated with the transferred annotationquery_len,ref_len: query and reference sequence lengths
If sequence-aware storage is enabled, the raw CSVs can also include alignment-derived metrics:
identity,similarity,alignment_score,gaps_percentage: global alignment metricsidentity_sw,similarity_sw,alignment_score_sw,gaps_percentage_sw: local Smith-Waterman-style alignment metricsalignment_length,alignment_length_sw: aligned lengths for the global and local alignments
distance is the nearest-neighbor distance in embedding space, so lower values indicate a closer reference match.
reliability_index is derived from distance so that higher values indicate stronger support:
- cosine lookup:
reliability_index = 1 - distance - euclidean lookup:
reliability_index = 0.5 / (0.5 + distance) - other metrics:
reliability_index = 1 / (1 + distance)
In practice:
- lower
distanceis better - higher
reliability_indexis better reliability_indexis the easiest column to rank by in the raw files
When alignment metrics are present:
identityand related columns summarize the global end-to-end alignmentidentity_swand related columns summarize the best local alignment segment
This is useful because some hits may share only a conserved local region. A protein can therefore have:
- moderate global identity but high local identity
- strong embedding similarity together with weak sequence alignment, or the reverse
These fields are best interpreted as complementary evidence rather than strict pass/fail filters.
summary.csv is the post-processed accession-by-GO summary table. It should be interpreted as the output of a heuristic ranking procedure, not as a table of probabilities. In particular, final_score is not a probability score and should not be read as a calibrated confidence value. The table aggregates all raw rows belonging to the same (accession, go_id, model_name, layer_index) combination and computes configured statistics such as min, max, and mean.
When sequence-aware exports are enabled, FANTASIA can also write auxiliary files such as sequences.fasta and query_index_mapping.csv to help relate internal Q* identifiers back to parsed query accessions. These mapping aids are provided as optional conveniences for downstream inspection. Their use is left to the user, since FANTASIA's primary goal is to provide a solid and flexible annotation framework rather than to impose a single interpretation or accuracy-estimation workflow.
By default, the repository configuration summarizes:
reliability_indexidentityidentity_sw- support count normalized by
limit_per_entry
The default aliases are:
riforreliability_indexid_gfor global identityid_lfor local identity
In the current code, the support count metric is derived from the number of raw rows supporting the same (accession, go_id, model_name, layer_index) group, normalized by limit_per_entry. This means count acts as a support-strength signal rather than a probability: GO terms supported repeatedly across raw hits receive a larger value.
So columns such as max_ri_ProtT5_L0, mean_id_g_ProtT5_L0, or max_id_l_ProtT5_L0 in summary.csv represent aggregated per-model, per-layer evidence for the same accession and GO term.
If weights are configured, FANTASIA also writes:
- weighted columns prefixed by
w_ - a composite
final_score
final_score is a configuration-driven heuristic ranking score, not a universal probability or calibrated confidence value. Its objective is to combine several evidence signals into one sortable value so candidate GO terms can be prioritized within the same run and configuration.
In the repository default configuration, final_score is built from a weighted combination of:
- the best embedding-derived support (
max_ri) - the best global alignment identity (
max_id_g) - the best local alignment identity (
max_id_l) - the support
count
This makes final_score useful for ranking candidate GO terms, filtering outputs, and downstream prioritization, but its numerical value should not be interpreted as a probability of correctness. Changing the configured metrics or weights changes the meaning of the score.
If lookup.topgo: true, FANTASIA also exports TopGO-compatible files under topgo/.
- Per-model/layer exports keep rows separated by model, layer, and GO category
- Ensemble exports keep the best
reliability_indexper(accession, go_id, category)across all models and layers
These files contain three columns in tab-separated form:
- accession
- GO term
- reliability index
FANTASIA requires two key services:
- PostgreSQL 16 with pgvector: Stores reference protein embeddings used by the lookup stage
- RabbitMQ: Message broker for distributed embedding task processing
- Python 3.12 (the project metadata specifies
>=3.12,<3.13) A Conda environment based on Python 3.12 is a suitable local setup option. - Docker and Docker Compose installed
Additional dependency notes:
- MMseqs2 is required if you enable redundancy filtering during lookup. FANTASIA invokes the external
mmseqsexecutable, so it must be installed separately and available in yourPATH. In the current workflow, this feature should be interpreted as query-aware redundancy masking rather than a guaranteed exclusion of all clearly similar donors. - Parasail is used for alignment-based post-processing through its Python package. When FANTASIA is installed through its declared Python dependencies,
parasailis provided by the runtime environment and does not need to be invoked as a separate command-line tool. - Taxonomy descendant expansion (
taxonomy.get_descendants: true) is currently disabled. The original implementation relied onete3.NCBITaxaand its local NCBI taxonomy database rather than on the FANTASIA PostgreSQL reference database, which introduced environment-dependent behavior. The current taxonomy filter therefore works on the exact taxonomy IDs you provide. For benchmark-style exclusions, manually list the relevant species, subspecies, or related taxa intaxonomy_ids_to_excludeand keepget_descendants: false.
Execution modes:
- Default run: embedding + lookup
only_lookup: true: skip embedding and use an existingembeddings.h5only_embedding: true: stop after generatingembeddings.h5only_lookup: trueandonly_embedding: truecannot be used together
To run Stage A only (embedding generation with no lookup search), set this in your config:
only_lookup: false
only_embedding: trueDeployment note These updates do not change the overall deployment strategy for Docker, Slurm, or array-based cluster execution. The main changes are at the application level:
- explicit support for
only_embedding: true- clearer disabling of
distance_threshold(for examplefalseinstead of legacy0, while keeping backward compatibility)- corrected and clarified taxonomy filtering behavior
- recommendation to use decompressed FASTA files for full embedding and full-pipeline runs
- optional generation of
query_index_mapping.csvfor sequence-aware outputs- GPU-oriented defaults in the packaged config (
embedding.device: cuda,lookup.use_gpu: true); CPU-only deployments should setembedding.device: cpuandlookup.use_gpu: falseExisting deployment wrappers should therefore remain structurally valid, but may require small updates if they assume the previous threshold convention, gzipped FASTA inputs, or older output expectations.
FANTASIA needs two local services while it runs:
- PostgreSQL + pgvector for the reference database
- RabbitMQ for the embedding queue
Start the services once, run as many FANTASIA jobs as needed, then stop them when you are done.
-
Start services (from the FANTASIA directory):
docker compose up -d # or: docker-compose up -d -
Verify services are running:
docker compose ps # or: docker-compose psExpected output:
CONTAINER ID IMAGE STATUS xxx pgvector/pgvector:0.7.0-pg16 Up (healthy) xxx rabbitmq:3.13-management Up (healthy) -
Test database connection:
PGPASSWORD=clave psql -h localhost -U usuario -d BioData -c "SELECT 1"
Some shared servers have Docker installed but neither docker compose nor
docker-compose. In that case, start the same services with plain docker run.
This example maps PostgreSQL to host port 5433 to avoid conflicts with an
existing system PostgreSQL on 5432.
docker run -d \
--name fantasia-postgres \
-e POSTGRES_USER=usuario \
-e POSTGRES_PASSWORD=clave \
-e POSTGRES_DB=BioData \
-p 5433:5432 \
-v fantasia_postgres_data:/var/lib/postgresql/data \
pgvector/pgvector:0.7.0-pg16
docker run -d \
--name fantasia-rabbitmq \
-e RABBITMQ_DEFAULT_USER=guest \
-e RABBITMQ_DEFAULT_PASS=guest \
-p 5672:5672 \
-p 15672:15672 \
-v fantasia_rabbitmq_data:/var/lib/rabbitmq \
rabbitmq:3.13-management-alpineCheck the services:
nc -z localhost 5433 && echo "Postgres OK"
nc -z localhost 5672 && echo "RabbitMQ OK"When using this plain-Docker setup, run FANTASIA with the matching database port:
FANTASIA_DB_PORT=5433 ./tests/benchmark/run_benchmark_example.shFor direct fantasia/python -m fantasia.main commands, pass the port as a CLI
override instead:
python -m fantasia.main initialize \
--config ./fantasia/config.yaml \
--DB_HOST localhost \
--DB_PORT 5433After logging out and back in, reactivate your environment and restart existing containers:
cd /path/to/FANTASIA
source ~/anaconda3/etc/profile.d/conda.sh
conda activate fantasia-py312
docker start fantasia-postgres fantasia-rabbitmqThen check the ports before launching a job:
nc -z localhost 5433 && echo "Postgres OK"
nc -z localhost 5672 && echo "RabbitMQ OK"The docker-compose.yml is configured with the following default credentials (matching config.yaml):
| Service | Host | Port | User | Password | Database |
|---|---|---|---|---|---|
| PostgreSQL | localhost | 5432 | usuario | clave | BioData |
| RabbitMQ | localhost | 5672 | guest | guest | - |
BioData is the default local PostgreSQL database name used for the restored reference lookup table downloaded from Zenodo. It is a configurable database name, not a separate repository requirement.
RabbitMQ Management UI is available at: http://localhost:15672 (user: guest, password: guest)
Docker Compose is unavailable: If both of these fail:
docker compose version
docker-compose --versionuse the plain docker run commands in Option B above.
Connection refused error:
# Check if containers are running
docker compose ps
# or: docker-compose ps
# If stopped, restart them
docker compose restart
# or: docker-compose restart
# View logs
docker compose logs postgres
docker compose logs rabbitmq
# or:
# docker-compose logs postgres
# docker-compose logs rabbitmqFor plain Docker, use:
docker ps -a --filter name=fantasia-postgres
docker ps -a --filter name=fantasia-rabbitmq
docker start fantasia-postgres fantasia-rabbitmq
docker logs fantasia-postgres
docker logs fantasia-rabbitmqWrong PostgreSQL port:
If host port 5432 is already occupied by a system PostgreSQL, the plain-Docker
example uses host port 5433. In that case, run FANTASIA with:
FANTASIA_DB_PORT=5433 ./tests/benchmark/run_benchmark_example.shFor direct CLI calls, use --DB_PORT 5433; environment variables are only
interpreted by the helper scripts.
Password authentication failed:
Ensure the credentials in docker-compose.yml match those in config.yaml:
# Current values in docker-compose.yml
POSTGRES_USER: usuario
POSTGRES_PASSWORD: clave
POSTGRES_DB: BioDataPermission denied for schema public: If startup fails while creating PIS support tables, grant schema permissions to the configured database user:
PGPASSWORD=clave psql -h localhost -p 5432 -U usuario -d BioData \
-c "GRANT USAGE, CREATE ON SCHEMA public TO usuario;"With the local Docker service, you can also run the grant inside the container:
docker compose exec postgres psql -U usuario -d BioData \
-c "ALTER SCHEMA public OWNER TO usuario; GRANT USAGE, CREATE ON SCHEMA public TO usuario;"
# or:
docker-compose exec postgres psql -U usuario -d BioData \
-c "ALTER SCHEMA public OWNER TO usuario; GRANT USAGE, CREATE ON SCHEMA public TO usuario;"If that user is not allowed to grant privileges, run the same grant with a
PostgreSQL admin user for the BioData database.
Cleaning up: To remove containers and volumes:
docker compose down -v
# or: docker-compose down -v| Name | Model ID | Params | Architecture | Description |
|---|---|---|---|---|
| ESM-2 | facebook/esm2_t33_650M_UR50D |
650M | Encoder (33L) | Learns structure/function from UniRef50. No MSAs. Optimized for accuracy. |
| ProtT5 | Rostlab/prot_t5_xl_uniref50 |
1.2B | Encoder-Decoder | Trained on UniRef50. Strong transfer for structure/function tasks. |
| ProstT5 | Rostlab/ProstT5 |
1.2B | Multi-modal T5 | Learns 3Di structural states + function. Enhances contact/function tasks. |
| Ankh3-Large | ElnaggarLab/ankh3-large |
620M | Encoder (T5-style) | Fast inference. Good semantic/structural representation. |
| ESM3c | esmc_600m |
600M | Encoder (36L) | New gen. model trained on UniRef + MGnify + JGI. High precision & speed. |
FANTASIA is the result of a collaborative effort between Ana Rojas’ Lab (CBBIO) (Andalusian Center for Developmental Biology, CSIC) and Rosa Fernández’s Lab (Metazoa Phylogenomics Lab, Institute of Evolutionary Biology, CSIC-UPF). This project demonstrates the synergy between research teams with diverse expertise.
This version of FANTASIA builds upon previous work from:
-
Metazoa Phylogenomics Lab's FANTASIA
The original implementation of FANTASIA for functional annotation. -
bio_embeddings
A state-of-the-art framework for generating protein sequence embeddings. -
GoPredSim
A similarity-based approach for Gene Ontology annotation. -
MMseqs2
Used for optional query-aware redundancy masking during lookup workflows. -
Parasail
Provides high-performance pairwise sequence alignment routines used in hit validation and post-processing. -
protein-information-system
Serves as the reference biological information system, providing a robust data model and curated datasets for protein structural and functional analysis.
We also extend our gratitude to LifeHUB-CSIC for inspiring this initiative and fostering innovation in computational biology.
If you use FANTASIA in your research, please cite the following publications:
-
Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024).
Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
DOI: 10.1101/2024.02.28.582465 -
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R., & Rojas, A. M. (2024).
Decoding proteome functional information in model organisms using protein language models.
DOI: 10.1101/2024.02.14.580341
FANTASIA is distributed under the terms of the GNU Affero General Public License v3.0.
- Ana M. Rojas: a.rojas.m@csic.es
- Rosa Fernández: rosa.fernandez@ibe.upf-csic.es
- Belén Carbonetto: belen.carbonetto.metazomics@gmail.com
- Àlex Domínguez Rodríguez: adomrod4@upo.es
- Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
- Francisco Miguel Pérez Canales: fmpercan@upo.es
- Francisco J. Ruiz Mota: fraruimot@alum.us.es
