Skip to content

ProstT5 issue submitting job with GPU on HPC #119

@abigailvolk

Description

@abigailvolk

Hi! I am having issues running phold on an HPC with GPU. I have access to cuda v12.8.1 on the HPC and Nvidia A100 GPU. I downloaded phold and the GPU version of Pytorch via Conda.

I reviewed past closed similar issues but none of the suggestions worked (such as setting the environment variables to offline when submitting an HPC job, manually downloading and extracting the database, etc.).

When I run as a job on an HPC cluster, it consistently fails here, even when I manually download/extract the DB from Zenodo:

phold run
--input /fs/ess/PAS1806/Abby/viral/synteny/microcystis/test3/pharokka.gbk
--database /fs/ess/PAS1806/Abby/TOOLS/phold
--output /fs/ess/PAS1806/Abby/viral/synteny/microcystis/test3/pholdtest
--threads 8
--hyps
--foldseek_gpu

2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0085 as it has a known function from Pharokka
2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0091 as it has a known function from Pharokka
2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0094 as it has a known function from Pharokka
2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0115 as it has a known function from Pharokka
2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0151 as it has a known function from Pharokka
2026-02-02 00:57:55.095 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0159 as it has a known function from Pharokka
2026-02-02 00:57:55.250 | INFO     | phold.features.predict_3Di:get_T5_model:125 - Using device: cuda:0
2026-02-02 00:57:55.250 | INFO     | phold.features.predict_3Di:get_T5_model:131 - Loading T5 from: /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16
2026-02-02 00:57:55.250 | INFO     | phold.features.predict_3Di:get_T5_model:132 - If /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16 is not found, it will be downloaded
2026-02-02 00:58:04.553 | WARNING  | phold.features.predict_3Di:get_T5_model:155 - Download from Hugging Face failed. Trying backup from Zenodo.
2026-02-02 00:58:04.553 | INFO     | phold.databases.db:download_zenodo_prostT5:289 - Downloading ProstT5 model backup from https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz
2026-02-02 00:58:04.563 | INFO     | phold.utils.external_tools:run_stream:56 - Started running aria2c --dir /fs/ess/PAS1806/Abby/TOOLS/phold --out models--Rostlab--ProstT5_fp16.tar.gz --max-connection-per-server=8 --allow-overwrite=true https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz ...
2026-02-02 00:58:05.020 | INFO     | phold.utils.external_tools:run_stream:73 - Done running aria2c --dir /fs/ess/PAS1806/Abby/TOOLS/phold --out models--Rostlab--ProstT5_fp16.tar.gz --max-connection-per-server=8 --allow-overwrite=true https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz
2026-02-02 00:58:05.020 | ERROR    | phold.utils.external_tools:run_download:134 - Error calling aria2c --dir /fs/ess/PAS1806/Abby/TOOLS/phold --out models--Rostlab--ProstT5_fp16.tar.gz --max-connection-per-server=8 --allow-overwrite=true https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz (return code 22)
2026-02-02 00:58:05.020 | WARNING  | phold.databases.db:download:253 - Downloading the database with aria2c failed. Trying now without.

02/02 00:58:04 [�[1;32mNOTICE�[0m] Downloading 1 item(s)

02/02 00:58:05 [�[1;31mERROR�[0m] CUID#7 - Download aborted. URI=https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz
Exception: [AbstractCommand.cc:351] errorCode=22 URI=https://zenodo.org/records/11234657/files/models--Rostlab--ProstT5_fp16.tar.gz
  -> [HttpSkipResponseCommand.cc:239] errorCode=22 The response status is not successful. status=403

02/02 00:58:05 [�[1;32mNOTICE�[0m] Download GID#e7ec869eebdeb19a not complete: /fs/ess/PAS1806/Abby/TOOLS/phold/models--Rostlab--ProstT5_fp16.tar.gz

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
e7ec86|ERR |       0B/s|/fs/ess/PAS1806/Abby/TOOLS/phold/models--Rostlab--ProstT5_fp16.tar.gz

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.

But when I run this from the login node, it appears to make it past where it fails on the but does not progress (I do not have access to GPU on the login node):

2026-02-02 03:00:57.486 | INFO     | phold.utils.validation:validate_input:61 - Successfully parsed input /fs/ess/PAS1806/Abby/viral/synteny/microcystis/test
3/pharokka.gbk as a Pharokka style Genbank file.
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:76 - You have used --hyps and a Pharokka style input Genbank was detected.
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:79 - Only unknown function proteins from your Pharokka input Genbank will b
e extracted and annotated with Phold.
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0008 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0009 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0024 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0025 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0026 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0027 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0031 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0045 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0048 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0049 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0051 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0052 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0060 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0065 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0067 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0076 as it has a known function from Pharokka
2026-02-02 03:00:57.486 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0078 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0082 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0083 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0085 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0091 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0094 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0115 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0151 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | INFO     | phold.subcommands.predict:subcommand_predict:122 - Skipping DSJFIZFZ_CDS_0159 as it has a known function from Pharokka
2026-02-02 03:00:57.487 | WARNING  | phold.features.predict_3Di:get_T5_model:119 - No available GPU was found, but --cpu was not specified
2026-02-02 03:00:57.487 | WARNING  | phold.features.predict_3Di:get_T5_model:122 - ProstT5 will be run with CPU only
2026-02-02 03:00:57.488 | INFO     | phold.features.predict_3Di:get_T5_model:125 - Using device: cpu
2026-02-02 03:00:57.488 | INFO     | phold.features.predict_3Di:get_T5_model:131 - Loading T5 from: /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16
2026-02-02 03:00:57.488 | INFO     | phold.features.predict_3Di:get_T5_model:132 - If /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16 is not found, it 
will be downloaded
2026-02-02 03:01:08.835 | INFO     | phold.features.predict_3Di:get_T5_model:174 - Rostlab/ProstT5_fp16 loaded
2026-02-02 03:01:08.972 | INFO     | phold.features.predict_3Di:get_embeddings:491 - Beginning ProstT5 predictions
2026-02-02 03:01:09.045 | INFO     | phold.features.predict_3Di:get_embeddings:496 - Using models in half-precision

And when I run instal before all of thisl, it also says the ProstT5 model is available.

phold install \
  -d /fs/ess/PAS1806/Abby/TOOLS/phold \
  --foldseek_gpu \
  -t 16
2026-02-02 02:54:00.165 | INFO     | phold:install:1343 - You have specified the /fs/ess/PAS1806/Abby/TOOLS/phold directory to store the Phold database and P
rostT5 model
2026-02-02 02:54:00.165 | INFO     | phold:install:1355 - Checking that the Rostlab/ProstT5_fp16 ProstT5 model is available in /fs/ess/PAS1806/Abby/TOOLS/pho
ld
2026-02-02 02:54:00.167 | INFO     | phold.features.predict_3Di:get_T5_model:125 - Using device: cpu
2026-02-02 02:54:00.180 | INFO     | phold.features.predict_3Di:get_T5_model:131 - Loading T5 from: /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16
2026-02-02 02:54:00.180 | INFO     | phold.features.predict_3Di:get_T5_model:132 - If /fs/ess/PAS1806/Abby/TOOLS/phold/Rostlab/ProstT5_fp16 is not found, it 
will be downloaded
2026-02-02 02:54:10.539 | INFO     | phold.features.predict_3Di:get_T5_model:174 - Rostlab/ProstT5_fp16 loaded
2026-02-02 02:54:10.573 | INFO     | phold:install:1366 - ProstT5 model downloaded
2026-02-02 02:54:10.573 | INFO     | phold.databases.db:install_database:143 - Checking Phold database installation in /fs/ess/PAS1806/Abby/TOOLS/phold.
2026-02-02 02:54:10.597 | INFO     | phold.databases.db:install_database:146 - All Phold databases files are present
2026-02-02 02:54:10.597 | INFO     | phold.databases.db:install_database:185 - All Phold database files compatible with Foldseek-GPU are present

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions