-
Notifications
You must be signed in to change notification settings - Fork 244
Description
Dear MMseqs2 team,
I am unable to successfully run nucleotide-vs-nucleotide searches for taxonomic annotation.
Environment:
MMseqs2 Version: 18.8cc5c
OS: Linux (HPC environment)
Installation method: Conda
Bug Description
When performing a nucleotide-vs-nucleotide search (--search-type 3) using a set of assembled contigs against the ref_prok_rep_genomes database, the prefilter subprocess is being terminated with a Killed signal.
I have observed this exact behavior with two different approaches, both following official documentation:
- Using the mmseqs taxonomy easy-workflow.
- Using an explicit, modular workflow (mmseqs createdb -> mmseqs search -> mmseqs lca).
This behavior could be related to the problems discussed in Issue #932?
Database Preparation
For full context, the target MMseqs2 database was created from a local BLAST database (ref_prok_rep_genomes) and the NCBI taxdump, following the standard procedure outlined in the MMseqs2 User Guide.
# 1. Download NCBI taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir taxonomy && tar -xzf taxdump.tar.gz -C taxonomy
# 2. Extract FASTA and mapping file from BLAST DB
blastdbcmd \
-db ref_prok_rep_genomes \
-entry all > ref_prok_rep_genomes.fna
blastdbcmd \
-db ref_prok_rep_genomes \
-entry all \
-outfmt "%a %T" > ref_prok_rep_genomes.taxidmapping
# 3. Create the MMseqs2 sequence database
mmseqs createdb \
ref_prok_rep_genomes.fna \
ref_prok_rep_genomes_db \
--dbtype 2
# 4. Create the final taxonomically-annotated database
mmseqs createtaxdb \
ref_prok_rep_genomes_db \
tmp_taxdb \
--ncbi-tax-dump taxonomy/ \
--tax-mapping-file ref_prok_rep_genomes.taxidmapping
Steps to Reproduce
The query is a standard set of metagenomic contigs.
# Step 1: Create query database
mmseqs createdb \
path/to/contigs.fna \
path/to/queryDB \
--compressed 1 \
--dbtype 2
# Step 2: Perform nucleotide search
mmseqs search \
path/to/queryDB \
path/to/ref_prok_rep_genomes_db \
path/to/search_results.db \
path/to/tmp_dir \
--split-memory-limit 250G \
--max-seq-len 300000000 \
--search-type 3 \
-s 4.0 \
--compress 1
# Step 3: LCA
mmseqs lca \
path/to/ref_prok_rep_genomes_db \
path/to/search_results.db \
path/to/lca.db \
--tax-lineage 1
# Step 4: Create TSV report
mmseqs createtsv \
path/to/queryDB \
path/to/lca.db \
path/to/tax.tsv \
--compressed 1
# Step 5: Generate Kraken-style report
mmseqs taxonomyreport \
path/to/ref_prok_rep_genomes_db \
path/to/lca.db \
path/to/tax.report \
--report-mode 0
# Step 6: Generate Krona report
mmseqs taxonomyreport \
path/to/ref_prok_rep_genomes_db \
path/to/lca.db \
path/to/tax.html \
--report-mode 1
Observed Behavior
The workflow fails during the prefilter step. The log output shows that the process is Killed after estimating memory consumption and starting the first of three prefiltering steps.
Query database size: 19348 type: Nucleotide
Target split mode. Searching through 3 splits
Estimated memory consumption: 222G
Target database size: 1102829 type: Nucleotide
The output of the prefilter cannot be compressed during target split mode. Prefilter result will not be compressed.
Process prefiltering step 1 of 3
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================]
/path/to/blastp.sh: line 144: 1652760 Killed $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"
Error: Prefilter died
Error: Search step died
Further Questions
Could you clarify the behavior of the --compress 1 flag? Is it safe to use this flag at every possible step (createdb, search, etc.)?
What are the best practices fro nucleotide-vs-nucleotide searches?
Thank you for your help!