Skip to content

Prefilter process is Killed during nucleotide search (--search-type 3) #1024

@ohickl

Description

@ohickl

Dear MMseqs2 team,

I am unable to successfully run nucleotide-vs-nucleotide searches for taxonomic annotation.

Environment:

MMseqs2 Version: 18.8cc5c

OS: Linux (HPC environment)

Installation method: Conda

Bug Description

When performing a nucleotide-vs-nucleotide search (--search-type 3) using a set of assembled contigs against the ref_prok_rep_genomes database, the prefilter subprocess is being terminated with a Killed signal.

I have observed this exact behavior with two different approaches, both following official documentation:

  • Using the mmseqs taxonomy easy-workflow.
  • Using an explicit, modular workflow (mmseqs createdb -> mmseqs search -> mmseqs lca).

This behavior could be related to the problems discussed in Issue #932?

Database Preparation

For full context, the target MMseqs2 database was created from a local BLAST database (ref_prok_rep_genomes) and the NCBI taxdump, following the standard procedure outlined in the MMseqs2 User Guide.

# 1. Download NCBI taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir taxonomy && tar -xzf taxdump.tar.gz -C taxonomy

# 2. Extract FASTA and mapping file from BLAST DB
blastdbcmd \
    -db ref_prok_rep_genomes \
    -entry all > ref_prok_rep_genomes.fna
blastdbcmd \
    -db ref_prok_rep_genomes \
    -entry all \
    -outfmt "%a %T" > ref_prok_rep_genomes.taxidmapping

# 3. Create the MMseqs2 sequence database
mmseqs createdb \
    ref_prok_rep_genomes.fna \
    ref_prok_rep_genomes_db \
    --dbtype 2

# 4. Create the final taxonomically-annotated database
mmseqs createtaxdb \
    ref_prok_rep_genomes_db \
    tmp_taxdb \
    --ncbi-tax-dump taxonomy/ \
    --tax-mapping-file ref_prok_rep_genomes.taxidmapping

Steps to Reproduce

The query is a standard set of metagenomic contigs.

# Step 1: Create query database
mmseqs createdb \
    path/to/contigs.fna \
    path/to/queryDB \
    --compressed 1 \
    --dbtype 2

# Step 2: Perform nucleotide search
mmseqs search \
    path/to/queryDB \
    path/to/ref_prok_rep_genomes_db \
    path/to/search_results.db \
    path/to/tmp_dir \
    --split-memory-limit 250G \
    --max-seq-len 300000000 \
    --search-type 3 \
    -s 4.0 \
    --compress 1

# Step 3: LCA
mmseqs lca \
    path/to/ref_prok_rep_genomes_db \
    path/to/search_results.db \
    path/to/lca.db \
    --tax-lineage 1

# Step 4: Create TSV report
mmseqs createtsv \
    path/to/queryDB \
    path/to/lca.db \
    path/to/tax.tsv \
    --compressed 1

# Step 5: Generate Kraken-style report
mmseqs taxonomyreport \
    path/to/ref_prok_rep_genomes_db \
    path/to/lca.db \
    path/to/tax.report \
    --report-mode 0

# Step 6: Generate Krona report
mmseqs taxonomyreport \
    path/to/ref_prok_rep_genomes_db \
    path/to/lca.db \
    path/to/tax.html \
    --report-mode 1

Observed Behavior

The workflow fails during the prefilter step. The log output shows that the process is Killed after estimating memory consumption and starting the first of three prefiltering steps.

Query database size: 19348 type: Nucleotide
Target split mode. Searching through 3 splits
Estimated memory consumption: 222G
Target database size: 1102829 type: Nucleotide
The output of the prefilter cannot be compressed during target split mode. Prefilter result will not be compressed.
Process prefiltering step 1 of 3

Index table k-mer threshold: 0 at k-mer size 15 
Index table: counting k-mers
[=================================================================]
/path/to/blastp.sh: line 144: 1652760 Killed                  $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"
Error: Prefilter died
Error: Search step died

mmseqs2_contigs.mg.log

Further Questions

Could you clarify the behavior of the --compress 1 flag? Is it safe to use this flag at every possible step (createdb, search, etc.)?

What are the best practices fro nucleotide-vs-nucleotide searches?

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions