Skip to content

BadZipFile Error in download_reference.py - NCBI API returns non-ZIP data with HTTP 200 #290

@vaishnavpvarma

Description

@vaishnavpvarma

Description of the bug

Hello,

I encountered an error in the KMERFINDER_DOWNLOAD_REFERENCE step, specifically during the NCBI API-based retrieval of reference files (e.g., .gff, .fna). The failure appears to stem from a validation issue in how the downloaded content is handled.

Below are my observations:

The download_reference.py script fails with a BadZipFile: File is not a zip file error when the NCBI Datasets API returns HTTP 200 status with non-ZIP data (such as JSON error responses or HTML pages). The script assumes all 200 responses contain valid ZIP archives without validating the response content type or format download_reference.py:74-76 .

Version of nf-core/bacass: v2.5.0 and it was executed in an HPC system with Lustre file structure.

This bug blocks the entire kmerfinder workflow, preventing:

  • Reference-based QUAST analysis
  • LIFTOFF annotation transfer
  • Pipeline completion with proper reference genome assessment

Here's the snippet of the error message:

Execution cancelled -- Finishing pending tasks before exit  
-[nf-core/bacass] Pipeline completed with errors-  
ERROR ~ Error executing process > 'NFCORE_BACASS:BACASS:KMERFINDER_SUMMARY_DOWNLOAD:KMERFINDER_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUMMARY_DOWNLOAD:KMERFINDER_DOWNLOAD_REFERENCE)'  
  
Caused by:  
  Process `NFCORE_BACASS:BACASS:KMERFINDER_SUMMARY_DOWNLOAD:KMERFINDER_DOWNLOAD_REFERENCE (NFCORE_BACASS:BACASS:KMERFINDER_SUMMARY_DOWNLOAD:KMERFINDER_DOWNLOAD_REFERENCE)` terminated with an error exit status (1)  
  
Command error:  
  Traceback (most recent call last):  
    File "/home/user/.nextflow/assets/nf-core/bacass/bin/download_reference.py", line 157, in <module>  
      sys.exit(main())  
    File "/home/user/.nextflow/assets/nf-core/bacass/bin/download_reference.py", line 148, in main  
      _extract_files(zip_bytes, acc, out_dir)  
    File "/home/user/.nextflow/assets/nf-core/bacass/bin/download_reference.py", line 97, in _extract_files  
      with zipfile.ZipFile(zip_bytes) as zf:  
    File "/usr/local/lib/python3.10/zipfile.py", line 1258, in __init__  
      self._RealGetContents()  
    File "/usr/local/lib/python3.10/zipfile.py", line 1325, in _RealGetContents  
      raise BadZipFile("File is not a zip file")  
  zipfile.BadZipFile: File is not a zip file  

Command used and terminal output

nextflow run nf-core/bacass \
-profile singularity \
-r 2.5.0 \
--input /lustre/user/project_file/nanopore_run/fastq_pass/merged_fastq/samplesheet_bacass_exp2.tsv \
--outdir /lustre/user/project_file/nanopore_run/bacass_op_exp2_run5/ \
--skip_toulligqc \
--skip_fastqc \
--skip_nanoplot \
--assembly_type long \
--assembler dragonflye \
--kmerfinderdb /lustre/user/project_file/databases/kmerfinder/20190108_stable_dirs/bacteria \
--dragonflye_args "--gsize 5m" \
--polish_method medaka \
--annotation_tool prokka \
--kraken2db /lustre/user/project_file/databases/kraken_tar/k2_standard_8gb_20210517.tar \
--busco_mode genome \
--busco_db_path /lustre/user/project_file/databases/kmerfinder/20190108_stable_dirs/bacteria/databases/busco/bacteria_odb10 \

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions