Skip to content

Add BBTools Java implementation for fqcnt benchmark#25

Open
bbushnell wants to merge 1 commit intolh3:masterfrom
bbushnell:add-bbtools-java
Open

Add BBTools Java implementation for fqcnt benchmark#25
bbushnell wants to merge 1 commit intolh3:masterfrom
bbushnell:add-bbtools-java

Conversation

@bbushnell
Copy link

This PR adds BBTools FastqScan as a Java implementation for the fqcnt benchmark.

Implementation Details

  • Uses BBTools' FastqScan tool with multithreaded SIMD-accelerated parsing
  • Wrapper script: fqcnt_java_bbtools.sh
  • Output format matches biofast specification: <records>\t<bases>\t<qualities>

Testing

Tested with M_abscessus_HiSeq.fq (5,682,010 reads):

5682010	568201000	568201000

Requirements

  • Java 18+ required (for jdk.incubator.vector SIMD support)
  • Java 25 recommended for optimal performance
  • BBTools: git clone --depth=1 https://github.com/bbushnell/BBTools

About BBTools

BBTools is a comprehensive suite of bioinformatics tools developed at the Joint Genome Institute (JGI). FastqScan provides high-performance FASTQ parsing optimized for modern hardware.

Repository: https://github.com/bbushnell/BBTools

Adds fqcnt_java_bbtools.sh wrapper for BBTools FastqScan.

BBTools is a comprehensive suite of bioinformatics tools developed at
JGI. FastqScan uses multithreaded SIMD-accelerated parsing for high
performance on modern hardware.

Tested with M_abscessus_HiSeq.fq (5,682,010 reads).

Requirements:
- Java 18+ (for jdk.incubator.vector SIMD support)
- Java 25 recommended for optimal performance
- BBTools: git clone --depth=1 https://github.com/bbushnell/BBTools
@bbushnell
Copy link
Author

bbushnell commented Dec 11, 2025

Performance Note: FastqScan is fastest with larger files and BGZF compression

JVM Startup Overhead

Java has ~0.25s startup/JIT compilation overhead that dominates benchmarks on small files (like the 5.6M read test case). This overhead is:

  • Amortized on production-scale files (100M+ reads)
  • Irrelevant when called from Java code (JVM already running)

BGZF Multithreaded Decompression

FastqScan is actually faster on BGZF-compressed files than plaintext due to parallel decompression, if there are sufficient cores (~20). On 80M reads:

  • Plaintext: ~4.2 GB/s (single-threaded)
  • BGZF compressed: ~6.2 GB/s (multithreaded decompression)
  • FastqScanMT (with t=2): ~9.5 GB/s on BGZF

FastqScan Performance Chart

Performance comparison showing FastqScanMT at 13.5x faster than Rust needletail on BGZF files
FastqScan_NeedleTail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant