Skip to content

yangao07/longcallD

Repository files navigation

Latest Release Github All Releases Bioconda Version Bioconda Install C/C++ CI C/C++ CI License

Updates (release v0.0.7)

  • sort reads internally before processing to fix potential inconsistency when multiple input BAM/CRAM files are provided

Getting Started

# Download pre-built executables and test data (recommended)
# Linux-x64
wget https://github.com/yangao07/longcallD/releases/download/v0.0.7/longcallD-v0.0.7_x64-linux.tar.gz
tar -zxvf longcallD-v0.0.7_x64-linux.tar.gz && cd longcallD-v0.0.7_x64-linux
# MacOS-arm64
wget https://github.com/yangao07/longcallD/releases/download/v0.0.7/longcallD-v0.0.7_arm64-macos.tar.gz
tar -zxvf longcallD-v0.0.7_arm64-macos.tar.gz && cd longcallD-v0.0.7_arm64-macos

# PacBio HiFi reads
./longcallD call ./test_data/chr11_2M.fa ./test_data/HG002_chr11_hifi_test.bam --hifi > HG002_hifi_test.vcf
# Oxford Nanopore reads
./longcallD call ./test_data/chr11_2M.fa ./test_data/HG002_chr11_ont_test.bam --ont > HG002_ont_test.vcf

Table of Contents

Introduction

LongcallD is a local-haplotagging-based variant caller designed for detecting small variants and structural variants (SVs) using long-read sequencing data. It supports both PacBio HiFi and Oxford Nanopore reads.

LongcallD phases long reads into haplotypes using SNPs and small indels before calling SVs. It outputs phased variant calls in VCF format, including SNPs, small indels, and large SVs (currently only supporting insertions and deletions).

LongcallD (≥v0.0.5) can also call low-allele-fraction mosaic variant when -s/--mosaic is used. Currently, only SNVs and large indels are supported, no mosaic small indels will be called. Specifically, longcallD can sensitively identify mosaic mobile element insertions (MEIs). Providing the annotation sequence of common mobile elements, i.e., Alu/L1/SVA, using -T is highly recommanded, which is included here.

Installation

Pre-built executables (recommended)

Linux-x64

wget https://github.com/yangao07/longcallD/releases/download/v0.0.7/longcallD-v0.0.7_x64-linux.tar.gz
tar -zxvf longcallD-v0.0.7_x64-linux.tar.gz

MacOS-arm64

wget https://github.com/yangao07/longcallD/releases/download/v0.0.7/longcallD-v0.0.7_arm64-macos.tar.gz
tar -zxvf longcallD-v0.0.7_arm64-macos.tar.gz

Linux-arm64/macOS-x64

There is no pre-built executable for Linux-arm64 or macOS-x64, please try conda or build from source.

Bioconda

For Linux and macOS

conda install -c bioconda longcalld

Build from source

To compile longcallD from source, ensure you have GCC/clang(9.0+) and zlib/libbz2/liblzma/libcurl (for htslib) installed.

wget https://github.com/yangao07/longcallD/releases/download/v0.0.7/longcallD-v0.0.7.tar.gz
tar -zxvf longcallD-v0.0.7.tar.gz
cd longcallD-v0.0.7; make

Usage

LongcallD requires a reference genome (FASTA) and a long-read BAM/CRAM file as inputs. It outputs phased variant calls in VCF format.

Variant calling with PacBio HiFi/Nanopore long reads

longcallD call -t16 ref.fa hifi.bam > hifi.vcf         # default for PacBio HiFi reads (--hifi)
longcallD call -t16 ref.fa ont.bam --ont > ont.vcf     # for ONT reads

Multiple input BAM/CRAM files of the same sample

You can provide multiple BAM/CRAM files of the same sample for variant calling using --input-is-list or -X:

longcallD call -t16 --input-is-list ref.fa bam_list.txt > sample.vcf
# where bam_list.txt contains:
# sample_part1.bam
# sample_part2.bam
# sample_part3.bam

or

longcallD call -t16 ref.fa sample_part1.bam -X sample_part2.bam -X sample_part3.bam > sample.vcf

Low-allele-fraction mosaic variant calling

With -s, longcallD will detect both germline and low-allele-fraction somatic/mosaic variants.

For each somatic/mosaic variant, a SOMATIC tag will be added to the INFO field in the output VCF.

longcallD call -s -t16 ref.fa hifi.bam > hifi.vcf
longcallD call -s -t16 ref.fa hifi.bam -T AluY_L1_SVA_cons_noPA.fa > hifi.vcf # add MEI information in INFO field

Region-specific variant calling

LongcallD supports region-based variant calling, similar to samtools view.

longcallD call -t16 ref.fa hifi.bam chr11:10,229,956-10,256,221 > hifi_reg.vcf
longcallD call -t16 ref.fa hifi.bam chr11:10,229,956-10,256,221 chr12:10,576,356-10,583,438 > hifi_regs.vcf
longcallD call -t16 ref.fa hifi.bam --region-file reg.bed > hifi_regs.vcf
longcallD call -t16 ref.fa hifi.bam --autosome > hifi_autosome.vcf

Output phased (& refined) long-read BAM/CRAM

LongcallD performs read phasing during variant calling and can output phased long reads in BAM/CRAM.

With --refine-aln, it can further output refined read alignment based on multiple sequence alignment within each haplotype, which is especially useful for low-complexity regions like homopolymers and tandem repeats.

longcallD call -t16 ref.fa hifi.bam --hifi -b hifi_phased.bam > hifi.vcf                  # output phased HiFi reads (BAM tag: HP & PS)
longcallD call -t16 ref.fa ont.bam --ont --refine-aln -b ont_phased_refined.bam > ont.vcf # output phased & refined ONT reads (BAM tag: HP & PS)

Variant calling from remote files

ref=https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18.fasta.gz
bam=https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_HiFi-Revio_20231031/HG002_PacBio-HiFi-Revio_20231031_48x_GRCh38-GIABv3.bam
longcallD call -t16 $ref $bam chr11:10,229,956-10,256,221 chr12:10,576,356-10,583,438 > hifi_regs.vcf

Memory usage

Because longcallD performs multiple sequence alignment and re-alignment, both are memory-intensive tasks, it generally requires more memory than other variant callers. Peak memory consumption depends primarily on the number of threads (-t/--threads), sequencing coverage, and read length. For human whole-genome datasets at ~40× coverage, longcallD typically uses about 1 GB (HiFi) or 2 GB (ONT R10) of memory per thread for germline variant calling.

Memory usage and runtime increase further when mosaic variant calling is enabled.

If you encounter memory constraints, you may restrict processing to specific genomic regions using --region-file. A region list for the human genome that excludes centromeres is available here.

Acknowledgements

LongcallD is dependent on the following libraries, we are grateful to all the developers/maintainers:

  • htslib: read/write BAM/CRAM/VCF
  • abPOA: consensus calling
  • WFA: pairwise alignment
  • edlib: fast sequence similarity calculation
  • cgranges: interval operations
  • sdust: identify low-complexity regions

Contact

For any questions or support, please contact:

About

A local-haplotagging-based small and structural variant caller

Resources

License

Stars

Watchers

Forks

Packages

No packages published