Fast genome alignment with plane sweep filtering and scaffolding. Wraps wfmash and FastGA aligners and applies plane sweep and other filtering methods to keep the best non-overlapping alignments.
SweepGA can:
- Align FASTA files directly using integrated wfmash or FastGA (supports .fa.gz)
- Filter existing alignments from any aligner (wfmash, minimap2, etc.)
- Apply scaffolding/chaining to merge nearby alignments into syntenic regions
- Output multiple formats: PAF (text) or .1aln (binary ONE format)
By default, it applies 1:1 plane sweep filtering to remove duplicate mappings. Scaffolding/chaining is disabled by default but can be enabled with -j to merge nearby alignments into syntenic regions.
This package includes two binaries:
sweepga- Genome alignment, filtering, and scaffolding toolalnstats- Alignment statistics and validation tool
Use alnstats to verify filtering results:
# Show statistics for a PAF file
alnstats alignments.paf
# Compare before/after filtering
alnstats raw.paf filtered.paf
# Detailed per-genome-pair breakdown
alnstats alignments.paf -dcargo install sweepgaThis installs both sweepga and alnstats binaries from the published crate.
Requires Rust 1.70+. Clone and install:
git clone https://github.com/pangenome/sweepga.git
cd sweepga
cargo install --force --path .Symptoms: Build fails with linker errors like:
ld: /usr/lib/x86_64-linux-gnu/librt.so: undefined reference to '__pthread_barrier_wait@GLIBC_PRIVATE'
This occurs on systems with multiple package managers (e.g., Debian + Guix) providing different glibc versions.
Fix: Use the clean build script to isolate from environment conflicts:
./scripts/build-clean.sh --installSee docs/BUILD-NOTES.md for details.
Adapted from https://issues.genenetwork.org/topics/rust/guix-rust-bootstrap:
# Update Guix
mkdir -p $HOME/opt
guix pull -p $HOME/opt/guix-pull-20251012 --url=https://codeberg.org/guix/guix
# Be sure to use the updated Guix
alias guix=$HOME/opt/guix-pull-20251012/bin/guix
# Update Rust and Cargo
mkdir -p ~/.cargo ~/.rustup # to prevent rebuilds
guix shell --share=$HOME/.cargo --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. ~/.cargo/env
rustup default stable
exit
# Clone the repository
git clone https://github.com/pangenome/sweepga.git
cd sweepga
guix shell --share=$HOME/.cargo --share=$HOME/.rustup -C -N -D -F -v 3 guix gcc-toolchain make libdeflate pkg-config xz coreutils sed zstd zlib nss-certs openssl curl cmake git clang # we need cmake and clang too for building
. ~/.cargo/env
export LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib
cargo build --release
# Check the lib path and put it into your ~/.bashrc or ~/.zshrc
echo $GUIX_ENVIRONMENT/
#/gnu/store/whgjblccmr4kdmsi4vg8h0p53m5f7sch-profile/
exit
echo "export GUIX_ENVIRONMENT=/gnu/store/whgjblccmr4kdmsi4vg8h0p53m5f7sch-profile/" >> ~/.bashrc # or ~/.zshrc
source ~/.bashrc # or ~/.zshrc
# Use the executable in sweepga/target/release
env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib ./target/release/sweepga --help# Self-alignment with 1:1 filtering
sweepga genome.fa.gz > output.paf
# Pairwise alignment (target, query order)
sweepga target.fa query.fa > output.paf
# With 2 threads
sweepga genome.fa.gz -t 2 > output.paf
# Output in .1aln format (binary, more compact)
sweepga genome.fa.gz --output-file output.1aln# Filter PAF from stdin (pipe)
cat alignments.paf | sweepga > filtered.paf
# Filter PAF from stdin (redirection)
sweepga < alignments.paf > filtered.paf
# Read PAF file directly
sweepga alignments.paf > filtered.paf
# Filter .1aln format (preserves format)
sweepga alignments.1aln --output-file filtered.1aln
# Convert .1aln to PAF
sweepga alignments.1aln --paf > output.pafScaffolding is disabled by default. Enable it with -j to merge nearby alignments into syntenic chains:
# Enable scaffolding with 10kb gap distance
sweepga genome.fa.gz -j 10k > output.paf
# Aggressive scaffolding with 1:1 filtering and rescue
sweepga alignments.paf -j 10k -s 10k -m 1:1 -d 20k > filtered.paf
# Permissive: keep all scaffolds without filtering
sweepga alignments.paf -j 50k -m many:many > filtered.paf-n/--num-mappings - Pre-scaffold filter: n:m-best mappings in query:target dimensions (default: 1:1)
"1:1"- Orthogonal: keep best mapping on both query and target axes"1"or"1:∞"- Keep best mapping per query position only"many"or"N:N"- No pre-filtering, pass all to scaffolding"n:m"- Keep top n per query, top m per target
-o/--overlap - Maximum overlap ratio for mappings (default: 0.95)
- Mappings with >95% overlap with a better-scoring mapping are removed
-b/--block-length - Minimum alignment block length (default: 0)
-i/--min-identity - Minimum identity threshold (0-1 fraction, 1-100%, or "aniN")
--scoring - Scoring function for plane sweep (default: log-length-ani)
ani- Sort by alignment identity %length- Sort by alignment lengthlength-ani- Sort by length × identitylog-length-ani- Sort by log(length) × identitymatches- Sort by number of matching bases
--self - Include self-mappings (excluded by default)
-N/--no-filter - Disable all filtering
-j/--scaffold-jump - Gap distance for merging alignments into scaffolds (default: 0 = disabled)
0- Scaffolding disabled (default)10000or10k- Merge alignments within 10kb gaps (moderate)- Higher values create longer scaffold chains
- Accepts k/m/g suffix (e.g.,
50k,1m)
-s/--scaffold-mass - Minimum scaffold chain length (default: 10k)
- Filters out short scaffold chains
- Accepts k/m/g suffix
-m/--scaffold-filter - Scaffold filter mode (default: many:many = no filtering)
"1:1"- Keep best scaffold per chromosome pair (90-99% reduction)"1"or"1:∞"- One scaffold per query, many per target"N:N"or"many:many"- Keep all non-overlapping scaffolds (30-70% reduction)"n:m"- Keep top n per query, top m per target
-O/--scaffold-overlap - Overlap threshold for scaffold filtering (default: 0.5)
-d/--scaffold-dist - Maximum distance for rescuing alignments near scaffolds (default: 0 = disabled)
0- No rescue (default)20000or20k- Rescue alignments within 20kb of scaffold anchors- Higher values rescue more alignments
- Distance is Euclidean:
sqrt((q_dist)² + (t_dist)²) - Only active when scaffolding is enabled (
-j > 0)
-Y/--min-scaffold-identity - Minimum scaffold identity threshold (defaults to -i value)
--scaffolds-only - Output only scaffold chains for debugging
--output-file <FILE> - Write output to file (auto-detects format from extension)
.paf- PAF text format.1aln- Binary ONE format (more compact)
--paf - Force PAF output format (overrides .1aln default for .1aln inputs)
--1aln - Output .1aln binary format instead of default PAF
-x/--sparsify - Tree sparsification pattern (default: 1.0 = keep all)
"0.5"- Keep 50% of alignments"tree:3"- Tree pattern with depth 3"tree:3,2,0.1"- Complex tree pattern
--ani-method - ANI calculation method (default: n100)
all- Use all basesorthogonal- Use orthogonal alignments onlynX[-sort]- Use top X% (e.g.,n50,n90-identity,n100-score)
-t/--threads - Number of threads for parallel processing (default: 8)
--quiet - Quiet mode (no progress output)
--tempdir <DIR> - Temporary directory for intermediate files (defaults to TMPDIR env var, then current directory). Use --tempdir ramdisk for /dev/shm (Linux RAM-backed tmpfs)
--check-fastga - Check FastGA binary locations and exit (diagnostic)
--aligner <ALIGNER> - Aligner for FASTA input: wfmash or fastga (default: wfmash)
-f/--frequency <N> - FastGA k-mer frequency threshold
--all-pairs - Align all genome pairs separately (for many genomes)
--disk-usage - Report disk usage statistics (current, peak, cumulative bytes written)
--batch-bytes <SIZE> - Maximum resource usage per batch (e.g., 10G, 500M). Partitions genomes into batches. FastGA: limits disk (index ~0.1GB + 12 bytes/bp). Wfmash: limits memory (~0.5GB + 20 bytes/bp).
--zstd - Compress k-mer index with zstd for ~2x disk savings and faster I/O
--zstd-level <N> - Zstd compression level 1-19 (default: 3). Higher = smaller files but slower
SweepGA can process AGC genome archives directly:
--agc-prefix <PREFIX> - Extract only samples matching this prefix
--agc-samples <LIST> - Extract only these samples (comma-separated or @file)
--agc-queries <LIST> / --agc-targets <LIST> - Query/target samples for asymmetric alignment
--agc-tempdir <DIR> - Temp directory for AGC extraction (defaults to --tempdir)
For controlling which genome pairs are aligned:
--pairs <FILE> - File of sample pairs to align (TSV: query<tab>target)
--list-pairs - List all sample pairs and exit
--max-pairs <N> - Maximum number of pairs to process (0 = unlimited)
--shuffle-pairs / --shuffle-seed <N> - Shuffle pair order for load balancing
--pairs-done <FILE> / --pairs-remaining <FILE> - Resume capability
--pair-start <N> - Start index for pair range (0-based)
--sparsify-pairs <STRATEGY> - Pair sparsification (none, auto, random:<frac>, giant:<prob>, tree:<near>:<far>:<random>)
When scaffolding is enabled (with -j > 0), the filtering process follows these steps:
- Input Processing - Filter by minimum block length, exclude self-mappings
- Pre-scaffold Filter - Apply plane sweep filter to individual mappings (
-n, default:1:1)- Removes duplicate/overlapping mappings
- Keeps best mapping per query-target pair
- Scaffold Creation - Merge nearby alignments into chains using union-find algorithm
- Alignments within
-jgap distance on both query and target are merged - Chains shorter than
-sminimum length are discarded
- Alignments within
- Scaffold Filter - Apply plane sweep filter to scaffold chains (
-m, default:many:many)1:1mode: Keep single best scaffold per chromosome pair (aggressive)many:manymode: Keep all non-overlapping scaffolds (moderate, default)
- Rescue Phase - Recover alignments near kept scaffolds (
-d, default:0= disabled)- When enabled, alignments within Euclidean distance of scaffold anchors are rescued
- Works per chromosome pair only
Example effects (514k input alignments):
- Default (
-n 1:1 -j 0): 476k alignments (7% reduction from 1:1 plane sweep only) - Moderate scaffolding (
-n 1:1 -j 10k -m many:many -d 20k): 180k alignments (65% reduction) - Aggressive scaffolding (
-n 1:1 -j 10k -m 1:1 -d 20k): 13k alignments (97% reduction)
The plane sweep algorithm operates per query-target chromosome pair:
- Group by chromosome pairs
- Sort alignments by query position
- Score each alignment using
--scoringfunction (default:log(length) × identity) - Sweep left-to-right, keeping best alignments based on multiplicity:
1:1: Keep single best alignment per position on both query and target1:∞: Keep best alignment per query position (multiple targets allowed)N:N: Keep all non-overlapping alignments
- Filter alignments with overlap exceeding threshold (
-o)
When using PanSN-style sequence names (e.g., genome#haplotype#chromosome):
- Filtering operates per chromosome pair within genome pairs
- Example:
SGDref#1#chrIpaired withDBVPG6765#1#chrI - Ensures best alignments are kept for each chromosome pair within each genome pair
- Maintains ~100% coverage for highly similar genomes
query_name query_len query_start query_end strand target_name target_len target_start target_end matches block_len mapq [tags...]
- More compact than PAF (typically 50-70% smaller)
- Preserves all alignment information including X records (edit distances)
- Faster to read/write for large files
- Use
--pafflag to convert to PAF for visualization
SweepGA: Fast plane sweep filtering for genome alignments https://github.com/pangenome/sweepga
MIT License - see LICENSE file for details