Skip to content

trim.alignments.parallel.sh

Simon Crameri edited this page Apr 3, 2022 · 2 revisions

Description

Trim alignments internally based on alignment completeness, for a batch of alignments in parallel. Optionally, use a sliding window approach to mask individual sequences that appear erroneously aligned.

Usage

trim.alignments.parallel.sh -s <file> -d <directory> -c <numeric fraction> -z <positive integer> \
                            -n <numeric fraction> -S <positive integer> -m <string> -t <positive integer> \
                            -w <positive integer> -h <positive integer> -i -v

Dependencies

# R package:
ape

Arguments

# Required
-s            File with sample names in FIRST column. Header and tab-separation expected. Any additional columns
              are ignored.
-d            Path to directory with raw or end-trimmed alignments. This directory should contain a FASTA file
              for each target region, each with aligned contigs of multiple samples.

# Optional [DEFAULT]
-c    [0.3]   Completeness parameter. Any alignment site with nucleotides in less than this specified fraction of
              aligned sequences is removed.
              Aligned sequences are defined as any row in the alignment with nucleotide data, samples where a locus
              is entirely missing (see -m option) are ignored.
-z     [20]   Window size parameter. Potential mis-assemblies or mis-alignments in each sequence are identified
              using a sliding window approach with this specified window size (in number of bases). 
-S      [1]   Step size parameter. Potential mis-assemblies or mis-alignments in each sequence are identified
              using a sliding window approach with this specified step size (in number of bases). 
-n    [0.5]   Conservation parameter. Entire windows are successively trimmed at contig ends if more than this
              fraction of nucleotides in the conserved part of the window deviate from the alignment consensus. 
              A conserved part of each window is defined as the alignment sites with nucleotides in at least
              20% of samples, and where the frequencies of minor alleles are all below 30% without considering gaps.
              By default, the sliding window approach stops if a successive window survives the trimming (see -i option).
-m    ['-']   Gap character. This character is interpreted as missing data or a gap when using the -c and -n filters.
-i  [false]   FLAG, if turned on, the sliding window approach is not only applied to contig ends, but extended to
              internal regions of the alignment (no stopping criterion used).
-v  [false]   FLAG, if turned on, the alignment trimming will be visualized as a PDF
              (recommended for few alignments only).
-w     [15]   Width of output PDF file.
-h      [7]   Height of output PDF file.
-t      [4]   Number of parallel threads.

Details

Value

This script creates an output directory of the form <inputdirectory>.c${c}.n{$n}, where ${c} is the completeness parameter and ${n} is the conservation parameter.

Examples

# no visualization
trim.alignments.parallel.sh -s mapfile.txt -d mafft.63.2396.c0.5.n0.25 -c 0.4 -z 20 -S 1 -n 0.5 -t 20

# with visualization
trim.alignments.parallel.sh -s mapfile.txt -d mafft.63.2396.c0.5.n0.25 -c 0.4 -z 20 -S 1 -n 0.5 -t 20 -v

# with visualization and internal trimming
trim.alignments.parallel.sh -s mapfile.txt -d mafft.63.2396.c0.5.n0.25 -c 0.4 -z 20 -S 1 -n 0.5 -t 20 -iv

Clone this wiki locally