Skip to content

filter.visual.coverages.sh

Simon Crameri edited this page Apr 2, 2022 · 5 revisions

Description

Filter samples and loci (target regions) with poor sequence data, taking taxon groups into account and using filtering thresholds informed by comprehensive visualizations.

Usage

filter.visual.coverages.sh -s <file> -t <file> -r <file> -a <numeric fraction> -b <numeric fraction> \
                           -c <positive integer> -d <positive integer> -e <positive integer> \
                           -f <numeric fraction> -p <numeric fraction>

Dependencies

# R packages:
ggplot2
ape
grid
VennDiagramm
tidyr

Arguments

# Required
-s         Path to samples file. Header and tab-separation expected.
          
           Sample IDs must be in the FIRST column. These must match (a subset of) sample names in
           the mapping stats passed via -t.
          
           Group IDs can be specified in the SECOND column (if not specified, all 
           samples are assumed to constitute one group).
          
           The group ID is used to apply region filtering criteria 4-9 within all considered groups,
           to determine regions passing the filtering criteria in all groups.
          
           Samples that do not belong to any specified group (second column empty or 'NA') will be 
           displayed in summary plots but will not be considerd during region filtering. 
          
           Additional columns are ignored.
          
          
-t         Path to coverage statistics. Header and tab-separation expected.

           Sample IDs must be in the FIRST column. Coverage statistics must be in the following 
           columns as defined in filter.visual.coverages.R.
          
           Only coverage statistics of samples passed via -s will be used. A Warning or Stop 
           is issued if there are mismatches.
          

-r         Path to region reference sequences. FASTA format expected. Used to correlate alignment 
           stats with reference sequence lengths and GC content.
          
           Only target regions passed via -t will be considered. A Warning or Stop is issued 
           if there are mismatches.


# Optional [DEFAULT]
           # Absolute thresholds to remove poorly sequenced samples or regions:
- a  [0.3] minimum fraction of regions with at least one mapped read in a sample (filters samples)
- b  [0.3] minimum fraction of samples with at least one mapped read in a region (filters target regions)
 
           # Relative thresholds...
- c  [500] minimum BWA-MEM alignment length
- d   [10] minimum average coverage in the aligned region
- e [1000] maximum average coverage in the aligned region
- f  [0.5] minimum alignment fraction (BWA-MEM alignment length divided by target region length)

           # ...that need to be met in a specified *fraction* of samples in each considered taxon group:
- p  [0.9]  minimum fraction of samples in each taxon group that need to pass filters c-f in order to
           keep a region

Details

The function can be executed several times using different threshold combinations. The output files will be different depending on the choice of filtering thresholds.

Value

The following five files are produced each time the function is executed (assuming coverage_stats.Q10.txt was passed via the -t option, and assuming default filters):

- coverage_stats.Q10-taxa-0.3.txt                          List of kept samples.
- coverage_stats.Q10-regions-0.2-0.7-1-8-1000-0.5-0.9.txt  List of kept target regions.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.txt       Coverage statistics merged with taxon groups
                                                           and further statistics.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.pdf       Visualizations.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.log       Log file documenting the chosen parameters
                                                           and their effects.

Examples

filter.visual.coverages.sh -s samples.mapfile.txt -t coverage_stats.txt -r reference.fasta \
                           -a 0.2 -b 0.4 -c 1 -d 8 -e 1000 -f 0 -p 0.4

Clone this wiki locally