-
Notifications
You must be signed in to change notification settings - Fork 0
filter.visual.coverages.sh
Simon Crameri edited this page Apr 2, 2022
·
5 revisions
Filter samples and loci (target regions) with poor sequence data, taking taxon groups into account and using filtering thresholds informed by comprehensive visualizations.
filter.visual.coverages.sh -s <file> -t <file> -r <file> -a <numeric fraction> -b <numeric fraction> \
-c <positive integer> -d <positive integer> -e <positive integer> \
-f <numeric fraction> -p <numeric fraction>
# R packages:
ggplot2
ape
grid
VennDiagramm
tidyr
# Required
-s Path to samples file. Header and tab-separation expected.
Sample IDs must be in the FIRST column. These must match (a subset of) sample names in
the mapping stats passed via -t.
Group IDs can be specified in the SECOND column (if not specified, all
samples are assumed to constitute one group).
The group ID is used to apply region filtering criteria 4-9 within all considered groups,
to determine regions passing the filtering criteria in all groups.
Samples that do not belong to any specified group (second column empty or 'NA') will be
displayed in summary plots but will not be considerd during region filtering.
Additional columns are ignored.
-t Path to coverage statistics. Header and tab-separation expected.
Sample IDs must be in the FIRST column. Coverage statistics must be in the following
columns as defined in filter.visual.coverages.R.
Only coverage statistics of samples passed via -s will be used. A Warning or Stop
is issued if there are mismatches.
-r Path to region reference sequences. FASTA format expected. Used to correlate alignment
stats with reference sequence lengths and GC content.
Only target regions passed via -t will be considered. A Warning or Stop is issued
if there are mismatches.
# Optional [DEFAULT]
# Absolute thresholds to remove poorly sequenced samples or regions:
- a [0.3] minimum fraction of regions with at least one mapped read in a sample (filters samples)
- b [0.3] minimum fraction of samples with at least one mapped read in a region (filters target regions)
# Relative thresholds...
- c [500] minimum BWA-MEM alignment length
- d [10] minimum average coverage in the aligned region
- e [1000] maximum average coverage in the aligned region
- f [0.5] minimum alignment fraction (BWA-MEM alignment length divided by target region length)
# ...that need to be met in a specified *fraction* of samples in each considered taxon group:
- p [0.9] minimum fraction of samples in each taxon group that need to pass filters c-f in order to
keep a region
The function can be executed several times using different threshold combinations. The output files will be different depending on the choice of filtering thresholds.
The following five files are produced each time the function is executed (assuming coverage_stats.Q10.txt was passed via the -t option, and assuming default filters):
- coverage_stats.Q10-taxa-0.3.txt List of kept samples.
- coverage_stats.Q10-regions-0.2-0.7-1-8-1000-0.5-0.9.txt List of kept target regions.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.txt Coverage statistics merged with taxon groups
and further statistics.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.pdf Visualizations.
- coverage_stats.Q10-0.3-0.3-500-10-1000-0.5-0.9.log Log file documenting the chosen parameters
and their effects.
filter.visual.coverages.sh -s samples.mapfile.txt -t coverage_stats.txt -r reference.fasta \
-a 0.2 -b 0.4 -c 1 -d 8 -e 1000 -f 0 -p 0.4
CaptureAl v0.1 Documentation