Skip to content

filter.visual.coverages.R

Simon Crameri edited this page Mar 25, 2020 · 1 revision

Description

Visualize and filter regions for coverage statistics using ggplot2 (Wickam 2016) box-and-violin plots and heatmaps.

Usage

filter.visual.coverages.R mapfile.txt coverage.stats.txt reference.fasta <min.pregion=0.3> <min.ptaxa=0.3> <min.len=500> <min.cov=10> <max.cov=1000> <min.ratio=0.5> <min.frac=0.9>

Arguments

Required

sfile|CHR path to samples file. Header and tab-separation expected. Sample IDs must be in the first column. Group IDs can be specified in the second column (if not specified, all samples are assumed to constitute one group). The group ID is used to apply region filtering criteria 4-9 within all considered groups, to determine regions passing the filtering criteria in all groups. Samples that do not belong to any specified group (second column empty or 'NA') will be displayed in summary plots but will not be considerd during region filtering. Additional columns are ignored.

stats|CHR path to alignment stats. Header and tab-separation expected. Sample IDs must be in the first column. Alignment stats must be in the following columns as defined in lines 114-122 of this script. Only alignment stats of samples in will be read (warns or stops if there is a mismatch).

refseqs|CHR path to region reference sequences. Fasta format expected. Used to correlate alignment stats with reference sequence lengths and GC content. Only regions in will be considered (warns or stops if there is a mismatch).

Optional

sample filtering criteria (sample quality)

min.pregion|NUM minimum fraction of regions recovered in a sample (i.e., sample has at least 1 mapped read in <min.pregion>*nregions regions) [DEFAULT: 0.3]

region filtering criteria (mapping sensitivity)

min.ptaxa|NUM minimum fraction of samples recovered in a region (i.e., region has at least 1 mapped read in <min.ptaxa>*nsamples samples) [DEFAULT: 0.3]

region filtering criteria (mapping specificity)

min.len|NUM minimum length in .bam [DEFAULT: 500]

min.cov|NUM minimum coverage in .bam [DEFAULT: 10]

max.cov|NUM maximum coverage in .bam [DEFAULT: 1000]

min.ratio|NUM minimum alignment fraction [DEFAULT: 0.5]

min.frac|NUM minimum fraction of samples conforming to the absolute filtering criteria 5-8 (i.e., regions must meet criteria 5-8 in (100*<min.frac>)% of considered samples, separately for each considered group) [DEFAULT: 0.9]

Details

  • Heatmaps are produced for ALL, PASSED and FAILED regions.

Value

  • loci_kept-q10-200-8-1000-0.2-0.15.txt with regions that passed filters.
  • loci_rm-q10-200-8-1000-0.2-0.15.txt with regions that failed filters.
  • coverage_stats-q10-500-10-1000-0.5-0.1.log with filtering log.
  • coverage_stats-q10-500-10-1000-0.5-0.1.pdf with visualizations.

Authors

Simon Crameri (ETHZ)

References

  • H. Wickham 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Clone this wiki locally