-
Notifications
You must be signed in to change notification settings - Fork 210
Getting alignment statistics with vg filter
The various mappers in vg (giraffe, map) create GAMs which include alignment metadata. In addition to the high-level statistics from vg stats -a, vg filter has a --tsv-out option to write a TSV with information about each individual read in a (possibly filtered subset of a) GAM.
The general syntax for using --tsv-out is:
vg filter --tsv-out FIELD mappings.gam > statistics.tsv
# Separate fields with semicolons & wrap in quotation marks
vg filter --tsv-out "FIELD1;FIELD2" mappings.gam > statistics.tsv
Other vg filter options are still applied. For example, this command outputs name and score only for mapped reads whose names begin with hifi:
vg filter --name-prefix hifi --only-mapped \
--tsv-out "name;score" mappings.gam > statistics.tsv
The output file is a TSV with a header line of column names. The first column name will have a # prefix. Non-header lines have the requested fields for a single read in the GAM. For example, running on the test file in test/surject/perpendicular.gam results in:
$ vg filter --tsv-out "name;score;cigar" ./test/surject/perpendicular.gam
#name score cigar
A00744:46:HV3C3DSXX:2:1503:9887:31485 121 10M1X19M1X10M15D28M1X17M1X63M
Some statistics are pulled directly from the GAM, though not all GAM fields are available. Others are calculated on the fly from the information in the GAM. Statistics pulled from the GAM aren’t recalculated if missing. For example, unless --add-identity is used during vg inject, the resulting GAM won’t have an identity field. Asking vg filter to output the missing identity field will cause an error.
-
name: Read name (pulled from GAM) -
score: Alignment score (pulled from GAM) - note that several options invg filtercan affect score, such as--rescore,--frac-score, and--substitutions -
correctly_mapped:Trueif a read was correctly mapped,Falseotherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
correctness:correctif a read was correctly mapped,off_referenceif it was set to have no truth,incorrectotherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads -
softclip_start: number of bases soft-clipped off the beginning of a read (calculated on the fly) -
softclip_end: number of bases soft-clipped off the end of a read (calculated on the fly) - NOT the index of a soft-clip position -
cigar: the read's CIGAR string;Xis a mismatch and allMs are true matches (calculated on the fly) -
nodes: a comma-separated list of node IDs and orientations (e.g.89607846+,89607845+,89607844+,89607843-,) in the order in which the read's alignment traverses them (pulled from GAM) -
identity: identity score (% of non-clipped read bases which are matches) of mapping (pulled from GAM) - will be0if read is not mapped -
is_perfect:1if an alignment is “perfect” (consisting of only matches and no mismatches, indels, or soft clips),0otherwise (calculated on the fly) -
mapping_quality: MQ score (pulled from GAM) -
sequence: read base sequence (pulled from GAM) -
length: length of read sequence (pulled from GAM) -
time_used: time in seconds spent on mapping (pulled from GAM) -
annotation: all annotations (pulled from GAM) - typically very large -
annotation.X: value of theXannotation (pulled from GAM)
Please request additional fields by opening an issue.