Skip to content

Getting alignment statistics with vg filter

Faith Okamoto edited this page Dec 17, 2025 · 7 revisions

The various mappers in vg (giraffe, map) create GAMs which include alignment metadata. In addition to the high-level statistics from vg stats -a, vg filter has a --tsv-out option to write a TSV with information about each individual read in a (possibly filtered subset of a) GAM.

Syntax

The general syntax for using --tsv-out is:

vg filter --tsv-out FIELD mappings.gam > statistics.tsv
# Separate fields with semicolons & wrap in quotation marks
vg filter --tsv-out "FIELD1;FIELD2" mappings.gam > statistics.tsv

Other vg filter options are still applied. For example, this command outputs name and score only for mapped reads whose names begin with hifi:

vg filter --name-prefix hifi --only-mapped \
    --tsv-out "name;score" mappings.gam > statistics.tsv

Output example

The output file is a TSV with a header line of column names. The first column name will have a # prefix. Non-header lines have the requested fields for a single read in the GAM. For example, running on the test file in test/surject/perpendicular.gam results in:

$ vg filter --tsv-out "name;score;cigar" ./test/surject/perpendicular.gam

#name   score   cigar
A00744:46:HV3C3DSXX:2:1503:9887:31485   121     10M1X19M1X10M15D28M1X17M1X63M

Available fields

Some statistics are pulled directly from the GAM, though not all GAM fields are available. Others are calculated on the fly from the information in the GAM. Statistics pulled from the GAM aren’t recalculated if missing. For example, unless --add-identity is used during vg inject, the resulting GAM won’t have an identity field. Asking vg filter to output the missing identity field will cause an error.

  • name: Read name (pulled from GAM)
  • score: Alignment score (pulled from GAM) - note that several options in vg filter can affect score, such as --rescore, --frac-score, and --substitutions
  • correctly_mapped: True if a read was correctly mapped, False otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads
  • correctness: correct if a read was correctly mapped, off_reference if it was set to have no truth, incorrect otherwise (pulled from GAM) - requires a known-truth mapping location, e.g. for simulated reads
  • softclip_start: number of bases soft-clipped off the beginning of a read (calculated on the fly)
  • softclip_end: number of bases soft-clipped off the end of a read (calculated on the fly) - NOT the index of a soft-clip position
  • cigar: the read's CIGAR string; X is a mismatch and all Ms are true matches (calculated on the fly)
  • nodes: a comma-separated list of node IDs and orientations (e.g. 89607846+,89607845+,89607844+,89607843-,) in the order in which the read's alignment traverses them (pulled from GAM)
  • identity: identity score (% of non-clipped read bases which are matches) of mapping (pulled from GAM) - will be 0 if read is not mapped
  • is_perfect: 1 if an alignment is “perfect” (consisting of only matches and no mismatches, indels, or soft clips), 0 otherwise (calculated on the fly)
  • mapping_quality: MQ score (pulled from GAM)
  • sequence: read base sequence (pulled from GAM)
  • length: length of read sequence (pulled from GAM)
  • time_used: time in seconds spent on mapping (pulled from GAM)
  • annotation: all annotations (pulled from GAM) - typically very large
  • annotation.X: value of the X annotation (pulled from GAM)

Please request additional fields by opening an issue.

Clone this wiki locally