Ignore sites with null GTs during de novo rate calculation #810

epiercehoffman · 2025-05-09T17:22:34Z

Updates

The intention of analyze_fams.R is to exclude sites for which any member of the trio has a null genotype (line 120). However, CollectVidsPerSample excludes null genotypes from each sample's carrier VID list (line 105). Those genotypes therefore appear as NA for that sample in analyze_fams.R when merged with another trio member who is a carrier, and they are treated thereafter as hom ref (line 133). This is not the intention of the script, and creates ambiguity in the de novo rate, as there is no claim made about the genotype when it is null.

This PR retains the VIDs with null genotypes for each sample so those sites can be discarded for the entire trio as intended during the de novo rate calculation. The null genotypes are removed prior to per-sample QC to maintain the previous behavior. The impact of null genotypes on the de novo rate should be examined further and reported as described in #807.

Testing

Validated all WDLs and JSONs with womtool and Terra validation script
Successfully ran MainVcfQc.wdl on the hgdp data (which contains trios)
- Verified that ./. genotypes were retained in VID lists and locally tested their removal during per-sample QC
- The de novo rate decreases slightly compared to the version on main (13.8% from 15.8%) owing to the dropped sites.
- The SVs per genome were identical when examining the same subset of samples.
- The cost and runtime were comparable despite the increase in intermediate data

epiercehoffman · 2025-05-27T15:41:11Z

Updated to also discard null genotypes prior to per-sample benchmarking. Re-tested and verified that the de novo rate is the only thing that differs between this branch and the main branch. The following MainVcfQc plots were identical:

SVs per genome (on identical set of samples)
Per-sample benchmarking against HGSVC
Site-level benchmarking against HGSVC and gnomAD
Genotype distribution
Frequency distribution
Size distribution
SV site counts

Also verified that the sample VID lists are only used in per-sample QC, per-family QC, and per-sample benchmarking.

epiercehoffman added 3 commits May 5, 2025 15:07

keep nulls for fam qc

b565e7e

fix column number

6c5590d

drop null gts in per-sample benchmarking

7dcbabb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ignore sites with null GTs during de novo rate calculation #810

Ignore sites with null GTs during de novo rate calculation #810

Uh oh!

epiercehoffman commented May 9, 2025

Uh oh!

epiercehoffman commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ignore sites with null GTs during de novo rate calculation #810

Are you sure you want to change the base?

Ignore sites with null GTs during de novo rate calculation #810

Uh oh!

Conversation

epiercehoffman commented May 9, 2025

Updates

Testing

Uh oh!

epiercehoffman commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants