Skip to content

Conversation

@epiercehoffman
Copy link
Collaborator

Updates

The intention of analyze_fams.R is to exclude sites for which any member of the trio has a null genotype (line 120). However, CollectVidsPerSample excludes null genotypes from each sample's carrier VID list (line 105). Those genotypes therefore appear as NA for that sample in analyze_fams.R when merged with another trio member who is a carrier, and they are treated thereafter as hom ref (line 133). This is not the intention of the script, and creates ambiguity in the de novo rate, as there is no claim made about the genotype when it is null.

This PR retains the VIDs with null genotypes for each sample so those sites can be discarded for the entire trio as intended during the de novo rate calculation. The null genotypes are removed prior to per-sample QC to maintain the previous behavior. The impact of null genotypes on the de novo rate should be examined further and reported as described in #807.

Testing

  • Validated all WDLs and JSONs with womtool and Terra validation script
  • Successfully ran MainVcfQc.wdl on the hgdp data (which contains trios)
    • Verified that ./. genotypes were retained in VID lists and locally tested their removal during per-sample QC
    • The de novo rate decreases slightly compared to the version on main (13.8% from 15.8%) owing to the dropped sites.
    • The SVs per genome were identical when examining the same subset of samples.
    • The cost and runtime were comparable despite the increase in intermediate data

@epiercehoffman
Copy link
Collaborator Author

Updated to also discard null genotypes prior to per-sample benchmarking. Re-tested and verified that the de novo rate is the only thing that differs between this branch and the main branch. The following MainVcfQc plots were identical:

  • SVs per genome (on identical set of samples)
  • Per-sample benchmarking against HGSVC
  • Site-level benchmarking against HGSVC and gnomAD
  • Genotype distribution
  • Frequency distribution
  • Size distribution
  • SV site counts

Also verified that the sample VID lists are only used in per-sample QC, per-family QC, and per-sample benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants