Skip to content

Conversation

@kjaisingh
Copy link
Collaborator

@kjaisingh kjaisingh commented Apr 25, 2025

Description

This PR is intended to provide a series of changes to methods used to do callset evaluations, including the following components:

  • Introduces a new WDL for callset evaluations.
  • Add stratification options to SVConcordance.wdl, which are now used in the callset evaluations.
  • Allow Vapor to be run on CRAM files stored in requester-pays buckets.
  • Provide option to modify the SV size limit for Vapor in the long-read comparisons.

Testing

Pre-Merge Changes Required

  • Create a new callset evaluation WDL.
    • Use the modified MakeGqRecalibratorTrainingSetFromPacBio as a base.
    • Rename MakeGqRecalibratorTrainingSetFromPacBio to use a prefix of AllOfUs and deprecate it into the internal repository.
    • Modify input parameters to be agnostic to the type of comparison needed: raw calls, joint callset, CNV sites or CNV calls.
    • Add input parameters to optionally do Vapor preprocessing, if Vapor is required.
    • Add input parameters to optionally standardize input VCFs.
    • Remove differentiation between loose and strict modes.
    • Integrate pre/post-processing functionality from all the various new scripts & WDLs into this.
  • Include a Python notebook for the analysis of results from the callset evaluation WDL, providing user-level parameters that are intended to govern how to process the data output by the evaluation WDL.
    • Modify input parameters to be agnostic to the type of comparison needed: raw calls, joint callset, CNV sites or CNV calls.
    • Use the pacbio_support_summary_table for Vapor results instead of the vapor_output_json.
    • Generalize the stratification field definition.
    • Include optional sections that can be run depending on what comparison mode we are in - e.g. short-read caller concordance, caller & depth only coverage, subsetting results to samples etc. Look through archived slides for all the different comparisons done for ideas for these.
  • Add documentation that outlines how to use the long-read evaluations.
  • Decide on the future of the DRAGEN-CNV standardizer (std_dragen_cnv.py) → if it is being included, add it to the standardizer options and documentation.
  • Remove automated sync of the MakeGqRecalibratorTrainingSetFromPacBio WDL to Dockstore.

@kjaisingh kjaisingh added methods and removed enhancement New feature or request labels Jul 30, 2025
@kjaisingh kjaisingh changed the title Enable standardized callset evaluations in GATK-SV Enable standardized callset evaluations Jul 31, 2025
@kjaisingh kjaisingh changed the title Enable standardized callset evaluations Provide methods for standardized callset evaluation Jul 31, 2025
@kjaisingh kjaisingh changed the title Provide methods for standardized callset evaluation Introduce methods for standardized callset evaluation Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants