Skip to content

Issues 601, 616: Avoid negative BAF values, update VCF docs#1047

Merged
etal merged 2 commits into
masterfrom
i601_616
Apr 29, 2026
Merged

Issues 601, 616: Avoid negative BAF values, update VCF docs#1047
etal merged 2 commits into
masterfrom
i601_616

Conversation

@etal
Copy link
Copy Markdown
Owner

@etal etal commented Apr 29, 2026

Addresses #601 and #616 --

Ensure negative BAF values aren't produced, even from VCFs that break assumptions/expectations.

Update and clarify documentation around VCF handling.

etal and others added 2 commits April 28, 2026 22:54
When `cnvkit call --purity` rescaled observed BAFs to estimate tumor BAFs,
the formula could produce values outside [0, 1] -- and the negative or >1
results were written straight into the .cns "baf" column. This happens
when the input VCF violates the heterozygous-germline-SNP assumption
(e.g. somatic-only variants) or the purity estimate is too low.

`rescale_baf` now clamps output to [0, 1] and logs a warning with the
count of affected segments, suggesting the likely causes. A small
tolerance avoids spurious warnings for values within float-arithmetic
noise of the boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #616 noted that the VCF, BAF rescaling, and LOH detection parts of
the docs were obscure. Restructure the docs and add load-time sanity
checks that catch the most common misuses before they produce nonsense.

Docs:
* `doc/baf.rst` becomes the canonical guide -- expanded with sections on
  VCF preparation, sample identification (-i/-n, PEDIGREE), BAF
  rescaling math (with equations), allele-specific copy number / LOH,
  and a troubleshooting section that documents the load-time and
  rescale-time warnings.
* `doc/fileformats.rst` and `doc/pipeline.rst` now defer to baf.rst
  rather than duplicating VCF-prep guidance.

Code:
* `cnvlib/cmdutil.py` -- new `_warn_if_baf_input_suspicious` helper
  invoked from `load_het_snps`, warning when no heterozygous variants
  survive filtering or when the median allele frequency is far from 0.5
  (the latter catches misuse like wrong sample IDs or somatic-only
  VCFs lacking a SOMATIC INFO tag).

Tests:
* `test/test_commands.py` adds `LoadHetSnpsTests` with seven cases
  covering empty inputs, skewed/balanced distributions, the small-set
  silence rule, the empty-VCF integration path, the legitimate-VCF
  silence path, and the missing-sample-IDs warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@etal
Copy link
Copy Markdown
Owner Author

etal commented Apr 29, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Claude finished @etal's task —— View job


Code Review for PR #1047: Issues 601, 616

Reviewing changes for negative BAF fixes and VCF documentation updates.

Review Progress

  • Read and analyze changed files
  • Review core BAF handling logic changes
  • Evaluate documentation updates
  • Check test coverage and implementation
  • Provide comprehensive review feedback

Summary

This PR addresses issues #601 and #616 by implementing proper clamping of BAF values to [0, 1] and adding comprehensive VCF documentation. The changes are well-implemented with excellent test coverage.

Detailed Review

✅ Core Logic Changes (call.py)

Lines 25-28, 565-578: The addition of _BAF_RANGE_TOL = 1e-9 and proper clamping in rescale_baf() effectively solves issue #601:

  • Uses a sensible tolerance (1e-9) to distinguish genuine out-of-range values from floating-point noise
  • Provides clear warning when clamping occurs with helpful diagnostic information
  • The warning message appropriately suggests potential causes (low purity estimate, somatic variants)
  • Clamping logic correctly preserves NaN values using pandas boolean operations

Lines 186-187: Integration with the main pipeline is clean and maintains backward compatibility.

✅ VCF Input Validation (cmdutil.py)

Lines 24-29, 93-125: Excellent addition of input validation for heterozygous SNPs:

  • _BAF_INPUT_MEDIAN_TOL = 0.2 and _BAF_INPUT_MIN_HET = 50 are reasonable thresholds
  • _warn_if_baf_input_suspicious() provides early detection of problematic VCF inputs
  • Warning messages are clear and actionable, guiding users to check sample IDs and variant types
  • Minimum count threshold prevents false alarms on small panels

Lines 82-86: Integration into load_het_snps() workflow is seamless.

✅ Test Coverage

test_commands.py (lines +69): The LoadHetSnpsTests class provides comprehensive coverage:

  • Tests both helper function and integration scenarios
  • Covers edge cases (empty VCFs, skewed distributions, missing sample IDs)
  • Uses appropriate mocking with assertLogs and assertNoLogs

test_properties.py (+50 lines): Outstanding property-based tests for rescale_baf():

  • Tests key invariants (identity, round-trip, clamping)
  • Includes specific regression tests for issue negative BAF values estimating LOH #601 scenarios
  • Uses Hypothesis effectively for comprehensive edge case coverage
  • Tests boundary conditions and NaN handling

✅ Documentation Updates

doc/baf.rst: Significantly improved with clearer structure:

  • New section "Preparing a VCF" with specific, actionable guidance
  • Clear explanation of germline vs. somatic variant requirements
  • Concrete examples of suitable VCF sources
  • Helpful troubleshooting information

doc/fileformats.rst & doc/pipeline.rst: Appropriate cross-reference updates maintain consistency.

Code Quality Assessment

Style & Conventions: ✅

  • Follows project's modern Python conventions (PEP 604 union syntax)
  • Consistent with existing error handling patterns
  • Proper type hints throughout

Performance: ✅

  • Changes are minimal overhead (simple clipping operation)
  • Early validation prevents wasted computation downstream
  • No unnecessary loops or copies

Security: ✅

  • No security implications
  • Proper input validation added

Maintainability: ✅

  • Clear function/variable names (_BAF_RANGE_TOL, _warn_if_baf_input_suspicious)
  • Comprehensive docstrings with examples
  • Well-structured test organization

Suggestions for Improvement

  1. Minor: Consider adding a code example to the rescale_baf() docstring showing the clamping behavior
  2. Enhancement: The warning message could potentially suggest specific purity ranges, but current message is already quite helpful

Clinical Impact Assessment

⚠️ Important: This changeset alters numerical output for cases where BAF would previously be negative. Downstream clinical pipelines should be aware that:

  • Previously negative BAF values will now be clamped to 0.0
  • This is a correctness fix - negative BAF values are nonsensical
  • Results will be more stable and interpretable
  • No changes to file formats (.cnr, .cns, .cnn, SEG, VCF output structure)

Final Assessment

Strong approval recommended. This is a well-crafted fix that:

  • ✅ Solves the reported issues effectively
  • ✅ Includes exceptional test coverage (both unit and property-based)
  • ✅ Provides excellent user guidance through documentation
  • ✅ Maintains backward compatibility while improving correctness
  • ✅ Follows all project conventions and best practices

The implementation demonstrates thoughtful engineering with comprehensive edge case handling and clear communication to users about potential issues.


@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.54%. Comparing base (2f6f3e0) to head (2b72abf).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1047      +/-   ##
==========================================
+ Coverage   66.41%   66.54%   +0.12%     
==========================================
  Files          72       72              
  Lines        7409     7425      +16     
  Branches     1303     1307       +4     
==========================================
+ Hits         4921     4941      +20     
+ Misses       2049     2048       -1     
+ Partials      439      436       -3     
Flag Coverage Δ
unittests 66.54% <100.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@etal etal merged commit 085d562 into master Apr 29, 2026
15 checks passed
@etal etal deleted the i601_616 branch April 29, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant