Add depth command for computing coverage depth across all sequences#125
Open
unavailable-2374 wants to merge 8 commits intopangenome:mainfrom
Open
Add depth command for computing coverage depth across all sequences#125unavailable-2374 wants to merge 8 commits intopangenome:mainfrom
unavailable-2374 wants to merge 8 commits intopangenome:mainfrom
Conversation
03643d4 to
f31996b
Compare
The depth command iterates through all sequences as references, calculates coverage depth (unique sample count) for each position using a sweep-line algorithm, and outputs a TSV table with window_id, depth, and sample columns. Key features: - Two-phase processing: parallel overlap queries + sequential deduplication - Union approach for A-B/B-A alignment asymmetry (merges all intervals per sample) - Groups by PanSN sample name (sample#haplotype#chr -> sample) - Supports transitive queries, custom reference ordering, and adjacent window merging - Each position is output only once via ProcessedRegions tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add the new subset_filter parameter (as None) to query_transitive_dfs and query_transitive_bfs calls to match updated method signatures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ptimization Major enhancements to the depth command: - Add --stats mode for global depth statistics across all sequences - Add --combined-output for merged BED output with sample lists - Add sample filtering via --samples and --samples-file options - Add --memory-efficient mode using compressed bitmaps (~50x less memory) - Add --fai-list for filling uncovered regions with depth=1 - Add --ref option for targeted mode (single reference sample) - Add --merge-tolerance for combining adjacent intervals - Add region query support via -r/--target-range and -b/--target-bed Technical changes: - Implement SampleFilter for flexible sample inclusion/exclusion - Add parallel computation with configurable thread pool - Optimize memory usage for TB-scale datasets - Support both per-sample and combined output modes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
638c8e8 to
321d99b
Compare
Major improvements to depth command: 1. New --windowed mode (compute_depth_windowed_v2): - Parallel batch processing of sequences (100 seqs/batch) - Sparse sample storage (Vec<(u16,u32,i64,i64)> vs Vec<Option>) - Streaming output per sequence (reduced memory) - Numeric sorting by seq_id instead of string comparison 2. Output format changes: - Column 1: ID (row number) instead of seq_name - Anchor sample column shows anchor sequence coordinates 3. Self-alignment filtering: - Default: filter out same-sample transitive alignments - Use --include-self-alignments to include intra-genome duplications 4. Merge tolerance for windowed mode: - --merge-tolerance now works with --windowed (default: 0.05) - Adjacent intervals with depth diff <= tolerance are merged Performance: ~10x memory reduction, full CPU utilization with -t flag Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major improvements to depth computation correctness and performance: - Add jemalloc allocator to fix glibc malloc fragmentation with 128 threads - Add query_raw_overlapping() for O(n+k) range queries instead of full interval loading - Add clear_sub_index_cache() to bound memory for per-file indexing mode - Two-phase processing: hub sequences (Phase 1) complete before leaves (Phase 2) - With --ref: ref sequences are Phase 1 - Without --ref: auto-detect hubs via alignment degree pre-scan - Phase 1 uses chunk-level parallelism (5MB chunks) for transitive mode - Degree-based sorting ensures high-connectivity sequences always anchor first - Fixes incorrect depth in star topologies when hub wasn't processed first Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Restructure help headings: "Mode selection" for --ref/--ref-only/-r/-b, "Statistics add-on" for --stats/--combined-output - Update CLI descriptions to reflect 3 modes: ref-anchored, ref-only, region query - Update log messages: "Ref-anchored mode" / "Ref-only mode" / "region query mode" - Update compute_depth_global doc comments to match hub-first design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Default transitive depth now uses raw-interval BFS with linear interpolation instead of CIGAR BFS, avoiding chunk boundary gaps and improving performance. CIGAR BFS retained via --use-BFS flag. - Remove --approximate flag (redundant with raw BFS default). - Fix depth counting: depth now consistently counts total unique samples covering each position (including the anchor sample). Previously anchor was excluded from aligned regions but included in gap regions, creating an off-by-one inconsistency. - sweep_line_depth() is now a pure sweep-line with no anchor special-casing; callers add a synthetic anchor alignment. - Fix 200+GB RAM during pre-scan by disabling tree cache before compute_alignment_degrees() and clearing sub-index cache after. - Fix BFS exploration using full alignment extents instead of linearly-interpolated clipped coordinates to avoid missing alignments at hop boundaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The depth command iterates through all sequences as references, calculates coverage depth (unique sample count) for each position using a sweep-line algorithm, and outputs a TSV table with window_id, depth, and sample columns.
Key features: