Skip to content

sldsc_postprocessing_pipeline: align sd_annot names to polyfun .results categories (covers --snp-list mode)#488

Merged
gaow merged 2 commits into
StatFunGen:mainfrom
al4225:main
May 9, 2026
Merged

sldsc_postprocessing_pipeline: align sd_annot names to polyfun .results categories (covers --snp-list mode)#488
gaow merged 2 commits into
StatFunGen:mainfrom
al4225:main

Conversation

@al4225
Copy link
Copy Markdown
Collaborator

@al4225 al4225 commented May 8, 2026

Summary

sldsc_postprocessing_pipeline() matches sd_annot_full (named after .annot.gz columns) against polyfun's .results.Category. The previous code only worked when polyfun preserved the original annot column name. This PR makes the matching robust to all four pipeline configurations (single | joint × snp-list | no).

Background — the naming asymmetry

Polyfun appends _<file_idx> to LD score column names when writing .results.Category (target = file_idx=0). The LD score column name itself depends on which polyfun script wrote the file, and the new pipeline branches on --snp-list:

Branch polyfun script LD score col .results target
no --snp-list, single compute_ldscores.py preserved ANNOT ANNOT_0
--snp-list, single ldsc.py --l2 hardcoded L2 (ldsc.py:317) L2_0
no --snp-list, joint (N) compute_ldscores.py preserved A1, A2, … A1_0, A2_0, …
--snp-list, joint (N) ldsc.py --l2 <annot>L2 per col A1L2_0, A2L2_0, …

compute_sldsc_annot_sd() only reads .annot.gz, so its return names never see the polyfun side. Without alignment, intersect() was empty whenever --snp-list was used → empty target_categories → cryptic downstream failure.

Fix

Two-stage match in sldsc_postprocessing_pipeline:

  1. paste0("_0") on sd_annot_full / is_binary_full names (covers no-snp-list).
  2. If intersect() is empty, take the first length(sd_annot_full) rows of .results.Category as targets and rename positionally (covers snp-list). Polyfun puts file_idx=0 rows first in .results, so position alignment is safe across all 4 branches.

Why this approach

  • Pipeline-side renaming would reproduce ldsc.py's hardcoded L2/<annot>L2 rule — brittle.
  • A use_snp_list flag from pipeline → pecotmr would couple the two repos.
  • Trusting polyfun's row ordering (target before baseline because file_idx=0 < 1) needs no flag and works for any future polyfun script with the same convention.

Validation

  • Standalone match test on real .results: ADSP allm (ANNOT_0, stage 1) and 1000G allm_snplist (L2_0, stage 2) both resolve correctly.
  • MWE end-to-end (validate_pecotmr_fix.sh): COMPLETED 0:0 / 9:41 — stage 1 path unchanged.
  • Production (1000G allm_snplist + m50_snplist, 6 contexts × 96 traits): all 6 produced <ctx>.sldsc_postprocess.rds (per_trait[96] + meta{tau_star, enrichment, enrichstat}) and 5 <group>.meta.rds per context — stage 2 confirmed at scale.

Test plan

  • Reproduce MWE: bash xqtl-protocol/test/scripts/validate_pecotmr_fix.sh
  • Run on a --snp-list .results (target = L2_0) and confirm fallback message in logs
  • Run on a no-snp-list .results (target = ANNOT_0) and confirm no fallback
  • Joint smoke: 2-col annot + --snp-list, verify target_categories = <a>L2_0, <b>L2_0

Files

  • R/sldsc_wrapper.R (+32 lines, single function modified)

🤖 Generated with Claude Code

al4225 and others added 2 commits May 8, 2026 10:13
…ts categories

Polyfun appends "_<file_idx>" to LD score column names when writing .results
Category, where the target annotation is file_idx=0. Two cases now handled:

1. compute_ldscores.py path (no --snp-list): preserves .annot.gz column names,
   so .results target Category = "<annot_col>_0". Add paste0("_0") to
   sd_annot_full / is_binary_full names so intersect() with .results categories
   matches.

2. ldsc.py --l2 path (with --snp-list): hardcodes LD score col to "L2" (single)
   or "<annot_col>L2" (joint), so paste0("_0") on .annot.gz names gives the
   wrong key. Fall back to positional rename: take the first
   length(sd_annot_full) rows of .results Category as target_categories
   (polyfun puts target before baseline because file_idx=0 < 1), rename
   sd_annot_full / is_binary_full to those names, and emit an INFO message
   with old/new names and the baseline count for traceability.

Without this, postprocess silently produced empty target_categories whenever
the pipeline ran with --snp-list, breaking downstream meta-analysis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gaow gaow merged commit 326fc90 into StatFunGen:main May 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants