Skip to content

sldsc_enrichment: fix snp-list mode + meta_subset tau_star handling#1319

Merged
gaow merged 2 commits into
StatFunGen:mainfrom
al4225:main
May 10, 2026
Merged

sldsc_enrichment: fix snp-list mode + meta_subset tau_star handling#1319
gaow merged 2 commits into
StatFunGen:mainfrom
al4225:main

Conversation

@al4225
Copy link
Copy Markdown
Contributor

@al4225 al4225 commented May 8, 2026

Summary

Two independent fixes to code/enrichment/sldsc_enrichment.ipynb to make the new pipeline runnable at production scale:

  1. [make_annotation_files_ldscore]: reshape annot dataframe so ldsc.py --print-snps accepts it (snp-list mode).
  2. [meta_subset]: split tau_star output into tau_star_single / tau_star_joint and project per-trait summaries through the view helper, so meta_sldsc_random finds the bare column names it expects.

Fix 1 — [make_annotation_files_ldscore] snp-list mode (commit d89e3918)

When --snp-list is set, Step C invokes polyfun/ldsc.py --l2 --print-snps (instead of compute_ldscores.py). ldsc.py strict-positionally reads cols 0..3 as CHR/BP/SNP/CM and cols 4+ as numeric annotations, AND requires the .annot SNP set to equal the .bim SNP set in identical row order. The previous Step A output failed both:

Constraint Old behavior Symptom
Cols 4+ must be numeric Wrote A1/A2/MAF as cols 4-6 TypeError: can't multiply sequence by non-int of type 'float' (1000G)
.annot rows == .bim rows Used merged-input row set ValueError: shapes (634887,) (1698778,) not broadcastable (ADSP)

Add a small normalize_for_ldsc() helper, applied to both single and joint annot dataframes before fwrite when use_print_snps is true:

  • drops A1/A2/MAF/CM (CM is re-sourced),
  • left-joins to .bim SNP set, fills 0 for missing SNPs,
  • takes CM from .bim (authoritative; ADSP .bim has CM=0, 1000G has real cM),
  • reorders to CHR-BP-SNP-CM-<annot…> matching .bim row order.

No-op when --snp-list is not set (compute_ldscores.py path is unchanged).

Fix 2 — [meta_subset] view helper for tau_star (commit c08438db)

postprocess writes per_trait[i]$summary with wide column names so a single per-trait list can hold both modes:

target, is_binary,
tau_single, tau_se_single, tau_star_single, tau_star_se_single,
enrichment_single, enrichment_se_single, enrichment_p_single,
enrichstat_single, enrichstat_se_single,
tau_joint, tau_se_joint, tau_star_joint, tau_star_se_joint,
enrichment_joint, ...

But meta_sldsc_random looks up bare names (tau_star, tau_star_se, etc.). The previous meta_subset cell passed subset_per_trait directly to meta_sldsc_random(..., "tau_star") → no bare tau_star column → all 96 traits skipped → out$tau_star was a list of NA.

Project subset_per_trait through pecotmr:::.sldsc_view_for_meta() once per mode (single | joint) before calling meta_sldsc_random. Output structure now mirrors postprocess for the "all" group:

out$tau_star_single   (per-target meta over single-tau)
out$tau_star_joint    (per-target meta over joint-tau)
out$enrichment        (single only — joint enrichment isn't well-defined)
out$enrichstat        (single only)

Validation

  • MWE end-to-end (test/scripts/validate_pecotmr_fix.sh): COMPLETED 0:0 / 9:41 (full make_annotation → get_heritability → postprocess → meta_subset for category1).
  • Production (1000G allm_snplist + 1000G m50_snplist, 6 contexts × 96 traits, ROSMAP_eQTL_{Ast,Inh,Mic}_mega): all 6 jobs produced complete <ctx>.sldsc_postprocess.rds and 5 <group>.meta.rds per context (brain / blood / brain_neurodegenerative / brain_psychiatric / brain_imaging).

Dependencies

This PR depends on the matching pecotmr PR (StatFunGen/pecotmr#488) for sd_annot ↔ polyfun .results category alignment. Without that, postprocess fails on snp-list-mode .results (target Category = L2_0/<annot>L2_0 instead of the .annot.gz column name).

Files

  • code/enrichment/sldsc_enrichment.ipynb (2 commits, +33 / -3 lines total)

🤖 Generated with Claude Code

al4225 and others added 2 commits May 8, 2026 10:46
…in snp-list mode

When --snp-list is set, Step C invokes polyfun's ldsc.py --l2 --print-snps
(rather than compute_ldscores.py). ldsc.py strict-positionally reads cols
0..3 as CHR/BP/SNP/CM and cols 4+ as numeric annotations, and requires the
.annot SNP set to equal the .bim SNP set in identical order. Add a small
normalize_for_ldsc() helper, applied to both single and joint annot
dataframes before fwrite when use_print_snps is true. No-op otherwise.

Without this, snp-list-flavored settings (ADSP allm_snplist / m50_snplist,
1000G allm_snplist / m50_snplist) failed Step C with TypeError on 1000G
(A1/A2 strings parsed as numeric) or ValueError on ADSP (annot vs bim
shape mismatch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view helper

postprocess writes per_trait[i]$summary with wide names (tau_star_single,
tau_star_joint, ...) so a single per-trait list can hold both modes. But
meta_sldsc_random looks up bare names (tau_star, tau_star_se), so passing
subset_per_trait directly returned NULLs for all 96 traits, leaving meta
output empty.

Project subset_per_trait through pecotmr:::.sldsc_view_for_meta() once per
mode (single | joint) before calling meta_sldsc_random. Output structure
now mirrors postprocess for the "all" group: tau_star_single,
tau_star_joint, enrichment, enrichstat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gaow gaow merged commit 2e5034a into StatFunGen:main May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants