sldsc_enrichment: fix snp-list mode + meta_subset tau_star handling#1319
Merged
Conversation
…in snp-list mode When --snp-list is set, Step C invokes polyfun's ldsc.py --l2 --print-snps (rather than compute_ldscores.py). ldsc.py strict-positionally reads cols 0..3 as CHR/BP/SNP/CM and cols 4+ as numeric annotations, and requires the .annot SNP set to equal the .bim SNP set in identical order. Add a small normalize_for_ldsc() helper, applied to both single and joint annot dataframes before fwrite when use_print_snps is true. No-op otherwise. Without this, snp-list-flavored settings (ADSP allm_snplist / m50_snplist, 1000G allm_snplist / m50_snplist) failed Step C with TypeError on 1000G (A1/A2 strings parsed as numeric) or ValueError on ADSP (annot vs bim shape mismatch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view helper postprocess writes per_trait[i]$summary with wide names (tau_star_single, tau_star_joint, ...) so a single per-trait list can hold both modes. But meta_sldsc_random looks up bare names (tau_star, tau_star_se), so passing subset_per_trait directly returned NULLs for all 96 traits, leaving meta output empty. Project subset_per_trait through pecotmr:::.sldsc_view_for_meta() once per mode (single | joint) before calling meta_sldsc_random. Output structure now mirrors postprocess for the "all" group: tau_star_single, tau_star_joint, enrichment, enrichstat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent fixes to
code/enrichment/sldsc_enrichment.ipynbto make the new pipeline runnable at production scale:[make_annotation_files_ldscore]: reshape annot dataframe soldsc.py --print-snpsaccepts it (snp-list mode).[meta_subset]: splittau_staroutput intotau_star_single/tau_star_jointand project per-trait summaries through the view helper, someta_sldsc_randomfinds the bare column names it expects.Fix 1 —
[make_annotation_files_ldscore]snp-list mode (commitd89e3918)When
--snp-listis set, Step C invokespolyfun/ldsc.py --l2 --print-snps(instead ofcompute_ldscores.py).ldsc.pystrict-positionally reads cols 0..3 as CHR/BP/SNP/CM and cols 4+ as numeric annotations, AND requires the.annotSNP set to equal the.bimSNP set in identical row order. The previous Step A output failed both:A1/A2/MAFas cols 4-6TypeError: can't multiply sequence by non-int of type 'float'(1000G).annotrows ==.bimrowsValueError: shapes (634887,) (1698778,) not broadcastable(ADSP)Add a small
normalize_for_ldsc()helper, applied to both single and joint annot dataframes before fwrite whenuse_print_snpsis true:A1/A2/MAF/CM(CM is re-sourced),.bimSNP set, fills 0 for missing SNPs,.bim(authoritative; ADSP.bimhas CM=0, 1000G has real cM),CHR-BP-SNP-CM-<annot…>matching.bimrow order.No-op when
--snp-listis not set (compute_ldscores.py path is unchanged).Fix 2 —
[meta_subset]view helper for tau_star (commitc08438db)postprocesswritesper_trait[i]$summarywith wide column names so a single per-trait list can hold both modes:But
meta_sldsc_randomlooks up bare names (tau_star,tau_star_se, etc.). The previousmeta_subsetcell passedsubset_per_traitdirectly tometa_sldsc_random(..., "tau_star")→ no baretau_starcolumn → all 96 traits skipped →out$tau_starwas a list of NA.Project
subset_per_traitthroughpecotmr:::.sldsc_view_for_meta()once per mode (single | joint) before callingmeta_sldsc_random. Output structure now mirrorspostprocessfor the "all" group:Validation
test/scripts/validate_pecotmr_fix.sh): COMPLETED 0:0 / 9:41 (full make_annotation → get_heritability → postprocess → meta_subset for category1).<ctx>.sldsc_postprocess.rdsand 5<group>.meta.rdsper context (brain / blood / brain_neurodegenerative / brain_psychiatric / brain_imaging).Dependencies
This PR depends on the matching pecotmr PR (StatFunGen/pecotmr#488) for sd_annot ↔ polyfun .results category alignment. Without that,
postprocessfails on snp-list-mode.results(target Category =L2_0/<annot>L2_0instead of the.annot.gzcolumn name).Files
code/enrichment/sldsc_enrichment.ipynb(2 commits, +33 / -3 lines total)🤖 Generated with Claude Code