From a17c30aab08ae428b19b4f909810e47b1a23582b Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Mon, 11 May 2026 15:48:54 -0400 Subject: [PATCH 1/4] Ensemble TWAS weight notebook Remove CV R^2 and update based on PR #486 pecotmr --- .../ensemble_twas_weights.ipynb | 992 ++++++++++++++++++ 1 file changed, 992 insertions(+) create mode 100644 code/pecotmr_integration/ensemble_twas_weights.ipynb diff --git a/code/pecotmr_integration/ensemble_twas_weights.ipynb b/code/pecotmr_integration/ensemble_twas_weights.ipynb new file mode 100644 index 000000000..9277af022 --- /dev/null +++ b/code/pecotmr_integration/ensemble_twas_weights.ipynb @@ -0,0 +1,992 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Ensemble TWAS Weights via Stacked Regression\n", + "\n", + "## Overview\n", + "\n", + "`pecotmr` offers 10+ TWAS weight methods (SuSiE, LASSO, Elastic Net, mr.ash, BayesR, DPR, etc.), each suited to different genetic architectures. Picking one method is arbitrary; running all and testing each incurs a multiple-testing penalty. **Stacked regression** ([SR-TWAS; Dai et al.,2024](https://doi.org/10.1038/s41467-024-50983-w)) provides a principled alternative: learn one convex combination of methods per gene, producing a single weight vector for downstream TWAS testing with no method-selection bias.\n", + "\n", + "This notebook:\n", + "1. Presents the mathematical framework and algorithm for ensemble weight learning\n", + "2. Benchmarks the approach via simulation using **genotypes** from LD sketch reference panels, so that realistic LD patterns are preserved\n", + "3. Produces diagnostic plots comparing MSE against ground truth and MSE against observed $Y$\n", + "\n", + "| Step | Description | Parallelism |\n", + "|------|-------------|-------------|\n", + "| `simulate_and_fit` | Load X from sketch LD, simulate Y, run methods + ensemble, save RDS | Per-region (`for_each`) |\n", + "| `analyze_results` | Load RDS files, compute metrics, generate diagnostic plots | Single task |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Mathematical Framework\n", + "\n", + "### Gene Expression Model\n", + "\n", + "For a target gene $g$ with $p$ cis-SNPs, the expression level of individual $i$ is:\n", + "\n", + "$$E_i = \\mathbf{G}_i \\mathbf{w} + \\epsilon_i, \\quad \\epsilon_i \\sim N(0, \\sigma^2_\\epsilon)$$\n", + "\n", + "where $\\mathbf{G}_i \\in \\mathbb{R}^{1 \\times p}$ is the genotype vector and $\\mathbf{w} \\in \\mathbb{R}^p$ is the true cis-eQTL effect-size vector.\n", + "\n", + "### Base Models\n", + "\n", + "We train $K$ prediction models $\\{m_1, \\ldots, m_K\\}$, each producing a weight estimate $\\hat{\\mathbf{w}}_k \\in \\mathbb{R}^p$. Under $F$-fold cross-validation, the out-of-fold prediction for sample $i$ from method $k$ is:\n", + "\n", + "$$\\hat{E}_i^{(k)} = \\mathbf{G}_i \\hat{\\mathbf{w}}_k^{(-f_i)}$$\n", + "\n", + "where $f_i$ denotes the fold containing sample $i$ and $\\hat{\\mathbf{w}}_k^{(-f)}$ is trained on all samples except fold $f$. Stacking all $n$ out-of-fold predictions across $K$ methods yields the prediction matrix:\n", + "\n", + "$$\\mathbf{P} = \\begin{pmatrix} \\hat{E}_1^{(1)} & \\cdots & \\hat{E}_1^{(K)} \\\\ \\vdots & \\ddots & \\vdots \\\\ \\hat{E}_n^{(1)} & \\cdots & \\hat{E}_n^{(K)} \\end{pmatrix} \\in \\mathbb{R}^{n \\times K}$$\n", + "\n", + "### Stacked Regression (Constrained QP)\n", + "\n", + "We seek ensemble coefficients $\\boldsymbol{\\zeta} = (\\zeta_1, \\ldots, \\zeta_K)^\\top$ that minimize the prediction error of the convex combination:\n", + "\n", + "$$\\min_{\\boldsymbol{\\zeta}} \\;\\; \\left\\| \\mathbf{E} - \\sum_{k=1}^{K} \\zeta_k \\hat{\\mathbf{E}}^{(k)} \\right\\|^2 = \\left\\| \\mathbf{E} - \\mathbf{P} \\boldsymbol{\\zeta} \\right\\|^2$$\n", + "\n", + "$$\\text{subject to} \\quad \\zeta_k \\geq 0 \\;\\; \\forall k, \\qquad \\sum_{k=1}^{K} \\zeta_k = 1$$\n", + "\n", + "This is a convex quadratic program solved via `quadprog::solve.QP` with:\n", + "\n", + "$$\\mathbf{D} = \\mathbf{P}^\\top \\mathbf{P} + \\lambda \\mathbf{I}, \\quad \\mathbf{d} = \\mathbf{P}^\\top \\mathbf{E}$$\n", + "\n", + "where $\\lambda = 10^{-8} \\cdot \\text{mean}(\\text{diag}(\\mathbf{P}^\\top \\mathbf{P}))$ provides ridge regularization for numerical stability.\n", + "\n", + "### Ensemble Weight Vector\n", + "\n", + "The final ensemble TWAS weight vector combines the full-data weight estimates:\n", + "\n", + "$$\\tilde{\\mathbf{w}} = \\sum_{k=1}^{K} \\zeta_k^* \\hat{\\mathbf{w}}_k$$\n", + "\n", + "### Performance Metrics\n", + "\n", + "Two complementary evaluation criteria:\n", + "\n", + "- **Against ground truth** (oracle): $\\text{MSE}_{\\text{truth}} = \\frac{1}{n} \\| \\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}} \\|^2$\n", + "- **Against observed $Y$** (practical): $R^2_Y = \\text{cor}(Y,\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$ and $\\text{MSE}_Y = \\frac{1}{n}\\|Y - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Algorithm\n", + "\n", + "The algorithm below shows the existing pecotmr cross-validation framework (Steps 1--5) and the **new ensemble learning extension** (Steps 6--9, in bold). Step numbers reference the mathematical formulation above.\n", + "\n", + "```\n", + "Algorithm: TWAS Weight Estimation with Ensemble Learning\n", + "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n", + "\n", + "Input: Genotype matrix G (n x p), expression vector E (n x 1),\n", + " weight methods M = {m_1, ..., m_K}, number of CV folds F\n", + "Output: Per-method weights {w_k}, ensemble coefficients {zeta_k},\n", + " combined weight vector w_tilde, CV performance metrics\n", + "\n", + "── Existing Framework: twas_weights_cv() + twas_weights() ──────\n", + "\n", + "1. Partition n samples into F folds: {S_1, ..., S_F}\n", + "2. For each fold f = 1, ..., F:\n", + " For each method k = 1, ..., K:\n", + " a. Train: w_k^(-f) <- m_k(G[T_f,], E[T_f])\n", + " b. Predict: E_hat^(k)[S_f] <- G[S_f,] * w_k^(-f)\n", + "3. Assemble out-of-fold prediction matrix P (n x K)\n", + " P[i,k] = E_hat_i^(k) from the fold containing sample i\n", + "4. Compute per-method performance:\n", + " R^2_k = cor(E, P[,k])^2 for k = 1, ..., K\n", + "5. Train final weights on full data:\n", + " w_k <- m_k(G, E) for all k\n", + "\n", + "── NEW: Ensemble Learning via ensemble_weights() ──────────────\n", + "\n", + " 6. Solve constrained QP for ensemble coefficients:\n", + " D <- P^T P + lambda * I (ridge regularization)\n", + " d <- P^T E\n", + " zeta* <- argmin 1/2 zeta^T D zeta - d^T zeta\n", + " s.t. sum(zeta_k) = 1, zeta_k >= 0 for all k\n", + " 7. Compute ensemble prediction:\n", + " E_hat_ens = P zeta*\n", + " 8. Compute ensemble performance:\n", + " R^2_ens = cor(E, E_hat_ens)^2\n", + " 9. Combine final weight vectors:\n", + " w_tilde = sum_k zeta_k * w_k\n", + "\n", + "Return: {w_k}, {zeta_k}, w_tilde, {R^2_k}, R^2_ens\n", + "```\n", + "\n", + "**Implementation**: Steps 1--5 are handled internally by `pecotmr::twas_weights_pipeline()`. Steps 6--9 are `pecotmr::ensemble_weights()`, now exported in PR #486 (no local source file needed)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Weight Methods\n", + "\n", + "Two method sets are provided depending on computational budget:\n", + "\n", + "### Expensive (default)\n", + "\n", + "| # | Method | Type | pecotmr function | Key parameters |\n", + "|---|--------|------|------------------|----------------|\n", + "| 1 | SuSiE | Bayesian sparse (L0-like) | `susie_weights` | `refine=FALSE, init_L=5, max_L=10` |\n", + "| 2 | SuSiE-ash | Bayesian sparse + ash shrinkage | `susie_ash_weights` | default |\n", + "| 3 | SuSiE-inf | Bayesian sparse + infinitesimal | `susie_inf_weights` | default |\n", + "| 4 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` | `init_prior_sd=TRUE, max.iter=100` |\n", + "| 5 | Elastic Net | L1+L2 penalized (glmnet) | `enet_weights` | default (`alpha=0.5`) |\n", + "| 6 | LASSO | L1 penalized (glmnet) | `lasso_weights` | default |\n", + "| 7 | BayesR | Bayesian mixture of normals | `bayes_r_weights` | default (`qgg`) |\n", + "| 8 | BayesL | Bayesian LASSO | `bayes_l_weights` | default |\n", + "| 9 | BayesA | Bayesian ridge-like | `bayes_a_weights` | default |\n", + "| 10 | BayesB | Bayesian variable selection | `bayes_b_weights` | default |\n", + "| 11 | BayesC | Bayesian spike-and-slab | `bayes_c_weights` | default |\n", + "| 12 | BayesN | Bayesian normal mixture | `bayes_n_weights` | default |\n", + "| 13 | B-LASSO | Bayesian LASSO variant | `b_lasso_weights` | default |\n", + "| 14 | DPR-VB | Dirichlet process regression (VB) | `dpr_vb_weights` | default (`RcppDPR`) |\n", + "| 15 | DPR-Gibbs | Dirichlet process regression (Gibbs) | `dpr_gibbs_weights` | default |\n", + "| 16 | DPR-Adaptive Gibbs | Dirichlet process regression (adaptive) | `dpr_adaptive_gibbs_weights` | default |\n", + "| 17 | SCAD | SCAD penalized | `scad_weights` | default |\n", + "| 18 | MCP | MCP penalized | `mcp_weights` | default |\n", + "| 19 | L0Learn | Best-subset approximation | `l0learn_weights` | default |\n", + "\n", + "### Cheap (fast iteration)\n", + "\n", + "| # | Method | Type | pecotmr function |\n", + "|---|--------|------|------------------|\n", + "| 1 | SuSiE | Bayesian sparse | `susie_weights` |\n", + "| 2 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` |\n", + "| 3 | LASSO | L1 penalized | `lasso_weights` |\n", + "| 4 | Elastic Net | L1+L2 penalized | `enet_weights` |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Simulation Design\n", + "\n", + "### Genotypes\n", + "\n", + "Each replicate uses **genotypes** loaded from LD sketch reference panels via `pecotmr::load_LD_matrix(meta, region, return_genotype = TRUE)`. Preserving LD patterns is critical for faithfully benchmarking TWAS weight methods — no synthetic genotype simulation can reproduce the complex correlation structure of actual genomic data.\n", + "\n", + "The pipeline requires two input files:\n", + "\n", + "1. **`ld_meta_file`** — Sketch LD metadata TSV pointing to plink2 files (one row per chromosome, `start=0, end=0` sentinel). Generated by the `rss_ld_sketch` pipeline. Example:\n", + " ```\n", + " #chrom start end path\n", + " 22 0 0 ADSP.R5.EUR.chr22\n", + " ```\n", + " The `path` column is the plink2 file prefix (resolved relative to the TSV directory). The corresponding `.pgen`, `.pvar`, `.psam`, `.afreq` files must exist.\n", + "\n", + "2. **`ld_block_file`** — BED or TSV file defining LD block boundaries (one row per block). Each block becomes a replicate region. Example (BED format, 0-based half-open):\n", + " ```\n", + " chr22 16051249 17614263\n", + " chr22 17614263 18834263\n", + " ...\n", + " ```\n", + " Or TSV with `#chrom start end` header. These define the genomic windows passed to `load_LD_matrix()` to extract variants per region.\n", + "\n", + "### Phenotypes\n", + "\n", + "- **Sparse**: `n_sparse` variants with large effects\n", + "- **Oligogenic**: `n_oligogenic` variants with moderate effects\n", + "- **Infinitesimal**: `n_inf` variants with small effects\n", + "- Total heritability: $h^2_g$ (default 0.2)\n", + "\n", + "### Replicates\n", + "\n", + "Each genomic region in the `ld_block_file` defines one replicate. The `n_seeds` parameter controls the number of independent phenotype realizations per region (different random seeds produce different causal architectures on the same genotype matrix).\n", + "\n", + "### Data Saving\n", + "\n", + "Set `save_data = TRUE` to save the raw genotype matrix, simulated phenotype, and ground-truth causal information in a separate `data.rds` file per replicate. This is useful for debugging but produces large files (one per region x seed). Disabled by default.\n", + "\n", + "### Evaluation Metrics\n", + "\n", + "Two complementary metrics per replicate:\n", + "\n", + "1. **Against ground truth** — MSE of predicted genetic component vs. true genetic component: $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$. This measures how well the method recovers the true signal.\n", + "\n", + "2. **Against observed $Y$** — $R^2$ and MSE of predictions vs. observed phenotype $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$. This is the practical metric: how well would the weights predict expression in a held-out sample?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Usage\n", + "\n", + "### Quick test (cheap methods, few regions)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", + " --ld-meta-file data/ld_meta_file.tsv \\\n", + " --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed \\\n", + " --chrom 21 \\\n", + " --n-seeds 1 \\\n", + " --method-set cheap \\\n", + " -J 20" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Full benchmark (expensive methods, 10 seeds per region)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", + " --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n", + " --ld-block-file /path/to/EUR_LD_blocks.bed \\\n", + " --chrom 22 \\\n", + " --n-seeds 10 \\\n", + " --method-set expensive \\\n", + " -J 40" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### With data saving (for debugging)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", + " --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n", + " --ld-block-file /path/to/EUR_LD_blocks.bed \\\n", + " --chrom 22 \\\n", + " --n-seeds 1 \\\n", + " --save-data TRUE \\\n", + " -J 20" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "\n", + "### Analysis and plots" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jp-MarkdownHeadingCollapsed": true, + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/ensemble_twas_weights.ipynb analyze_results \\\n", + " --cwd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/ensemble_twas \\\n", + " --ld-meta-file /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/rss_ld_sketch/sketch_ld_meta.tsv \\\n", + " --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Pipeline Implementation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# Output directory\n", + "parameter: cwd = path(\"output/ensemble_twas\")\n", + "\n", + "\n", + "# ── Genotype source ──────────────────────────────────────────\n", + "# Sketch LD metadata TSV (plink2 format: chrom, start=0, end=0, path=prefix)\n", + "# Generated by the rss_ld_sketch pipeline. Used by load_LD_matrix() to load genotypes.\n", + "parameter: ld_meta_file = path(\"\")\n", + "# LD block file defining replicate regions (BED or TSV with chrom/start/end)\n", + "# Each block becomes one replicate. Can be EUR_LD_blocks.bed or similar.\n", + "parameter: ld_block_file = path(\"\")\n", + "# Chromosome filter (0 = all chromosomes)\n", + "parameter: chrom = 0\n", + "# Max variants to subsample per region (0 = no limit)\n", + "parameter: max_variants = 0\n", + "\n", + "# ── Method selection ─────────────────────────────────────────\n", + "# \"expensive\": full 19-method list (SuSiE variants, mr.ash, BayesR family, DPR family, LASSO, Elastic Net, SCAD, MCP, L0Learn)\n", + "# \"cheap\": \"fast_default\" preset — SuSiE, mr.ash, Elastic Net, LASSO\n", + "parameter: method_set = \"expensive\"\n", + "\n", + "# ── Simulation parameters ────────────────────────────────────\n", + "parameter: h2g = 0.2\n", + "parameter: n_sparse = 3\n", + "parameter: n_oligogenic = 5\n", + "parameter: n_inf = 15\n", + "parameter: cv_folds = 5\n", + "parameter: max_cv_variants = 500\n", + "\n", + "# ── Replicate control ────────────────────────────────────────\n", + "parameter: n_seeds = 1\n", + "parameter: seed_base = 42\n", + "\n", + "# ── Data saving (for debugging) ──────────────────────────────\n", + "# When TRUE, saves raw genotype matrix X, phenotype y, and sim object\n", + "# as a separate data.rds per replicate. Produces large files.\n", + "parameter: save_data = False\n", + "\n", + "# ── Cluster resources ────────────────────────────────────────\n", + "parameter: job_size = 1\n", + "parameter: walltime = \"4:00:00\"\n", + "parameter: mem = \"16G\"\n", + "parameter: numThreads = 4\n", + "\n", + "import os\n", + "cwd = path(f'{cwd:a}')\n", + "\n", + "def _read_ld_blocks(block_file, chrom_filter):\n", + " \"\"\"Parse LD block file (BED or TSV with chrom/start/end columns).\n", + " Returns list of region dicts with id, chrom, start, end, region string.\n", + " Supports BED (0-based half-open) and TSV with #chrom header.\n", + " \"\"\"\n", + " regions = []\n", + " with open(block_file) as fh:\n", + " for line in fh:\n", + " line = line.strip()\n", + " if not line or line.startswith('#chrom') or line.startswith('chrom'):\n", + " continue\n", + " parts = line.split()\n", + " if len(parts) < 3:\n", + " continue\n", + " c = parts[0]\n", + " if not c.startswith('chr'):\n", + " c = f'chr{c}'\n", + " try:\n", + " cnum = int(c.replace('chr', ''))\n", + " except ValueError:\n", + " continue\n", + " if not (1 <= cnum <= 22):\n", + " continue\n", + " if chrom_filter != 0 and cnum != chrom_filter:\n", + " continue\n", + " start, end = int(parts[1]), int(parts[2])\n", + " regions.append({\n", + " 'id': f'{c}_{start}_{end}',\n", + " 'chrom': c,\n", + " 'start': start,\n", + " 'end': end,\n", + " 'region': f'{c}:{start}-{end}'\n", + " })\n", + " return regions\n", + "\n", + "# Validate inputs\n", + "if not str(ld_meta_file) or not os.path.isfile(str(ld_meta_file)):\n", + " raise ValueError(\"ld_meta_file is required — must point to a sketch LD metadata TSV (plink2 format)\")\n", + "if not str(ld_block_file) or not os.path.isfile(str(ld_block_file)):\n", + " raise ValueError(\"ld_block_file is required — must point to an LD block BED/TSV defining replicate regions\")\n", + "\n", + "# Build replicate list: regions x seeds\n", + "_base_regions = _read_ld_blocks(str(ld_block_file), chrom)\n", + "if not _base_regions:\n", + " raise ValueError(f\"No regions found in {ld_block_file} for chrom={chrom}\")\n", + "\n", + "replicates = []\n", + "for seed_idx in range(n_seeds):\n", + " for i, r in enumerate(_base_regions):\n", + " rep = dict(r)\n", + " rep['seed'] = seed_base + seed_idx * len(_base_regions) + i\n", + " rep['rep_id'] = f'{r[\"id\"]}_seed{seed_idx}'\n", + " replicates.append(rep)\n", + "\n", + "print(f\" {len(_base_regions)} regions x {n_seeds} seeds = {len(replicates)} replicates\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### `simulate_and_fit`\n", + "\n", + "For each replicate (region x seed):\n", + "1. Load genotype matrix $\\mathbf{G}$ from LD sketch via `pecotmr::load_LD_matrix()` using the sketch LD metadata and the region defined by the LD block file\n", + "2. Simulate expression $E$ via `simxQTL::generate_cis_qtl_data()` with mixed architecture on top of $\\mathbf{G}$\n", + "3. Run `pecotmr::twas_weights_pipeline()` with selected methods and 5-fold CV\n", + "4. Call `pecotmr::ensemble_weights()` (now exported, PR #486) to learn $\\boldsymbol{\\zeta}$ and combine weights\n", + "5. Compute metrics against ground truth and against observed $Y$\n", + "6. Save results as RDS; optionally save raw data (X, y, sim) as separate `data.rds` when `--save-data TRUE`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[simulate_and_fit]\n", + "input: for_each = \"replicates\"\n", + "output_files = [f'{cwd}/{_replicates[\"rep_id\"]}/results.rds']\n", + "if save_data:\n", + " output_files.append(f'{cwd}/{_replicates[\"rep_id\"]}/data.rds')\n", + "output: output_files\n", + "# Pre-compute data path in Python so the R script can reference it unconditionally.\n", + "# (Using ${_output[1]} directly fails with IndexError when save_data=False\n", + "# because SoS's ${} substitution is Python-eager, regardless of R control flow.)\n", + "_data_path = f'{cwd}/{_replicates[\"rep_id\"]}/data.rds' if save_data else ''\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "R: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " library(pecotmr)\n", + " library(simxQTL)\n", + " set.seed(${_replicates['seed']})\n", + " cat(\"Replicate: ${_replicates['rep_id']}\\n\")\n", + " cat(\"Seed: ${_replicates['seed']}\\n\")\n", + " cat(\"Region: ${_replicates['region']}\\n\")\n", + "\n", + " # ── 1. Load genotype from LD sketch ─────────────────────\n", + " ld_data <- load_LD_matrix(\n", + " \"${ld_meta_file}\",\n", + " region = \"${_replicates['region']}\",\n", + " return_genotype = TRUE\n", + " )\n", + " X <- ld_data$LD_matrix\n", + " X <- scale(X)\n", + " X[is.na(X)] <- 0\n", + " n <- nrow(X); p <- ncol(X)\n", + " cat(sprintf(\"Genotype matrix: %d samples x %d variants\\n\", n, p))\n", + "\n", + " max_var <- ${max_variants}\n", + " if (max_var > 0 && p > max_var) {\n", + " keep_idx <- sort(sample(p, max_var))\n", + " X <- X[, keep_idx, drop = FALSE]\n", + " p <- ncol(X)\n", + " cat(sprintf(\"Subsampled to %d variants\\n\", p))\n", + " }\n", + "\n", + " # ── 2. Simulate training phenotype ───────────────────────\n", + " sim <- generate_cis_qtl_data(X, h2g = ${h2g},\n", + " n_sparse = ${n_sparse},\n", + " n_oligogenic = ${n_oligogenic},\n", + " n_inf = ${n_inf})\n", + " y <- sim$y # training Y: G*w_true + eps_train\n", + " cat(sprintf(\"Realized h2g: %.4f residual var: %.4f\\n\",\n", + " sim$h2g, sim$residual_variance))\n", + "\n", + " # ── 2a. Build ground-truth pieces from sim$beta ───────────\n", + " if (!is.null(sim$beta) && length(sim$beta) == p) {\n", + " true_beta <- as.numeric(sim$beta)\n", + " } else {\n", + " warning(\"sim$beta missing or wrong length (\", length(sim$beta),\n", + " \" vs p=\", p, \"); defaulting to zero vector\")\n", + " true_beta <- rep(0, p)\n", + " }\n", + " true_genetic <- as.vector(X %*% true_beta)\n", + "\n", + " # ── 2b. Fresh held-out Y (same X, same w_true, new noise) ─\n", + " # This is what lets us evaluate prediction without in-sample\n", + " # overfitting bias — the weights never saw this Y.\n", + " sigma2 <- sim$residual_variance\n", + " eps_new <- rnorm(n, mean = 0, sd = sqrt(sigma2))\n", + " y_test <- true_genetic + eps_new\n", + " cat(\"Generated fresh Y_test (independent noise draw)\\n\")\n", + "\n", + " # ── 2c. Optional raw-data dump for debugging ─────────────\n", + " data_path <- \"${_data_path}\"\n", + " if (nzchar(data_path)) {\n", + " data_out <- list(\n", + " X = X, y = y, y_test = y_test, sim = sim,\n", + " region = \"${_replicates['region']}\",\n", + " rep_id = \"${_replicates['rep_id']}\",\n", + " seed = ${_replicates['seed']},\n", + " variant_info = ld_data$ref_panel\n", + " )\n", + " dir.create(dirname(data_path), recursive = TRUE, showWarnings = FALSE)\n", + " saveRDS(data_out, data_path)\n", + " cat(\"Saved raw data:\", data_path, \"\\n\")\n", + " rm(data_out)\n", + " }\n", + "\n", + " # ── 3. Define weight methods ─────────────────────────────\n", + " method_set <- \"${method_set}\"\n", + " if (method_set == \"expensive\") {\n", + " weight_methods <- list(\n", + " susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n", + " susie_ash_weights = list(),\n", + " susie_inf_weights = list(),\n", + " mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n", + " enet_weights = list(),\n", + " lasso_weights = list(),\n", + " bayes_r_weights = list(),\n", + " bayes_l_weights = list(),\n", + " bayes_a_weights = list(),\n", + " bayes_b_weights = list(),\n", + " bayes_c_weights = list(),\n", + " bayes_n_weights = list(),\n", + " b_lasso_weights = list(),\n", + " dpr_vb_weights = list(),\n", + " dpr_gibbs_weights = list(),\n", + " dpr_adaptive_gibbs_weights = list(),\n", + " scad_weights = list(),\n", + " mcp_weights = list(),\n", + " l0learn_weights = list()\n", + "\n", + " )\n", + " } else {\n", + " weight_methods <- list(\n", + " susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n", + " mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n", + " lasso_weights = list(),\n", + " enet_weights = list()\n", + " )\n", + " }\n", + "\n", + " # ── 4. TWAS weights pipeline (CV + full-data weights) ────\n", + " # Note: twas_weights_pipeline() in pecotmr >=0.5 (PR #486) does NOT accept num_threads.\n", + " # ensemble_weights() is now exported; no local R/ensemble_weights.R source needed.\n", + " cat(\"Running twas_weights_pipeline with\", length(weight_methods), \"methods...\\n\")\n", + " res <- tryCatch(\n", + " twas_weights_pipeline(\n", + " X, y,\n", + " cv_folds = ${cv_folds},\n", + " weight_methods = weight_methods,\n", + " max_cv_variants = ${max_cv_variants}\n", + " ),\n", + " error = function(e) {\n", + " cat(\"Pipeline error:\", conditionMessage(e), \"\\n\")\n", + " NULL\n", + " }\n", + " )\n", + "\n", + " if (is.null(res)) {\n", + " cat(\"Pipeline failed — exiting non-zero so SoS flags this replicate\\n\")\n", + " saveRDS(list(results = NULL, ensemble = NULL,\n", + " truth = list(rep_id = \"${_replicates['rep_id']}\",\n", + " seed = ${_replicates['seed']},\n", + " error = TRUE)),\n", + " \"${_output[0]}\")\n", + " quit(save = \"no\", status = 1)\n", + " }\n", + "\n", + " # ── 5. Ensemble learning (steps 6-9 in the algorithm) ────\n", + " cat(\"Computing ensemble weights...\\n\")\n", + " ens <- tryCatch(\n", + " pecotmr::ensemble_weights(\n", + " cv_results = res$twas_cv_result,\n", + " Y = y,\n", + " twas_weight_list = res$twas_weights\n", + " ),\n", + " error = function(e) {\n", + " cat(\"Ensemble error:\", conditionMessage(e), \"\\n\")\n", + " NULL\n", + " }\n", + " )\n", + "\n", + " # ── 6. Compute metrics ───────────────────────────────────\n", + " # Four metrics per method and per ensemble:\n", + " # mse_w = mean((w_hat - w_true)^2) weight-space, LD-independent\n", + " # mse_pred_truth = mean((X*w_hat - X*w_true)^2) prediction-space, LD-aware (what TWAS uses)\n", + " # mse_y = mean((X*w_hat - y_test)^2) held-out Y (fresh eps), honest out-of-sample\n", + " # rsq_y = cor(X*w_hat, y_test)^2 held-out Y R^2\n", + " compute_metrics <- function(w_hat) {\n", + " if (is.matrix(w_hat)) w_hat <- w_hat[, 1]\n", + " pred <- as.vector(X %*% w_hat)\n", + " mse_w <- mean((w_hat - true_beta)^2)\n", + " mse_pred_truth <- mean((pred - true_genetic)^2)\n", + " mse_y <- mean((pred - y_test)^2)\n", + " rsq_y <- if (sd(pred) > 0) cor(pred, y_test)^2 else 0\n", + " c(mse_w = mse_w,\n", + " mse_pred_truth = mse_pred_truth,\n", + " mse_y = mse_y,\n", + " rsq_y = rsq_y)\n", + " }\n", + "\n", + " method_metrics <- sapply(names(res$twas_weights), function(wname) {\n", + " compute_metrics(res$twas_weights[[wname]])\n", + " })\n", + " colnames(method_metrics) <- gsub(\"_weights$\", \"\", colnames(method_metrics))\n", + "\n", + " ensemble_metrics <- c(mse_w = NA, mse_pred_truth = NA, mse_y = NA, rsq_y = NA)\n", + " if (!is.null(ens) && !is.null(ens$ensemble_twas_weights)) {\n", + " ensemble_metrics <- compute_metrics(ens$ensemble_twas_weights)\n", + " }\n", + "\n", + "\n", + " # ── 7. Save results ──────────────────────────────────────\n", + " truth <- list(\n", + " causal_sparse = sim$sparse_indices,\n", + " causal_oligo = sim$oligogenic_indices,\n", + " causal_inf = sim$infinitesimal_indices,\n", + " true_beta = true_beta,\n", + " true_y = y,\n", + " y_test = y_test,\n", + " residual_variance = sigma2,\n", + " h2g = sim$h2g,\n", + " rep_id = \"${_replicates['rep_id']}\",\n", + " region = \"${_replicates['region']}\",\n", + " seed = ${_replicates['seed']},\n", + " n = n, p = p,\n", + " method_metrics = method_metrics,\n", + " ensemble_metrics = ensemble_metrics,\n", + " )\n", + "\n", + " out <- list(results = res, ensemble = ens, truth = truth)\n", + " dir.create(dirname(\"${_output[0]}\"), recursive = TRUE, showWarnings = FALSE)\n", + " saveRDS(out, \"${_output[0]}\")\n", + " cat(\"Saved:\", \"${_output[0]}\", \"\\n\")\n", + " tryCatch({\n", + " cat(\"Ensemble MSE (weights): \", round(ensemble_metrics[\"mse_w\"], 6), \"\\n\")\n", + " cat(\"Ensemble MSE (pred vs truth):\",\n", + " round(ensemble_metrics[\"mse_pred_truth\"], 6), \"\\n\")\n", + " cat(\"Ensemble MSE (held-out Y): \",\n", + " round(ensemble_metrics[\"mse_y\"], 6), \"\\n\")\n", + " cat(\"Ensemble R^2 (held-out Y): \",\n", + " round(ensemble_metrics[\"rsq_y\"], 4), \"\\n\")\n", + " }, error = function(e) cat(\"Metric-print failed (non-fatal):\", conditionMessage(e), \"\\n\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### `analyze_results`\n", + "\n", + "Load all per-replicate RDS files and produce:\n", + "1. **MSE of weights** — $\\frac{1}{p}\\|\\hat{\\mathbf{w}} - \\mathbf{w}_{\\text{true}}\\|^2$ (LD-independent)\n", + "2. **MSE of predicted genetic component** — $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$ (LD-aware, TWAS-relevant)\n", + "3. **MSE vs held-out $Y$** — $\\frac{1}{n}\\|Y_{\\text{test}} - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y_{\\text{test}} = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\varepsilon_{\\text{new}}$\n", + "4. **$R^2$ vs held-out $Y$** — $\\text{cor}(Y_{\\text{test}},\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$\n", + "5. **Coefficient heatmap** — ensemble $\\zeta_k$ across replicates\n", + "6. **Scatter plot** — ensemble $R^2$ vs. best single-method $R^2$\n", + "7. **Summary CSV** — one row per replicate with all metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[analyze_results]\n", + "parameter: results_dir = path(f'{cwd}')\n", + "output: f'{results_dir}/diagnostics.pdf', f'{results_dir}/summary_table.csv'\n", + "task: trunk_workers = 1, walltime = \"1:00:00\", mem = \"8G\", cores = 1\n", + "R: expand = \"${ }\", stderr = f'{results_dir}/analyze.stderr', stdout = f'{results_dir}/analyze.stdout'\n", + "\n", + " library(ggplot2)\n", + " library(tidyr)\n", + " library(dplyr)\n", + "\n", + " results_dir <- \"${results_dir}\"\n", + "\n", + " # ── 1. Load all RDS files ────────────────────────────────\n", + " rds_files <- list.files(results_dir, pattern = \"results\\\\.rds$\",\n", + " recursive = TRUE, full.names = TRUE)\n", + " cat(sprintf(\"Found %d RDS files\\n\", length(rds_files)))\n", + "\n", + " all_data <- lapply(rds_files, function(f) {\n", + " tryCatch(readRDS(f), error = function(e) NULL)\n", + " })\n", + " all_data <- all_data[!sapply(all_data, is.null)]\n", + " all_data <- all_data[!sapply(all_data, function(x) isTRUE(x$truth$error))]\n", + " cat(sprintf(\"%d successful replicates\\n\", length(all_data)))\n", + "\n", + " if (length(all_data) == 0) stop(\"No valid results found\")\n", + "\n", + " # ── 2. Extract metrics ───────────────────────────────────\n", + " metrics_list <- lapply(seq_along(all_data), function(i) {\n", + " x <- all_data[[i]]\n", + " coefs <- if (!is.null(x$ensemble)) x$ensemble$method_coef else NULL\n", + " list(\n", + " rep_id = x$truth$rep_id, seed = x$truth$seed, region = x$truth$region,\n", + " method_coef = coefs,\n", + " method_metrics = x$truth$method_metrics,\n", + " ensemble_metrics = x$truth$ensemble_metrics\n", + " )\n", + " })\n", + "\n", + " # ── 3. Tidy data frames ──────────────────────────────────\n", + " build_metric_df <- function(metric_name) {\n", + " do.call(rbind, lapply(metrics_list, function(m) {\n", + " if (is.null(m$method_metrics)) return(NULL)\n", + " if (!(metric_name %in% rownames(m$method_metrics))) return(NULL)\n", + " methods <- colnames(m$method_metrics)\n", + " vals <- m$method_metrics[metric_name, ]\n", + " method_df <- data.frame(rep_id = m$rep_id, method = methods,\n", + " value = as.numeric(vals),\n", + " stringsAsFactors = FALSE)\n", + " ens_df <- data.frame(rep_id = m$rep_id, method = \"ensemble\",\n", + " value = as.numeric(m$ensemble_metrics[metric_name]),\n", + " stringsAsFactors = FALSE)\n", + " rbind(method_df, ens_df)\n", + " }))\n", + " }\n", + "\n", + "\n", + " mse_w_df <- build_metric_df(\"mse_w\")\n", + " mse_pred_truth_df <- build_metric_df(\"mse_pred_truth\")\n", + " mse_y_df <- build_metric_df(\"mse_y\")\n", + " rsq_y_df <- build_metric_df(\"rsq_y\")\n", + "\n", + " coef_mat <- do.call(rbind, lapply(metrics_list, function(m) {\n", + " if (is.null(m$method_coef)) return(NULL)\n", + " m$method_coef\n", + " }))\n", + " if (!is.null(coef_mat)) {\n", + " rownames(coef_mat) <- sapply(metrics_list[!sapply(metrics_list,\n", + " function(m) is.null(m$method_coef))], function(m) m$rep_id)\n", + " }\n", + "\n", + " scatter_df <- do.call(rbind, lapply(metrics_list, function(m) {\n", + " if (is.null(m$method_metrics)) return(NULL)\n", + " if (!(\"rsq_y\" %in% rownames(m$method_metrics))) return(NULL)\n", + " rsq_vals <- m$method_metrics[\"rsq_y\", ]\n", + " data.frame(\n", + " rep_id = m$rep_id,\n", + " best_single_rsq = max(rsq_vals, na.rm = TRUE),\n", + " best_method = names(which.max(rsq_vals)),\n", + " ensemble_rsq = as.numeric(m$ensemble_metrics[\"rsq_y\"]),\n", + " stringsAsFactors = FALSE\n", + " )\n", + " }))\n", + "\n", + " # ── 4. Diagnostic plots ──────────────────────────────────\n", + " make_boxplot <- function(df, y_label, title, descending = FALSE) {\n", + " if (is.null(df) || nrow(df) == 0) return(NULL)\n", + " ord <- df %>% group_by(method) %>%\n", + " summarise(med = median(value, na.rm = TRUE))\n", + " if (descending) ord <- ord %>% arrange(desc(med)) else ord <- ord %>% arrange(med)\n", + " df$method <- factor(df$method, levels = ord$method)\n", + " ggplot(df, aes(x = method, y = value,\n", + " fill = ifelse(method == \"ensemble\", \"Ensemble\", \"Single\"))) +\n", + " geom_boxplot(outlier.size = 0.8) +\n", + " scale_fill_manual(values = c(\"Ensemble\" = \"#E41A1C\",\n", + " \"Single\" = \"#377EB8\"), name = \"\") +\n", + " labs(title = title, x = \"Method\", y = y_label) +\n", + " theme_bw(base_size = 12) +\n", + " theme(axis.text.x = element_text(angle = 45, hjust = 1))\n", + " }\n", + "\n", + " pdf(\"${_output[0]}\", width = 12, height = 8)\n", + "\n", + "\n", + " p <- make_boxplot(mse_w_df, \"MSE of weights\",\n", + " \"MSE on Weights (LD-independent): mean((w_hat - w_true)^2)\",\n", + " descending = TRUE)\n", + " if (!is.null(p)) print(p)\n", + "\n", + " p <- make_boxplot(mse_pred_truth_df, \"MSE(G*w_hat vs G*w_true)\",\n", + " \"MSE on Predicted Genetic Component (LD-aware, TWAS-relevant)\",\n", + " descending = TRUE)\n", + " if (!is.null(p)) print(p)\n", + "\n", + " p <- make_boxplot(mse_y_df, \"MSE (vs held-out Y)\",\n", + " \"MSE Against Held-Out Y (fresh noise — honest out-of-sample)\",\n", + " descending = TRUE)\n", + " if (!is.null(p)) print(p)\n", + "\n", + " p <- make_boxplot(rsq_y_df, expression(R^2~(vs~held-out~Y)),\n", + " \"R-squared Against Held-Out Y\")\n", + " if (!is.null(p)) print(p)\n", + "\n", + " if (!is.null(coef_mat) && nrow(coef_mat) > 1) {\n", + " coef_long <- as.data.frame(coef_mat) %>%\n", + " mutate(replicate = rownames(coef_mat)) %>%\n", + " pivot_longer(-replicate, names_to = \"method\", values_to = \"zeta\")\n", + " p5 <- ggplot(coef_long, aes(x = method, y = replicate, fill = zeta)) +\n", + " geom_tile() +\n", + " scale_fill_gradient(low = \"white\", high = \"#E41A1C\",\n", + " name = expression(zeta[k]), limits = c(0, 1)) +\n", + " labs(title = \"Ensemble Coefficients Across Replicates\",\n", + " x = \"Method\", y = \"Replicate\") +\n", + " theme_bw(base_size = 10) +\n", + " theme(axis.text.x = element_text(angle = 45, hjust = 1),\n", + " axis.text.y = element_text(size = 6))\n", + " print(p5)\n", + " }\n", + "\n", + " if (!is.null(scatter_df) && nrow(scatter_df) > 0) {\n", + " p6 <- ggplot(scatter_df, aes(x = best_single_rsq, y = ensemble_rsq)) +\n", + " geom_point(alpha = 0.6, size = 2, color = \"#377EB8\") +\n", + " geom_abline(intercept = 0, slope = 1, linetype = \"dashed\",\n", + " color = \"grey40\") +\n", + " labs(title = \"Ensemble vs Best Single Method (R^2 vs held-out Y)\",\n", + " x = expression(Best~single~method~R^2),\n", + " y = expression(Ensemble~R^2)) +\n", + " theme_bw(base_size = 12) + coord_equal()\n", + " print(p6)\n", + " }\n", + "\n", + " dev.off()\n", + " cat(\"Saved plots:\", \"${_output[0]}\", \"\\n\")\n", + "\n", + " # ── 5. Summary table ─────────────────────────────────────\n", + " metric_rows <- c(\"mse_w\", \"mse_pred_truth\", \"mse_y\", \"rsq_y\")\n", + " summary_rows <- lapply(metrics_list, function(m) {\n", + " row <- data.frame(\n", + " rep_id = m$rep_id, region = m$region, seed = m$seed,\n", + " stringsAsFactors = FALSE\n", + " )\n", + " for (mr in metric_rows) {\n", + " row[[paste0(\"ensemble_\", mr)]] <- as.numeric(m$ensemble_metrics[mr])\n", + " }\n", + " if (!is.null(m$method_metrics)) {\n", + " for (mn in colnames(m$method_metrics)) {\n", + " for (mr in metric_rows) {\n", + " if (mr %in% rownames(m$method_metrics)) {\n", + " row[[paste0(mn, \"_\", mr)]] <- m$method_metrics[mr, mn]\n", + " }\n", + " }\n", + " }\n", + " }\n", + " if (!is.null(m$method_coef)) {\n", + " for (mn in names(m$method_coef)) {\n", + " row[[paste0(\"zeta_\", mn)]] <- m$method_coef[[mn]]\n", + " }\n", + " }\n", + " row\n", + " })\n", + " summary_df <- bind_rows(summary_rows)\n", + "\n", + " method_rsq_cols <- grep(\"_rsq_y$\", names(summary_df), value = TRUE)\n", + " method_rsq_cols <- setdiff(method_rsq_cols, \"ensemble_rsq_y\")\n", + " if (length(method_rsq_cols) > 0) {\n", + " summary_df$best_method_rsq_y <- apply(summary_df[method_rsq_cols], 1,\n", + " max, na.rm = TRUE)\n", + " summary_df$best_method <- apply(summary_df[method_rsq_cols], 1, function(r) {\n", + " gsub(\"_rsq_y$\", \"\", names(which.max(r)))\n", + " })\n", + " summary_df$rsq_y_improvement <-\n", + " summary_df$ensemble_rsq_y - summary_df$best_method_rsq_y\n", + " }\n", + "\n", + " write.csv(summary_df, \"${_output[1]}\", row.names = FALSE)\n", + " cat(\"Saved summary:\", \"${_output[1]}\", \"\\n\")\n", + "\n", + " cat(\"\\n=== Summary ===\\n\")\n", + " cat(sprintf(\"Replicates: %d\\n\", nrow(summary_df)))\n", + " cat(sprintf(\"Mean ensemble MSE (weights) : %.6f\\n\",\n", + " mean(summary_df$ensemble_mse_w, na.rm = TRUE)))\n", + " cat(sprintf(\"Mean ensemble MSE (pred truth) : %.6f\\n\",\n", + " mean(summary_df$ensemble_mse_pred_truth, na.rm = TRUE)))\n", + " cat(sprintf(\"Mean ensemble MSE (held-out Y) : %.6f\\n\",\n", + " mean(summary_df$ensemble_mse_y, na.rm = TRUE)))\n", + " cat(sprintf(\"Mean ensemble R^2 (held-out Y) : %.4f\\n\",\n", + " mean(summary_df$ensemble_rsq_y, na.rm = TRUE)))\n", + " if (\"rsq_y_improvement\" %in% names(summary_df)) {\n", + " cat(sprintf(\"Ensemble >= best in %.1f%% of replicates (R^2 vs held-out Y)\\n\",\n", + " 100 * mean(summary_df$rsq_y_improvement >= 0, na.rm = TRUE)))\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Expected Results\n", + "\n", + "With default simulation parameters ($h^2_g = 0.2$, mixed architecture) on genotype regions:\n", + "\n", + "- **MSE against truth vs. MSE against $Y$**: MSE against truth isolates weight estimation quality (unaffected by noise). MSE against $Y$ includes irreducible noise $\\sigma^2_\\epsilon$, so it is always larger.\n", + "- **Coefficient patterns**: Sparse architectures favor SuSiE/LASSO; polygenic architectures favor BayesR/DPR; the ensemble adapts by assigning larger $\\zeta_k$ to the method best suited to each region's architecture.\n", + "- **Scatter plot**: Most points should lie on or above the diagonal, indicating ensemble matches or improves on the best single method." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "R", + "ir", + "R", + "", + "r" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ] + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 7780dbfd0f6f861567a4d933141eb055c180abdc Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 19 May 2026 09:13:59 -0400 Subject: [PATCH 2/4] Add new sections for snRNAseq and pseudobulk preprocessing --- website/_toc.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/website/_toc.yml b/website/_toc.yml index f6e3744c5..d5efb01d2 100644 --- a/website/_toc.yml +++ b/website/_toc.yml @@ -26,6 +26,8 @@ parts: - file: code/molecular_phenotypes/calling/RNA_calling.ipynb - file: code/molecular_phenotypes/QC/bulk_expression_QC.ipynb - file: code/molecular_phenotypes/QC/bulk_expression_normalization.ipynb + - file: code/molecular_phenotypes/snRNAseq_preprocessing.ipynb + - file: code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb - file: code/molecular_phenotypes/methylation.ipynb sections: - file: code/molecular_phenotypes/calling/methylation_calling.ipynb From 5bd3996e32957eeb4a5b88a45bc2e2b4ba37a8d8 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 19 May 2026 09:15:26 -0400 Subject: [PATCH 3/4] Delete code/pecotmr_integration/ensemble_twas_weights.ipynb --- .../ensemble_twas_weights.ipynb | 992 ------------------ 1 file changed, 992 deletions(-) delete mode 100644 code/pecotmr_integration/ensemble_twas_weights.ipynb diff --git a/code/pecotmr_integration/ensemble_twas_weights.ipynb b/code/pecotmr_integration/ensemble_twas_weights.ipynb deleted file mode 100644 index 9277af022..000000000 --- a/code/pecotmr_integration/ensemble_twas_weights.ipynb +++ /dev/null @@ -1,992 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Ensemble TWAS Weights via Stacked Regression\n", - "\n", - "## Overview\n", - "\n", - "`pecotmr` offers 10+ TWAS weight methods (SuSiE, LASSO, Elastic Net, mr.ash, BayesR, DPR, etc.), each suited to different genetic architectures. Picking one method is arbitrary; running all and testing each incurs a multiple-testing penalty. **Stacked regression** ([SR-TWAS; Dai et al.,2024](https://doi.org/10.1038/s41467-024-50983-w)) provides a principled alternative: learn one convex combination of methods per gene, producing a single weight vector for downstream TWAS testing with no method-selection bias.\n", - "\n", - "This notebook:\n", - "1. Presents the mathematical framework and algorithm for ensemble weight learning\n", - "2. Benchmarks the approach via simulation using **genotypes** from LD sketch reference panels, so that realistic LD patterns are preserved\n", - "3. Produces diagnostic plots comparing MSE against ground truth and MSE against observed $Y$\n", - "\n", - "| Step | Description | Parallelism |\n", - "|------|-------------|-------------|\n", - "| `simulate_and_fit` | Load X from sketch LD, simulate Y, run methods + ensemble, save RDS | Per-region (`for_each`) |\n", - "| `analyze_results` | Load RDS files, compute metrics, generate diagnostic plots | Single task |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Mathematical Framework\n", - "\n", - "### Gene Expression Model\n", - "\n", - "For a target gene $g$ with $p$ cis-SNPs, the expression level of individual $i$ is:\n", - "\n", - "$$E_i = \\mathbf{G}_i \\mathbf{w} + \\epsilon_i, \\quad \\epsilon_i \\sim N(0, \\sigma^2_\\epsilon)$$\n", - "\n", - "where $\\mathbf{G}_i \\in \\mathbb{R}^{1 \\times p}$ is the genotype vector and $\\mathbf{w} \\in \\mathbb{R}^p$ is the true cis-eQTL effect-size vector.\n", - "\n", - "### Base Models\n", - "\n", - "We train $K$ prediction models $\\{m_1, \\ldots, m_K\\}$, each producing a weight estimate $\\hat{\\mathbf{w}}_k \\in \\mathbb{R}^p$. Under $F$-fold cross-validation, the out-of-fold prediction for sample $i$ from method $k$ is:\n", - "\n", - "$$\\hat{E}_i^{(k)} = \\mathbf{G}_i \\hat{\\mathbf{w}}_k^{(-f_i)}$$\n", - "\n", - "where $f_i$ denotes the fold containing sample $i$ and $\\hat{\\mathbf{w}}_k^{(-f)}$ is trained on all samples except fold $f$. Stacking all $n$ out-of-fold predictions across $K$ methods yields the prediction matrix:\n", - "\n", - "$$\\mathbf{P} = \\begin{pmatrix} \\hat{E}_1^{(1)} & \\cdots & \\hat{E}_1^{(K)} \\\\ \\vdots & \\ddots & \\vdots \\\\ \\hat{E}_n^{(1)} & \\cdots & \\hat{E}_n^{(K)} \\end{pmatrix} \\in \\mathbb{R}^{n \\times K}$$\n", - "\n", - "### Stacked Regression (Constrained QP)\n", - "\n", - "We seek ensemble coefficients $\\boldsymbol{\\zeta} = (\\zeta_1, \\ldots, \\zeta_K)^\\top$ that minimize the prediction error of the convex combination:\n", - "\n", - "$$\\min_{\\boldsymbol{\\zeta}} \\;\\; \\left\\| \\mathbf{E} - \\sum_{k=1}^{K} \\zeta_k \\hat{\\mathbf{E}}^{(k)} \\right\\|^2 = \\left\\| \\mathbf{E} - \\mathbf{P} \\boldsymbol{\\zeta} \\right\\|^2$$\n", - "\n", - "$$\\text{subject to} \\quad \\zeta_k \\geq 0 \\;\\; \\forall k, \\qquad \\sum_{k=1}^{K} \\zeta_k = 1$$\n", - "\n", - "This is a convex quadratic program solved via `quadprog::solve.QP` with:\n", - "\n", - "$$\\mathbf{D} = \\mathbf{P}^\\top \\mathbf{P} + \\lambda \\mathbf{I}, \\quad \\mathbf{d} = \\mathbf{P}^\\top \\mathbf{E}$$\n", - "\n", - "where $\\lambda = 10^{-8} \\cdot \\text{mean}(\\text{diag}(\\mathbf{P}^\\top \\mathbf{P}))$ provides ridge regularization for numerical stability.\n", - "\n", - "### Ensemble Weight Vector\n", - "\n", - "The final ensemble TWAS weight vector combines the full-data weight estimates:\n", - "\n", - "$$\\tilde{\\mathbf{w}} = \\sum_{k=1}^{K} \\zeta_k^* \\hat{\\mathbf{w}}_k$$\n", - "\n", - "### Performance Metrics\n", - "\n", - "Two complementary evaluation criteria:\n", - "\n", - "- **Against ground truth** (oracle): $\\text{MSE}_{\\text{truth}} = \\frac{1}{n} \\| \\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}} \\|^2$\n", - "- **Against observed $Y$** (practical): $R^2_Y = \\text{cor}(Y,\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$ and $\\text{MSE}_Y = \\frac{1}{n}\\|Y - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Algorithm\n", - "\n", - "The algorithm below shows the existing pecotmr cross-validation framework (Steps 1--5) and the **new ensemble learning extension** (Steps 6--9, in bold). Step numbers reference the mathematical formulation above.\n", - "\n", - "```\n", - "Algorithm: TWAS Weight Estimation with Ensemble Learning\n", - "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n", - "\n", - "Input: Genotype matrix G (n x p), expression vector E (n x 1),\n", - " weight methods M = {m_1, ..., m_K}, number of CV folds F\n", - "Output: Per-method weights {w_k}, ensemble coefficients {zeta_k},\n", - " combined weight vector w_tilde, CV performance metrics\n", - "\n", - "── Existing Framework: twas_weights_cv() + twas_weights() ──────\n", - "\n", - "1. Partition n samples into F folds: {S_1, ..., S_F}\n", - "2. For each fold f = 1, ..., F:\n", - " For each method k = 1, ..., K:\n", - " a. Train: w_k^(-f) <- m_k(G[T_f,], E[T_f])\n", - " b. Predict: E_hat^(k)[S_f] <- G[S_f,] * w_k^(-f)\n", - "3. Assemble out-of-fold prediction matrix P (n x K)\n", - " P[i,k] = E_hat_i^(k) from the fold containing sample i\n", - "4. Compute per-method performance:\n", - " R^2_k = cor(E, P[,k])^2 for k = 1, ..., K\n", - "5. Train final weights on full data:\n", - " w_k <- m_k(G, E) for all k\n", - "\n", - "── NEW: Ensemble Learning via ensemble_weights() ──────────────\n", - "\n", - " 6. Solve constrained QP for ensemble coefficients:\n", - " D <- P^T P + lambda * I (ridge regularization)\n", - " d <- P^T E\n", - " zeta* <- argmin 1/2 zeta^T D zeta - d^T zeta\n", - " s.t. sum(zeta_k) = 1, zeta_k >= 0 for all k\n", - " 7. Compute ensemble prediction:\n", - " E_hat_ens = P zeta*\n", - " 8. Compute ensemble performance:\n", - " R^2_ens = cor(E, E_hat_ens)^2\n", - " 9. Combine final weight vectors:\n", - " w_tilde = sum_k zeta_k * w_k\n", - "\n", - "Return: {w_k}, {zeta_k}, w_tilde, {R^2_k}, R^2_ens\n", - "```\n", - "\n", - "**Implementation**: Steps 1--5 are handled internally by `pecotmr::twas_weights_pipeline()`. Steps 6--9 are `pecotmr::ensemble_weights()`, now exported in PR #486 (no local source file needed)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Weight Methods\n", - "\n", - "Two method sets are provided depending on computational budget:\n", - "\n", - "### Expensive (default)\n", - "\n", - "| # | Method | Type | pecotmr function | Key parameters |\n", - "|---|--------|------|------------------|----------------|\n", - "| 1 | SuSiE | Bayesian sparse (L0-like) | `susie_weights` | `refine=FALSE, init_L=5, max_L=10` |\n", - "| 2 | SuSiE-ash | Bayesian sparse + ash shrinkage | `susie_ash_weights` | default |\n", - "| 3 | SuSiE-inf | Bayesian sparse + infinitesimal | `susie_inf_weights` | default |\n", - "| 4 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` | `init_prior_sd=TRUE, max.iter=100` |\n", - "| 5 | Elastic Net | L1+L2 penalized (glmnet) | `enet_weights` | default (`alpha=0.5`) |\n", - "| 6 | LASSO | L1 penalized (glmnet) | `lasso_weights` | default |\n", - "| 7 | BayesR | Bayesian mixture of normals | `bayes_r_weights` | default (`qgg`) |\n", - "| 8 | BayesL | Bayesian LASSO | `bayes_l_weights` | default |\n", - "| 9 | BayesA | Bayesian ridge-like | `bayes_a_weights` | default |\n", - "| 10 | BayesB | Bayesian variable selection | `bayes_b_weights` | default |\n", - "| 11 | BayesC | Bayesian spike-and-slab | `bayes_c_weights` | default |\n", - "| 12 | BayesN | Bayesian normal mixture | `bayes_n_weights` | default |\n", - "| 13 | B-LASSO | Bayesian LASSO variant | `b_lasso_weights` | default |\n", - "| 14 | DPR-VB | Dirichlet process regression (VB) | `dpr_vb_weights` | default (`RcppDPR`) |\n", - "| 15 | DPR-Gibbs | Dirichlet process regression (Gibbs) | `dpr_gibbs_weights` | default |\n", - "| 16 | DPR-Adaptive Gibbs | Dirichlet process regression (adaptive) | `dpr_adaptive_gibbs_weights` | default |\n", - "| 17 | SCAD | SCAD penalized | `scad_weights` | default |\n", - "| 18 | MCP | MCP penalized | `mcp_weights` | default |\n", - "| 19 | L0Learn | Best-subset approximation | `l0learn_weights` | default |\n", - "\n", - "### Cheap (fast iteration)\n", - "\n", - "| # | Method | Type | pecotmr function |\n", - "|---|--------|------|------------------|\n", - "| 1 | SuSiE | Bayesian sparse | `susie_weights` |\n", - "| 2 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` |\n", - "| 3 | LASSO | L1 penalized | `lasso_weights` |\n", - "| 4 | Elastic Net | L1+L2 penalized | `enet_weights` |\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Simulation Design\n", - "\n", - "### Genotypes\n", - "\n", - "Each replicate uses **genotypes** loaded from LD sketch reference panels via `pecotmr::load_LD_matrix(meta, region, return_genotype = TRUE)`. Preserving LD patterns is critical for faithfully benchmarking TWAS weight methods — no synthetic genotype simulation can reproduce the complex correlation structure of actual genomic data.\n", - "\n", - "The pipeline requires two input files:\n", - "\n", - "1. **`ld_meta_file`** — Sketch LD metadata TSV pointing to plink2 files (one row per chromosome, `start=0, end=0` sentinel). Generated by the `rss_ld_sketch` pipeline. Example:\n", - " ```\n", - " #chrom start end path\n", - " 22 0 0 ADSP.R5.EUR.chr22\n", - " ```\n", - " The `path` column is the plink2 file prefix (resolved relative to the TSV directory). The corresponding `.pgen`, `.pvar`, `.psam`, `.afreq` files must exist.\n", - "\n", - "2. **`ld_block_file`** — BED or TSV file defining LD block boundaries (one row per block). Each block becomes a replicate region. Example (BED format, 0-based half-open):\n", - " ```\n", - " chr22 16051249 17614263\n", - " chr22 17614263 18834263\n", - " ...\n", - " ```\n", - " Or TSV with `#chrom start end` header. These define the genomic windows passed to `load_LD_matrix()` to extract variants per region.\n", - "\n", - "### Phenotypes\n", - "\n", - "- **Sparse**: `n_sparse` variants with large effects\n", - "- **Oligogenic**: `n_oligogenic` variants with moderate effects\n", - "- **Infinitesimal**: `n_inf` variants with small effects\n", - "- Total heritability: $h^2_g$ (default 0.2)\n", - "\n", - "### Replicates\n", - "\n", - "Each genomic region in the `ld_block_file` defines one replicate. The `n_seeds` parameter controls the number of independent phenotype realizations per region (different random seeds produce different causal architectures on the same genotype matrix).\n", - "\n", - "### Data Saving\n", - "\n", - "Set `save_data = TRUE` to save the raw genotype matrix, simulated phenotype, and ground-truth causal information in a separate `data.rds` file per replicate. This is useful for debugging but produces large files (one per region x seed). Disabled by default.\n", - "\n", - "### Evaluation Metrics\n", - "\n", - "Two complementary metrics per replicate:\n", - "\n", - "1. **Against ground truth** — MSE of predicted genetic component vs. true genetic component: $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$. This measures how well the method recovers the true signal.\n", - "\n", - "2. **Against observed $Y$** — $R^2$ and MSE of predictions vs. observed phenotype $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$. This is the practical metric: how well would the weights predict expression in a held-out sample?" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Usage\n", - "\n", - "### Quick test (cheap methods, few regions)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", - " --ld-meta-file data/ld_meta_file.tsv \\\n", - " --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed \\\n", - " --chrom 21 \\\n", - " --n-seeds 1 \\\n", - " --method-set cheap \\\n", - " -J 20" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Full benchmark (expensive methods, 10 seeds per region)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", - " --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n", - " --ld-block-file /path/to/EUR_LD_blocks.bed \\\n", - " --chrom 22 \\\n", - " --n-seeds 10 \\\n", - " --method-set expensive \\\n", - " -J 40" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### With data saving (for debugging)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n", - " --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n", - " --ld-block-file /path/to/EUR_LD_blocks.bed \\\n", - " --chrom 22 \\\n", - " --n-seeds 1 \\\n", - " --save-data TRUE \\\n", - " -J 20" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "\n", - "### Analysis and plots" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "jp-MarkdownHeadingCollapsed": true, - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/ensemble_twas_weights.ipynb analyze_results \\\n", - " --cwd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/ensemble_twas \\\n", - " --ld-meta-file /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/rss_ld_sketch/sketch_ld_meta.tsv \\\n", - " --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Pipeline Implementation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "# Output directory\n", - "parameter: cwd = path(\"output/ensemble_twas\")\n", - "\n", - "\n", - "# ── Genotype source ──────────────────────────────────────────\n", - "# Sketch LD metadata TSV (plink2 format: chrom, start=0, end=0, path=prefix)\n", - "# Generated by the rss_ld_sketch pipeline. Used by load_LD_matrix() to load genotypes.\n", - "parameter: ld_meta_file = path(\"\")\n", - "# LD block file defining replicate regions (BED or TSV with chrom/start/end)\n", - "# Each block becomes one replicate. Can be EUR_LD_blocks.bed or similar.\n", - "parameter: ld_block_file = path(\"\")\n", - "# Chromosome filter (0 = all chromosomes)\n", - "parameter: chrom = 0\n", - "# Max variants to subsample per region (0 = no limit)\n", - "parameter: max_variants = 0\n", - "\n", - "# ── Method selection ─────────────────────────────────────────\n", - "# \"expensive\": full 19-method list (SuSiE variants, mr.ash, BayesR family, DPR family, LASSO, Elastic Net, SCAD, MCP, L0Learn)\n", - "# \"cheap\": \"fast_default\" preset — SuSiE, mr.ash, Elastic Net, LASSO\n", - "parameter: method_set = \"expensive\"\n", - "\n", - "# ── Simulation parameters ────────────────────────────────────\n", - "parameter: h2g = 0.2\n", - "parameter: n_sparse = 3\n", - "parameter: n_oligogenic = 5\n", - "parameter: n_inf = 15\n", - "parameter: cv_folds = 5\n", - "parameter: max_cv_variants = 500\n", - "\n", - "# ── Replicate control ────────────────────────────────────────\n", - "parameter: n_seeds = 1\n", - "parameter: seed_base = 42\n", - "\n", - "# ── Data saving (for debugging) ──────────────────────────────\n", - "# When TRUE, saves raw genotype matrix X, phenotype y, and sim object\n", - "# as a separate data.rds per replicate. Produces large files.\n", - "parameter: save_data = False\n", - "\n", - "# ── Cluster resources ────────────────────────────────────────\n", - "parameter: job_size = 1\n", - "parameter: walltime = \"4:00:00\"\n", - "parameter: mem = \"16G\"\n", - "parameter: numThreads = 4\n", - "\n", - "import os\n", - "cwd = path(f'{cwd:a}')\n", - "\n", - "def _read_ld_blocks(block_file, chrom_filter):\n", - " \"\"\"Parse LD block file (BED or TSV with chrom/start/end columns).\n", - " Returns list of region dicts with id, chrom, start, end, region string.\n", - " Supports BED (0-based half-open) and TSV with #chrom header.\n", - " \"\"\"\n", - " regions = []\n", - " with open(block_file) as fh:\n", - " for line in fh:\n", - " line = line.strip()\n", - " if not line or line.startswith('#chrom') or line.startswith('chrom'):\n", - " continue\n", - " parts = line.split()\n", - " if len(parts) < 3:\n", - " continue\n", - " c = parts[0]\n", - " if not c.startswith('chr'):\n", - " c = f'chr{c}'\n", - " try:\n", - " cnum = int(c.replace('chr', ''))\n", - " except ValueError:\n", - " continue\n", - " if not (1 <= cnum <= 22):\n", - " continue\n", - " if chrom_filter != 0 and cnum != chrom_filter:\n", - " continue\n", - " start, end = int(parts[1]), int(parts[2])\n", - " regions.append({\n", - " 'id': f'{c}_{start}_{end}',\n", - " 'chrom': c,\n", - " 'start': start,\n", - " 'end': end,\n", - " 'region': f'{c}:{start}-{end}'\n", - " })\n", - " return regions\n", - "\n", - "# Validate inputs\n", - "if not str(ld_meta_file) or not os.path.isfile(str(ld_meta_file)):\n", - " raise ValueError(\"ld_meta_file is required — must point to a sketch LD metadata TSV (plink2 format)\")\n", - "if not str(ld_block_file) or not os.path.isfile(str(ld_block_file)):\n", - " raise ValueError(\"ld_block_file is required — must point to an LD block BED/TSV defining replicate regions\")\n", - "\n", - "# Build replicate list: regions x seeds\n", - "_base_regions = _read_ld_blocks(str(ld_block_file), chrom)\n", - "if not _base_regions:\n", - " raise ValueError(f\"No regions found in {ld_block_file} for chrom={chrom}\")\n", - "\n", - "replicates = []\n", - "for seed_idx in range(n_seeds):\n", - " for i, r in enumerate(_base_regions):\n", - " rep = dict(r)\n", - " rep['seed'] = seed_base + seed_idx * len(_base_regions) + i\n", - " rep['rep_id'] = f'{r[\"id\"]}_seed{seed_idx}'\n", - " replicates.append(rep)\n", - "\n", - "print(f\" {len(_base_regions)} regions x {n_seeds} seeds = {len(replicates)} replicates\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### `simulate_and_fit`\n", - "\n", - "For each replicate (region x seed):\n", - "1. Load genotype matrix $\\mathbf{G}$ from LD sketch via `pecotmr::load_LD_matrix()` using the sketch LD metadata and the region defined by the LD block file\n", - "2. Simulate expression $E$ via `simxQTL::generate_cis_qtl_data()` with mixed architecture on top of $\\mathbf{G}$\n", - "3. Run `pecotmr::twas_weights_pipeline()` with selected methods and 5-fold CV\n", - "4. Call `pecotmr::ensemble_weights()` (now exported, PR #486) to learn $\\boldsymbol{\\zeta}$ and combine weights\n", - "5. Compute metrics against ground truth and against observed $Y$\n", - "6. Save results as RDS; optionally save raw data (X, y, sim) as separate `data.rds` when `--save-data TRUE`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[simulate_and_fit]\n", - "input: for_each = \"replicates\"\n", - "output_files = [f'{cwd}/{_replicates[\"rep_id\"]}/results.rds']\n", - "if save_data:\n", - " output_files.append(f'{cwd}/{_replicates[\"rep_id\"]}/data.rds')\n", - "output: output_files\n", - "# Pre-compute data path in Python so the R script can reference it unconditionally.\n", - "# (Using ${_output[1]} directly fails with IndexError when save_data=False\n", - "# because SoS's ${} substitution is Python-eager, regardless of R control flow.)\n", - "_data_path = f'{cwd}/{_replicates[\"rep_id\"]}/data.rds' if save_data else ''\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", - "R: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " library(pecotmr)\n", - " library(simxQTL)\n", - " set.seed(${_replicates['seed']})\n", - " cat(\"Replicate: ${_replicates['rep_id']}\\n\")\n", - " cat(\"Seed: ${_replicates['seed']}\\n\")\n", - " cat(\"Region: ${_replicates['region']}\\n\")\n", - "\n", - " # ── 1. Load genotype from LD sketch ─────────────────────\n", - " ld_data <- load_LD_matrix(\n", - " \"${ld_meta_file}\",\n", - " region = \"${_replicates['region']}\",\n", - " return_genotype = TRUE\n", - " )\n", - " X <- ld_data$LD_matrix\n", - " X <- scale(X)\n", - " X[is.na(X)] <- 0\n", - " n <- nrow(X); p <- ncol(X)\n", - " cat(sprintf(\"Genotype matrix: %d samples x %d variants\\n\", n, p))\n", - "\n", - " max_var <- ${max_variants}\n", - " if (max_var > 0 && p > max_var) {\n", - " keep_idx <- sort(sample(p, max_var))\n", - " X <- X[, keep_idx, drop = FALSE]\n", - " p <- ncol(X)\n", - " cat(sprintf(\"Subsampled to %d variants\\n\", p))\n", - " }\n", - "\n", - " # ── 2. Simulate training phenotype ───────────────────────\n", - " sim <- generate_cis_qtl_data(X, h2g = ${h2g},\n", - " n_sparse = ${n_sparse},\n", - " n_oligogenic = ${n_oligogenic},\n", - " n_inf = ${n_inf})\n", - " y <- sim$y # training Y: G*w_true + eps_train\n", - " cat(sprintf(\"Realized h2g: %.4f residual var: %.4f\\n\",\n", - " sim$h2g, sim$residual_variance))\n", - "\n", - " # ── 2a. Build ground-truth pieces from sim$beta ───────────\n", - " if (!is.null(sim$beta) && length(sim$beta) == p) {\n", - " true_beta <- as.numeric(sim$beta)\n", - " } else {\n", - " warning(\"sim$beta missing or wrong length (\", length(sim$beta),\n", - " \" vs p=\", p, \"); defaulting to zero vector\")\n", - " true_beta <- rep(0, p)\n", - " }\n", - " true_genetic <- as.vector(X %*% true_beta)\n", - "\n", - " # ── 2b. Fresh held-out Y (same X, same w_true, new noise) ─\n", - " # This is what lets us evaluate prediction without in-sample\n", - " # overfitting bias — the weights never saw this Y.\n", - " sigma2 <- sim$residual_variance\n", - " eps_new <- rnorm(n, mean = 0, sd = sqrt(sigma2))\n", - " y_test <- true_genetic + eps_new\n", - " cat(\"Generated fresh Y_test (independent noise draw)\\n\")\n", - "\n", - " # ── 2c. Optional raw-data dump for debugging ─────────────\n", - " data_path <- \"${_data_path}\"\n", - " if (nzchar(data_path)) {\n", - " data_out <- list(\n", - " X = X, y = y, y_test = y_test, sim = sim,\n", - " region = \"${_replicates['region']}\",\n", - " rep_id = \"${_replicates['rep_id']}\",\n", - " seed = ${_replicates['seed']},\n", - " variant_info = ld_data$ref_panel\n", - " )\n", - " dir.create(dirname(data_path), recursive = TRUE, showWarnings = FALSE)\n", - " saveRDS(data_out, data_path)\n", - " cat(\"Saved raw data:\", data_path, \"\\n\")\n", - " rm(data_out)\n", - " }\n", - "\n", - " # ── 3. Define weight methods ─────────────────────────────\n", - " method_set <- \"${method_set}\"\n", - " if (method_set == \"expensive\") {\n", - " weight_methods <- list(\n", - " susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n", - " susie_ash_weights = list(),\n", - " susie_inf_weights = list(),\n", - " mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n", - " enet_weights = list(),\n", - " lasso_weights = list(),\n", - " bayes_r_weights = list(),\n", - " bayes_l_weights = list(),\n", - " bayes_a_weights = list(),\n", - " bayes_b_weights = list(),\n", - " bayes_c_weights = list(),\n", - " bayes_n_weights = list(),\n", - " b_lasso_weights = list(),\n", - " dpr_vb_weights = list(),\n", - " dpr_gibbs_weights = list(),\n", - " dpr_adaptive_gibbs_weights = list(),\n", - " scad_weights = list(),\n", - " mcp_weights = list(),\n", - " l0learn_weights = list()\n", - "\n", - " )\n", - " } else {\n", - " weight_methods <- list(\n", - " susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n", - " mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n", - " lasso_weights = list(),\n", - " enet_weights = list()\n", - " )\n", - " }\n", - "\n", - " # ── 4. TWAS weights pipeline (CV + full-data weights) ────\n", - " # Note: twas_weights_pipeline() in pecotmr >=0.5 (PR #486) does NOT accept num_threads.\n", - " # ensemble_weights() is now exported; no local R/ensemble_weights.R source needed.\n", - " cat(\"Running twas_weights_pipeline with\", length(weight_methods), \"methods...\\n\")\n", - " res <- tryCatch(\n", - " twas_weights_pipeline(\n", - " X, y,\n", - " cv_folds = ${cv_folds},\n", - " weight_methods = weight_methods,\n", - " max_cv_variants = ${max_cv_variants}\n", - " ),\n", - " error = function(e) {\n", - " cat(\"Pipeline error:\", conditionMessage(e), \"\\n\")\n", - " NULL\n", - " }\n", - " )\n", - "\n", - " if (is.null(res)) {\n", - " cat(\"Pipeline failed — exiting non-zero so SoS flags this replicate\\n\")\n", - " saveRDS(list(results = NULL, ensemble = NULL,\n", - " truth = list(rep_id = \"${_replicates['rep_id']}\",\n", - " seed = ${_replicates['seed']},\n", - " error = TRUE)),\n", - " \"${_output[0]}\")\n", - " quit(save = \"no\", status = 1)\n", - " }\n", - "\n", - " # ── 5. Ensemble learning (steps 6-9 in the algorithm) ────\n", - " cat(\"Computing ensemble weights...\\n\")\n", - " ens <- tryCatch(\n", - " pecotmr::ensemble_weights(\n", - " cv_results = res$twas_cv_result,\n", - " Y = y,\n", - " twas_weight_list = res$twas_weights\n", - " ),\n", - " error = function(e) {\n", - " cat(\"Ensemble error:\", conditionMessage(e), \"\\n\")\n", - " NULL\n", - " }\n", - " )\n", - "\n", - " # ── 6. Compute metrics ───────────────────────────────────\n", - " # Four metrics per method and per ensemble:\n", - " # mse_w = mean((w_hat - w_true)^2) weight-space, LD-independent\n", - " # mse_pred_truth = mean((X*w_hat - X*w_true)^2) prediction-space, LD-aware (what TWAS uses)\n", - " # mse_y = mean((X*w_hat - y_test)^2) held-out Y (fresh eps), honest out-of-sample\n", - " # rsq_y = cor(X*w_hat, y_test)^2 held-out Y R^2\n", - " compute_metrics <- function(w_hat) {\n", - " if (is.matrix(w_hat)) w_hat <- w_hat[, 1]\n", - " pred <- as.vector(X %*% w_hat)\n", - " mse_w <- mean((w_hat - true_beta)^2)\n", - " mse_pred_truth <- mean((pred - true_genetic)^2)\n", - " mse_y <- mean((pred - y_test)^2)\n", - " rsq_y <- if (sd(pred) > 0) cor(pred, y_test)^2 else 0\n", - " c(mse_w = mse_w,\n", - " mse_pred_truth = mse_pred_truth,\n", - " mse_y = mse_y,\n", - " rsq_y = rsq_y)\n", - " }\n", - "\n", - " method_metrics <- sapply(names(res$twas_weights), function(wname) {\n", - " compute_metrics(res$twas_weights[[wname]])\n", - " })\n", - " colnames(method_metrics) <- gsub(\"_weights$\", \"\", colnames(method_metrics))\n", - "\n", - " ensemble_metrics <- c(mse_w = NA, mse_pred_truth = NA, mse_y = NA, rsq_y = NA)\n", - " if (!is.null(ens) && !is.null(ens$ensemble_twas_weights)) {\n", - " ensemble_metrics <- compute_metrics(ens$ensemble_twas_weights)\n", - " }\n", - "\n", - "\n", - " # ── 7. Save results ──────────────────────────────────────\n", - " truth <- list(\n", - " causal_sparse = sim$sparse_indices,\n", - " causal_oligo = sim$oligogenic_indices,\n", - " causal_inf = sim$infinitesimal_indices,\n", - " true_beta = true_beta,\n", - " true_y = y,\n", - " y_test = y_test,\n", - " residual_variance = sigma2,\n", - " h2g = sim$h2g,\n", - " rep_id = \"${_replicates['rep_id']}\",\n", - " region = \"${_replicates['region']}\",\n", - " seed = ${_replicates['seed']},\n", - " n = n, p = p,\n", - " method_metrics = method_metrics,\n", - " ensemble_metrics = ensemble_metrics,\n", - " )\n", - "\n", - " out <- list(results = res, ensemble = ens, truth = truth)\n", - " dir.create(dirname(\"${_output[0]}\"), recursive = TRUE, showWarnings = FALSE)\n", - " saveRDS(out, \"${_output[0]}\")\n", - " cat(\"Saved:\", \"${_output[0]}\", \"\\n\")\n", - " tryCatch({\n", - " cat(\"Ensemble MSE (weights): \", round(ensemble_metrics[\"mse_w\"], 6), \"\\n\")\n", - " cat(\"Ensemble MSE (pred vs truth):\",\n", - " round(ensemble_metrics[\"mse_pred_truth\"], 6), \"\\n\")\n", - " cat(\"Ensemble MSE (held-out Y): \",\n", - " round(ensemble_metrics[\"mse_y\"], 6), \"\\n\")\n", - " cat(\"Ensemble R^2 (held-out Y): \",\n", - " round(ensemble_metrics[\"rsq_y\"], 4), \"\\n\")\n", - " }, error = function(e) cat(\"Metric-print failed (non-fatal):\", conditionMessage(e), \"\\n\"))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### `analyze_results`\n", - "\n", - "Load all per-replicate RDS files and produce:\n", - "1. **MSE of weights** — $\\frac{1}{p}\\|\\hat{\\mathbf{w}} - \\mathbf{w}_{\\text{true}}\\|^2$ (LD-independent)\n", - "2. **MSE of predicted genetic component** — $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$ (LD-aware, TWAS-relevant)\n", - "3. **MSE vs held-out $Y$** — $\\frac{1}{n}\\|Y_{\\text{test}} - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y_{\\text{test}} = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\varepsilon_{\\text{new}}$\n", - "4. **$R^2$ vs held-out $Y$** — $\\text{cor}(Y_{\\text{test}},\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$\n", - "5. **Coefficient heatmap** — ensemble $\\zeta_k$ across replicates\n", - "6. **Scatter plot** — ensemble $R^2$ vs. best single-method $R^2$\n", - "7. **Summary CSV** — one row per replicate with all metrics" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[analyze_results]\n", - "parameter: results_dir = path(f'{cwd}')\n", - "output: f'{results_dir}/diagnostics.pdf', f'{results_dir}/summary_table.csv'\n", - "task: trunk_workers = 1, walltime = \"1:00:00\", mem = \"8G\", cores = 1\n", - "R: expand = \"${ }\", stderr = f'{results_dir}/analyze.stderr', stdout = f'{results_dir}/analyze.stdout'\n", - "\n", - " library(ggplot2)\n", - " library(tidyr)\n", - " library(dplyr)\n", - "\n", - " results_dir <- \"${results_dir}\"\n", - "\n", - " # ── 1. Load all RDS files ────────────────────────────────\n", - " rds_files <- list.files(results_dir, pattern = \"results\\\\.rds$\",\n", - " recursive = TRUE, full.names = TRUE)\n", - " cat(sprintf(\"Found %d RDS files\\n\", length(rds_files)))\n", - "\n", - " all_data <- lapply(rds_files, function(f) {\n", - " tryCatch(readRDS(f), error = function(e) NULL)\n", - " })\n", - " all_data <- all_data[!sapply(all_data, is.null)]\n", - " all_data <- all_data[!sapply(all_data, function(x) isTRUE(x$truth$error))]\n", - " cat(sprintf(\"%d successful replicates\\n\", length(all_data)))\n", - "\n", - " if (length(all_data) == 0) stop(\"No valid results found\")\n", - "\n", - " # ── 2. Extract metrics ───────────────────────────────────\n", - " metrics_list <- lapply(seq_along(all_data), function(i) {\n", - " x <- all_data[[i]]\n", - " coefs <- if (!is.null(x$ensemble)) x$ensemble$method_coef else NULL\n", - " list(\n", - " rep_id = x$truth$rep_id, seed = x$truth$seed, region = x$truth$region,\n", - " method_coef = coefs,\n", - " method_metrics = x$truth$method_metrics,\n", - " ensemble_metrics = x$truth$ensemble_metrics\n", - " )\n", - " })\n", - "\n", - " # ── 3. Tidy data frames ──────────────────────────────────\n", - " build_metric_df <- function(metric_name) {\n", - " do.call(rbind, lapply(metrics_list, function(m) {\n", - " if (is.null(m$method_metrics)) return(NULL)\n", - " if (!(metric_name %in% rownames(m$method_metrics))) return(NULL)\n", - " methods <- colnames(m$method_metrics)\n", - " vals <- m$method_metrics[metric_name, ]\n", - " method_df <- data.frame(rep_id = m$rep_id, method = methods,\n", - " value = as.numeric(vals),\n", - " stringsAsFactors = FALSE)\n", - " ens_df <- data.frame(rep_id = m$rep_id, method = \"ensemble\",\n", - " value = as.numeric(m$ensemble_metrics[metric_name]),\n", - " stringsAsFactors = FALSE)\n", - " rbind(method_df, ens_df)\n", - " }))\n", - " }\n", - "\n", - "\n", - " mse_w_df <- build_metric_df(\"mse_w\")\n", - " mse_pred_truth_df <- build_metric_df(\"mse_pred_truth\")\n", - " mse_y_df <- build_metric_df(\"mse_y\")\n", - " rsq_y_df <- build_metric_df(\"rsq_y\")\n", - "\n", - " coef_mat <- do.call(rbind, lapply(metrics_list, function(m) {\n", - " if (is.null(m$method_coef)) return(NULL)\n", - " m$method_coef\n", - " }))\n", - " if (!is.null(coef_mat)) {\n", - " rownames(coef_mat) <- sapply(metrics_list[!sapply(metrics_list,\n", - " function(m) is.null(m$method_coef))], function(m) m$rep_id)\n", - " }\n", - "\n", - " scatter_df <- do.call(rbind, lapply(metrics_list, function(m) {\n", - " if (is.null(m$method_metrics)) return(NULL)\n", - " if (!(\"rsq_y\" %in% rownames(m$method_metrics))) return(NULL)\n", - " rsq_vals <- m$method_metrics[\"rsq_y\", ]\n", - " data.frame(\n", - " rep_id = m$rep_id,\n", - " best_single_rsq = max(rsq_vals, na.rm = TRUE),\n", - " best_method = names(which.max(rsq_vals)),\n", - " ensemble_rsq = as.numeric(m$ensemble_metrics[\"rsq_y\"]),\n", - " stringsAsFactors = FALSE\n", - " )\n", - " }))\n", - "\n", - " # ── 4. Diagnostic plots ──────────────────────────────────\n", - " make_boxplot <- function(df, y_label, title, descending = FALSE) {\n", - " if (is.null(df) || nrow(df) == 0) return(NULL)\n", - " ord <- df %>% group_by(method) %>%\n", - " summarise(med = median(value, na.rm = TRUE))\n", - " if (descending) ord <- ord %>% arrange(desc(med)) else ord <- ord %>% arrange(med)\n", - " df$method <- factor(df$method, levels = ord$method)\n", - " ggplot(df, aes(x = method, y = value,\n", - " fill = ifelse(method == \"ensemble\", \"Ensemble\", \"Single\"))) +\n", - " geom_boxplot(outlier.size = 0.8) +\n", - " scale_fill_manual(values = c(\"Ensemble\" = \"#E41A1C\",\n", - " \"Single\" = \"#377EB8\"), name = \"\") +\n", - " labs(title = title, x = \"Method\", y = y_label) +\n", - " theme_bw(base_size = 12) +\n", - " theme(axis.text.x = element_text(angle = 45, hjust = 1))\n", - " }\n", - "\n", - " pdf(\"${_output[0]}\", width = 12, height = 8)\n", - "\n", - "\n", - " p <- make_boxplot(mse_w_df, \"MSE of weights\",\n", - " \"MSE on Weights (LD-independent): mean((w_hat - w_true)^2)\",\n", - " descending = TRUE)\n", - " if (!is.null(p)) print(p)\n", - "\n", - " p <- make_boxplot(mse_pred_truth_df, \"MSE(G*w_hat vs G*w_true)\",\n", - " \"MSE on Predicted Genetic Component (LD-aware, TWAS-relevant)\",\n", - " descending = TRUE)\n", - " if (!is.null(p)) print(p)\n", - "\n", - " p <- make_boxplot(mse_y_df, \"MSE (vs held-out Y)\",\n", - " \"MSE Against Held-Out Y (fresh noise — honest out-of-sample)\",\n", - " descending = TRUE)\n", - " if (!is.null(p)) print(p)\n", - "\n", - " p <- make_boxplot(rsq_y_df, expression(R^2~(vs~held-out~Y)),\n", - " \"R-squared Against Held-Out Y\")\n", - " if (!is.null(p)) print(p)\n", - "\n", - " if (!is.null(coef_mat) && nrow(coef_mat) > 1) {\n", - " coef_long <- as.data.frame(coef_mat) %>%\n", - " mutate(replicate = rownames(coef_mat)) %>%\n", - " pivot_longer(-replicate, names_to = \"method\", values_to = \"zeta\")\n", - " p5 <- ggplot(coef_long, aes(x = method, y = replicate, fill = zeta)) +\n", - " geom_tile() +\n", - " scale_fill_gradient(low = \"white\", high = \"#E41A1C\",\n", - " name = expression(zeta[k]), limits = c(0, 1)) +\n", - " labs(title = \"Ensemble Coefficients Across Replicates\",\n", - " x = \"Method\", y = \"Replicate\") +\n", - " theme_bw(base_size = 10) +\n", - " theme(axis.text.x = element_text(angle = 45, hjust = 1),\n", - " axis.text.y = element_text(size = 6))\n", - " print(p5)\n", - " }\n", - "\n", - " if (!is.null(scatter_df) && nrow(scatter_df) > 0) {\n", - " p6 <- ggplot(scatter_df, aes(x = best_single_rsq, y = ensemble_rsq)) +\n", - " geom_point(alpha = 0.6, size = 2, color = \"#377EB8\") +\n", - " geom_abline(intercept = 0, slope = 1, linetype = \"dashed\",\n", - " color = \"grey40\") +\n", - " labs(title = \"Ensemble vs Best Single Method (R^2 vs held-out Y)\",\n", - " x = expression(Best~single~method~R^2),\n", - " y = expression(Ensemble~R^2)) +\n", - " theme_bw(base_size = 12) + coord_equal()\n", - " print(p6)\n", - " }\n", - "\n", - " dev.off()\n", - " cat(\"Saved plots:\", \"${_output[0]}\", \"\\n\")\n", - "\n", - " # ── 5. Summary table ─────────────────────────────────────\n", - " metric_rows <- c(\"mse_w\", \"mse_pred_truth\", \"mse_y\", \"rsq_y\")\n", - " summary_rows <- lapply(metrics_list, function(m) {\n", - " row <- data.frame(\n", - " rep_id = m$rep_id, region = m$region, seed = m$seed,\n", - " stringsAsFactors = FALSE\n", - " )\n", - " for (mr in metric_rows) {\n", - " row[[paste0(\"ensemble_\", mr)]] <- as.numeric(m$ensemble_metrics[mr])\n", - " }\n", - " if (!is.null(m$method_metrics)) {\n", - " for (mn in colnames(m$method_metrics)) {\n", - " for (mr in metric_rows) {\n", - " if (mr %in% rownames(m$method_metrics)) {\n", - " row[[paste0(mn, \"_\", mr)]] <- m$method_metrics[mr, mn]\n", - " }\n", - " }\n", - " }\n", - " }\n", - " if (!is.null(m$method_coef)) {\n", - " for (mn in names(m$method_coef)) {\n", - " row[[paste0(\"zeta_\", mn)]] <- m$method_coef[[mn]]\n", - " }\n", - " }\n", - " row\n", - " })\n", - " summary_df <- bind_rows(summary_rows)\n", - "\n", - " method_rsq_cols <- grep(\"_rsq_y$\", names(summary_df), value = TRUE)\n", - " method_rsq_cols <- setdiff(method_rsq_cols, \"ensemble_rsq_y\")\n", - " if (length(method_rsq_cols) > 0) {\n", - " summary_df$best_method_rsq_y <- apply(summary_df[method_rsq_cols], 1,\n", - " max, na.rm = TRUE)\n", - " summary_df$best_method <- apply(summary_df[method_rsq_cols], 1, function(r) {\n", - " gsub(\"_rsq_y$\", \"\", names(which.max(r)))\n", - " })\n", - " summary_df$rsq_y_improvement <-\n", - " summary_df$ensemble_rsq_y - summary_df$best_method_rsq_y\n", - " }\n", - "\n", - " write.csv(summary_df, \"${_output[1]}\", row.names = FALSE)\n", - " cat(\"Saved summary:\", \"${_output[1]}\", \"\\n\")\n", - "\n", - " cat(\"\\n=== Summary ===\\n\")\n", - " cat(sprintf(\"Replicates: %d\\n\", nrow(summary_df)))\n", - " cat(sprintf(\"Mean ensemble MSE (weights) : %.6f\\n\",\n", - " mean(summary_df$ensemble_mse_w, na.rm = TRUE)))\n", - " cat(sprintf(\"Mean ensemble MSE (pred truth) : %.6f\\n\",\n", - " mean(summary_df$ensemble_mse_pred_truth, na.rm = TRUE)))\n", - " cat(sprintf(\"Mean ensemble MSE (held-out Y) : %.6f\\n\",\n", - " mean(summary_df$ensemble_mse_y, na.rm = TRUE)))\n", - " cat(sprintf(\"Mean ensemble R^2 (held-out Y) : %.4f\\n\",\n", - " mean(summary_df$ensemble_rsq_y, na.rm = TRUE)))\n", - " if (\"rsq_y_improvement\" %in% names(summary_df)) {\n", - " cat(sprintf(\"Ensemble >= best in %.1f%% of replicates (R^2 vs held-out Y)\\n\",\n", - " 100 * mean(summary_df$rsq_y_improvement >= 0, na.rm = TRUE)))\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Expected Results\n", - "\n", - "With default simulation parameters ($h^2_g = 0.2$, mixed architecture) on genotype regions:\n", - "\n", - "- **MSE against truth vs. MSE against $Y$**: MSE against truth isolates weight estimation quality (unaffected by noise). MSE against $Y$ includes irreducible noise $\\sigma^2_\\epsilon$, so it is always larger.\n", - "- **Coefficient patterns**: Sparse architectures favor SuSiE/LASSO; polygenic architectures favor BayesR/DPR; the ensemble adapts by assigning larger $\\zeta_k$ to the method best suited to each region's architecture.\n", - "- **Scatter plot**: Most points should lie on or above the diagonal, indicating ensemble matches or improves on the best single method." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "SoS", - "language": "sos", - "name": "sos" - }, - "language_info": { - "codemirror_mode": "sos", - "file_extension": ".sos", - "mimetype": "text/x-sos", - "name": "sos", - "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", - "pygments_lexer": "sos" - }, - "sos": { - "kernels": [ - [ - "R", - "ir", - "R", - "", - "r" - ], - [ - "SoS", - "sos", - "", - "", - "sos" - ] - ] - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} From b27a39f3f394a11044990357d6157c833d06fb84 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 19 May 2026 09:22:28 -0400 Subject: [PATCH 4/4] Add new QC notebooks to the table of contents --- website/_toc.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/website/_toc.yml b/website/_toc.yml index d5efb01d2..5b919a091 100644 --- a/website/_toc.yml +++ b/website/_toc.yml @@ -26,8 +26,11 @@ parts: - file: code/molecular_phenotypes/calling/RNA_calling.ipynb - file: code/molecular_phenotypes/QC/bulk_expression_QC.ipynb - file: code/molecular_phenotypes/QC/bulk_expression_normalization.ipynb + - file: code/molecular_phenotypes/QC/apa_impute.ipynb - file: code/molecular_phenotypes/snRNAseq_preprocessing.ipynb - file: code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb + - file: code/molecular_phenotypes/QC/pseudobulk_expression_aggregation_QC_norm.ipynb + - file: code/molecular_phenotypes/QC/pseudobulk_mega_expression_QC_and_normalization.ipynb - file: code/molecular_phenotypes/methylation.ipynb sections: - file: code/molecular_phenotypes/calling/methylation_calling.ipynb