From a17c30aab08ae428b19b4f909810e47b1a23582b Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Mon, 11 May 2026 15:48:54 -0400
Subject: [PATCH 1/4] Ensemble TWAS weight notebook

Remove CV R^2 and update based on PR #486 pecotmr
---
 .../ensemble_twas_weights.ipynb               | 992 ++++++++++++++++++
 1 file changed, 992 insertions(+)
 create mode 100644 code/pecotmr_integration/ensemble_twas_weights.ipynb

diff --git a/code/pecotmr_integration/ensemble_twas_weights.ipynb b/code/pecotmr_integration/ensemble_twas_weights.ipynb
new file mode 100644
index 000000000..9277af022
--- /dev/null
+++ b/code/pecotmr_integration/ensemble_twas_weights.ipynb
@@ -0,0 +1,992 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# Ensemble TWAS Weights via Stacked Regression\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "`pecotmr` offers 10+ TWAS weight methods (SuSiE, LASSO, Elastic Net, mr.ash, BayesR, DPR, etc.), each suited to different genetic architectures. Picking one method is arbitrary; running all and testing each incurs a multiple-testing penalty. **Stacked regression** ([SR-TWAS; Dai et al.,2024](https://doi.org/10.1038/s41467-024-50983-w)) provides a principled alternative: learn one convex combination of methods per gene, producing a single weight vector for downstream TWAS testing with no method-selection bias.\n",
+    "\n",
+    "This notebook:\n",
+    "1. Presents the mathematical framework and algorithm for ensemble weight learning\n",
+    "2. Benchmarks the approach via simulation using **genotypes** from LD sketch reference panels, so that realistic LD patterns are preserved\n",
+    "3. Produces diagnostic plots comparing MSE against ground truth and MSE against observed $Y$\n",
+    "\n",
+    "| Step | Description | Parallelism |\n",
+    "|------|-------------|-------------|\n",
+    "| `simulate_and_fit` | Load X from sketch LD, simulate Y, run methods + ensemble, save RDS | Per-region (`for_each`) |\n",
+    "| `analyze_results` | Load RDS files, compute metrics, generate diagnostic plots | Single task |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Mathematical Framework\n",
+    "\n",
+    "### Gene Expression Model\n",
+    "\n",
+    "For a target gene $g$ with $p$ cis-SNPs, the expression level of individual $i$ is:\n",
+    "\n",
+    "$$E_i = \\mathbf{G}_i \\mathbf{w} + \\epsilon_i, \\quad \\epsilon_i \\sim N(0, \\sigma^2_\\epsilon)$$\n",
+    "\n",
+    "where $\\mathbf{G}_i \\in \\mathbb{R}^{1 \\times p}$ is the genotype vector and $\\mathbf{w} \\in \\mathbb{R}^p$ is the true cis-eQTL effect-size vector.\n",
+    "\n",
+    "### Base Models\n",
+    "\n",
+    "We train $K$ prediction models $\\{m_1, \\ldots, m_K\\}$, each producing a weight estimate $\\hat{\\mathbf{w}}_k \\in \\mathbb{R}^p$. Under $F$-fold cross-validation, the out-of-fold prediction for sample $i$ from method $k$ is:\n",
+    "\n",
+    "$$\\hat{E}_i^{(k)} = \\mathbf{G}_i \\hat{\\mathbf{w}}_k^{(-f_i)}$$\n",
+    "\n",
+    "where $f_i$ denotes the fold containing sample $i$ and $\\hat{\\mathbf{w}}_k^{(-f)}$ is trained on all samples except fold $f$. Stacking all $n$ out-of-fold predictions across $K$ methods yields the prediction matrix:\n",
+    "\n",
+    "$$\\mathbf{P} = \\begin{pmatrix} \\hat{E}_1^{(1)} & \\cdots & \\hat{E}_1^{(K)} \\\\ \\vdots & \\ddots & \\vdots \\\\ \\hat{E}_n^{(1)} & \\cdots & \\hat{E}_n^{(K)} \\end{pmatrix} \\in \\mathbb{R}^{n \\times K}$$\n",
+    "\n",
+    "### Stacked Regression (Constrained QP)\n",
+    "\n",
+    "We seek ensemble coefficients $\\boldsymbol{\\zeta} = (\\zeta_1, \\ldots, \\zeta_K)^\\top$ that minimize the prediction error of the convex combination:\n",
+    "\n",
+    "$$\\min_{\\boldsymbol{\\zeta}} \\;\\; \\left\\| \\mathbf{E} - \\sum_{k=1}^{K} \\zeta_k \\hat{\\mathbf{E}}^{(k)} \\right\\|^2 = \\left\\| \\mathbf{E} - \\mathbf{P} \\boldsymbol{\\zeta} \\right\\|^2$$\n",
+    "\n",
+    "$$\\text{subject to} \\quad \\zeta_k \\geq 0 \\;\\; \\forall k, \\qquad \\sum_{k=1}^{K} \\zeta_k = 1$$\n",
+    "\n",
+    "This is a convex quadratic program solved via `quadprog::solve.QP` with:\n",
+    "\n",
+    "$$\\mathbf{D} = \\mathbf{P}^\\top \\mathbf{P} + \\lambda \\mathbf{I}, \\quad \\mathbf{d} = \\mathbf{P}^\\top \\mathbf{E}$$\n",
+    "\n",
+    "where $\\lambda = 10^{-8} \\cdot \\text{mean}(\\text{diag}(\\mathbf{P}^\\top \\mathbf{P}))$ provides ridge regularization for numerical stability.\n",
+    "\n",
+    "### Ensemble Weight Vector\n",
+    "\n",
+    "The final ensemble TWAS weight vector combines the full-data weight estimates:\n",
+    "\n",
+    "$$\\tilde{\\mathbf{w}} = \\sum_{k=1}^{K} \\zeta_k^* \\hat{\\mathbf{w}}_k$$\n",
+    "\n",
+    "### Performance Metrics\n",
+    "\n",
+    "Two complementary evaluation criteria:\n",
+    "\n",
+    "- **Against ground truth** (oracle): $\\text{MSE}_{\\text{truth}} = \\frac{1}{n} \\| \\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}} \\|^2$\n",
+    "- **Against observed $Y$** (practical): $R^2_Y = \\text{cor}(Y,\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$ and $\\text{MSE}_Y = \\frac{1}{n}\\|Y - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Algorithm\n",
+    "\n",
+    "The algorithm below shows the existing pecotmr cross-validation framework (Steps 1--5) and the **new ensemble learning extension** (Steps 6--9, in bold). Step numbers reference the mathematical formulation above.\n",
+    "\n",
+    "```\n",
+    "Algorithm: TWAS Weight Estimation with Ensemble Learning\n",
+    "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n",
+    "\n",
+    "Input:  Genotype matrix G (n x p), expression vector E (n x 1),\n",
+    "        weight methods M = {m_1, ..., m_K}, number of CV folds F\n",
+    "Output: Per-method weights {w_k}, ensemble coefficients {zeta_k},\n",
+    "        combined weight vector w_tilde, CV performance metrics\n",
+    "\n",
+    "── Existing Framework: twas_weights_cv() + twas_weights() ──────\n",
+    "\n",
+    "1. Partition n samples into F folds: {S_1, ..., S_F}\n",
+    "2. For each fold f = 1, ..., F:\n",
+    "   For each method k = 1, ..., K:\n",
+    "     a. Train:   w_k^(-f)  <-  m_k(G[T_f,], E[T_f])\n",
+    "     b. Predict:  E_hat^(k)[S_f]  <-  G[S_f,] * w_k^(-f)\n",
+    "3. Assemble out-of-fold prediction matrix P  (n x K)\n",
+    "     P[i,k] = E_hat_i^(k) from the fold containing sample i\n",
+    "4. Compute per-method performance:\n",
+    "     R^2_k = cor(E, P[,k])^2    for k = 1, ..., K\n",
+    "5. Train final weights on full data:\n",
+    "     w_k  <-  m_k(G, E)    for all k\n",
+    "\n",
+    "── NEW: Ensemble Learning via ensemble_weights() ──────────────\n",
+    "\n",
+    "  6. Solve constrained QP for ensemble coefficients:\n",
+    "       D  <-  P^T P + lambda * I      (ridge regularization)\n",
+    "       d  <-  P^T E\n",
+    "       zeta*  <-  argmin  1/2 zeta^T D zeta - d^T zeta\n",
+    "                  s.t.  sum(zeta_k) = 1,  zeta_k >= 0  for all k\n",
+    "  7. Compute ensemble prediction:\n",
+    "       E_hat_ens  =  P zeta*\n",
+    "  8. Compute ensemble performance:\n",
+    "       R^2_ens  =  cor(E, E_hat_ens)^2\n",
+    "  9. Combine final weight vectors:\n",
+    "       w_tilde  =  sum_k  zeta_k * w_k\n",
+    "\n",
+    "Return: {w_k}, {zeta_k}, w_tilde, {R^2_k}, R^2_ens\n",
+    "```\n",
+    "\n",
+    "**Implementation**: Steps 1--5 are handled internally by `pecotmr::twas_weights_pipeline()`. Steps 6--9 are `pecotmr::ensemble_weights()`, now exported in PR #486 (no local source file needed)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Weight Methods\n",
+    "\n",
+    "Two method sets are provided depending on computational budget:\n",
+    "\n",
+    "### Expensive (default)\n",
+    "\n",
+    "| # | Method | Type | pecotmr function | Key parameters |\n",
+    "|---|--------|------|------------------|----------------|\n",
+    "| 1 | SuSiE | Bayesian sparse (L0-like) | `susie_weights` | `refine=FALSE, init_L=5, max_L=10` |\n",
+    "| 2 | SuSiE-ash | Bayesian sparse + ash shrinkage | `susie_ash_weights` | default |\n",
+    "| 3 | SuSiE-inf | Bayesian sparse + infinitesimal | `susie_inf_weights` | default |\n",
+    "| 4 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` | `init_prior_sd=TRUE, max.iter=100` |\n",
+    "| 5 | Elastic Net | L1+L2 penalized (glmnet) | `enet_weights` | default (`alpha=0.5`) |\n",
+    "| 6 | LASSO | L1 penalized (glmnet) | `lasso_weights` | default |\n",
+    "| 7 | BayesR | Bayesian mixture of normals | `bayes_r_weights` | default (`qgg`) |\n",
+    "| 8 | BayesL | Bayesian LASSO | `bayes_l_weights` | default |\n",
+    "| 9 | BayesA | Bayesian ridge-like | `bayes_a_weights` | default |\n",
+    "| 10 | BayesB | Bayesian variable selection | `bayes_b_weights` | default |\n",
+    "| 11 | BayesC | Bayesian spike-and-slab | `bayes_c_weights` | default |\n",
+    "| 12 | BayesN | Bayesian normal mixture | `bayes_n_weights` | default |\n",
+    "| 13 | B-LASSO | Bayesian LASSO variant | `b_lasso_weights` | default |\n",
+    "| 14 | DPR-VB | Dirichlet process regression (VB) | `dpr_vb_weights` | default (`RcppDPR`) |\n",
+    "| 15 | DPR-Gibbs | Dirichlet process regression (Gibbs) | `dpr_gibbs_weights` | default |\n",
+    "| 16 | DPR-Adaptive Gibbs | Dirichlet process regression (adaptive) | `dpr_adaptive_gibbs_weights` | default |\n",
+    "| 17 | SCAD | SCAD penalized | `scad_weights` | default |\n",
+    "| 18 | MCP | MCP penalized | `mcp_weights` | default |\n",
+    "| 19 | L0Learn | Best-subset approximation | `l0learn_weights` | default |\n",
+    "\n",
+    "### Cheap (fast iteration)\n",
+    "\n",
+    "| # | Method | Type | pecotmr function |\n",
+    "|---|--------|------|------------------|\n",
+    "| 1 | SuSiE | Bayesian sparse | `susie_weights` |\n",
+    "| 2 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` |\n",
+    "| 3 | LASSO | L1 penalized | `lasso_weights` |\n",
+    "| 4 | Elastic Net | L1+L2 penalized | `enet_weights` |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Simulation Design\n",
+    "\n",
+    "### Genotypes\n",
+    "\n",
+    "Each replicate uses **genotypes** loaded from LD sketch reference panels via `pecotmr::load_LD_matrix(meta, region, return_genotype = TRUE)`. Preserving LD patterns is critical for faithfully benchmarking TWAS weight methods — no synthetic genotype simulation can reproduce the complex correlation structure of actual genomic data.\n",
+    "\n",
+    "The pipeline requires two input files:\n",
+    "\n",
+    "1. **`ld_meta_file`** — Sketch LD metadata TSV pointing to plink2 files (one row per chromosome, `start=0, end=0` sentinel). Generated by the `rss_ld_sketch` pipeline. Example:\n",
+    "   ```\n",
+    "   #chrom  start  end  path\n",
+    "   22      0      0    ADSP.R5.EUR.chr22\n",
+    "   ```\n",
+    "   The `path` column is the plink2 file prefix (resolved relative to the TSV directory). The corresponding `.pgen`, `.pvar`, `.psam`, `.afreq` files must exist.\n",
+    "\n",
+    "2. **`ld_block_file`** — BED or TSV file defining LD block boundaries (one row per block). Each block becomes a replicate region. Example (BED format, 0-based half-open):\n",
+    "   ```\n",
+    "   chr22  16051249  17614263\n",
+    "   chr22  17614263  18834263\n",
+    "   ...\n",
+    "   ```\n",
+    "   Or TSV with `#chrom  start  end` header. These define the genomic windows passed to `load_LD_matrix()` to extract variants per region.\n",
+    "\n",
+    "### Phenotypes\n",
+    "\n",
+    "- **Sparse**: `n_sparse` variants with large effects\n",
+    "- **Oligogenic**: `n_oligogenic` variants with moderate effects\n",
+    "- **Infinitesimal**: `n_inf` variants with small effects\n",
+    "- Total heritability: $h^2_g$ (default 0.2)\n",
+    "\n",
+    "### Replicates\n",
+    "\n",
+    "Each genomic region in the `ld_block_file` defines one replicate. The `n_seeds` parameter controls the number of independent phenotype realizations per region (different random seeds produce different causal architectures on the same genotype matrix).\n",
+    "\n",
+    "### Data Saving\n",
+    "\n",
+    "Set `save_data = TRUE` to save the raw genotype matrix, simulated phenotype, and ground-truth causal information in a separate `data.rds` file per replicate. This is useful for debugging but produces large files (one per region x seed). Disabled by default.\n",
+    "\n",
+    "### Evaluation Metrics\n",
+    "\n",
+    "Two complementary metrics per replicate:\n",
+    "\n",
+    "1. **Against ground truth** — MSE of predicted genetic component vs. true genetic component: $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$. This measures how well the method recovers the true signal.\n",
+    "\n",
+    "2. **Against observed $Y$** — $R^2$ and MSE of predictions vs. observed phenotype $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$. This is the practical metric: how well would the weights predict expression in a held-out sample?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Usage\n",
+    "\n",
+    "### Quick test (cheap methods, few regions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
+    "    --ld-meta-file data/ld_meta_file.tsv \\\n",
+    "    --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed \\\n",
+    "    --chrom 21 \\\n",
+    "    --n-seeds 1 \\\n",
+    "    --method-set cheap \\\n",
+    "    -J 20"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### Full benchmark (expensive methods, 10 seeds per region)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
+    "    --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n",
+    "    --ld-block-file /path/to/EUR_LD_blocks.bed \\\n",
+    "    --chrom 22 \\\n",
+    "    --n-seeds 10 \\\n",
+    "    --method-set expensive \\\n",
+    "    -J 40"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### With data saving (for debugging)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
+    "    --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n",
+    "    --ld-block-file /path/to/EUR_LD_blocks.bed \\\n",
+    "    --chrom 22 \\\n",
+    "    --n-seeds 1 \\\n",
+    "    --save-data TRUE \\\n",
+    "    -J 20"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "\n",
+    "### Analysis and plots"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true,
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "sos run pipeline/ensemble_twas_weights.ipynb analyze_results \\\n",
+    "    --cwd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/ensemble_twas \\\n",
+    "    --ld-meta-file /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/rss_ld_sketch/sketch_ld_meta.tsv \\\n",
+    "    --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "# Pipeline Implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "# Output directory\n",
+    "parameter: cwd = path(\"output/ensemble_twas\")\n",
+    "\n",
+    "\n",
+    "# ── Genotype source ──────────────────────────────────────────\n",
+    "# Sketch LD metadata TSV (plink2 format: chrom, start=0, end=0, path=prefix)\n",
+    "# Generated by the rss_ld_sketch pipeline. Used by load_LD_matrix() to load genotypes.\n",
+    "parameter: ld_meta_file = path(\"\")\n",
+    "# LD block file defining replicate regions (BED or TSV with chrom/start/end)\n",
+    "# Each block becomes one replicate. Can be EUR_LD_blocks.bed or similar.\n",
+    "parameter: ld_block_file = path(\"\")\n",
+    "# Chromosome filter (0 = all chromosomes)\n",
+    "parameter: chrom = 0\n",
+    "# Max variants to subsample per region (0 = no limit)\n",
+    "parameter: max_variants = 0\n",
+    "\n",
+    "# ── Method selection ─────────────────────────────────────────\n",
+    "# \"expensive\": full 19-method list (SuSiE variants, mr.ash, BayesR family, DPR family, LASSO, Elastic Net, SCAD, MCP, L0Learn)\n",
+    "# \"cheap\":     \"fast_default\" preset — SuSiE, mr.ash, Elastic Net, LASSO\n",
+    "parameter: method_set = \"expensive\"\n",
+    "\n",
+    "# ── Simulation parameters ────────────────────────────────────\n",
+    "parameter: h2g = 0.2\n",
+    "parameter: n_sparse = 3\n",
+    "parameter: n_oligogenic = 5\n",
+    "parameter: n_inf = 15\n",
+    "parameter: cv_folds = 5\n",
+    "parameter: max_cv_variants = 500\n",
+    "\n",
+    "# ── Replicate control ────────────────────────────────────────\n",
+    "parameter: n_seeds = 1\n",
+    "parameter: seed_base = 42\n",
+    "\n",
+    "# ── Data saving (for debugging) ──────────────────────────────\n",
+    "# When TRUE, saves raw genotype matrix X, phenotype y, and sim object\n",
+    "# as a separate data.rds per replicate. Produces large files.\n",
+    "parameter: save_data = False\n",
+    "\n",
+    "# ── Cluster resources ────────────────────────────────────────\n",
+    "parameter: job_size = 1\n",
+    "parameter: walltime = \"4:00:00\"\n",
+    "parameter: mem = \"16G\"\n",
+    "parameter: numThreads = 4\n",
+    "\n",
+    "import os\n",
+    "cwd = path(f'{cwd:a}')\n",
+    "\n",
+    "def _read_ld_blocks(block_file, chrom_filter):\n",
+    "    \"\"\"Parse LD block file (BED or TSV with chrom/start/end columns).\n",
+    "    Returns list of region dicts with id, chrom, start, end, region string.\n",
+    "    Supports BED (0-based half-open) and TSV with #chrom header.\n",
+    "    \"\"\"\n",
+    "    regions = []\n",
+    "    with open(block_file) as fh:\n",
+    "        for line in fh:\n",
+    "            line = line.strip()\n",
+    "            if not line or line.startswith('#chrom') or line.startswith('chrom'):\n",
+    "                continue\n",
+    "            parts = line.split()\n",
+    "            if len(parts) < 3:\n",
+    "                continue\n",
+    "            c = parts[0]\n",
+    "            if not c.startswith('chr'):\n",
+    "                c = f'chr{c}'\n",
+    "            try:\n",
+    "                cnum = int(c.replace('chr', ''))\n",
+    "            except ValueError:\n",
+    "                continue\n",
+    "            if not (1 <= cnum <= 22):\n",
+    "                continue\n",
+    "            if chrom_filter != 0 and cnum != chrom_filter:\n",
+    "                continue\n",
+    "            start, end = int(parts[1]), int(parts[2])\n",
+    "            regions.append({\n",
+    "                'id': f'{c}_{start}_{end}',\n",
+    "                'chrom': c,\n",
+    "                'start': start,\n",
+    "                'end': end,\n",
+    "                'region': f'{c}:{start}-{end}'\n",
+    "            })\n",
+    "    return regions\n",
+    "\n",
+    "# Validate inputs\n",
+    "if not str(ld_meta_file) or not os.path.isfile(str(ld_meta_file)):\n",
+    "    raise ValueError(\"ld_meta_file is required — must point to a sketch LD metadata TSV (plink2 format)\")\n",
+    "if not str(ld_block_file) or not os.path.isfile(str(ld_block_file)):\n",
+    "    raise ValueError(\"ld_block_file is required — must point to an LD block BED/TSV defining replicate regions\")\n",
+    "\n",
+    "# Build replicate list: regions x seeds\n",
+    "_base_regions = _read_ld_blocks(str(ld_block_file), chrom)\n",
+    "if not _base_regions:\n",
+    "    raise ValueError(f\"No regions found in {ld_block_file} for chrom={chrom}\")\n",
+    "\n",
+    "replicates = []\n",
+    "for seed_idx in range(n_seeds):\n",
+    "    for i, r in enumerate(_base_regions):\n",
+    "        rep = dict(r)\n",
+    "        rep['seed'] = seed_base + seed_idx * len(_base_regions) + i\n",
+    "        rep['rep_id'] = f'{r[\"id\"]}_seed{seed_idx}'\n",
+    "        replicates.append(rep)\n",
+    "\n",
+    "print(f\"  {len(_base_regions)} regions x {n_seeds} seeds = {len(replicates)} replicates\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### `simulate_and_fit`\n",
+    "\n",
+    "For each replicate (region x seed):\n",
+    "1. Load genotype matrix $\\mathbf{G}$ from LD sketch via `pecotmr::load_LD_matrix()` using the sketch LD metadata and the region defined by the LD block file\n",
+    "2. Simulate expression $E$ via `simxQTL::generate_cis_qtl_data()` with mixed architecture on top of $\\mathbf{G}$\n",
+    "3. Run `pecotmr::twas_weights_pipeline()` with selected methods and 5-fold CV\n",
+    "4. Call `pecotmr::ensemble_weights()` (now exported, PR #486) to learn $\\boldsymbol{\\zeta}$ and combine weights\n",
+    "5. Compute metrics against ground truth and against observed $Y$\n",
+    "6. Save results as RDS; optionally save raw data (X, y, sim) as separate `data.rds` when `--save-data TRUE`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[simulate_and_fit]\n",
+    "input: for_each = \"replicates\"\n",
+    "output_files = [f'{cwd}/{_replicates[\"rep_id\"]}/results.rds']\n",
+    "if save_data:\n",
+    "    output_files.append(f'{cwd}/{_replicates[\"rep_id\"]}/data.rds')\n",
+    "output: output_files\n",
+    "# Pre-compute data path in Python so the R script can reference it unconditionally.\n",
+    "# (Using ${_output[1]} directly fails with IndexError when save_data=False\n",
+    "#  because SoS's ${} substitution is Python-eager, regardless of R control flow.)\n",
+    "_data_path = f'{cwd}/{_replicates[\"rep_id\"]}/data.rds' if save_data else ''\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "R: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
+    "\n",
+    "    library(pecotmr)\n",
+    "    library(simxQTL)\n",
+    "    set.seed(${_replicates['seed']})\n",
+    "    cat(\"Replicate: ${_replicates['rep_id']}\\n\")\n",
+    "    cat(\"Seed: ${_replicates['seed']}\\n\")\n",
+    "    cat(\"Region: ${_replicates['region']}\\n\")\n",
+    "\n",
+    "    # ── 1. Load genotype from LD sketch ─────────────────────\n",
+    "    ld_data <- load_LD_matrix(\n",
+    "        \"${ld_meta_file}\",\n",
+    "        region = \"${_replicates['region']}\",\n",
+    "        return_genotype = TRUE\n",
+    "    )\n",
+    "    X <- ld_data$LD_matrix\n",
+    "    X <- scale(X)\n",
+    "    X[is.na(X)] <- 0\n",
+    "    n <- nrow(X); p <- ncol(X)\n",
+    "    cat(sprintf(\"Genotype matrix: %d samples x %d variants\\n\", n, p))\n",
+    "\n",
+    "    max_var <- ${max_variants}\n",
+    "    if (max_var > 0 && p > max_var) {\n",
+    "        keep_idx <- sort(sample(p, max_var))\n",
+    "        X <- X[, keep_idx, drop = FALSE]\n",
+    "        p <- ncol(X)\n",
+    "        cat(sprintf(\"Subsampled to %d variants\\n\", p))\n",
+    "    }\n",
+    "\n",
+    "    # ── 2. Simulate training phenotype ───────────────────────\n",
+    "    sim <- generate_cis_qtl_data(X, h2g = ${h2g},\n",
+    "                                  n_sparse = ${n_sparse},\n",
+    "                                  n_oligogenic = ${n_oligogenic},\n",
+    "                                  n_inf = ${n_inf})\n",
+    "    y <- sim$y                     # training Y:  G*w_true + eps_train\n",
+    "    cat(sprintf(\"Realized h2g: %.4f    residual var: %.4f\\n\",\n",
+    "                sim$h2g, sim$residual_variance))\n",
+    "\n",
+    "    # ── 2a. Build ground-truth pieces from sim$beta ───────────\n",
+    "    if (!is.null(sim$beta) && length(sim$beta) == p) {\n",
+    "        true_beta <- as.numeric(sim$beta)\n",
+    "    } else {\n",
+    "        warning(\"sim$beta missing or wrong length (\", length(sim$beta),\n",
+    "                \" vs p=\", p, \"); defaulting to zero vector\")\n",
+    "        true_beta <- rep(0, p)\n",
+    "    }\n",
+    "    true_genetic <- as.vector(X %*% true_beta)\n",
+    "\n",
+    "    # ── 2b. Fresh held-out Y (same X, same w_true, new noise) ─\n",
+    "    # This is what lets us evaluate prediction without in-sample\n",
+    "    # overfitting bias — the weights never saw this Y.\n",
+    "    sigma2 <- sim$residual_variance\n",
+    "    eps_new <- rnorm(n, mean = 0, sd = sqrt(sigma2))\n",
+    "    y_test <- true_genetic + eps_new\n",
+    "    cat(\"Generated fresh Y_test (independent noise draw)\\n\")\n",
+    "\n",
+    "    # ── 2c. Optional raw-data dump for debugging ─────────────\n",
+    "    data_path <- \"${_data_path}\"\n",
+    "    if (nzchar(data_path)) {\n",
+    "        data_out <- list(\n",
+    "            X = X, y = y, y_test = y_test, sim = sim,\n",
+    "            region = \"${_replicates['region']}\",\n",
+    "            rep_id = \"${_replicates['rep_id']}\",\n",
+    "            seed = ${_replicates['seed']},\n",
+    "            variant_info = ld_data$ref_panel\n",
+    "        )\n",
+    "        dir.create(dirname(data_path), recursive = TRUE, showWarnings = FALSE)\n",
+    "        saveRDS(data_out, data_path)\n",
+    "        cat(\"Saved raw data:\", data_path, \"\\n\")\n",
+    "        rm(data_out)\n",
+    "    }\n",
+    "\n",
+    "    # ── 3. Define weight methods ─────────────────────────────\n",
+    "    method_set <- \"${method_set}\"\n",
+    "    if (method_set == \"expensive\") {\n",
+    "        weight_methods <- list(\n",
+    "            susie_weights              = list(refine = FALSE, init_L = 5, max_L = 10),\n",
+    "            susie_ash_weights          = list(),\n",
+    "            susie_inf_weights          = list(),\n",
+    "            mrash_weights              = list(init_prior_sd = TRUE, max.iter = 100),\n",
+    "            enet_weights               = list(),\n",
+    "            lasso_weights              = list(),\n",
+    "            bayes_r_weights            = list(),\n",
+    "            bayes_l_weights            = list(),\n",
+    "            bayes_a_weights            = list(),\n",
+    "            bayes_b_weights            = list(),\n",
+    "            bayes_c_weights            = list(),\n",
+    "            bayes_n_weights            = list(),\n",
+    "            b_lasso_weights            = list(),\n",
+    "            dpr_vb_weights             = list(),\n",
+    "            dpr_gibbs_weights          = list(),\n",
+    "            dpr_adaptive_gibbs_weights = list(),\n",
+    "            scad_weights               = list(),\n",
+    "            mcp_weights                = list(),\n",
+    "            l0learn_weights            = list()\n",
+    "\n",
+    "        )\n",
+    "    } else {\n",
+    "        weight_methods <- list(\n",
+    "            susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n",
+    "            mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n",
+    "            lasso_weights = list(),\n",
+    "            enet_weights  = list()\n",
+    "        )\n",
+    "    }\n",
+    "\n",
+    "    # ── 4. TWAS weights pipeline (CV + full-data weights) ────\n",
+    "    # Note: twas_weights_pipeline() in pecotmr >=0.5 (PR #486) does NOT accept num_threads.\n",
+    "    # ensemble_weights() is now exported; no local R/ensemble_weights.R source needed.\n",
+    "    cat(\"Running twas_weights_pipeline with\", length(weight_methods), \"methods...\\n\")\n",
+    "    res <- tryCatch(\n",
+    "        twas_weights_pipeline(\n",
+    "            X, y,\n",
+    "            cv_folds = ${cv_folds},\n",
+    "            weight_methods = weight_methods,\n",
+    "            max_cv_variants = ${max_cv_variants}\n",
+    "        ),\n",
+    "        error = function(e) {\n",
+    "            cat(\"Pipeline error:\", conditionMessage(e), \"\\n\")\n",
+    "            NULL\n",
+    "        }\n",
+    "    )\n",
+    "\n",
+    "    if (is.null(res)) {\n",
+    "        cat(\"Pipeline failed — exiting non-zero so SoS flags this replicate\\n\")\n",
+    "        saveRDS(list(results = NULL, ensemble = NULL,\n",
+    "                     truth = list(rep_id = \"${_replicates['rep_id']}\",\n",
+    "                                  seed = ${_replicates['seed']},\n",
+    "                                  error = TRUE)),\n",
+    "                \"${_output[0]}\")\n",
+    "        quit(save = \"no\", status = 1)\n",
+    "    }\n",
+    "\n",
+    "    # ── 5. Ensemble learning (steps 6-9 in the algorithm) ────\n",
+    "    cat(\"Computing ensemble weights...\\n\")\n",
+    "    ens <- tryCatch(\n",
+    "        pecotmr::ensemble_weights(\n",
+    "            cv_results = res$twas_cv_result,\n",
+    "            Y = y,\n",
+    "            twas_weight_list = res$twas_weights\n",
+    "        ),\n",
+    "        error = function(e) {\n",
+    "            cat(\"Ensemble error:\", conditionMessage(e), \"\\n\")\n",
+    "            NULL\n",
+    "        }\n",
+    "    )\n",
+    "\n",
+    "    # ── 6. Compute metrics ───────────────────────────────────\n",
+    "    # Four metrics per method and per ensemble:\n",
+    "    #   mse_w          = mean((w_hat - w_true)^2)                        weight-space, LD-independent\n",
+    "    #   mse_pred_truth = mean((X*w_hat - X*w_true)^2)                    prediction-space, LD-aware (what TWAS uses)\n",
+    "    #   mse_y          = mean((X*w_hat - y_test)^2)                      held-out Y (fresh eps), honest out-of-sample\n",
+    "    #   rsq_y          = cor(X*w_hat, y_test)^2                          held-out Y R^2\n",
+    "    compute_metrics <- function(w_hat) {\n",
+    "        if (is.matrix(w_hat)) w_hat <- w_hat[, 1]\n",
+    "        pred <- as.vector(X %*% w_hat)\n",
+    "        mse_w          <- mean((w_hat - true_beta)^2)\n",
+    "        mse_pred_truth <- mean((pred - true_genetic)^2)\n",
+    "        mse_y          <- mean((pred - y_test)^2)\n",
+    "        rsq_y          <- if (sd(pred) > 0) cor(pred, y_test)^2 else 0\n",
+    "        c(mse_w = mse_w,\n",
+    "          mse_pred_truth = mse_pred_truth,\n",
+    "          mse_y = mse_y,\n",
+    "          rsq_y = rsq_y)\n",
+    "    }\n",
+    "\n",
+    "    method_metrics <- sapply(names(res$twas_weights), function(wname) {\n",
+    "        compute_metrics(res$twas_weights[[wname]])\n",
+    "    })\n",
+    "    colnames(method_metrics) <- gsub(\"_weights$\", \"\", colnames(method_metrics))\n",
+    "\n",
+    "    ensemble_metrics <- c(mse_w = NA, mse_pred_truth = NA, mse_y = NA, rsq_y = NA)\n",
+    "    if (!is.null(ens) && !is.null(ens$ensemble_twas_weights)) {\n",
+    "        ensemble_metrics <- compute_metrics(ens$ensemble_twas_weights)\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "    # ── 7. Save results ──────────────────────────────────────\n",
+    "    truth <- list(\n",
+    "        causal_sparse = sim$sparse_indices,\n",
+    "        causal_oligo = sim$oligogenic_indices,\n",
+    "        causal_inf = sim$infinitesimal_indices,\n",
+    "        true_beta = true_beta,\n",
+    "        true_y = y,\n",
+    "        y_test = y_test,\n",
+    "        residual_variance = sigma2,\n",
+    "        h2g = sim$h2g,\n",
+    "        rep_id = \"${_replicates['rep_id']}\",\n",
+    "        region = \"${_replicates['region']}\",\n",
+    "        seed = ${_replicates['seed']},\n",
+    "        n = n, p = p,\n",
+    "        method_metrics = method_metrics,\n",
+    "        ensemble_metrics = ensemble_metrics,\n",
+    "    )\n",
+    "\n",
+    "    out <- list(results = res, ensemble = ens, truth = truth)\n",
+    "    dir.create(dirname(\"${_output[0]}\"), recursive = TRUE, showWarnings = FALSE)\n",
+    "    saveRDS(out, \"${_output[0]}\")\n",
+    "    cat(\"Saved:\", \"${_output[0]}\", \"\\n\")\n",
+    "    tryCatch({\n",
+    "    cat(\"Ensemble MSE (weights): \", round(ensemble_metrics[\"mse_w\"],          6), \"\\n\")\n",
+    "    cat(\"Ensemble MSE (pred vs truth):\",\n",
+    "        round(ensemble_metrics[\"mse_pred_truth\"], 6), \"\\n\")\n",
+    "    cat(\"Ensemble MSE (held-out Y): \",\n",
+    "        round(ensemble_metrics[\"mse_y\"], 6), \"\\n\")\n",
+    "    cat(\"Ensemble R^2  (held-out Y): \",\n",
+    "        round(ensemble_metrics[\"rsq_y\"], 4), \"\\n\")\n",
+    "    }, error = function(e) cat(\"Metric-print failed (non-fatal):\", conditionMessage(e), \"\\n\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "### `analyze_results`\n",
+    "\n",
+    "Load all per-replicate RDS files and produce:\n",
+    "1. **MSE of weights** — $\\frac{1}{p}\\|\\hat{\\mathbf{w}} - \\mathbf{w}_{\\text{true}}\\|^2$ (LD-independent)\n",
+    "2. **MSE of predicted genetic component** — $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$ (LD-aware, TWAS-relevant)\n",
+    "3. **MSE vs held-out $Y$** — $\\frac{1}{n}\\|Y_{\\text{test}} - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y_{\\text{test}} = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\varepsilon_{\\text{new}}$\n",
+    "4. **$R^2$ vs held-out $Y$** — $\\text{cor}(Y_{\\text{test}},\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$\n",
+    "5. **Coefficient heatmap** — ensemble $\\zeta_k$ across replicates\n",
+    "6. **Scatter plot** — ensemble $R^2$ vs. best single-method $R^2$\n",
+    "7. **Summary CSV** — one row per replicate with all metrics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[analyze_results]\n",
+    "parameter: results_dir = path(f'{cwd}')\n",
+    "output: f'{results_dir}/diagnostics.pdf', f'{results_dir}/summary_table.csv'\n",
+    "task: trunk_workers = 1, walltime = \"1:00:00\", mem = \"8G\", cores = 1\n",
+    "R: expand = \"${ }\", stderr = f'{results_dir}/analyze.stderr', stdout = f'{results_dir}/analyze.stdout'\n",
+    "\n",
+    "    library(ggplot2)\n",
+    "    library(tidyr)\n",
+    "    library(dplyr)\n",
+    "\n",
+    "    results_dir <- \"${results_dir}\"\n",
+    "\n",
+    "    # ── 1. Load all RDS files ────────────────────────────────\n",
+    "    rds_files <- list.files(results_dir, pattern = \"results\\\\.rds$\",\n",
+    "                            recursive = TRUE, full.names = TRUE)\n",
+    "    cat(sprintf(\"Found %d RDS files\\n\", length(rds_files)))\n",
+    "\n",
+    "    all_data <- lapply(rds_files, function(f) {\n",
+    "        tryCatch(readRDS(f), error = function(e) NULL)\n",
+    "    })\n",
+    "    all_data <- all_data[!sapply(all_data, is.null)]\n",
+    "    all_data <- all_data[!sapply(all_data, function(x) isTRUE(x$truth$error))]\n",
+    "    cat(sprintf(\"%d successful replicates\\n\", length(all_data)))\n",
+    "\n",
+    "    if (length(all_data) == 0) stop(\"No valid results found\")\n",
+    "\n",
+    "    # ── 2. Extract metrics ───────────────────────────────────\n",
+    "    metrics_list <- lapply(seq_along(all_data), function(i) {\n",
+    "        x <- all_data[[i]]\n",
+    "        coefs      <- if (!is.null(x$ensemble)) x$ensemble$method_coef else NULL\n",
+    "        list(\n",
+    "            rep_id = x$truth$rep_id, seed = x$truth$seed, region = x$truth$region,\n",
+    "            method_coef = coefs,\n",
+    "            method_metrics = x$truth$method_metrics,\n",
+    "            ensemble_metrics = x$truth$ensemble_metrics\n",
+    "        )\n",
+    "    })\n",
+    "\n",
+    "    # ── 3. Tidy data frames ──────────────────────────────────\n",
+    "    build_metric_df <- function(metric_name) {\n",
+    "        do.call(rbind, lapply(metrics_list, function(m) {\n",
+    "            if (is.null(m$method_metrics)) return(NULL)\n",
+    "            if (!(metric_name %in% rownames(m$method_metrics))) return(NULL)\n",
+    "            methods <- colnames(m$method_metrics)\n",
+    "            vals    <- m$method_metrics[metric_name, ]\n",
+    "            method_df <- data.frame(rep_id = m$rep_id, method = methods,\n",
+    "                                    value = as.numeric(vals),\n",
+    "                                    stringsAsFactors = FALSE)\n",
+    "            ens_df <- data.frame(rep_id = m$rep_id, method = \"ensemble\",\n",
+    "                                 value = as.numeric(m$ensemble_metrics[metric_name]),\n",
+    "                                 stringsAsFactors = FALSE)\n",
+    "            rbind(method_df, ens_df)\n",
+    "        }))\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "    mse_w_df          <- build_metric_df(\"mse_w\")\n",
+    "    mse_pred_truth_df <- build_metric_df(\"mse_pred_truth\")\n",
+    "    mse_y_df          <- build_metric_df(\"mse_y\")\n",
+    "    rsq_y_df          <- build_metric_df(\"rsq_y\")\n",
+    "\n",
+    "    coef_mat <- do.call(rbind, lapply(metrics_list, function(m) {\n",
+    "        if (is.null(m$method_coef)) return(NULL)\n",
+    "        m$method_coef\n",
+    "    }))\n",
+    "    if (!is.null(coef_mat)) {\n",
+    "        rownames(coef_mat) <- sapply(metrics_list[!sapply(metrics_list,\n",
+    "            function(m) is.null(m$method_coef))], function(m) m$rep_id)\n",
+    "    }\n",
+    "\n",
+    "    scatter_df <- do.call(rbind, lapply(metrics_list, function(m) {\n",
+    "        if (is.null(m$method_metrics)) return(NULL)\n",
+    "        if (!(\"rsq_y\" %in% rownames(m$method_metrics))) return(NULL)\n",
+    "        rsq_vals <- m$method_metrics[\"rsq_y\", ]\n",
+    "        data.frame(\n",
+    "            rep_id = m$rep_id,\n",
+    "            best_single_rsq = max(rsq_vals, na.rm = TRUE),\n",
+    "            best_method = names(which.max(rsq_vals)),\n",
+    "            ensemble_rsq = as.numeric(m$ensemble_metrics[\"rsq_y\"]),\n",
+    "            stringsAsFactors = FALSE\n",
+    "        )\n",
+    "    }))\n",
+    "\n",
+    "    # ── 4. Diagnostic plots ──────────────────────────────────\n",
+    "    make_boxplot <- function(df, y_label, title, descending = FALSE) {\n",
+    "        if (is.null(df) || nrow(df) == 0) return(NULL)\n",
+    "        ord <- df %>% group_by(method) %>%\n",
+    "            summarise(med = median(value, na.rm = TRUE))\n",
+    "        if (descending) ord <- ord %>% arrange(desc(med)) else ord <- ord %>% arrange(med)\n",
+    "        df$method <- factor(df$method, levels = ord$method)\n",
+    "        ggplot(df, aes(x = method, y = value,\n",
+    "                       fill = ifelse(method == \"ensemble\", \"Ensemble\", \"Single\"))) +\n",
+    "            geom_boxplot(outlier.size = 0.8) +\n",
+    "            scale_fill_manual(values = c(\"Ensemble\" = \"#E41A1C\",\n",
+    "                                          \"Single\"  = \"#377EB8\"), name = \"\") +\n",
+    "            labs(title = title, x = \"Method\", y = y_label) +\n",
+    "            theme_bw(base_size = 12) +\n",
+    "            theme(axis.text.x = element_text(angle = 45, hjust = 1))\n",
+    "    }\n",
+    "\n",
+    "    pdf(\"${_output[0]}\", width = 12, height = 8)\n",
+    "\n",
+    "\n",
+    "    p <- make_boxplot(mse_w_df, \"MSE of weights\",\n",
+    "                      \"MSE on Weights  (LD-independent):  mean((w_hat - w_true)^2)\",\n",
+    "                      descending = TRUE)\n",
+    "    if (!is.null(p)) print(p)\n",
+    "\n",
+    "    p <- make_boxplot(mse_pred_truth_df, \"MSE(G*w_hat vs G*w_true)\",\n",
+    "                      \"MSE on Predicted Genetic Component  (LD-aware, TWAS-relevant)\",\n",
+    "                      descending = TRUE)\n",
+    "    if (!is.null(p)) print(p)\n",
+    "\n",
+    "    p <- make_boxplot(mse_y_df, \"MSE (vs held-out Y)\",\n",
+    "                      \"MSE Against Held-Out Y  (fresh noise — honest out-of-sample)\",\n",
+    "                      descending = TRUE)\n",
+    "    if (!is.null(p)) print(p)\n",
+    "\n",
+    "    p <- make_boxplot(rsq_y_df, expression(R^2~(vs~held-out~Y)),\n",
+    "                      \"R-squared Against Held-Out Y\")\n",
+    "    if (!is.null(p)) print(p)\n",
+    "\n",
+    "    if (!is.null(coef_mat) && nrow(coef_mat) > 1) {\n",
+    "        coef_long <- as.data.frame(coef_mat) %>%\n",
+    "            mutate(replicate = rownames(coef_mat)) %>%\n",
+    "            pivot_longer(-replicate, names_to = \"method\", values_to = \"zeta\")\n",
+    "        p5 <- ggplot(coef_long, aes(x = method, y = replicate, fill = zeta)) +\n",
+    "            geom_tile() +\n",
+    "            scale_fill_gradient(low = \"white\", high = \"#E41A1C\",\n",
+    "                                name = expression(zeta[k]), limits = c(0, 1)) +\n",
+    "            labs(title = \"Ensemble Coefficients Across Replicates\",\n",
+    "                 x = \"Method\", y = \"Replicate\") +\n",
+    "            theme_bw(base_size = 10) +\n",
+    "            theme(axis.text.x = element_text(angle = 45, hjust = 1),\n",
+    "                  axis.text.y = element_text(size = 6))\n",
+    "        print(p5)\n",
+    "    }\n",
+    "\n",
+    "    if (!is.null(scatter_df) && nrow(scatter_df) > 0) {\n",
+    "        p6 <- ggplot(scatter_df, aes(x = best_single_rsq, y = ensemble_rsq)) +\n",
+    "            geom_point(alpha = 0.6, size = 2, color = \"#377EB8\") +\n",
+    "            geom_abline(intercept = 0, slope = 1, linetype = \"dashed\",\n",
+    "                        color = \"grey40\") +\n",
+    "            labs(title = \"Ensemble vs Best Single Method  (R^2 vs held-out Y)\",\n",
+    "                 x = expression(Best~single~method~R^2),\n",
+    "                 y = expression(Ensemble~R^2)) +\n",
+    "            theme_bw(base_size = 12) + coord_equal()\n",
+    "        print(p6)\n",
+    "    }\n",
+    "\n",
+    "    dev.off()\n",
+    "    cat(\"Saved plots:\", \"${_output[0]}\", \"\\n\")\n",
+    "\n",
+    "    # ── 5. Summary table ─────────────────────────────────────\n",
+    "    metric_rows <- c(\"mse_w\", \"mse_pred_truth\", \"mse_y\", \"rsq_y\")\n",
+    "    summary_rows <- lapply(metrics_list, function(m) {\n",
+    "        row <- data.frame(\n",
+    "            rep_id = m$rep_id, region = m$region, seed = m$seed,\n",
+    "            stringsAsFactors = FALSE\n",
+    "        )\n",
+    "        for (mr in metric_rows) {\n",
+    "            row[[paste0(\"ensemble_\", mr)]] <- as.numeric(m$ensemble_metrics[mr])\n",
+    "        }\n",
+    "        if (!is.null(m$method_metrics)) {\n",
+    "            for (mn in colnames(m$method_metrics)) {\n",
+    "                for (mr in metric_rows) {\n",
+    "                    if (mr %in% rownames(m$method_metrics)) {\n",
+    "                        row[[paste0(mn, \"_\", mr)]] <- m$method_metrics[mr, mn]\n",
+    "                    }\n",
+    "                }\n",
+    "            }\n",
+    "        }\n",
+    "        if (!is.null(m$method_coef)) {\n",
+    "            for (mn in names(m$method_coef)) {\n",
+    "                row[[paste0(\"zeta_\", mn)]] <- m$method_coef[[mn]]\n",
+    "            }\n",
+    "        }\n",
+    "        row\n",
+    "    })\n",
+    "    summary_df <- bind_rows(summary_rows)\n",
+    "\n",
+    "    method_rsq_cols <- grep(\"_rsq_y$\", names(summary_df), value = TRUE)\n",
+    "    method_rsq_cols <- setdiff(method_rsq_cols, \"ensemble_rsq_y\")\n",
+    "    if (length(method_rsq_cols) > 0) {\n",
+    "        summary_df$best_method_rsq_y <- apply(summary_df[method_rsq_cols], 1,\n",
+    "                                              max, na.rm = TRUE)\n",
+    "        summary_df$best_method <- apply(summary_df[method_rsq_cols], 1, function(r) {\n",
+    "            gsub(\"_rsq_y$\", \"\", names(which.max(r)))\n",
+    "        })\n",
+    "        summary_df$rsq_y_improvement <-\n",
+    "            summary_df$ensemble_rsq_y - summary_df$best_method_rsq_y\n",
+    "    }\n",
+    "\n",
+    "    write.csv(summary_df, \"${_output[1]}\", row.names = FALSE)\n",
+    "    cat(\"Saved summary:\", \"${_output[1]}\", \"\\n\")\n",
+    "\n",
+    "    cat(\"\\n=== Summary ===\\n\")\n",
+    "    cat(sprintf(\"Replicates: %d\\n\", nrow(summary_df)))\n",
+    "    cat(sprintf(\"Mean ensemble MSE (weights)     : %.6f\\n\",\n",
+    "                mean(summary_df$ensemble_mse_w,         na.rm = TRUE)))\n",
+    "    cat(sprintf(\"Mean ensemble MSE (pred truth)  : %.6f\\n\",\n",
+    "                mean(summary_df$ensemble_mse_pred_truth, na.rm = TRUE)))\n",
+    "    cat(sprintf(\"Mean ensemble MSE (held-out Y)  : %.6f\\n\",\n",
+    "                mean(summary_df$ensemble_mse_y,         na.rm = TRUE)))\n",
+    "    cat(sprintf(\"Mean ensemble R^2 (held-out Y)  : %.4f\\n\",\n",
+    "                mean(summary_df$ensemble_rsq_y,         na.rm = TRUE)))\n",
+    "    if (\"rsq_y_improvement\" %in% names(summary_df)) {\n",
+    "        cat(sprintf(\"Ensemble >= best in %.1f%% of replicates (R^2 vs held-out Y)\\n\",\n",
+    "            100 * mean(summary_df$rsq_y_improvement >= 0, na.rm = TRUE)))\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Expected Results\n",
+    "\n",
+    "With default simulation parameters ($h^2_g = 0.2$, mixed architecture) on genotype regions:\n",
+    "\n",
+    "- **MSE against truth vs. MSE against $Y$**: MSE against truth isolates weight estimation quality (unaffected by noise). MSE against $Y$ includes irreducible noise $\\sigma^2_\\epsilon$, so it is always larger.\n",
+    "- **Coefficient patterns**: Sparse architectures favor SuSiE/LASSO; polygenic architectures favor BayesR/DPR; the ensemble adapts by assigning larger $\\zeta_k$ to the method best suited to each region's architecture.\n",
+    "- **Scatter plot**: Most points should lie on or above the diagonal, indicating ensemble matches or improves on the best single method."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "R",
+     "ir",
+     "R",
+     "",
+     "r"
+    ],
+    [
+     "SoS",
+     "sos",
+     "",
+     "",
+     "sos"
+    ]
+   ]
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From 7780dbfd0f6f861567a4d933141eb055c180abdc Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 19 May 2026 09:13:59 -0400
Subject: [PATCH 2/4] Add new sections for snRNAseq and pseudobulk
 preprocessing

---
 website/_toc.yml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/website/_toc.yml b/website/_toc.yml
index f6e3744c5..d5efb01d2 100644
--- a/website/_toc.yml
+++ b/website/_toc.yml
@@ -26,6 +26,8 @@ parts:
       - file: code/molecular_phenotypes/calling/RNA_calling.ipynb
       - file: code/molecular_phenotypes/QC/bulk_expression_QC.ipynb
       - file: code/molecular_phenotypes/QC/bulk_expression_normalization.ipynb
+      - file: code/molecular_phenotypes/snRNAseq_preprocessing.ipynb
+      - file: code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb
     - file: code/molecular_phenotypes/methylation.ipynb
       sections:
       - file: code/molecular_phenotypes/calling/methylation_calling.ipynb

From 5bd3996e32957eeb4a5b88a45bc2e2b4ba37a8d8 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 19 May 2026 09:15:26 -0400
Subject: [PATCH 3/4] Delete
 code/pecotmr_integration/ensemble_twas_weights.ipynb

---
 .../ensemble_twas_weights.ipynb               | 992 ------------------
 1 file changed, 992 deletions(-)
 delete mode 100644 code/pecotmr_integration/ensemble_twas_weights.ipynb

diff --git a/code/pecotmr_integration/ensemble_twas_weights.ipynb b/code/pecotmr_integration/ensemble_twas_weights.ipynb
deleted file mode 100644
index 9277af022..000000000
--- a/code/pecotmr_integration/ensemble_twas_weights.ipynb
+++ /dev/null
@@ -1,992 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# Ensemble TWAS Weights via Stacked Regression\n",
-    "\n",
-    "## Overview\n",
-    "\n",
-    "`pecotmr` offers 10+ TWAS weight methods (SuSiE, LASSO, Elastic Net, mr.ash, BayesR, DPR, etc.), each suited to different genetic architectures. Picking one method is arbitrary; running all and testing each incurs a multiple-testing penalty. **Stacked regression** ([SR-TWAS; Dai et al.,2024](https://doi.org/10.1038/s41467-024-50983-w)) provides a principled alternative: learn one convex combination of methods per gene, producing a single weight vector for downstream TWAS testing with no method-selection bias.\n",
-    "\n",
-    "This notebook:\n",
-    "1. Presents the mathematical framework and algorithm for ensemble weight learning\n",
-    "2. Benchmarks the approach via simulation using **genotypes** from LD sketch reference panels, so that realistic LD patterns are preserved\n",
-    "3. Produces diagnostic plots comparing MSE against ground truth and MSE against observed $Y$\n",
-    "\n",
-    "| Step | Description | Parallelism |\n",
-    "|------|-------------|-------------|\n",
-    "| `simulate_and_fit` | Load X from sketch LD, simulate Y, run methods + ensemble, save RDS | Per-region (`for_each`) |\n",
-    "| `analyze_results` | Load RDS files, compute metrics, generate diagnostic plots | Single task |"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Mathematical Framework\n",
-    "\n",
-    "### Gene Expression Model\n",
-    "\n",
-    "For a target gene $g$ with $p$ cis-SNPs, the expression level of individual $i$ is:\n",
-    "\n",
-    "$$E_i = \\mathbf{G}_i \\mathbf{w} + \\epsilon_i, \\quad \\epsilon_i \\sim N(0, \\sigma^2_\\epsilon)$$\n",
-    "\n",
-    "where $\\mathbf{G}_i \\in \\mathbb{R}^{1 \\times p}$ is the genotype vector and $\\mathbf{w} \\in \\mathbb{R}^p$ is the true cis-eQTL effect-size vector.\n",
-    "\n",
-    "### Base Models\n",
-    "\n",
-    "We train $K$ prediction models $\\{m_1, \\ldots, m_K\\}$, each producing a weight estimate $\\hat{\\mathbf{w}}_k \\in \\mathbb{R}^p$. Under $F$-fold cross-validation, the out-of-fold prediction for sample $i$ from method $k$ is:\n",
-    "\n",
-    "$$\\hat{E}_i^{(k)} = \\mathbf{G}_i \\hat{\\mathbf{w}}_k^{(-f_i)}$$\n",
-    "\n",
-    "where $f_i$ denotes the fold containing sample $i$ and $\\hat{\\mathbf{w}}_k^{(-f)}$ is trained on all samples except fold $f$. Stacking all $n$ out-of-fold predictions across $K$ methods yields the prediction matrix:\n",
-    "\n",
-    "$$\\mathbf{P} = \\begin{pmatrix} \\hat{E}_1^{(1)} & \\cdots & \\hat{E}_1^{(K)} \\\\ \\vdots & \\ddots & \\vdots \\\\ \\hat{E}_n^{(1)} & \\cdots & \\hat{E}_n^{(K)} \\end{pmatrix} \\in \\mathbb{R}^{n \\times K}$$\n",
-    "\n",
-    "### Stacked Regression (Constrained QP)\n",
-    "\n",
-    "We seek ensemble coefficients $\\boldsymbol{\\zeta} = (\\zeta_1, \\ldots, \\zeta_K)^\\top$ that minimize the prediction error of the convex combination:\n",
-    "\n",
-    "$$\\min_{\\boldsymbol{\\zeta}} \\;\\; \\left\\| \\mathbf{E} - \\sum_{k=1}^{K} \\zeta_k \\hat{\\mathbf{E}}^{(k)} \\right\\|^2 = \\left\\| \\mathbf{E} - \\mathbf{P} \\boldsymbol{\\zeta} \\right\\|^2$$\n",
-    "\n",
-    "$$\\text{subject to} \\quad \\zeta_k \\geq 0 \\;\\; \\forall k, \\qquad \\sum_{k=1}^{K} \\zeta_k = 1$$\n",
-    "\n",
-    "This is a convex quadratic program solved via `quadprog::solve.QP` with:\n",
-    "\n",
-    "$$\\mathbf{D} = \\mathbf{P}^\\top \\mathbf{P} + \\lambda \\mathbf{I}, \\quad \\mathbf{d} = \\mathbf{P}^\\top \\mathbf{E}$$\n",
-    "\n",
-    "where $\\lambda = 10^{-8} \\cdot \\text{mean}(\\text{diag}(\\mathbf{P}^\\top \\mathbf{P}))$ provides ridge regularization for numerical stability.\n",
-    "\n",
-    "### Ensemble Weight Vector\n",
-    "\n",
-    "The final ensemble TWAS weight vector combines the full-data weight estimates:\n",
-    "\n",
-    "$$\\tilde{\\mathbf{w}} = \\sum_{k=1}^{K} \\zeta_k^* \\hat{\\mathbf{w}}_k$$\n",
-    "\n",
-    "### Performance Metrics\n",
-    "\n",
-    "Two complementary evaluation criteria:\n",
-    "\n",
-    "- **Against ground truth** (oracle): $\\text{MSE}_{\\text{truth}} = \\frac{1}{n} \\| \\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}} \\|^2$\n",
-    "- **Against observed $Y$** (practical): $R^2_Y = \\text{cor}(Y,\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$ and $\\text{MSE}_Y = \\frac{1}{n}\\|Y - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Algorithm\n",
-    "\n",
-    "The algorithm below shows the existing pecotmr cross-validation framework (Steps 1--5) and the **new ensemble learning extension** (Steps 6--9, in bold). Step numbers reference the mathematical formulation above.\n",
-    "\n",
-    "```\n",
-    "Algorithm: TWAS Weight Estimation with Ensemble Learning\n",
-    "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n",
-    "\n",
-    "Input:  Genotype matrix G (n x p), expression vector E (n x 1),\n",
-    "        weight methods M = {m_1, ..., m_K}, number of CV folds F\n",
-    "Output: Per-method weights {w_k}, ensemble coefficients {zeta_k},\n",
-    "        combined weight vector w_tilde, CV performance metrics\n",
-    "\n",
-    "── Existing Framework: twas_weights_cv() + twas_weights() ──────\n",
-    "\n",
-    "1. Partition n samples into F folds: {S_1, ..., S_F}\n",
-    "2. For each fold f = 1, ..., F:\n",
-    "   For each method k = 1, ..., K:\n",
-    "     a. Train:   w_k^(-f)  <-  m_k(G[T_f,], E[T_f])\n",
-    "     b. Predict:  E_hat^(k)[S_f]  <-  G[S_f,] * w_k^(-f)\n",
-    "3. Assemble out-of-fold prediction matrix P  (n x K)\n",
-    "     P[i,k] = E_hat_i^(k) from the fold containing sample i\n",
-    "4. Compute per-method performance:\n",
-    "     R^2_k = cor(E, P[,k])^2    for k = 1, ..., K\n",
-    "5. Train final weights on full data:\n",
-    "     w_k  <-  m_k(G, E)    for all k\n",
-    "\n",
-    "── NEW: Ensemble Learning via ensemble_weights() ──────────────\n",
-    "\n",
-    "  6. Solve constrained QP for ensemble coefficients:\n",
-    "       D  <-  P^T P + lambda * I      (ridge regularization)\n",
-    "       d  <-  P^T E\n",
-    "       zeta*  <-  argmin  1/2 zeta^T D zeta - d^T zeta\n",
-    "                  s.t.  sum(zeta_k) = 1,  zeta_k >= 0  for all k\n",
-    "  7. Compute ensemble prediction:\n",
-    "       E_hat_ens  =  P zeta*\n",
-    "  8. Compute ensemble performance:\n",
-    "       R^2_ens  =  cor(E, E_hat_ens)^2\n",
-    "  9. Combine final weight vectors:\n",
-    "       w_tilde  =  sum_k  zeta_k * w_k\n",
-    "\n",
-    "Return: {w_k}, {zeta_k}, w_tilde, {R^2_k}, R^2_ens\n",
-    "```\n",
-    "\n",
-    "**Implementation**: Steps 1--5 are handled internally by `pecotmr::twas_weights_pipeline()`. Steps 6--9 are `pecotmr::ensemble_weights()`, now exported in PR #486 (no local source file needed)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Weight Methods\n",
-    "\n",
-    "Two method sets are provided depending on computational budget:\n",
-    "\n",
-    "### Expensive (default)\n",
-    "\n",
-    "| # | Method | Type | pecotmr function | Key parameters |\n",
-    "|---|--------|------|------------------|----------------|\n",
-    "| 1 | SuSiE | Bayesian sparse (L0-like) | `susie_weights` | `refine=FALSE, init_L=5, max_L=10` |\n",
-    "| 2 | SuSiE-ash | Bayesian sparse + ash shrinkage | `susie_ash_weights` | default |\n",
-    "| 3 | SuSiE-inf | Bayesian sparse + infinitesimal | `susie_inf_weights` | default |\n",
-    "| 4 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` | `init_prior_sd=TRUE, max.iter=100` |\n",
-    "| 5 | Elastic Net | L1+L2 penalized (glmnet) | `enet_weights` | default (`alpha=0.5`) |\n",
-    "| 6 | LASSO | L1 penalized (glmnet) | `lasso_weights` | default |\n",
-    "| 7 | BayesR | Bayesian mixture of normals | `bayes_r_weights` | default (`qgg`) |\n",
-    "| 8 | BayesL | Bayesian LASSO | `bayes_l_weights` | default |\n",
-    "| 9 | BayesA | Bayesian ridge-like | `bayes_a_weights` | default |\n",
-    "| 10 | BayesB | Bayesian variable selection | `bayes_b_weights` | default |\n",
-    "| 11 | BayesC | Bayesian spike-and-slab | `bayes_c_weights` | default |\n",
-    "| 12 | BayesN | Bayesian normal mixture | `bayes_n_weights` | default |\n",
-    "| 13 | B-LASSO | Bayesian LASSO variant | `b_lasso_weights` | default |\n",
-    "| 14 | DPR-VB | Dirichlet process regression (VB) | `dpr_vb_weights` | default (`RcppDPR`) |\n",
-    "| 15 | DPR-Gibbs | Dirichlet process regression (Gibbs) | `dpr_gibbs_weights` | default |\n",
-    "| 16 | DPR-Adaptive Gibbs | Dirichlet process regression (adaptive) | `dpr_adaptive_gibbs_weights` | default |\n",
-    "| 17 | SCAD | SCAD penalized | `scad_weights` | default |\n",
-    "| 18 | MCP | MCP penalized | `mcp_weights` | default |\n",
-    "| 19 | L0Learn | Best-subset approximation | `l0learn_weights` | default |\n",
-    "\n",
-    "### Cheap (fast iteration)\n",
-    "\n",
-    "| # | Method | Type | pecotmr function |\n",
-    "|---|--------|------|------------------|\n",
-    "| 1 | SuSiE | Bayesian sparse | `susie_weights` |\n",
-    "| 2 | mr.ash | Bayesian adaptive shrinkage | `mrash_weights` |\n",
-    "| 3 | LASSO | L1 penalized | `lasso_weights` |\n",
-    "| 4 | Elastic Net | L1+L2 penalized | `enet_weights` |\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Simulation Design\n",
-    "\n",
-    "### Genotypes\n",
-    "\n",
-    "Each replicate uses **genotypes** loaded from LD sketch reference panels via `pecotmr::load_LD_matrix(meta, region, return_genotype = TRUE)`. Preserving LD patterns is critical for faithfully benchmarking TWAS weight methods — no synthetic genotype simulation can reproduce the complex correlation structure of actual genomic data.\n",
-    "\n",
-    "The pipeline requires two input files:\n",
-    "\n",
-    "1. **`ld_meta_file`** — Sketch LD metadata TSV pointing to plink2 files (one row per chromosome, `start=0, end=0` sentinel). Generated by the `rss_ld_sketch` pipeline. Example:\n",
-    "   ```\n",
-    "   #chrom  start  end  path\n",
-    "   22      0      0    ADSP.R5.EUR.chr22\n",
-    "   ```\n",
-    "   The `path` column is the plink2 file prefix (resolved relative to the TSV directory). The corresponding `.pgen`, `.pvar`, `.psam`, `.afreq` files must exist.\n",
-    "\n",
-    "2. **`ld_block_file`** — BED or TSV file defining LD block boundaries (one row per block). Each block becomes a replicate region. Example (BED format, 0-based half-open):\n",
-    "   ```\n",
-    "   chr22  16051249  17614263\n",
-    "   chr22  17614263  18834263\n",
-    "   ...\n",
-    "   ```\n",
-    "   Or TSV with `#chrom  start  end` header. These define the genomic windows passed to `load_LD_matrix()` to extract variants per region.\n",
-    "\n",
-    "### Phenotypes\n",
-    "\n",
-    "- **Sparse**: `n_sparse` variants with large effects\n",
-    "- **Oligogenic**: `n_oligogenic` variants with moderate effects\n",
-    "- **Infinitesimal**: `n_inf` variants with small effects\n",
-    "- Total heritability: $h^2_g$ (default 0.2)\n",
-    "\n",
-    "### Replicates\n",
-    "\n",
-    "Each genomic region in the `ld_block_file` defines one replicate. The `n_seeds` parameter controls the number of independent phenotype realizations per region (different random seeds produce different causal architectures on the same genotype matrix).\n",
-    "\n",
-    "### Data Saving\n",
-    "\n",
-    "Set `save_data = TRUE` to save the raw genotype matrix, simulated phenotype, and ground-truth causal information in a separate `data.rds` file per replicate. This is useful for debugging but produces large files (one per region x seed). Disabled by default.\n",
-    "\n",
-    "### Evaluation Metrics\n",
-    "\n",
-    "Two complementary metrics per replicate:\n",
-    "\n",
-    "1. **Against ground truth** — MSE of predicted genetic component vs. true genetic component: $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$. This measures how well the method recovers the true signal.\n",
-    "\n",
-    "2. **Against observed $Y$** — $R^2$ and MSE of predictions vs. observed phenotype $Y = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\epsilon$. This is the practical metric: how well would the weights predict expression in a held-out sample?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Usage\n",
-    "\n",
-    "### Quick test (cheap methods, few regions)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
-    "    --ld-meta-file data/ld_meta_file.tsv \\\n",
-    "    --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed \\\n",
-    "    --chrom 21 \\\n",
-    "    --n-seeds 1 \\\n",
-    "    --method-set cheap \\\n",
-    "    -J 20"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### Full benchmark (expensive methods, 10 seeds per region)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
-    "    --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n",
-    "    --ld-block-file /path/to/EUR_LD_blocks.bed \\\n",
-    "    --chrom 22 \\\n",
-    "    --n-seeds 10 \\\n",
-    "    --method-set expensive \\\n",
-    "    -J 40"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### With data saving (for debugging)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/ensemble_twas_weights.ipynb simulate_and_fit \\\n",
-    "    --ld-meta-file /path/to/sketch_ld_meta.tsv \\\n",
-    "    --ld-block-file /path/to/EUR_LD_blocks.bed \\\n",
-    "    --chrom 22 \\\n",
-    "    --n-seeds 1 \\\n",
-    "    --save-data TRUE \\\n",
-    "    -J 20"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "\n",
-    "### Analysis and plots"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "jp-MarkdownHeadingCollapsed": true,
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "sos run pipeline/ensemble_twas_weights.ipynb analyze_results \\\n",
-    "    --cwd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/ensemble_twas \\\n",
-    "    --ld-meta-file /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/rss_ld_sketch/sketch_ld_meta.tsv \\\n",
-    "    --ld-block-file /restricted/projectnb/casa/oaolayin/ROSMAP_NIA_geno/EUR_LD_blocks.bed"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "# Pipeline Implementation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[global]\n",
-    "# Output directory\n",
-    "parameter: cwd = path(\"output/ensemble_twas\")\n",
-    "\n",
-    "\n",
-    "# ── Genotype source ──────────────────────────────────────────\n",
-    "# Sketch LD metadata TSV (plink2 format: chrom, start=0, end=0, path=prefix)\n",
-    "# Generated by the rss_ld_sketch pipeline. Used by load_LD_matrix() to load genotypes.\n",
-    "parameter: ld_meta_file = path(\"\")\n",
-    "# LD block file defining replicate regions (BED or TSV with chrom/start/end)\n",
-    "# Each block becomes one replicate. Can be EUR_LD_blocks.bed or similar.\n",
-    "parameter: ld_block_file = path(\"\")\n",
-    "# Chromosome filter (0 = all chromosomes)\n",
-    "parameter: chrom = 0\n",
-    "# Max variants to subsample per region (0 = no limit)\n",
-    "parameter: max_variants = 0\n",
-    "\n",
-    "# ── Method selection ─────────────────────────────────────────\n",
-    "# \"expensive\": full 19-method list (SuSiE variants, mr.ash, BayesR family, DPR family, LASSO, Elastic Net, SCAD, MCP, L0Learn)\n",
-    "# \"cheap\":     \"fast_default\" preset — SuSiE, mr.ash, Elastic Net, LASSO\n",
-    "parameter: method_set = \"expensive\"\n",
-    "\n",
-    "# ── Simulation parameters ────────────────────────────────────\n",
-    "parameter: h2g = 0.2\n",
-    "parameter: n_sparse = 3\n",
-    "parameter: n_oligogenic = 5\n",
-    "parameter: n_inf = 15\n",
-    "parameter: cv_folds = 5\n",
-    "parameter: max_cv_variants = 500\n",
-    "\n",
-    "# ── Replicate control ────────────────────────────────────────\n",
-    "parameter: n_seeds = 1\n",
-    "parameter: seed_base = 42\n",
-    "\n",
-    "# ── Data saving (for debugging) ──────────────────────────────\n",
-    "# When TRUE, saves raw genotype matrix X, phenotype y, and sim object\n",
-    "# as a separate data.rds per replicate. Produces large files.\n",
-    "parameter: save_data = False\n",
-    "\n",
-    "# ── Cluster resources ────────────────────────────────────────\n",
-    "parameter: job_size = 1\n",
-    "parameter: walltime = \"4:00:00\"\n",
-    "parameter: mem = \"16G\"\n",
-    "parameter: numThreads = 4\n",
-    "\n",
-    "import os\n",
-    "cwd = path(f'{cwd:a}')\n",
-    "\n",
-    "def _read_ld_blocks(block_file, chrom_filter):\n",
-    "    \"\"\"Parse LD block file (BED or TSV with chrom/start/end columns).\n",
-    "    Returns list of region dicts with id, chrom, start, end, region string.\n",
-    "    Supports BED (0-based half-open) and TSV with #chrom header.\n",
-    "    \"\"\"\n",
-    "    regions = []\n",
-    "    with open(block_file) as fh:\n",
-    "        for line in fh:\n",
-    "            line = line.strip()\n",
-    "            if not line or line.startswith('#chrom') or line.startswith('chrom'):\n",
-    "                continue\n",
-    "            parts = line.split()\n",
-    "            if len(parts) < 3:\n",
-    "                continue\n",
-    "            c = parts[0]\n",
-    "            if not c.startswith('chr'):\n",
-    "                c = f'chr{c}'\n",
-    "            try:\n",
-    "                cnum = int(c.replace('chr', ''))\n",
-    "            except ValueError:\n",
-    "                continue\n",
-    "            if not (1 <= cnum <= 22):\n",
-    "                continue\n",
-    "            if chrom_filter != 0 and cnum != chrom_filter:\n",
-    "                continue\n",
-    "            start, end = int(parts[1]), int(parts[2])\n",
-    "            regions.append({\n",
-    "                'id': f'{c}_{start}_{end}',\n",
-    "                'chrom': c,\n",
-    "                'start': start,\n",
-    "                'end': end,\n",
-    "                'region': f'{c}:{start}-{end}'\n",
-    "            })\n",
-    "    return regions\n",
-    "\n",
-    "# Validate inputs\n",
-    "if not str(ld_meta_file) or not os.path.isfile(str(ld_meta_file)):\n",
-    "    raise ValueError(\"ld_meta_file is required — must point to a sketch LD metadata TSV (plink2 format)\")\n",
-    "if not str(ld_block_file) or not os.path.isfile(str(ld_block_file)):\n",
-    "    raise ValueError(\"ld_block_file is required — must point to an LD block BED/TSV defining replicate regions\")\n",
-    "\n",
-    "# Build replicate list: regions x seeds\n",
-    "_base_regions = _read_ld_blocks(str(ld_block_file), chrom)\n",
-    "if not _base_regions:\n",
-    "    raise ValueError(f\"No regions found in {ld_block_file} for chrom={chrom}\")\n",
-    "\n",
-    "replicates = []\n",
-    "for seed_idx in range(n_seeds):\n",
-    "    for i, r in enumerate(_base_regions):\n",
-    "        rep = dict(r)\n",
-    "        rep['seed'] = seed_base + seed_idx * len(_base_regions) + i\n",
-    "        rep['rep_id'] = f'{r[\"id\"]}_seed{seed_idx}'\n",
-    "        replicates.append(rep)\n",
-    "\n",
-    "print(f\"  {len(_base_regions)} regions x {n_seeds} seeds = {len(replicates)} replicates\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### `simulate_and_fit`\n",
-    "\n",
-    "For each replicate (region x seed):\n",
-    "1. Load genotype matrix $\\mathbf{G}$ from LD sketch via `pecotmr::load_LD_matrix()` using the sketch LD metadata and the region defined by the LD block file\n",
-    "2. Simulate expression $E$ via `simxQTL::generate_cis_qtl_data()` with mixed architecture on top of $\\mathbf{G}$\n",
-    "3. Run `pecotmr::twas_weights_pipeline()` with selected methods and 5-fold CV\n",
-    "4. Call `pecotmr::ensemble_weights()` (now exported, PR #486) to learn $\\boldsymbol{\\zeta}$ and combine weights\n",
-    "5. Compute metrics against ground truth and against observed $Y$\n",
-    "6. Save results as RDS; optionally save raw data (X, y, sim) as separate `data.rds` when `--save-data TRUE`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[simulate_and_fit]\n",
-    "input: for_each = \"replicates\"\n",
-    "output_files = [f'{cwd}/{_replicates[\"rep_id\"]}/results.rds']\n",
-    "if save_data:\n",
-    "    output_files.append(f'{cwd}/{_replicates[\"rep_id\"]}/data.rds')\n",
-    "output: output_files\n",
-    "# Pre-compute data path in Python so the R script can reference it unconditionally.\n",
-    "# (Using ${_output[1]} directly fails with IndexError when save_data=False\n",
-    "#  because SoS's ${} substitution is Python-eager, regardless of R control flow.)\n",
-    "_data_path = f'{cwd}/{_replicates[\"rep_id\"]}/data.rds' if save_data else ''\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
-    "R: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
-    "\n",
-    "    library(pecotmr)\n",
-    "    library(simxQTL)\n",
-    "    set.seed(${_replicates['seed']})\n",
-    "    cat(\"Replicate: ${_replicates['rep_id']}\\n\")\n",
-    "    cat(\"Seed: ${_replicates['seed']}\\n\")\n",
-    "    cat(\"Region: ${_replicates['region']}\\n\")\n",
-    "\n",
-    "    # ── 1. Load genotype from LD sketch ─────────────────────\n",
-    "    ld_data <- load_LD_matrix(\n",
-    "        \"${ld_meta_file}\",\n",
-    "        region = \"${_replicates['region']}\",\n",
-    "        return_genotype = TRUE\n",
-    "    )\n",
-    "    X <- ld_data$LD_matrix\n",
-    "    X <- scale(X)\n",
-    "    X[is.na(X)] <- 0\n",
-    "    n <- nrow(X); p <- ncol(X)\n",
-    "    cat(sprintf(\"Genotype matrix: %d samples x %d variants\\n\", n, p))\n",
-    "\n",
-    "    max_var <- ${max_variants}\n",
-    "    if (max_var > 0 && p > max_var) {\n",
-    "        keep_idx <- sort(sample(p, max_var))\n",
-    "        X <- X[, keep_idx, drop = FALSE]\n",
-    "        p <- ncol(X)\n",
-    "        cat(sprintf(\"Subsampled to %d variants\\n\", p))\n",
-    "    }\n",
-    "\n",
-    "    # ── 2. Simulate training phenotype ───────────────────────\n",
-    "    sim <- generate_cis_qtl_data(X, h2g = ${h2g},\n",
-    "                                  n_sparse = ${n_sparse},\n",
-    "                                  n_oligogenic = ${n_oligogenic},\n",
-    "                                  n_inf = ${n_inf})\n",
-    "    y <- sim$y                     # training Y:  G*w_true + eps_train\n",
-    "    cat(sprintf(\"Realized h2g: %.4f    residual var: %.4f\\n\",\n",
-    "                sim$h2g, sim$residual_variance))\n",
-    "\n",
-    "    # ── 2a. Build ground-truth pieces from sim$beta ───────────\n",
-    "    if (!is.null(sim$beta) && length(sim$beta) == p) {\n",
-    "        true_beta <- as.numeric(sim$beta)\n",
-    "    } else {\n",
-    "        warning(\"sim$beta missing or wrong length (\", length(sim$beta),\n",
-    "                \" vs p=\", p, \"); defaulting to zero vector\")\n",
-    "        true_beta <- rep(0, p)\n",
-    "    }\n",
-    "    true_genetic <- as.vector(X %*% true_beta)\n",
-    "\n",
-    "    # ── 2b. Fresh held-out Y (same X, same w_true, new noise) ─\n",
-    "    # This is what lets us evaluate prediction without in-sample\n",
-    "    # overfitting bias — the weights never saw this Y.\n",
-    "    sigma2 <- sim$residual_variance\n",
-    "    eps_new <- rnorm(n, mean = 0, sd = sqrt(sigma2))\n",
-    "    y_test <- true_genetic + eps_new\n",
-    "    cat(\"Generated fresh Y_test (independent noise draw)\\n\")\n",
-    "\n",
-    "    # ── 2c. Optional raw-data dump for debugging ─────────────\n",
-    "    data_path <- \"${_data_path}\"\n",
-    "    if (nzchar(data_path)) {\n",
-    "        data_out <- list(\n",
-    "            X = X, y = y, y_test = y_test, sim = sim,\n",
-    "            region = \"${_replicates['region']}\",\n",
-    "            rep_id = \"${_replicates['rep_id']}\",\n",
-    "            seed = ${_replicates['seed']},\n",
-    "            variant_info = ld_data$ref_panel\n",
-    "        )\n",
-    "        dir.create(dirname(data_path), recursive = TRUE, showWarnings = FALSE)\n",
-    "        saveRDS(data_out, data_path)\n",
-    "        cat(\"Saved raw data:\", data_path, \"\\n\")\n",
-    "        rm(data_out)\n",
-    "    }\n",
-    "\n",
-    "    # ── 3. Define weight methods ─────────────────────────────\n",
-    "    method_set <- \"${method_set}\"\n",
-    "    if (method_set == \"expensive\") {\n",
-    "        weight_methods <- list(\n",
-    "            susie_weights              = list(refine = FALSE, init_L = 5, max_L = 10),\n",
-    "            susie_ash_weights          = list(),\n",
-    "            susie_inf_weights          = list(),\n",
-    "            mrash_weights              = list(init_prior_sd = TRUE, max.iter = 100),\n",
-    "            enet_weights               = list(),\n",
-    "            lasso_weights              = list(),\n",
-    "            bayes_r_weights            = list(),\n",
-    "            bayes_l_weights            = list(),\n",
-    "            bayes_a_weights            = list(),\n",
-    "            bayes_b_weights            = list(),\n",
-    "            bayes_c_weights            = list(),\n",
-    "            bayes_n_weights            = list(),\n",
-    "            b_lasso_weights            = list(),\n",
-    "            dpr_vb_weights             = list(),\n",
-    "            dpr_gibbs_weights          = list(),\n",
-    "            dpr_adaptive_gibbs_weights = list(),\n",
-    "            scad_weights               = list(),\n",
-    "            mcp_weights                = list(),\n",
-    "            l0learn_weights            = list()\n",
-    "\n",
-    "        )\n",
-    "    } else {\n",
-    "        weight_methods <- list(\n",
-    "            susie_weights = list(refine = FALSE, init_L = 5, max_L = 10),\n",
-    "            mrash_weights = list(init_prior_sd = TRUE, max.iter = 100),\n",
-    "            lasso_weights = list(),\n",
-    "            enet_weights  = list()\n",
-    "        )\n",
-    "    }\n",
-    "\n",
-    "    # ── 4. TWAS weights pipeline (CV + full-data weights) ────\n",
-    "    # Note: twas_weights_pipeline() in pecotmr >=0.5 (PR #486) does NOT accept num_threads.\n",
-    "    # ensemble_weights() is now exported; no local R/ensemble_weights.R source needed.\n",
-    "    cat(\"Running twas_weights_pipeline with\", length(weight_methods), \"methods...\\n\")\n",
-    "    res <- tryCatch(\n",
-    "        twas_weights_pipeline(\n",
-    "            X, y,\n",
-    "            cv_folds = ${cv_folds},\n",
-    "            weight_methods = weight_methods,\n",
-    "            max_cv_variants = ${max_cv_variants}\n",
-    "        ),\n",
-    "        error = function(e) {\n",
-    "            cat(\"Pipeline error:\", conditionMessage(e), \"\\n\")\n",
-    "            NULL\n",
-    "        }\n",
-    "    )\n",
-    "\n",
-    "    if (is.null(res)) {\n",
-    "        cat(\"Pipeline failed — exiting non-zero so SoS flags this replicate\\n\")\n",
-    "        saveRDS(list(results = NULL, ensemble = NULL,\n",
-    "                     truth = list(rep_id = \"${_replicates['rep_id']}\",\n",
-    "                                  seed = ${_replicates['seed']},\n",
-    "                                  error = TRUE)),\n",
-    "                \"${_output[0]}\")\n",
-    "        quit(save = \"no\", status = 1)\n",
-    "    }\n",
-    "\n",
-    "    # ── 5. Ensemble learning (steps 6-9 in the algorithm) ────\n",
-    "    cat(\"Computing ensemble weights...\\n\")\n",
-    "    ens <- tryCatch(\n",
-    "        pecotmr::ensemble_weights(\n",
-    "            cv_results = res$twas_cv_result,\n",
-    "            Y = y,\n",
-    "            twas_weight_list = res$twas_weights\n",
-    "        ),\n",
-    "        error = function(e) {\n",
-    "            cat(\"Ensemble error:\", conditionMessage(e), \"\\n\")\n",
-    "            NULL\n",
-    "        }\n",
-    "    )\n",
-    "\n",
-    "    # ── 6. Compute metrics ───────────────────────────────────\n",
-    "    # Four metrics per method and per ensemble:\n",
-    "    #   mse_w          = mean((w_hat - w_true)^2)                        weight-space, LD-independent\n",
-    "    #   mse_pred_truth = mean((X*w_hat - X*w_true)^2)                    prediction-space, LD-aware (what TWAS uses)\n",
-    "    #   mse_y          = mean((X*w_hat - y_test)^2)                      held-out Y (fresh eps), honest out-of-sample\n",
-    "    #   rsq_y          = cor(X*w_hat, y_test)^2                          held-out Y R^2\n",
-    "    compute_metrics <- function(w_hat) {\n",
-    "        if (is.matrix(w_hat)) w_hat <- w_hat[, 1]\n",
-    "        pred <- as.vector(X %*% w_hat)\n",
-    "        mse_w          <- mean((w_hat - true_beta)^2)\n",
-    "        mse_pred_truth <- mean((pred - true_genetic)^2)\n",
-    "        mse_y          <- mean((pred - y_test)^2)\n",
-    "        rsq_y          <- if (sd(pred) > 0) cor(pred, y_test)^2 else 0\n",
-    "        c(mse_w = mse_w,\n",
-    "          mse_pred_truth = mse_pred_truth,\n",
-    "          mse_y = mse_y,\n",
-    "          rsq_y = rsq_y)\n",
-    "    }\n",
-    "\n",
-    "    method_metrics <- sapply(names(res$twas_weights), function(wname) {\n",
-    "        compute_metrics(res$twas_weights[[wname]])\n",
-    "    })\n",
-    "    colnames(method_metrics) <- gsub(\"_weights$\", \"\", colnames(method_metrics))\n",
-    "\n",
-    "    ensemble_metrics <- c(mse_w = NA, mse_pred_truth = NA, mse_y = NA, rsq_y = NA)\n",
-    "    if (!is.null(ens) && !is.null(ens$ensemble_twas_weights)) {\n",
-    "        ensemble_metrics <- compute_metrics(ens$ensemble_twas_weights)\n",
-    "    }\n",
-    "\n",
-    "\n",
-    "    # ── 7. Save results ──────────────────────────────────────\n",
-    "    truth <- list(\n",
-    "        causal_sparse = sim$sparse_indices,\n",
-    "        causal_oligo = sim$oligogenic_indices,\n",
-    "        causal_inf = sim$infinitesimal_indices,\n",
-    "        true_beta = true_beta,\n",
-    "        true_y = y,\n",
-    "        y_test = y_test,\n",
-    "        residual_variance = sigma2,\n",
-    "        h2g = sim$h2g,\n",
-    "        rep_id = \"${_replicates['rep_id']}\",\n",
-    "        region = \"${_replicates['region']}\",\n",
-    "        seed = ${_replicates['seed']},\n",
-    "        n = n, p = p,\n",
-    "        method_metrics = method_metrics,\n",
-    "        ensemble_metrics = ensemble_metrics,\n",
-    "    )\n",
-    "\n",
-    "    out <- list(results = res, ensemble = ens, truth = truth)\n",
-    "    dir.create(dirname(\"${_output[0]}\"), recursive = TRUE, showWarnings = FALSE)\n",
-    "    saveRDS(out, \"${_output[0]}\")\n",
-    "    cat(\"Saved:\", \"${_output[0]}\", \"\\n\")\n",
-    "    tryCatch({\n",
-    "    cat(\"Ensemble MSE (weights): \", round(ensemble_metrics[\"mse_w\"],          6), \"\\n\")\n",
-    "    cat(\"Ensemble MSE (pred vs truth):\",\n",
-    "        round(ensemble_metrics[\"mse_pred_truth\"], 6), \"\\n\")\n",
-    "    cat(\"Ensemble MSE (held-out Y): \",\n",
-    "        round(ensemble_metrics[\"mse_y\"], 6), \"\\n\")\n",
-    "    cat(\"Ensemble R^2  (held-out Y): \",\n",
-    "        round(ensemble_metrics[\"rsq_y\"], 4), \"\\n\")\n",
-    "    }, error = function(e) cat(\"Metric-print failed (non-fatal):\", conditionMessage(e), \"\\n\"))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "### `analyze_results`\n",
-    "\n",
-    "Load all per-replicate RDS files and produce:\n",
-    "1. **MSE of weights** — $\\frac{1}{p}\\|\\hat{\\mathbf{w}} - \\mathbf{w}_{\\text{true}}\\|^2$ (LD-independent)\n",
-    "2. **MSE of predicted genetic component** — $\\frac{1}{n}\\|\\mathbf{G}\\hat{\\mathbf{w}} - \\mathbf{G}\\mathbf{w}_{\\text{true}}\\|^2$ (LD-aware, TWAS-relevant)\n",
-    "3. **MSE vs held-out $Y$** — $\\frac{1}{n}\\|Y_{\\text{test}} - \\mathbf{G}\\hat{\\mathbf{w}}\\|^2$, where $Y_{\\text{test}} = \\mathbf{G}\\mathbf{w}_{\\text{true}} + \\varepsilon_{\\text{new}}$\n",
-    "4. **$R^2$ vs held-out $Y$** — $\\text{cor}(Y_{\\text{test}},\\; \\mathbf{G}\\hat{\\mathbf{w}})^2$\n",
-    "5. **Coefficient heatmap** — ensemble $\\zeta_k$ across replicates\n",
-    "6. **Scatter plot** — ensemble $R^2$ vs. best single-method $R^2$\n",
-    "7. **Summary CSV** — one row per replicate with all metrics"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[analyze_results]\n",
-    "parameter: results_dir = path(f'{cwd}')\n",
-    "output: f'{results_dir}/diagnostics.pdf', f'{results_dir}/summary_table.csv'\n",
-    "task: trunk_workers = 1, walltime = \"1:00:00\", mem = \"8G\", cores = 1\n",
-    "R: expand = \"${ }\", stderr = f'{results_dir}/analyze.stderr', stdout = f'{results_dir}/analyze.stdout'\n",
-    "\n",
-    "    library(ggplot2)\n",
-    "    library(tidyr)\n",
-    "    library(dplyr)\n",
-    "\n",
-    "    results_dir <- \"${results_dir}\"\n",
-    "\n",
-    "    # ── 1. Load all RDS files ────────────────────────────────\n",
-    "    rds_files <- list.files(results_dir, pattern = \"results\\\\.rds$\",\n",
-    "                            recursive = TRUE, full.names = TRUE)\n",
-    "    cat(sprintf(\"Found %d RDS files\\n\", length(rds_files)))\n",
-    "\n",
-    "    all_data <- lapply(rds_files, function(f) {\n",
-    "        tryCatch(readRDS(f), error = function(e) NULL)\n",
-    "    })\n",
-    "    all_data <- all_data[!sapply(all_data, is.null)]\n",
-    "    all_data <- all_data[!sapply(all_data, function(x) isTRUE(x$truth$error))]\n",
-    "    cat(sprintf(\"%d successful replicates\\n\", length(all_data)))\n",
-    "\n",
-    "    if (length(all_data) == 0) stop(\"No valid results found\")\n",
-    "\n",
-    "    # ── 2. Extract metrics ───────────────────────────────────\n",
-    "    metrics_list <- lapply(seq_along(all_data), function(i) {\n",
-    "        x <- all_data[[i]]\n",
-    "        coefs      <- if (!is.null(x$ensemble)) x$ensemble$method_coef else NULL\n",
-    "        list(\n",
-    "            rep_id = x$truth$rep_id, seed = x$truth$seed, region = x$truth$region,\n",
-    "            method_coef = coefs,\n",
-    "            method_metrics = x$truth$method_metrics,\n",
-    "            ensemble_metrics = x$truth$ensemble_metrics\n",
-    "        )\n",
-    "    })\n",
-    "\n",
-    "    # ── 3. Tidy data frames ──────────────────────────────────\n",
-    "    build_metric_df <- function(metric_name) {\n",
-    "        do.call(rbind, lapply(metrics_list, function(m) {\n",
-    "            if (is.null(m$method_metrics)) return(NULL)\n",
-    "            if (!(metric_name %in% rownames(m$method_metrics))) return(NULL)\n",
-    "            methods <- colnames(m$method_metrics)\n",
-    "            vals    <- m$method_metrics[metric_name, ]\n",
-    "            method_df <- data.frame(rep_id = m$rep_id, method = methods,\n",
-    "                                    value = as.numeric(vals),\n",
-    "                                    stringsAsFactors = FALSE)\n",
-    "            ens_df <- data.frame(rep_id = m$rep_id, method = \"ensemble\",\n",
-    "                                 value = as.numeric(m$ensemble_metrics[metric_name]),\n",
-    "                                 stringsAsFactors = FALSE)\n",
-    "            rbind(method_df, ens_df)\n",
-    "        }))\n",
-    "    }\n",
-    "\n",
-    "\n",
-    "    mse_w_df          <- build_metric_df(\"mse_w\")\n",
-    "    mse_pred_truth_df <- build_metric_df(\"mse_pred_truth\")\n",
-    "    mse_y_df          <- build_metric_df(\"mse_y\")\n",
-    "    rsq_y_df          <- build_metric_df(\"rsq_y\")\n",
-    "\n",
-    "    coef_mat <- do.call(rbind, lapply(metrics_list, function(m) {\n",
-    "        if (is.null(m$method_coef)) return(NULL)\n",
-    "        m$method_coef\n",
-    "    }))\n",
-    "    if (!is.null(coef_mat)) {\n",
-    "        rownames(coef_mat) <- sapply(metrics_list[!sapply(metrics_list,\n",
-    "            function(m) is.null(m$method_coef))], function(m) m$rep_id)\n",
-    "    }\n",
-    "\n",
-    "    scatter_df <- do.call(rbind, lapply(metrics_list, function(m) {\n",
-    "        if (is.null(m$method_metrics)) return(NULL)\n",
-    "        if (!(\"rsq_y\" %in% rownames(m$method_metrics))) return(NULL)\n",
-    "        rsq_vals <- m$method_metrics[\"rsq_y\", ]\n",
-    "        data.frame(\n",
-    "            rep_id = m$rep_id,\n",
-    "            best_single_rsq = max(rsq_vals, na.rm = TRUE),\n",
-    "            best_method = names(which.max(rsq_vals)),\n",
-    "            ensemble_rsq = as.numeric(m$ensemble_metrics[\"rsq_y\"]),\n",
-    "            stringsAsFactors = FALSE\n",
-    "        )\n",
-    "    }))\n",
-    "\n",
-    "    # ── 4. Diagnostic plots ──────────────────────────────────\n",
-    "    make_boxplot <- function(df, y_label, title, descending = FALSE) {\n",
-    "        if (is.null(df) || nrow(df) == 0) return(NULL)\n",
-    "        ord <- df %>% group_by(method) %>%\n",
-    "            summarise(med = median(value, na.rm = TRUE))\n",
-    "        if (descending) ord <- ord %>% arrange(desc(med)) else ord <- ord %>% arrange(med)\n",
-    "        df$method <- factor(df$method, levels = ord$method)\n",
-    "        ggplot(df, aes(x = method, y = value,\n",
-    "                       fill = ifelse(method == \"ensemble\", \"Ensemble\", \"Single\"))) +\n",
-    "            geom_boxplot(outlier.size = 0.8) +\n",
-    "            scale_fill_manual(values = c(\"Ensemble\" = \"#E41A1C\",\n",
-    "                                          \"Single\"  = \"#377EB8\"), name = \"\") +\n",
-    "            labs(title = title, x = \"Method\", y = y_label) +\n",
-    "            theme_bw(base_size = 12) +\n",
-    "            theme(axis.text.x = element_text(angle = 45, hjust = 1))\n",
-    "    }\n",
-    "\n",
-    "    pdf(\"${_output[0]}\", width = 12, height = 8)\n",
-    "\n",
-    "\n",
-    "    p <- make_boxplot(mse_w_df, \"MSE of weights\",\n",
-    "                      \"MSE on Weights  (LD-independent):  mean((w_hat - w_true)^2)\",\n",
-    "                      descending = TRUE)\n",
-    "    if (!is.null(p)) print(p)\n",
-    "\n",
-    "    p <- make_boxplot(mse_pred_truth_df, \"MSE(G*w_hat vs G*w_true)\",\n",
-    "                      \"MSE on Predicted Genetic Component  (LD-aware, TWAS-relevant)\",\n",
-    "                      descending = TRUE)\n",
-    "    if (!is.null(p)) print(p)\n",
-    "\n",
-    "    p <- make_boxplot(mse_y_df, \"MSE (vs held-out Y)\",\n",
-    "                      \"MSE Against Held-Out Y  (fresh noise — honest out-of-sample)\",\n",
-    "                      descending = TRUE)\n",
-    "    if (!is.null(p)) print(p)\n",
-    "\n",
-    "    p <- make_boxplot(rsq_y_df, expression(R^2~(vs~held-out~Y)),\n",
-    "                      \"R-squared Against Held-Out Y\")\n",
-    "    if (!is.null(p)) print(p)\n",
-    "\n",
-    "    if (!is.null(coef_mat) && nrow(coef_mat) > 1) {\n",
-    "        coef_long <- as.data.frame(coef_mat) %>%\n",
-    "            mutate(replicate = rownames(coef_mat)) %>%\n",
-    "            pivot_longer(-replicate, names_to = \"method\", values_to = \"zeta\")\n",
-    "        p5 <- ggplot(coef_long, aes(x = method, y = replicate, fill = zeta)) +\n",
-    "            geom_tile() +\n",
-    "            scale_fill_gradient(low = \"white\", high = \"#E41A1C\",\n",
-    "                                name = expression(zeta[k]), limits = c(0, 1)) +\n",
-    "            labs(title = \"Ensemble Coefficients Across Replicates\",\n",
-    "                 x = \"Method\", y = \"Replicate\") +\n",
-    "            theme_bw(base_size = 10) +\n",
-    "            theme(axis.text.x = element_text(angle = 45, hjust = 1),\n",
-    "                  axis.text.y = element_text(size = 6))\n",
-    "        print(p5)\n",
-    "    }\n",
-    "\n",
-    "    if (!is.null(scatter_df) && nrow(scatter_df) > 0) {\n",
-    "        p6 <- ggplot(scatter_df, aes(x = best_single_rsq, y = ensemble_rsq)) +\n",
-    "            geom_point(alpha = 0.6, size = 2, color = \"#377EB8\") +\n",
-    "            geom_abline(intercept = 0, slope = 1, linetype = \"dashed\",\n",
-    "                        color = \"grey40\") +\n",
-    "            labs(title = \"Ensemble vs Best Single Method  (R^2 vs held-out Y)\",\n",
-    "                 x = expression(Best~single~method~R^2),\n",
-    "                 y = expression(Ensemble~R^2)) +\n",
-    "            theme_bw(base_size = 12) + coord_equal()\n",
-    "        print(p6)\n",
-    "    }\n",
-    "\n",
-    "    dev.off()\n",
-    "    cat(\"Saved plots:\", \"${_output[0]}\", \"\\n\")\n",
-    "\n",
-    "    # ── 5. Summary table ─────────────────────────────────────\n",
-    "    metric_rows <- c(\"mse_w\", \"mse_pred_truth\", \"mse_y\", \"rsq_y\")\n",
-    "    summary_rows <- lapply(metrics_list, function(m) {\n",
-    "        row <- data.frame(\n",
-    "            rep_id = m$rep_id, region = m$region, seed = m$seed,\n",
-    "            stringsAsFactors = FALSE\n",
-    "        )\n",
-    "        for (mr in metric_rows) {\n",
-    "            row[[paste0(\"ensemble_\", mr)]] <- as.numeric(m$ensemble_metrics[mr])\n",
-    "        }\n",
-    "        if (!is.null(m$method_metrics)) {\n",
-    "            for (mn in colnames(m$method_metrics)) {\n",
-    "                for (mr in metric_rows) {\n",
-    "                    if (mr %in% rownames(m$method_metrics)) {\n",
-    "                        row[[paste0(mn, \"_\", mr)]] <- m$method_metrics[mr, mn]\n",
-    "                    }\n",
-    "                }\n",
-    "            }\n",
-    "        }\n",
-    "        if (!is.null(m$method_coef)) {\n",
-    "            for (mn in names(m$method_coef)) {\n",
-    "                row[[paste0(\"zeta_\", mn)]] <- m$method_coef[[mn]]\n",
-    "            }\n",
-    "        }\n",
-    "        row\n",
-    "    })\n",
-    "    summary_df <- bind_rows(summary_rows)\n",
-    "\n",
-    "    method_rsq_cols <- grep(\"_rsq_y$\", names(summary_df), value = TRUE)\n",
-    "    method_rsq_cols <- setdiff(method_rsq_cols, \"ensemble_rsq_y\")\n",
-    "    if (length(method_rsq_cols) > 0) {\n",
-    "        summary_df$best_method_rsq_y <- apply(summary_df[method_rsq_cols], 1,\n",
-    "                                              max, na.rm = TRUE)\n",
-    "        summary_df$best_method <- apply(summary_df[method_rsq_cols], 1, function(r) {\n",
-    "            gsub(\"_rsq_y$\", \"\", names(which.max(r)))\n",
-    "        })\n",
-    "        summary_df$rsq_y_improvement <-\n",
-    "            summary_df$ensemble_rsq_y - summary_df$best_method_rsq_y\n",
-    "    }\n",
-    "\n",
-    "    write.csv(summary_df, \"${_output[1]}\", row.names = FALSE)\n",
-    "    cat(\"Saved summary:\", \"${_output[1]}\", \"\\n\")\n",
-    "\n",
-    "    cat(\"\\n=== Summary ===\\n\")\n",
-    "    cat(sprintf(\"Replicates: %d\\n\", nrow(summary_df)))\n",
-    "    cat(sprintf(\"Mean ensemble MSE (weights)     : %.6f\\n\",\n",
-    "                mean(summary_df$ensemble_mse_w,         na.rm = TRUE)))\n",
-    "    cat(sprintf(\"Mean ensemble MSE (pred truth)  : %.6f\\n\",\n",
-    "                mean(summary_df$ensemble_mse_pred_truth, na.rm = TRUE)))\n",
-    "    cat(sprintf(\"Mean ensemble MSE (held-out Y)  : %.6f\\n\",\n",
-    "                mean(summary_df$ensemble_mse_y,         na.rm = TRUE)))\n",
-    "    cat(sprintf(\"Mean ensemble R^2 (held-out Y)  : %.4f\\n\",\n",
-    "                mean(summary_df$ensemble_rsq_y,         na.rm = TRUE)))\n",
-    "    if (\"rsq_y_improvement\" %in% names(summary_df)) {\n",
-    "        cat(sprintf(\"Ensemble >= best in %.1f%% of replicates (R^2 vs held-out Y)\\n\",\n",
-    "            100 * mean(summary_df$rsq_y_improvement >= 0, na.rm = TRUE)))\n",
-    "    }"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Expected Results\n",
-    "\n",
-    "With default simulation parameters ($h^2_g = 0.2$, mixed architecture) on genotype regions:\n",
-    "\n",
-    "- **MSE against truth vs. MSE against $Y$**: MSE against truth isolates weight estimation quality (unaffected by noise). MSE against $Y$ includes irreducible noise $\\sigma^2_\\epsilon$, so it is always larger.\n",
-    "- **Coefficient patterns**: Sparse architectures favor SuSiE/LASSO; polygenic architectures favor BayesR/DPR; the ensemble adapts by assigning larger $\\zeta_k$ to the method best suited to each region's architecture.\n",
-    "- **Scatter plot**: Most points should lie on or above the diagonal, indicating ensemble matches or improves on the best single method."
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "SoS",
-   "language": "sos",
-   "name": "sos"
-  },
-  "language_info": {
-   "codemirror_mode": "sos",
-   "file_extension": ".sos",
-   "mimetype": "text/x-sos",
-   "name": "sos",
-   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
-   "pygments_lexer": "sos"
-  },
-  "sos": {
-   "kernels": [
-    [
-     "R",
-     "ir",
-     "R",
-     "",
-     "r"
-    ],
-    [
-     "SoS",
-     "sos",
-     "",
-     "",
-     "sos"
-    ]
-   ]
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}

From b27a39f3f394a11044990357d6157c833d06fb84 Mon Sep 17 00:00:00 2001
From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com>
Date: Tue, 19 May 2026 09:22:28 -0400
Subject: [PATCH 4/4] Add new QC notebooks to the table of contents

---
 website/_toc.yml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/website/_toc.yml b/website/_toc.yml
index d5efb01d2..5b919a091 100644
--- a/website/_toc.yml
+++ b/website/_toc.yml
@@ -26,8 +26,11 @@ parts:
       - file: code/molecular_phenotypes/calling/RNA_calling.ipynb
       - file: code/molecular_phenotypes/QC/bulk_expression_QC.ipynb
       - file: code/molecular_phenotypes/QC/bulk_expression_normalization.ipynb
+      - file: code/molecular_phenotypes/QC/apa_impute.ipynb
       - file: code/molecular_phenotypes/snRNAseq_preprocessing.ipynb
       - file: code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb
+      - file: code/molecular_phenotypes/QC/pseudobulk_expression_aggregation_QC_norm.ipynb
+      - file: code/molecular_phenotypes/QC/pseudobulk_mega_expression_QC_and_normalization.ipynb
     - file: code/molecular_phenotypes/methylation.ipynb
       sections:
       - file: code/molecular_phenotypes/calling/methylation_calling.ipynb