Skip to content

uchicago-dsi/hfdp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

743 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HFDP — Habitat Factorized Dynamics-derived Phenotypes

HFDP is research software for predicting breast cancer pathologic complete response (pCR) from dynamic contrast‑enhanced MRI (DCE‑MRI) by modeling enhancement dynamics (time‑intensity curves) and learning a generative habitat factorization via diffusion‑inspired denoising reconstruction, optionally fused with clinical covariates.

HFDP is organized as a two-stage pipeline:

  • Stage 1 (pretrain): learn a habitat decomposer that factorizes a DCE time series into K soft spatial habitats and K corresponding enhancement curves (curve shape + timing), trained via noise-conditioned denoising reconstruction (diffusion-inspired) + diagnostics.
  • Stage 2 (downstream): freeze the decomposer, cache curve-dynamics features (and optional per-habitat tokens), then train a lightweight classifier and fusion head (cov-only / img-only / fused) to predict pCR.

Stage 3 (end-to-end finetuning) is planned but not implemented.

Project status

  • Pre‑alpha: APIs/configs may change without warning.
  • Not clinically validated: do not use for medical decision-making.
  • No patient data included: you provide your own DCE volumes, masks, and clinical metadata.

Repository layout

  • pretrain.py: stage 1 habitat decomposer pretraining.
  • train.py: stage 2 downstream pCR training (cached habitat features + covariate fusion).
  • hfdp/: library code (data, models, training, utils).
  • configs/: minimal example configs (see configs/README.md).
  • docs/: technical docs and diagrams.

Useful dataset/process notes:

Core tensors (stage 1)

  • x0: [T, Z, X, Y] (DCE time series)
  • times_sec: [T] (acquisition times)
  • key_padding_mask: [T] (True = padded)
  • breast_mask: [Z, X, Y]
  • tumor_mask: [Z, X, Y]

Installation

git clone git@github.com:uchicago-dsi/hfdp.git
cd hfdp
git submodule update --init --recursive

Install micromamba (optional)

If you do not already have micromamba:

macOS (Homebrew)

brew install micromamba

Linux (x86_64)

mkdir -p ~/.local/bin
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj -C ~/.local/bin --strip-components=1 bin/micromamba
export PATH="$HOME/.local/bin:$PATH"

# enable `micromamba activate` (restart your shell after this)
micromamba shell init -s bash -p ~/.micromamba

For other platforms, see the official micromamba installation docs.

Then create the recommended HFDP environment:

micromamba env create -f environment.yml
micromamba activate hfdp
python -m pip install --no-build-isolation -r requirements-dev.txt
python -m pip install -e .

If ffmpeg -version fails inside the env with libopenh264.so.5: cannot open shared object file, repair the env-local ABI link once:

ln -sf "$CONDA_PREFIX/lib/libopenh264.so.2.1.1" "$CONDA_PREFIX/lib/libopenh264.so.5"

environment.yml is the default Linux + NVIDIA recipe used for HFDP work in this repo. It installs:

  • Python 3.11
  • PyTorch + CUDA runtime
  • ffmpeg for mask_debug overlay movies
  • the PyRadiomics build prerequisites needed for a clean micromamba install
  • PYTHONNOUSERSITE=1 inside the env so ~/.local packages do not leak in

The explicit pip install --no-build-isolation -r requirements-dev.txt step is intentional: pyradiomics==3.0.1 needs versioneer and the in-env numpy visible at build time, which standard isolated builds do not provide.

Verify the editable path points at this checkout:

python - <<'PY'
import hfdp
print(hfdp.__file__)
PY

This environment.yml path is the supported install flow for this repo; do not expect a raw pip install -r requirements*.txt install to reproduce the same environment.

For one-shot commands without activation, keep the same isolation explicitly:

PYTHONNOUSERSITE=1 micromamba run -n hfdp python train.py --config <yaml>

If you need a CPU-only or non-NVIDIA setup, keep the same editable-install step but swap the PyTorch lines in environment.yml for the appropriate packages for your platform.

Config quickstart

data:
  mode: breast_volume
  slice_cache:
    intensity_normalization: per_exam_minmax
    enforce_left_on_left: true
pretrain:
  training:
    max_epochs: 10
  decomposer:
    enabled: true
    k_habitats: 8
    target_grid_zyx: [96, 144, 144]
    input_representation: delta_t0

Before running, edit the example configs under configs/ to point at your data:

  • data.paths.dataset_root (required)
  • data.paths.mask_dirs and data.paths.breast_mask_dirs (required)

Quickstart (debug)

Stage 1 (habitat decomposer pretraining):

python pretrain.py --config configs/pretrain/habitat_decomposer_mvp.yaml --debug

Stage 2 (downstream fusion head):

python train.py --config configs/downstream/habitat_fusion_baseline.yaml --debug

Submitit: single-slot CPU CV

The foreman owns queued experiment selection and normal Slurm throughput. It materializes launches through run_with_submitit.py, which remains the job-level submitit adapter for script/config/fold execution and CV aggregation. Use run_with_submitit.py directly only for one-off debug or intentionally manual launches.

For downstream-only radiomics runs, you can execute all CV folds inside one Slurm slot without requesting GPUs:

python run_with_submitit.py \
  --cpu-only \
  --single-job-cv \
  --parallel-folds-per-job 5 \
  --folds all \
  --cpus_per_task 32 \
  --mem_gb_per_gpu 192 \
  --partition general \
  --timeout 720 \
  --constraint "a100|h100|h200" \
  --name <run_name> \
  train.py --config <config.yaml>

This mode is designed for slot-limited sweeps. Tune data.num_workers per config to avoid CPU oversubscription when multiple folds run concurrently. For radiomics backends, also set train.cache.radiomics_exam_shards (usually equal to --parallel-folds-per-job) so folds cooperatively build exam-cache shards instead of waiting on a single cache lock.

Clinical labels

Set the ground-truth column via data.paths.label_column (defaults to pcr):

data:
  paths:
    label_column: pcr

Citation

This repo ships a CITATION.cff file for GitHub’s citation UI.

License

MIT (see LICENSE).

About

Habitat Factorized Dynamics-derived Phenotypes: DCE‑MRI pCR modeling via habitat factorization, diffusion-style denoising pretraining, and covariate fusion

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors