- Distribution name:
grasp-tool - Import name:
grasp_tool
This is the official PyTorch implementation of:
"GRASP: Modeling Transcript Spot Graph Representations to Analyze Subcellular Patterns and Cell Clustering in High-Resolution Spatial Transcriptomics".
Note: This codebase was uploaded along with the manuscript for peer review. The complete code will be released after acceptance.
Check out the tutorial pages for demos and documentation:
If you have any questions, please don't hesitate to contact us.
- Python: 3.9+
- Training (optional):
build-train-pklandtrain-mocorequire PyTorch (torch) and PyTorch Geometric (torch-geometric). They are NOT installed by default viapip install grasp-tool.
grasp_tool/: installable package sourcescripts/: repo-only helper scripts (tiny demo, release tooling)example_pkl/: example raw input used by demos (repo-only; excluded from PyPI wheel)demo_pkl/: small pre-generated registered subset for smoke tests (repo-only)envs/: optional conda env templatesdocs/: maintainer docs
The recommended workflow is:
- Use conda/mamba to create a stable Python environment (and GPU toolchain if needed)
- Use pip to install the GRASP package from PyPI (
grasp-tool)
conda create -n grasp python=3.9 -y
conda activate graspOr use the provided environment file (creates env grasp and installs grasp-tool via pip):
conda env create -f envs/grasp-base.yml
conda activate grasppip install grasp-toolSmoke checks (should work without training deps):
grasp-tool --help
grasp-tool train-moco --helpIf you plan to run training commands (build-train-pkl, train-moco), install the training stack first:
- PyTorch install selector: https://pytorch.org/get-started/locally/
- PyG install guide: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
If you're running from this repo checkout, you can also run a small demo:
grasp-tool register \
--pkl_file example_pkl/simulated1_data_dict.pkl \
--output_pkl outputs/simulated1_registered.pklIf you have a repo checkout, you can run a fast end-to-end smoke test on a small
subset of example_pkl/simulated1_data_dict.pkl.
This demo runs:
register(with--nc_demo 4)- create a tiny
df_registeredsubset +pairs.csv portrait(JS distances)partition-graphsaugment-graphsbuild-train-pkltrain-mocofor 1 epoch
Run:
# Default: uses conda env name "grasp"
bash scripts/tiny_demo_example_pkl.shOverride the conda env name (useful if you have multiple envs):
GRASP_CONDA_ENV=<your_env_name> bash scripts/tiny_demo_example_pkl.shOutputs are written under:
outputs/tiny_demo_example_pkl_<timestamp>/
Notes:
- The final training step requires
torchandtorch-geometric. example_pkl/is excluded from PyPI release artifacts; this demo is intended for repo checkouts.- The
scripts/directory is not part of the installed PyPI wheel. If you installed viapip install grasp-tool, you need to clone this repo to use this demo script.
Pre-generated demo artifact (repo-only):
demo_pkl/tiny_registered.pkl: a smalldf_registeredsubset (suitable forportrait,partition-graphs,cellplot)demo_pkl/pairs.csv: pairs table forbuild-train-pkl
You can use it to skip register during manual smoke tests:
python -m grasp_tool portrait \
--pkl_file demo_pkl/tiny_registered.pkl \
--output_dir outputs/portrait_demo \
--max_count 2 \
--num_threads 1 \
--visualize_top_n 0 \
--use_same_rpip install grasp-tool intentionally does NOT pull in the training stack.
If you want to run training-related commands:
build-train-pkl(needstorch+torch-geometric)train-moco(needstorch+torch-geometric)
Install them following the official PyTorch / PyTorch Geometric (PyG) instructions:
- PyTorch: https://pytorch.org/get-started/locally/
- PyG: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html
We provide two practical options below (pick one).
Option A (pip wheels; tested on Linux x86_64 + CUDA 12.1):
# Install PyTorch (CUDA 12.1 example)
pip install --no-cache-dir torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 \
--index-url https://download.pytorch.org/whl/cu121
# Install PyTorch Geometric (PyG) wheels matching torch + CUDA
pip install --no-cache-dir pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
-f https://data.pyg.org/whl/torch-2.2.2+cu121.html
pip install --no-cache-dir torch-geometricOption B (conda):
# CPU-only
conda install -c pytorch pytorch torchvision torchaudio cpuonly
# CUDA (example: CUDA 12.1)
conda install -c pytorch -c nvidia pytorch torchvision torchaudio pytorch-cuda=12.1Then install PyTorch Geometric (PyG) following the official guide (the exact command depends on your torch + CUDA build).
Make sure the installed PyG build matches your PyTorch and CUDA versions.
Verify your training stack:
python -c "import torch, torch_geometric; print('torch', torch.__version__, 'cuda', torch.cuda.is_available()); print('pyg', torch_geometric.__version__)"If cuda is False, training will run on CPU.
Poetry is only needed for development and release.
poetry install
poetry run grasp-tool --helpIf you are new to the project, the simplest path is:
conda create -n grasp python=3.9 -y
conda activate grasp
pip install grasp-toolIf you prefer a single command, use:
conda env create -f envs/grasp-base.yml
conda activate graspOptional extras (only needed for non-core utilities):
- Optimal Transport utilities (
POT/ import nameot):pip install grasp-tool[ot]
The recommended example input is:
example_pkl/simulated1_data_dict.pkl
This PKL contains the raw inputs required by register (and also already includes
df_registered; the steps below still show how to run the full pipeline end-to-end).
Input: a PKL dict containing at least:
data_df(DataFrame; must includecell,type,centerX,centerY,x,y)cell_mask_df(DataFrame; columns:cell,x,y)nuclear_boundary(dict: cell -> DataFrame with columnsx,y)
Run:
python -m grasp_tool register \
--pkl_file example_pkl/simulated1_data_dict.pkl \
--output_pkl outputs/simulated1_registered.pklOutput: a PKL dict with keys:
df_registerednuclear_boundary_df_registeredcell_radiicell_nuclear_stats
This is optional and mainly used for sanity-checking transcript spatial patterns.
Notes:
- This command can generate a large number of images if you do not restrict
--cellsand--genes. - The input PKL must contain the required keys for the selected
--mode.--mode raw-cell: expectscell_boundary(and optionallynuclear_boundary) in the raw PKL. If your raw PKL does not havecell_boundary, use--mode registered-geneinstead.
Registered-gene plots (recommended; uses df_registered):
python -m grasp_tool cellplot \
--mode registered-gene \
--pkl outputs/simulated1_registered.pkl \
--output_dir outputs/cellplot \
--dataset simulated1 \
--cells cell_11 \
--genes gene_6_2_1,gene_6_3_1 \
--with_nuclear 1Raw cell boundary plots (uses cell_boundary / nuclear_boundary in the raw PKL):
python -m grasp_tool cellplot \
--mode raw-cell \
--pkl example_pkl/simulated1_data_dict.pkl \
--output_dir outputs/cellplot_raw \
--dataset simulated1python -m grasp_tool portrait \
--pkl_file outputs/simulated1_registered.pkl \
--output_dir outputs/portrait \
--use_same_r \
--visualize_top_n 0 \
--auto_paramspython -m grasp_tool partition-graphs \
--pkl outputs/simulated1_registered.pkl \
--graph_root outputs/graphs \
--n_sectors 20 \
--m_rings 10 \
--k_neighbor 5You can restrict scope for a quick smoke test:
python -m grasp_tool partition-graphs \
--pkl outputs/simulated1_registered.pkl \
--graph_root outputs/graphs_demo \
--cells cell_11,cell_135 \
--genes gene_0_0_0,gene_0_1_0python -m grasp_tool augment-graphs \
--graph_root outputs/graphs \
--dropout_ratio 0.1 \
--seed 2025You need a pairs.csv with columns: cell,gene.
This file defines which (cell, gene) pairs are included in the dataset.
build-train-pkl reads it, loads the corresponding graphs from graph_root, and writes a train.pkl used by train-moco.
You can use pairs.csv to subsample a large dataset for faster experiments.
Example (generate pairs from df_registered):
python -c "import pickle, pandas as pd; d=pickle.load(open('outputs/simulated1_registered.pkl','rb')); pairs=d['df_registered'][['cell','gene']].drop_duplicates(); pairs.to_csv('outputs/pairs.csv', index=False); print('wrote outputs/pairs.csv', len(pairs))"Then:
python -m grasp_tool build-train-pkl \
--pairs_csv outputs/pairs.csv \
--graph_root outputs/graphs \
--output_pkl outputs/train.pkl \
--dataset simulated1Note: this stage requires torch and torch-geometric, which are NOT installed by
pip install grasp-tool. Install them first (see "Training dependencies" above).
python -m grasp_tool train-moco \
--dataset simulated1 \
--pkl outputs/train.pkl \
--js 0 \
--n 20 \
--m 10 \
--num_epoch 300 \
--batch_size 64 \
--cuda_device 0 \
--output_dir outputs/embeddingsIf you want to use JS for positive sampling:
python -m grasp_tool train-moco \
--dataset simulated1 \
--pkl outputs/train.pkl \
--js 1 \
--js_file outputs/portrait/js_distances_*.csvOptional: clustering evaluation with ground-truth labels
If you have a ground-truth label CSV, you can enable clustering evaluation during training. This will compute ARI/NMI/Accuracy/Precision/Recall/F1 and write best_* summaries.
Label CSV requirements:
- Must include both
cellandgenecolumns. The embedding table is at (cell,gene) granularity, so labels must be joinable on (cell,gene). - Must include one label column. Recognized column names (highest priority first):
groundtruth_wzx,groundtruth,label,location,cluster,category,type.
Example label file:
cell,gene,groundtruth
10-0,SPTBN1,TypeA
10-0,MALAT1,TypeA
10-1,SPTBN1,TypeBRun training with evaluation enabled:
python -m grasp_tool train-moco \
--dataset simulated1 \
--pkl outputs/train.pkl \
--num_clusters 8 \
--label_file /path/to/simulated1_label.csvNotes:
--num_clustersenables clustering evaluation; if omitted, no clustering metrics are computed.- If the label file cannot be loaded, evaluation falls back to
unknownlabels (metrics will not be meaningful).
This section summarizes what each stage writes to disk.
Command:
python -m grasp_tool register --pkl_file <raw.pkl> --output_pkl <registered.pkl>Output (<registered.pkl> is a dict with):
df_registered(DataFrame): normalized transcript coordinates; contains at leastcell,gene,x_c_s,y_c_sand also keeps original columns.nuclear_boundary_df_registered(DataFrame): normalized nucleus boundary points per cell (containscell,x_c_s,y_c_splus intermediate columns).cell_radii(dict): per-cell radius used by downstream partitioning.cell_nuclear_stats(DataFrame): per-cell nucleus exceed stats (exceed_percent/exceed_count/num_nuclear_points).meta(dict): run metadata.
Command:
python -m grasp_tool cellplot --mode <raw-cell|registered-gene> --pkl <input.pkl> --output_dir <dir>Output (<dir>):
- Raw mode (
--mode raw-cell): writes per-cell boundary plots under1_<dataset>_raw_cell_plot/. - Registered mode (
--mode registered-gene): writes per-cell/per-gene scatter plots under<dataset>/registered_gene/<cell>/.
Command:
python -m grasp_tool portrait --pkl_file <registered.pkl> --output_dir <dir>Output (<dir>):
js_distances_*.csv: JS distance table used for positive sampling (whentrain-moco --js 1).
Command:
python -m grasp_tool partition-graphs --pkl <registered.pkl> --graph_root <graph_root>Output directory layout (<graph_root>):
<graph_root>/<cell>/<gene>_node_matrix.csv<graph_root>/<cell>/<gene>_adj_matrix.csv<graph_root>/<cell>/<gene>_dis_matrix.csv
These CSVs are the on-disk graph representation consumed by build-train-pkl.
Command:
python -m grasp_tool augment-graphs --graph_root <graph_root>Output directory layout:
<graph_root>/<cell>_aug/<gene>_node_matrix.csv<graph_root>/<cell>_aug/<gene>_adj_matrix.csv
Command:
python -m grasp_tool build-train-pkl --pairs_csv <pairs.csv> --graph_root <graph_root> --output_pkl <train.pkl>Output (<train.pkl> is a dict with):
original_graphs: list oftorch_geometric.data.Dataaugmented_graphs: list oftorch_geometric.data.Datagene_labels,cell_labels: aligned labels for each graphmeta: dataset tag + graph parameters +pairs.csvpath
Command:
python -m grasp_tool train-moco --dataset <name> --pkl <train.pkl> --output_dir <out_root>Output directory layout (<out_root>):
<out_root>/<run_id>/1_training_config.json: the full resolved args snapshot<out_root>/<run_id>/epoch{E}_lr{LR}_embedding.csv: main representation output- columns:
feature_1..feature_d, cell, gene
- columns:
- checkpoints:
<out_root>/<run_id>/epoch_{E}_lr_{LR}_checkpoint.pth- (optional)
<out_root>/<run_id>/best_model_epoch_{E}_lr_{LR}.pth
- best summary (only when clustering is enabled via
--num_clusters):<out_root>/<run_id>/best_metrics_lr{LR}.json<out_root>/<run_id>/best_{vis_method}_{cluster_method}_lr{LR}.png
- evaluation / visualization (from
grasp_tool/gnn/plot_refined.py):<out_root>/<run_id>/epoch{E}_lr{LR}_metrics*.txt<out_root>/<run_id>/epoch{E}_lr{LR}_clusters*.csv<out_root>/<run_id>/epoch{E}_lr{LR}_visualization*.png
<out_root>/<run_id>/ALL_COMPLETED.txt: written after all learning rates finish
In general, you can always inspect the full list via:
python -m grasp_tool --help
python -m grasp_tool <command> --help--pkl_file: input raw data dict PKL--output_pkl: output registered PKL (will containdf_registered)--nc_demo: process only first N cells (smoke test)--chunk_size: multiprocessing chunk size (speed/memory tradeoff)--clip_to_cell:1to clip nucleus to cell boundary;0to keep outside points--remove_outliers:1to drop nucleus points exceeding boundary--epsilon: numerical stability
--mode:raw-cellorregistered-gene--pkl/--pkl_file: input PKL path--output_dir: output directory root--dataset: dataset tag used in output paths (optional)--cells: restrict to a comma-separated subset of cells (recommended)--genes: restrict to a comma-separated subset of genes (registered-gene only; recommended)--with_nuclear:1to plot nucleus boundary if present,0to disable (registered-gene only)
This command is a pass-through wrapper. Common knobs:
--auto_params: auto-selectr_min/r_max/bin_size--use_same_r: enforce the samerwithin each gene--max_count,--transcript_window: reduce compute for large datasets--output_dir: control wherejs_distances_*.csvis written
--pkl: registered PKL (must containdf_registered)--graph_root: output root directory--n_sectors,--m_rings: partition resolution--k_neighbor: kNN graph connectivity--cells,--genes: restrict scope (smoke test)--epsilon: boundary classification tolerance
--graph_root: directory created bypartition-graphs--dropout_ratio: node dropout probability--seed: make augmentation deterministic--angle_min,--angle_max: rotation angle range (degrees)
--pairs_csv: CSV with columnscell,gene--graph_root: directory created bypartition-graphs(and augmented byaugment-graphs)--output_pkl: training PKL consumed bytrain-moco--dataset: dataset tag stored in metadata--processes: multiprocessing workers
This command runs the packaged training entrypoint (grasp_tool.cli.train_moco).
--pkl: training PKL built bybuild-train-pkl--output_dir: output root directory--lrs: learning rate list (e.g.--lrs 0.001or--lrs 0.001 0.002)--use_gradient_clipping:1(default) to clip gradients,0to disable--gradient_clip_norm: max norm for gradient clipping--js+--js_file: use JS distance for positive sampling--n,--m: must match partition settings--seed: reproducibility--num_epoch,--batch_size: training schedule--cuda_device: GPU index--num_clusters: affects clustering evaluation (for very small datasets, set it <= num graphs)--label_file: optional ground-truth label CSV path (used by clustering evaluation; see above)
- Always record:
n_sectors/m_rings/k_neighbor/dropout_ratio/seedand the exactpairs.csv. - Prefer writing all outputs under
outputs/(or a dedicated run directory). - For large runs, use tmux/screen; training can be slow due to evaluation + visualization.