STRAND (Subcellular-Resolved Transcriptome RNA Architecture and Navigation Database) Tools is a command-line toolbox for subcellular spatial transcriptomics analysis. It provides four commands for standardized PKL input data:
| Command | Function |
|---|---|
subcellfeat |
Full pipeline: compute 17 spatial features (Bento + SPRAWL) and predict RNA localization patterns |
subcellfeat-pattern |
Predict patterns from pre-computed feature parquet (skip feature computation) |
subcellfeat-compartment |
Detect subcellular transcriptomic compartments via RNAflux embedding + SOM clustering |
subcellfeat-coloc |
Identify significant RNA-RNA colocalization pairs via InSTAnT PP + CPB |
The toolbox is designed for molecule-resolved spatial transcriptomics datasets such as MERFISH, Xenium, CosMx, seqFISH, and other transcript-level spatial data.
The main command. It computes 17 spatial features (13 Bento + 4 SPRAWL) for each cell-gene pair from a PKL bundle, then predicts RNA localization patterns using a trained XGBoost classifier.
Supported pattern classes:
Nuclear, Nuclear edge, Cytoplasmic, Cell edge, Protrusion, Radial, Random, Foci
Default prediction strategy:
- Use the primary 8-class XGBoost model.
- Calculate the proportion of predictions classified as Foci.
- If Foci ratio > 0.5, rerun prediction with the 7-class no-Foci model.
- Save the final result with the unified column name
pattern.
Use --features-only to compute features without pattern prediction.
A lightweight command for when features have already been computed. Takes a feature parquet file as input and applies the XGBoost classifier directly, skipping the expensive Bento/SPRAWL computation.
Detects subcellular transcriptomic compartments using RNA spatial distribution and embedding-based analysis. The pipeline samples transcript neighborhoods, computes RNAflux embeddings, and clusters them via SOM (Self-Organizing Map).
Outputs (prefixed with --out-prefix):
{out_prefix}_sampled_batches.csv
{out_prefix}_sampled_batches_meta.json
{out_prefix}_{batch}_embedding.png
{out_prefix}_{batch}_subdomain.png
Identifies significant RNA-RNA colocalization relationships within cells. It runs InSTAnT Proximal Pairs (PP) and Conditional Probability of Barcodes (CPB) tests, filters for significance, and generates visualizations.
If the input data has no z column, the command automatically switches to 2D mode.
Outputs:
instant_significant_pairs.csv
prefilter_summary.csv
viz_cpb_heatmap_topN.png
viz_coloc_network_styled.png
This repository uses Git LFS for model and data files.
git clone https://github.com/TingtingLiGroup/STRAND.git
cd STRAND
git lfs pullconda env create -f environment.yml
conda activate subcellular
pip install -e .python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .The compartment detection module supports GPU acceleration via PyTorch. To enable it:
pip install -e ".[gpu]"Without PyTorch, compartment detection runs on CPU with NumPy (functionally identical, slower on large datasets).
subcellfeat --help
subcellfeat-pattern --help
subcellfeat-compartment --help
subcellfeat-coloc --helpThe toolbox uses a standardized PKL bundle as input.
The PKL file should contain a Python dictionary:
{
"data_df": pandas.DataFrame,
"cell_boundary": dict,
"nuclear_boundary": dict,
"coordinates": pandas.DataFrame # optional
}Required columns:
cell
gene
x
y
Optional columns:
z
fov
batch
sample_id
umi
annotation
Each row represents one transcript molecule.
Example:
cell gene x y z fov
cell_1 ACTB 123.5 456.2 0 FOV1
cell_1 MALAT1 130.1 440.8 0 FOV1
The cell boundary should be a dictionary:
cell_boundary[cell_id] = pandas.DataFrame({
"x": [...],
"y": [...]
})Each key is a cell ID. Each value is the polygon boundary of that cell.
The nuclear boundary should follow the same format:
nuclear_boundary[cell_id] = pandas.DataFrame({
"x": [...],
"y": [...]
})If nuclear boundary is unavailable, it can be empty or missing. Some nucleus-related features may be affected, but the pipeline should remain compatible.
subcellfeat \
--pkl data/simulated_data_dict.pkl \
--out results/simulated_pattern.parquet \
--profileFor large datasets, use --fast to skip the two slowest SPRAWL features (punctate + radial):
subcellfeat \
--pkl data/simulated_data_dict.pkl \
--out results/simulated_pattern.parquet \
--fast --profileOr reduce SPRAWL approximation parameters:
subcellfeat \
--pkl data/simulated_data_dict.pkl \
--out results/simulated_pattern.parquet \
--sprawl-iterations 20 --sprawl-pairs 2 --profileTo compute features only (no pattern prediction):
subcellfeat \
--pkl data/simulated_data_dict.pkl \
--out results/simulated_features.parquet \
--features-only --profileIf you already have a feature parquet from step 4.1 --features-only:
subcellfeat-pattern \
--input results/simulated_features.parquet \
--output results/simulated_pattern.parquet \
--profilesubcellfeat-compartment \
--pkl data/simulated_data_dict.pkl \
--out-prefix results/simulated_compartment \
--frac 0.01 \
--radius 40 \
--n-clusters auto \
--profilesubcellfeat-coloc \
--pkl data/simulated_data_dict.pkl \
--out-dir results/simulated_coloc \
--use-2d \
--profileNote: if the input PKL has no z column, --use-2d is applied automatically.
The Pattern module computes 17 features for each cell-gene pair.
bento_cell_inner_proximity
bento_nucleus_inner_proximity
bento_nucleus_outer_proximity
bento_cell_inner_asymmetry
bento_nucleus_inner_asymmetry
bento_nucleus_outer_asymmetry
bento_point_dispersion_norm
bento_nucleus_dispersion_norm
bento_l_max
bento_l_max_gradient
bento_l_min_gradient
bento_l_monotony
bento_l_half_radius
sprawl_peripheral
sprawl_central
sprawl_punctate
sprawl_radial
The two slowest SPRAWL features are:
sprawl_punctate
sprawl_radial
Their approximation parameters can be controlled by:
--sprawl-iterations
--sprawl-pairsExample:
--sprawl-iterations 20 --sprawl-pairs 2This reduces runtime but may introduce more randomness in SPRAWL punctate and radial scores.
The main output is a parquet file:
pattern.parquet
It contains:
cell
gene
Bento features
SPRAWL features
class probability columns
pattern
The final pattern label is stored in:
pattern
Typical outputs (prefixed with --out-prefix):
{out_prefix}_sampled_batches.csv
{out_prefix}_sampled_batches_meta.json
{out_prefix}_{batch}_embedding.png
{out_prefix}_{batch}_subdomain.png
Typical outputs:
instant_significant_pairs.csv
prefilter_summary.csv
viz_cpb_heatmap_topN.png
viz_coloc_network_styled.png
top pair cell plots
The toolbox uses Python pickle to load PKL input bundles. Pickle deserialization can execute arbitrary code. Only load PKL files that you or your collaborators produced from trusted data sources. Do not load PKL files from untrusted or unknown origins.
- The toolbox expects standardized PKL input.
- The main command-line tools use user-provided input and output paths.
- Large datasets should be processed with background execution.
- Pattern classification depends on trained XGBoost models stored in the
models/directory. - For reproducibility, keep the model files and the code version synchronized.
If you use STRAND Tools in your research, please cite:
STRAND Tools: A command-line toolbox for subcellular spatial transcriptomics analysis. https://github.com/TingtingLiGroup/STRAND
Contributions are welcome. Please open an issue or submit a pull request on GitHub.
This project is licensed under the MIT License. See LICENSE for details.