🧬 Cell2Sentence4Longevity

A preprocessing pipeline that converts h5ad single-cell RNA-seq files into "cell sentences" - space-separated lists of gene symbols ordered by expression level for machine learning applications.

Features

One-step streaming - Reads h5ad → creates cell sentences → extracts age → splits train/test → writes output (no interim files!)
Type-safe - Full type hints throughout
Memory-efficient - Chunked processing with Polars, backed h5ad loading
Structured logging - Eliot logging with detailed diagnostics
HGNC gene mapping - Official gene name conversions (auto-created when needed)
Stratified splits - Train/test splits maintaining age distribution
Publication metadata - Optional CellxGene API lookup for publication info (DOI, title, etc.)
Batch processing - Process multiple h5ad files in one run
Auto-detection - Finds h5ad files in specified input folder

Example Dataset

This pipeline was developed for the AIDA (Asian Immune Diversity Atlas) dataset. An example processed dataset is available at: https://huggingface.co/datasets/transhumanist-already-exists/aida-asian-pbmc-cell-sentence-top2000

The example dataset is not included in this repository. You can download it or use your own h5ad files.

Installation

# Install dependencies
uv sync

Quick Start

One-Step Streaming Approach (Recommended):

# Download example AIDA dataset (optional)
uv run preprocess download

# Or use your own h5ad files - place them in data/input/
# The pipeline will auto-detect .h5ad files in the input folder

# Run full pipeline (auto-detects h5ad in data/input/)
# This processes everything in ONE streaming pass: h5ad → cell sentences → age extraction → train/test split → output
uv run preprocess run

# Or specify input file explicitly
uv run preprocess run path/to/file.h5ad --input-dir ./data/input

# Process multiple files in batch mode
uv run preprocess run --batch-mode --input-dir ./data/input

Upload to HuggingFace (Optional)

To upload processed data, configure your HuggingFace token:

cp .env.template .env
# Edit .env and add: HF_TOKEN=your_token_here

Then run with repo ID:

uv run preprocess run --repo-id "username/dataset-name"

Get your HuggingFace token from: https://huggingface.co/settings/tokens

Testing

# Run all tests
uv run pytest tests/ -v

# Run integration tests (downloads real data from CZI Science)
uv run pytest tests/test_integration.py::TestIntegrationPipeline -v -s

# Clean up test directories older than 7 days
uv run preprocess cleanup

# Remove all test directories
uv run preprocess cleanup --days 0

See tests/README.md for more details.

Usage

Run Pipeline (One-Step Streaming - Recommended)

The run command uses a one-step streaming approach that processes everything in a single pass:

Reads h5ad chunks
Creates cell sentences
Extracts age from development_stage
Splits into train/test (stratified by age)
Writes directly to output
No interim files created!

# Auto-detect h5ad file from data/input/
uv run preprocess run

# Batch mode - process all h5ad files in a directory
uv run preprocess run --batch-mode --input-dir ./data/input

# Or specify file and folders
uv run preprocess run /path/to/file.h5ad \
  --output-dir ./data/output \
  --repo-id "username/dataset-name"  # Optional, for upload

# Skip train/test split (produce single parquet dataset)
# Useful when you want HuggingFace users to decide on their own splitting
uv run preprocess run /path/to/file.h5ad \
  --skip-train-test-split \
  --output-dir ./data/output

# Publication metadata lookup is enabled by default
# This adds columns: collection_id, publication_title, publication_doi, publication_contact
# To disable: --no-lookup-publication
uv run preprocess run /path/to/file.h5ad \
  --output-dir ./data/output

Command Options

The run command supports many options for fine-tuning:

Input/Output:

h5ad_path (optional argument) - Path to h5ad file or directory (auto-detects from --input-dir if not provided)
--input-dir - Directory containing input files (default: ./data/input)
--output-dir / -o - Directory for final output files (default: ./data/output)
--interim-dir - Directory for interim files (default: ./data/interim)

Processing Options:

--chunk-size / -c - Number of cells per chunk (default: 10000)
--top-genes - Number of top expressed genes per cell (default: 2000)
--test-size - Proportion of data for test set (default: 0.05)
--skip-train-test-split - Skip train/test split and produce single parquet dataset
--batch-mode - Process all h5ad files in input directory
--skip-existing - Skip datasets that already have output files

Compression:

--compression - Compression algorithm: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd (default: zstd)
--compression-level - Compression level: 1-9 for zstd/gzip, 1-11 for brotli (default: 3)
--use-pyarrow / --no-pyarrow - Use pyarrow backend for parquet writes (default: True)

HGNC Gene Mapping:

--mappers / -m - Path to HGNC mappers pickle file (optional, auto-created if needed)
--create-hgnc - Force creation of HGNC mapper (default: auto-created only if needed)

HuggingFace Upload:

--repo-id / -r - HuggingFace repository ID (e.g., username/dataset-name)
--token / -t - HuggingFace API token (can also use HF_TOKEN env var)

Publication Metadata:

--lookup-publication / --no-lookup-publication - Enable/disable CellxGene API lookup (default: enabled)

Logging:

--log-dir - Directory for log files, separate log per file (default: ./logs)

Other:

--keep-interim - Keep interim parquet files after processing (default: False, cleaned up to save space)

Publication Metadata Lookup

For datasets from CellxGene Discover, publication metadata lookup is enabled by default. This automatically adds publication information:

# Publication lookup is enabled by default
uv run preprocess run ./data/input/10cc50a0-af80-4fa1-b668-893dd5c0113a.h5ad

# To disable publication lookup
uv run preprocess run ./data/input/file.h5ad --no-lookup-publication

# In batch mode (enabled by default)
uv run preprocess run --batch-mode --input-dir ./data/input

This queries the CellxGene API and adds the following columns to your output:

collection_id - CellxGene collection ID
publication_title - Title of the associated publication/collection
publication_doi - DOI of the publication (if available)
publication_contact - Contact name for the publication

Note: The API lookup may fail for some datasets if:

The dataset is not from CellxGene Discover
The CellxGene API changes
Network issues occur

When lookup fails, processing continues without publication metadata.

Run Individual Steps (Legacy Two-Step Approach)

If you need more control, you can run individual steps. Note: This creates interim files.

Download Dataset

# Download with default URL (AIDA dataset)
uv run preprocess download

# Download from custom URL
uv run preprocess download --url https://example.com/dataset.h5ad

# Specify output directory and filename
uv run preprocess download \
  --url https://example.com/dataset.h5ad \
  --input-dir ./data/input \
  --filename custom_name.h5ad

# Force re-download even if file exists
uv run preprocess download --force

Download Command Options:

--url / -u - URL to download dataset from (default: AIDA dataset URL)
--input-dir / -i - Directory to save downloaded files (default: ./data/input)
--filename / -f - Optional filename (if not provided, extracted from URL)
--force - Force re-download even if file exists
--log-file / -l - Path to eliot log file (optional)

Create HGNC Mapper

# Create HGNC mapper explicitly (usually auto-created when needed)
uv run preprocess hgnc-mapper --interim-dir ./data/interim

# Or force creation during run command
uv run preprocess run --create-hgnc

HGNC Mapper Command Options:

--interim-dir / -i - Directory to save interim files (HGNC mappers) (default: ./data/interim)
--log-file / -l - Path to eliot log file (optional)

Upload to HuggingFace

# Upload a single dataset directory (uses default repo-id if not specified)
uv run preprocess upload \
  --output-dir ./data/output \
  --token $HF_TOKEN

# Or specify custom repo-id
uv run preprocess upload \
  --output-dir ./data/output \
  --repo-id "username/dataset-name" \
  --token $HF_TOKEN

# With custom README file
uv run preprocess upload \
  --output-dir ./data/output \
  --repo-id "username/dataset-name" \
  --readme ./README.md \
  --token $HF_TOKEN

# Or upload during processing (per-file uploads)
uv run preprocess run /path/to/file.h5ad \
  --repo-id "username/dataset-name" \
  --token $HF_TOKEN

Upload Command Options:

--output-dir / -o - Directory containing train/test subdirectories (default: ./data/output)
--repo-id / -r - HuggingFace repository ID (default: longevity-genie/cell2sentence4longevity-data)
--token / -t - HuggingFace API token (required, can also use HF_TOKEN env var)
--readme - Path to README file to include in the dataset (optional)
--log-file / -l - Path to eliot log file (optional)

Batch Processing

Process multiple h5ad files efficiently. See BATCH_PROCESSING.md for full details.

# Process all h5ad files in a directory
uv run preprocess run --batch-mode --input-dir ./data/input

# With upload to HuggingFace (uploads after each successful file)
uv run preprocess run --batch-mode \
  --input-dir ./data/input \
  --repo-id username/my-datasets \
  --token $HF_TOKEN

# Skip files that already have output
uv run preprocess run --batch-mode \
  --input-dir ./data/input \
  --skip-existing

Key features:

One-step streaming: No interim files, everything in one pass
Memory efficient: Processes terabytes of data with constant memory usage
Error isolation: Failures in one file don't stop others
Per-file logging: Separate logs for each dataset
Batch summary: Creates batch_processing_summary.tsv with timing and status for all files

Project Structure

cell2sentence4longevity/
├── src/cell2sentence4longevity/
│   ├── preprocess.py               # Main CLI with all commands
│   ├── cleanup.py                  # Cleanup utilities
│   └── preprocessing/
│       ├── hgnc_mapper.py          # Gene mapping
│       ├── h5ad_converter.py       # H5AD conversion (one-step & two-step)
│       ├── train_test_split.py     # Data splitting (legacy)
│       ├── upload.py               # HuggingFace upload
│       └── download.py             # Dataset download
├── tests/                          # Integration tests
├── docs/                           # Documentation
├── pyproject.toml
└── README.md

Logging

All operations use Eliot structured logging. To enable file logging:

# Logs are written to --log-dir (default: ./logs)
# Each dataset gets its own log directory
uv run preprocess run /path/to/file.h5ad --log-dir ./logs

# In batch mode, each file gets its own log subdirectory
uv run preprocess run --batch-mode --log-dir ./logs

This creates per-dataset logs:

./logs/{dataset_name}/pipeline.json - Machine-readable structured logs
./logs/{dataset_name}/pipeline.log - Human-readable formatted logs

Log Analysis

# View specific log sections
grep "age_extraction_summary" logs/pipeline.log
grep "filtering_summary" logs/pipeline.log
grep "gene_mapping_summary" logs/pipeline.log

See docs/LOGGING.md for detailed documentation.

License

CC BY 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
data		data
demo_logging		demo_logging
docs		docs
examples		examples
logs		logs
notebooks		notebooks
src/cell2sentence4longevity		src/cell2sentence4longevity
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
batch_extract.json		batch_extract.json
extract.json		extract.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Cell2Sentence4Longevity

Features

Example Dataset

Installation

Quick Start

Upload to HuggingFace (Optional)

Testing

Usage

Run Pipeline (One-Step Streaming - Recommended)

Command Options

Publication Metadata Lookup

Run Individual Steps (Legacy Two-Step Approach)

Download Dataset

Create HGNC Mapper

Upload to HuggingFace

Batch Processing

Project Structure

Logging

Log Analysis

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

longevity-genie/cell2sentence4longevity

Folders and files

Latest commit

History

Repository files navigation

🧬 Cell2Sentence4Longevity

Features

Example Dataset

Installation

Quick Start

Upload to HuggingFace (Optional)

Testing

Usage

Run Pipeline (One-Step Streaming - Recommended)

Command Options

Publication Metadata Lookup

Run Individual Steps (Legacy Two-Step Approach)

Download Dataset

Create HGNC Mapper

Upload to HuggingFace

Batch Processing

Project Structure

Logging

Log Analysis

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages