Peter's Parse and Processing of Prenatal Particulars via Pandas
A simple, extensible CLI for downloading the Human Phenotype Ontology, parsing genotype/phenotype Excel workbooks, and producing GA4GH Phenopackets as specified here. This project enables downloading the latest or specified Human Phenotype Ontology (HPO) JSON release, auto-classifying Excel sheets as genotype or phenotype data, normalizing column names and HPO IDs, and writing one Phenopacket per record. Additional commands provide quick auditing of workbooks for header normalization, sheet classification, and required variant columns. Built for easy integration and reproducibility, P6 supports rapid phenotypic data preparation for research and clinical workflows, and runs locally with simple installation via pip. The end usage of this project is to convert an existing digital record of phenotypic data into phenopackets, such that they may be linked to their corresponding VCFs and used to integrate with a larger federated repository system.
- Features
- Prerequisites
- Installation
- Quickstart
- CLI Reference
- Development & Testing
- Contributing
- License
- Contact
- Download: fetch the latest or a specific
hp.jsonrelease from GitHub - Parse: autodetect genotype vs phenotype sheets in any Excel workbook
- Normalize: clean up column names, HPO IDs, timestamps, and data types
- Generate: emit individual Phenopacket files, one per record (will change the file extension later)
-
Clone the repo:
git clone https://github.com/VarenyaJ/P6.git cd P6 -
(Recommended) Create a virtual environment (venv or Conda):
python3 -m venv .venv source .venv/bin/activateconda env create -f requirements/environment.yml -y conda activate P6
-
Install via pip:
python3 -m pip install -r requirements/requirements.txt . -
Verify the installation:
p6 --help
You should see something like:
Usage: p6 [OPTIONS] COMMAND [ARGS]... P6: Peter's Parse and Processing of Prenatal Particulars via Pandas. Options: --help Show this message and exit. Commands: download Download a specific or the latest HPO JSON release into... parse-excel Read each sheet, check column order, then: - Identify as a...
Fetch the latest release into tests/data/ (the default directory):
p6 downloadAfter running, you’ll have tests/data/hp.json.
With your HPO JSON in place at tests/data/hp.json, run:
p6 parse-excel -e tests/data/Sydney_Python_transformation.xlsxResulting phenopacket files will be under:
phenopacket_from_excel/$(date "+%Y-%m-%d_%H-%M-%S")/phenopackets/
Quickly check each sheet in an Excel file for header normalization, sheet classification, and presence of required variant columns.
p6 audit-excel -e tests/data/Sydney_Python_transformation.xlsxBy default you get a table; use -r for a JSON output to the console.
p6 audit-excel -e tests/data/Sydney_Python_transformation.xlsx -rUsage:
p6 download [OPTIONS]Options:
-d, --data-path PATH where to save HPO JSON (default: tests/data)
-v, --hpo-version TEXT exact HPO release tag (e.g. 2025-03-03 or v2025-03-03)
--help Show this help message and exit.Examples:
Fetch a specific release tag (e.g. v2025-03-03 or 2025-03-03) into tests/data/ (the default directory):
p6 download -v 2025-03-03
p6 download --hpo-version 2025-03-03Fetch a specific release tag (e.g. v2025-03-03 or 2025-03-03) into a custom directory:
p6 download -d src/P6 -v 2025-03-03
p6 download --data-path src/P6 --hpo-version 2025-03-03Read an Excel workbook, classify sheets, normalize fields, and emit Phenopacket protobuffers.
Usage: p6 parse-excel [OPTIONS] EXCEL_FILE
Options:
-e, --excel-path FILE path to the Excel workbook [required]
-hpo, --custom-hpo FILE path to a custom HPO JSON file (defaults to `tests/data/hp.json`)
--help Show this message and exit.Example:
Explicitly point at a custom HPO file:
p6 parse-excel -e tests/data/Sydney_Python_transformation.xlsx -hpo src/P6/hp.jsonRun a lightweight audit on each sheet in an Excel workbook, reporting header counts, sheet classification, and missing variant‐column checks.
Usage: p6 audit-excel [OPTIONS] EXCEL_FILE
Options:
-e, --excel-path FILE path to the Excel workbook [required]
-r, --report-json output audit report as JSON instead of table
--help Show this message and exit.Install dev requirements:
python3 -m pip install -r requirements/requirements.txt -r requirements/requirements_test.txt .This will install P6 along with the dependencies needed for the development.
Run the full test suite:
pytest -qLint & type-check (via ruff and built-in assertions):
ruff check .
ruff format .- Fork the repo & create a feature branch
- Make your changes & add tests
- Ensure all tests pass & lint is clean
- Submit a pull request against main
- Please follow the AGPL-3.0 code of conduct.
This project is licensed under the AGPL-3.0. See LICENSE for details.
Varenya Jain [email protected] GitHub: @VarenyaJ