Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM debian:bookworm

WORKDIR /workspaces/genome-exploration
25 changes: 25 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"name": "Pathogenic Variant Finder Dev Container",
"dockerFile": "Dockerfile",
"customizations": {
"vscode": {
"extensions": [
"ExodiusStudios.comment-anchors",
"ms-vscode-remote.remote-containers",
"ms-azuretools.vscode-docker"
],
"settings": {
"editor.tabSize": 2,
"terminal.integrated.defaultProfile.linux": "zsh"
}
}
},
"features": {
"ghcr.io/stuartleeks/dev-container-features/shell-history:0": {},
"ghcr.io/schlich/devcontainer-features/powerlevel10k:1": {},
"ghcr.io/nils-geistmann/devcontainers-features/zsh:0": {},
"ghcr.io/devcontainers/features/rust:1": {},
"ghcr.io/devcontainers/features/python:1": {}
},
"postStartCommand": "apt-get update && apt-get install pkg-config libssl-dev && cargo build --release && ln -sf \"${PWD}/target/release/pathogenic_variant_finder\" /usr/local/bin/pathogenic"
}
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Generated by Cargo
# will have compiled files and executables
debug/
target/

# These are backup files generated by rustfmt
**/*.rs.bk

# MSVC Windows builds of rustc generate these, which store debugging information
*.pdb

#Additional filetypes to ignore
*.log
*.csv

#Specific directories / files to ignore
reports/
data/
clinvar_data/
Cargo.lock
.cursorrules
10 changes: 10 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,13 @@ csv = "1.3"
rayon = "1.10"
# For .gz
flate2 = "1.1.0"
# Noodles for VCF parsing/handling - use the meta crate
noodles = { version = "0.95.0", features = ["vcf", "csi", "core"] }
# Date/time handling for logging
chrono = "0.4"
# Progress bars
indicatif = "0.17"
# CPU count for parallel processing
num_cpus = "1.16"
# XZ compression support
xz2 = "0.1"
123 changes: 116 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This tool analyzes VCF (Variant Call Format) files to identify potentially patho

## Features

- **Comprehensive Pathogenic Variant Detection**: Identifies variants classified as pathogenic or likely pathogenic in ClinVar
- **Comprehensive Variant Detection**: Identifies variants classified as pathogenic, benign, or of uncertain significance in ClinVar
- **Population Frequency Integration**: Incorporates allele frequencies from 1000 Genomes (global and by population)
- **High Performance**:
- Parallel processing using Rayon's work-stealing thread pool
Expand All @@ -21,39 +21,96 @@ This tool analyzes VCF (Variant Call Format) files to identify potentially patho
- Mode of inheritance if available
- Citations and additional evidence
- **Automatic Reference Data Management**: Downloads and maintains necessary reference databases
- **Advanced Reporting**: Generates detailed CSV reports and summary statistics files
- **Flexible Variant Selection**: Configurable inclusion of pathogenic, VUS, and benign variants

## Installation

There are multiple ways to install the Pathogenic Variant Finder:

### Option 1: Using the installer script (recommended)

```bash
# Clone the repository
git clone https://github.com/SauersML/pathogenic.git
cd pathogenic

# Run the installer script (you might need to make it executable first)
chmod +x install.sh
./install.sh

# Or run it directly with bash
bash install.sh
```

The installer will:
1. Build the release version of the tool
2. Give you options to create a symlink in a directory in your PATH
3. Allow you to run the tool simply as `pathogenic` from anywhere

### Option 2: Manual installation

```bash
# Clone the repository
git clone https://github.com/SauersML/pathogenic.git
cd pathogenic

# Build the project
cargo build --release

# The binary will be available at target/release/pathogenic
# The binary will be available at target/release/pathogenic_variant_finder
# You can run it directly:
./target/release/pathogenic_variant_finder -b GRCh38 -i your_variants.vcf

# Optional: Create a symlink for easier access
sudo ln -sf "$(pwd)/target/release/pathogenic_variant_finder" /usr/local/bin/pathogenic
```

### Development Container Users

If you're using the provided Dev Container, the tool will be automatically built and made available as `pathogenic` in your PATH when the container starts.

## Usage

```
# Basic usage
pathogenic --build GRCh38 --input your_variants.vcf > results.csv
Once installed, you can use the tool simply as:

```bash
# Basic usage (pathogenic variants only)
pathogenic -b GRCh38 -i your_variants.vcf

# Include Variants of Uncertain Significance (VUS)
pathogenic -b GRCh38 -i your_variants.vcf -v

# Include benign variants
pathogenic -b GRCh38 -i your_variants.vcf -n

# Include both VUS and benign variants
pathogenic -b GRCh38 -i your_variants.vcf -v -n

# Disable markdown report generation
pathogenic -b GRCh38 -i your_variants.vcf --markdown-report=false
```

### Command-line Arguments

- `--build`, `-b`: Genome build, must be either "GRCh37" (hg19) or "GRCh38" (hg38)
- `--input`, `-i`: Path to the input VCF file (can be uncompressed or gzipped)
- `--include-vus`, `-v`, `--vus`: Include variants of uncertain significance in the output
- `--include-benign`, `-n`, `--benign`: Include benign variants in the output
- `--markdown-report`, `--md-report`: Generate markdown report (enabled by default, use `--markdown-report=false` to disable)

## Output

The tool outputs a CSV file to stdout with the following columns:
The tool generates the following output files in the `reports/` directory:

### 1. CSV Report File

The CSV report contains detailed information about all identified variants with the following columns:

- Chromosome, Position, Reference Allele, Alternate Allele
- Clinical Significance
- Is Alt Pathogenic
- Significance Category (pathogenic, benign, vus, conflicting)
- Gene
- ClinVar Allele ID
- Clinical Disease Name (CLNDN)
Expand All @@ -74,6 +131,46 @@ The tool outputs a CSV file to stdout with the following columns:

Depending on available data, many of these fields may not be available.

### 2. Statistics Text File

A companion statistics file summarizing the analysis settings and results:

- Analysis settings (input file, genome build, variant types included)
- Command used to run the analysis
- Total variants processed and reported
- Number of unique genes
- Counts of each variant classification type
- Distribution of allele frequencies across populations

### 3. Markdown Report File

A comprehensive, human-readable markdown report that includes:

- Analysis settings and command used
- Summary statistics with variant counts by category
- Table of contents linking to different sections
- Detailed variant information organized by clinical significance
- Variants grouped by gene with sortable tables
- Detailed information about each variant including:
- Location, DNA change, and genotype
- Clinical significance and disease associations
- Population frequencies
- Mode of inheritance and other annotations
- Understanding section with explanations of terms

The markdown report is generated by default but can be disabled using `--markdown-report=false`.

### Output File Naming

Output files follow this naming convention:
```
[input_filename]_[analysis_type]_[timestamp].csv
[input_filename]_[analysis_type]_[timestamp]_stats.txt
[input_filename]_[analysis_type]_[timestamp].md
```

Where `analysis_type` indicates which variant types were included in the report.

## How It Works

1. **Data Collection**: The tool automatically downloads necessary reference databases:
Expand All @@ -84,18 +181,30 @@ Depending on available data, many of these fields may not be available.
2. **Variant Processing**:
- Parses the user's VCF input
- Filters for variants present in the sample's genotype
- Matches variants against ClinVar to identify pathogenic variants
- Matches variants against ClinVar based on selected variant types
- Integrates 1000 Genomes allele frequency data
- Adds detailed annotations from ClinVar summary data

3. **Output Generation**:
- Sorts variants by chromosome and position
- Outputs comprehensive CSV with all annotations
- Generates a statistics file summarizing the analysis
- Creates a detailed markdown report with interactive sections (unless disabled)

## Logging

The tool maintains a log file (`pathogenic.log`) that captures all processing steps and can be useful for troubleshooting.

## Documentation

For more detailed information, see the documentation in the `docs/` directory:

- [Implementation Details](docs/implementation_details.md)
- [Noodles Integration](docs/noodles_integration.md)
- [Parallel Processing](docs/parallel_processing.md)
- [Reporting Features](docs/reporting_features.md)
- [1000 Genome Frequency Extraction](docs/1000genome_frequency_extraction.md)

## Usage
Do not use this for clinical purposes.
- I might have written a bug in the code.
Expand Down
Loading