Skip to content

microstijn/toolkit_genome_architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genome Architecture & Taxonomy Toolkit

** NOW NO LONGER MAINTAINED ** See https://github.com/microstijn/genomeArch for further developments.

This toolkit provides a suite of modular, command-line driven scripts for analyzing prokaryotic genomes. The primary workflow extracts taxonomic and assembly metadata from NCBI datasets, retrieves associated environmental data from the OmniMicrobe API, and calculates detailed gene overlap and genome architecture metrics from GFF files.

The entire process is managed by a central pipeline controller (run_pipeline.jl) for automated, end-to-end analysis.

Features

  • Automated Pipeline: A master script (run_pipeline.jl) runs the entire workflow, from raw data to final analysis, in the correct order.
  • Comprehensive Analysis:
    • Extracts full taxonomic lineage (superkingdom to species).
    • Maps organisms to known environments using the OmniMicrobe database.
    • Calculates unidirectional, convergent, and divergent gene overlaps.

TODO

Module for Pre-Binned Metagenomes

A major goal is to adapt this toolkit to run on already binned and annotated metagenomic data. This will allow for the analysis of individual Metagenome-Assembled Genomes (MAGs) and a comparison of genome architecture across the recovered genomes in a community.

The planned steps to achieve this are:

  • Create an Input Handler for MAGs: The pipeline will be adapted to accept a directory containing multiple bins (MAGs). The script will iterate through each bin, using its existing annotation file (GFF format) as the primary input for the analysis.

  • Integrate MAG Taxonomic Classification: Incorporate a standard tool for assigning taxonomy to bins, such as GTDB-Tk. This step will take the contigs of each MAG and assign it a robust taxonomic lineage, which can then be mapped to an NCBI taxonomy ID for downstream use.

  • Enable Batch Processing of Bins: The run_pipeline.jl controller will be modified to loop through each MAG. For each MAG, it will execute the calc_gene_overlap_on_GFF.jl script on the corresponding GFF file. The final output will be an aggregated CSV containing the architecture metrics for all MAGs, with a new column to identify the source bin.

  • Apply Environment Search to MAGs: Once a taxonomic ID is assigned to each MAG, the existing environment_from_omnicrobe_by_taxid.jl script can be run on these IDs to enrich the final output with potential environmental data for each recovered genome.

ML integration

The long-term dream for this toolkit is to serve as a core component in a larger machine learning pipeline capable of predicting the metabolic potential of microbial communities from environmental samples. The architectural metrics generated by these scripts would provide a unique set of structural features for the model.

Data Sources

All data processed by this toolkit is sourced from the National Center for Biotechnology Information (NCBI).

Genome Assemblies and Metadata

The required genome assembly and metadata files can be downloaded using the NCBI datasets command-line tool. This process retrieves GFF3 files for genome annotation and the accompanying JSONL files (assembly_data_report.jsonl) that are processed by the scripts.

For example, to download all available reference genomes for Bacteria (taxon: 2) and Archaea (taxon: 2157), you can use commands like the following:

# Download dehydrated packages for Bacteria and Archaea
datasets download genome taxon 2 --assembly-source refseq --reference --include gff3,gtf,seq-report --dehydrated --filename bacteria_reference.zip
datasets download genome taxon 2157 --assembly-source refseq --reference --include gff3,gtf,seq-report --dehydrated --filename archaea_reference.zip

# Rehydrate the packages to retrieve the data files
datasets rehydrate --directory bacteria_reference/
datasets rehydrate --directory archaea_reference/

Taxonomy Database

The taxonomic lineage information is derived from the NCBI Taxonomy database dump files (nodes.dmp and names.dmp). These files can be obtained from the following FTP address:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz


Installation

Prerequisites

  • Julia: Version 1.6 or higher.

Setup

  1. Clone the repository:

    git clone https://github.com/microstijn/toolkit_genome_architecture.git
    cd toolkit_genome_architecture/src
  2. Install Julia Dependencies: This project uses a local Julia environment defined by Project.toml and Manifest.toml. To install all required packages with their correct versions, start Julia in the src directory and run the following commands:

    using Pkg
    Pkg.activate(".")
    Pkg.instantiate()

Usage

Recommended Method: The Master Pipeline

The easiest and recommended way to use the toolkit is via the run_pipeline.jl script. It handles all steps, file management, and error checking.

Run it from your terminal within the src directory, providing paths to your data.

Example Command:

julia run_pipeline.jl \
    --jsonl-files /path/to/your/data/bacteria_report.jsonl /path/to/your/data/archaea_report.jsonl \
    --gff-dir /path/to/your/gff_files/ \
    --taxdump-dir /path/to/your/taxdump/ \
    --output-dir ./pipeline_output

Example Output

*A real-world example output has been generated by running the architecture part on all reference genomes for Bacteria. You can find the resulting file in the /out directory of this repository to see the expected format and data structure.

Arguments:

  • --jsonl-files: Path(s) to your NCBI assembly_data_report.jsonl files.
  • --gff-dir: Path to the directory containing all your .gff genome files.
  • --taxdump-dir: Path to the directory containing the NCBI nodes.dmp and names.dmp files.
  • --output-dir: The directory where all results will be saved. It will be created if it doesn't exist.

Running Individual Scripts

You can also run each script individually. Use the --help flag to see all available options for each script (e.g., julia taxid_from_ncbi_JSON.jl --help).


Scripts Overview

run_pipeline.jl

  • Description: The master controller script that orchestrates the entire workflow. This is the recommended entry point.
  • Inputs: Command-line paths to all raw data directories.
  • Outputs: A populated output directory containing all generated analysis files.

taxid_from_ncbi_JSON.jl

  • Description: Parses NCBI assembly_data_report.jsonl files to extract assembly metadata (accession, completeness, GC content) and organism info. It then uses an NCBI taxdump to map each entry to its full taxonomic lineage.
  • Inputs: NCBI JSONL file(s) and the nodes.dmp/names.dmp taxdump files.
  • Outputs: A CSV file (e.g., bacteria_report_TaxId.csv) containing the combined metadata and taxonomy.

environment_from_omnicrobe_by_taxid.jl

  • Description: Takes the taxonomy CSV from the previous step and queries the OmniMicrobe API to find known environmental niches for each taxon ID. It uses concurrent (asynchronous) requests to speed up the process.
  • Inputs: The ...TaxId.csv file.
  • Outputs: A new CSV file (e.g., ...TaxId_Omni.csv) with added columns for environments and obtId.

calc_gene_overlap_on_GFF.jl

  • Description: Analyzes a directory of GFF3 files to calculate detailed genome architecture metrics. It processes each contig to measure gene density, spacing, and overlaps (unidirectional, convergent, and divergent).
  • Inputs: A directory containing GFF3 files.
  • Outputs: A single CSV file (genome_architecture_metrics.csv) summarizing the structural features of every contig in every input genome.

Contributing

Contributions are welcome! If you would like to contribute to this project, please follow this simple workflow:

  • Open an Issue: Before starting work, please open an issue on GitHub. Describe the bug you want to fix or the feature you would like to add. This allows for discussion and ensures your work aligns with the project's goals.
  • Create a Branch: Once the issue is ready to be worked on, create a new branch from that issue. GitHub often provides a button to do this directly from the issue page. Naming the branch after the issue (e.g., issue-12-fix-api-bug) is good practice.
  • Submit a Pull Request: After committing your changes to your branch, open a pull request to merge your branch back into the main branch. Please link the pull request to the original issue so that it closes automatically upon merging.

Author & Revision

  • Original Author: SHP (2022-2025)
  • Last Revised: August 7, 2025
  • License: This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published