Skip to content

treangenlab/somatem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

414 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

somatem

A modular and open source metagenomic analysis toolkit designed for long reads

somatem is a modular Nextflow based pipeline designed for long-read microbiome analysis, including both 16S and metagenomic support. somatem supports both Oxford Nanopore Technologies and PacBio. Built with ease of use and analytical rigor in mind, somatem enforces best practices for long-read sequencing data analysis.

The pipeline is divided into key subworkflows, allowing users to run the exact analyses they need:

  • Pre-processing: Quality control and read filtering.
  • Taxonomic Profiling: Taxonomic classification and relative abundance estimation.
  • Assembly & MAG Analysis: De novo metagenomic assembly, binning, quality assessment, and functional annotation.
  • Genome Dynamics: Structural variant and horizontal gene transfer detection for temporal samples.

Initial Setup

Follow these steps to configure your environment and download the somatem pipeline. Note: This pipeline is designed for Linux/macOS environments and is not compatible with Windows.

1. Install conda/mamba/micromamba

We utilize micromamba (a faster, drop-in replacement for conda) but any of the listed package managers will work for to install somatem. See example below for micromamba installation.

"${SHELL}" <(curl -L [https://micro.mamba.pm/install.sh](https://micro.mamba.pm/install.sh))

2. Create and Activate the somatem Environment Set up a dedicated base environment for somatem:

micromamba create -n somatem -c bioconda somatem # Again use your package manager of interest

3. Test out somatem!

To process long-read 16S sequencing with somatem one would simply

# activate environment
micromamba activate somatem

# run the somatem 16S subworkflow
somatem 16S -i /path/to/16S_samplesheet.csv -o /path/to/desired_output

For help on making your input samplesheet, please see the example here


Usage

Information on how to run the various subworkflows in somatem can be found in our wiki pages!

Database Configuration

Several tools in this pipeline rely on large reference databases. Proper configuration is essential to manage storage effectively. The first time you run a pipeline requiring a database these will be installe for you and saved at that path for future runs.

  • Storage Requirements: Some databases (e.g., Bakta, CheckM2, SingleM) require up to 100 GB of free space. Ensure your target drive has adequate capacity.
  • Directory Setup: Decide whether you want a local database directory within the somatem folder, or a shared, centralized directory (highly recommended for HPC cluster environments).
  • Configuration: Update the /path/to/env/somatem/share/somatem-{version}/nextflow.config file to point the pipeline to your chosen directory. Locate and modify the following variable:
    db_base_dir = "/home/dbs" // Change this to "./assets/databases" for local storage

Performance & Resource Notes:

  • Automated Downloads: The pipeline automatically downloads most required databases (<3 GB). However, the Bakta database used in the assembly_mags subworkflow is approximately 60 GB and may require additional time.
  • Compute Time: The assembly_mags step is computationally intensive. As a benchmark, processing the two example files (assets/mag_big_samplesheet.csv) takes roughly 6 hours on an HPC cluster equipped with 128 CPUs, 128 GB of memory, and 2 TB of free storage.

Pipeline Tools

somatem integrates state-of-the-art bioinformatics tools, neatly organized into the following subworkflows:

Pre-processing

Prepares raw data for downstream analysis through rigorous quality control and filtering.

  • NanoPlot: QC plotting suite for initial and final assessment of long-read sequencing data.
  • Hostile: Depletes host contamination by filtering reads that align to a host reference genome.
  • Chopper: Filters nanopore reads by quality and length, removing sub-par data.

Taxonomic Profiling

Delivers rapid and accurate taxonomic classification for metagenomic datasets.

  • Emu: Taxonomic classification and abundance estimation optimized for long-read 16S rRNA.
  • Lemur: Rapid, multi-marker gene taxonomic profiling for long-read metagenomes.
  • MAGnet: Refines taxonomic profiles via reference genome mapping to correct false positives.
  • SingleM: Profiles microbial communities using universal marker genes. Includes the pipe module for reads/assemblies and the appraise module to evaluate binning completeness.

Assembly & MAG Analysis

Handles de novo assembly, genome binning, and functional annotation.

  • Flye: Repeat-graph-based de novo assembler optimized for PacBio and Nanopore reads.
  • Minimap2 & SAMtools: Pairwise alignment processing, read mapping, and coverage calculation.
  • SemiBin2: Metagenomic binning leveraging semi-supervised deep learning.
  • CheckM2: Machine-learning-driven prediction of genome bin quality and completeness.
  • Bakta: Comprehensive and rapid annotation of bacterial genomes and plasmids.

Genome Dynamics

Investigates structural variations over time.

  • Rhea: Detects structural variants and horizontal gene transfer events in temporally evolving microbial samples.
  • Bandage: Interactive visualization tool for assembly graphs, highly useful for reviewing Rhea outputs.

Functional Annotation

Screens for targets of clinical and functional interest.

  • SeqScreen: Functional screening of pathogenic sequences and antimicrobial resistance (AMR) genes.

Reporting & Visualization

Aggregates and visualizes complex datasets.

  • Taxburst: Interactive, web-based visualization of taxonomic profiles.
  • MultiQC: Aggregates logs and results across multiple tools into a single, user-friendly HTML report.

Additional Documentation

For deeper dives into pipeline architecture and tool notes, please see the docs/ directory:

Citation

If somatem facilitates your research, please cite the underlying tools that made your analysis possible. A comprehensive list of citation links is available in docs/somatem-docs/tool_links.csv.

Contributing & License

Contributions from the community are welcome! Please review our development documentation for guidelines on how to submit pull requests.

This project is licensed under the GNU General Public License v3.0 (GPLv3). See the LICENSE file for full details.

About

LLM accessible long-read metagenomics pipeline with best practices

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors