A modular and open source metagenomic analysis toolkit designed for long reads
somatem is a modular Nextflow based pipeline designed for long-read microbiome analysis, including both 16S and metagenomic support. somatem supports both Oxford Nanopore Technologies and PacBio. Built with ease of use and analytical rigor in mind, somatem enforces best practices for long-read sequencing data analysis.
The pipeline is divided into key subworkflows, allowing users to run the exact analyses they need:
- Pre-processing: Quality control and read filtering.
- Taxonomic Profiling: Taxonomic classification and relative abundance estimation.
- Assembly & MAG Analysis: De novo metagenomic assembly, binning, quality assessment, and functional annotation.
- Genome Dynamics: Structural variant and horizontal gene transfer detection for temporal samples.
Follow these steps to configure your environment and download the somatem pipeline. Note: This pipeline is designed for Linux/macOS environments and is not compatible with Windows.
1. Install conda/mamba/micromamba
We utilize micromamba (a faster, drop-in replacement for conda) but any of the listed package managers will work for to install somatem. See example below for micromamba installation.
"${SHELL}" <(curl -L [https://micro.mamba.pm/install.sh](https://micro.mamba.pm/install.sh))
2. Create and Activate the somatem Environment Set up a dedicated base environment for somatem:
micromamba create -n somatem -c bioconda somatem # Again use your package manager of interest3. Test out somatem!
To process long-read 16S sequencing with somatem one would simply
# activate environment
micromamba activate somatem
# run the somatem 16S subworkflow
somatem 16S -i /path/to/16S_samplesheet.csv -o /path/to/desired_outputFor help on making your input samplesheet, please see the example here
Information on how to run the various subworkflows in somatem can be found in our wiki pages!
Several tools in this pipeline rely on large reference databases. Proper configuration is essential to manage storage effectively. The first time you run a pipeline requiring a database these will be installe for you and saved at that path for future runs.
- Storage Requirements: Some databases (e.g., Bakta, CheckM2, SingleM) require up to 100 GB of free space. Ensure your target drive has adequate capacity.
- Directory Setup: Decide whether you want a local database directory within the
somatemfolder, or a shared, centralized directory (highly recommended for HPC cluster environments). - Configuration: Update the
/path/to/env/somatem/share/somatem-{version}/nextflow.configfile to point the pipeline to your chosen directory. Locate and modify the following variable:db_base_dir = "/home/dbs" // Change this to "./assets/databases" for local storage
Performance & Resource Notes:
- Automated Downloads: The pipeline automatically downloads most required databases (<3 GB). However, the Bakta database used in the
assembly_magssubworkflow is approximately 60 GB and may require additional time. - Compute Time: The
assembly_magsstep is computationally intensive. As a benchmark, processing the two example files (assets/mag_big_samplesheet.csv) takes roughly 6 hours on an HPC cluster equipped with 128 CPUs, 128 GB of memory, and 2 TB of free storage.
somatem integrates state-of-the-art bioinformatics tools, neatly organized into the following subworkflows:
Prepares raw data for downstream analysis through rigorous quality control and filtering.
- NanoPlot: QC plotting suite for initial and final assessment of long-read sequencing data.
- Hostile: Depletes host contamination by filtering reads that align to a host reference genome.
- Chopper: Filters nanopore reads by quality and length, removing sub-par data.
Delivers rapid and accurate taxonomic classification for metagenomic datasets.
- Emu: Taxonomic classification and abundance estimation optimized for long-read 16S rRNA.
- Lemur: Rapid, multi-marker gene taxonomic profiling for long-read metagenomes.
- MAGnet: Refines taxonomic profiles via reference genome mapping to correct false positives.
- SingleM: Profiles microbial communities using universal marker genes. Includes the
pipemodule for reads/assemblies and theappraisemodule to evaluate binning completeness.
Handles de novo assembly, genome binning, and functional annotation.
- Flye: Repeat-graph-based de novo assembler optimized for PacBio and Nanopore reads.
- Minimap2 & SAMtools: Pairwise alignment processing, read mapping, and coverage calculation.
- SemiBin2: Metagenomic binning leveraging semi-supervised deep learning.
- CheckM2: Machine-learning-driven prediction of genome bin quality and completeness.
- Bakta: Comprehensive and rapid annotation of bacterial genomes and plasmids.
Investigates structural variations over time.
- Rhea: Detects structural variants and horizontal gene transfer events in temporally evolving microbial samples.
- Bandage: Interactive visualization tool for assembly graphs, highly useful for reviewing Rhea outputs.
Screens for targets of clinical and functional interest.
- SeqScreen: Functional screening of pathogenic sequences and antimicrobial resistance (AMR) genes.
Aggregates and visualizes complex datasets.
- Taxburst: Interactive, web-based visualization of taxonomic profiles.
- MultiQC: Aggregates logs and results across multiple tools into a single, user-friendly HTML report.
For deeper dives into pipeline architecture and tool notes, please see the docs/ directory:
If somatem facilitates your research, please cite the underlying tools that made your analysis possible. A comprehensive list of citation links is available in docs/somatem-docs/tool_links.csv.
Contributions from the community are welcome! Please review our development documentation for guidelines on how to submit pull requests.
This project is licensed under the GNU General Public License v3.0 (GPLv3). See the LICENSE file for full details.