RNA-seq Workflow (nl-rna-seq_wf)

A comprehensive Nextflow-based RNA-seq analysis pipeline for processing and analyzing RNA sequencing data, featuring quality control, alignment, quantification, and advanced analysis capabilities including IRFinder, RNA-SeQC, and Kallisto quantification.

Current Version: v1.1.0 - Enhanced stability and coverage analysis capabilities

🚀 Features

Quality Control: Fastp preprocessing with rRNA and globin RNA contamination detection
Alignment: STAR alignment with multiple read length optimizations (69-151bp)
Quantification:
- STAR gene-level counts
- Kallisto transcript-level quantification
- Subread featureCounts for gene expression
Quality Metrics: RNA-SeQC comprehensive quality assessment
Advanced Analysis:
- IRFinder for intron retention detection
- Splice junction analysis with BAM2SJ
- Enhanced coverage analysis with dual Mosdepth processes (MOSDEPTH + MOSDEPTH_BED)
- CDS region analysis and filtering with BEDTOOLS
- Advanced variant calling with DEEPVARIANT
Output Formats: CRAM compression for efficient storage
Containerized: Docker-based execution for reproducibility

🏗️ Architecture

graph TB
    subgraph Input
        A1[FASTQ Files] --> B1[Fastp QC]
        A2[Reference Data] --> B1
    end

    subgraph Contamination Check
        B1 --> C1[BWA rRNA Check]
        B1 --> C2[BWA Globin Check]
        C1 --> D1[Samtools Flagstat]
        C2 --> D2[Samtools Flagstat]
    end

    subgraph Alignment & Quantification
        B1 --> E1[STAR Alignment]
        B1 --> E2[Kallisto Quant]
        E1 --> F1[Samtools Index]
        F1 --> F2[Sambamba MarkDup]
        F2 --> G1[FeatureCounts]
        F2 --> G2[RNA-SeQC]
        F2 --> G3[IRFinder]
        F2 --> G4[Mosdepth Coverage Analysis]
        F2 --> G4b[Enhanced Mosdepth Coverage]
        F2 --> G5[BAM2SJ]
        F2 --> G6[BEDTOOLS CDS Analysis]
        G6 --> G7[DEEPVARIANT Variant Calling]
    end

    subgraph Output
        G1 --> H1[Gene Counts]
        G2 --> H2[QC Reports]
        G3 --> H3[IRFinder Results]
        G4 --> H4[Coverage Files]
        G4b --> H4b[Enhanced Coverage Analysis]
        G5 --> H5[Splice Junctions]
        G6 --> H6[CDS Region Files]
        G7 --> H7[Variant Call Files]
        F2 --> H8[CRAM Files]
    end

    style Input fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style Contamination Check fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style Alignment fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style Output fill:#fff3e0,stroke:#e65100,stroke-width:2px

🔄 Workflow Process Flow

The pipeline follows a logical sequence where each step builds upon the previous:

Quality Control & Alignment: FASTQ files are processed and aligned to generate BAM files
Coverage Analysis: Mosdepth processes analyze coverage across the genome and coding regions
CDS Region Identification: BEDTOOLS processes the coverage data to identify coding sequences with adequate coverage
Variant Calling: DEEPVARIANT uses the BEDTOOLS-generated CDS bed file to focus variant calling on well-covered coding regions

Key Dependency: DEEPVARIANT requires the CDS bed file from BEDTOOLS to define the regions for variant calling, ensuring efficient and accurate variant detection in coding sequences.

📋 Prerequisites

System Requirements

Operating System: Linux-based (Ubuntu 18.04+ recommended)
CPU: Minimum 48 cores for optimal performance
Memory: Minimum 192GB RAM for STAR alignment
Storage: Sufficient space for RNA-seq data processing
Docker: Required for containerized execution

Software Requirements

Docker

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Verify installation
docker --version

Nextflow

# Install Nextflow
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

# Verify installation
nextflow -version

🛠️ Installation

Clone the repository:

git clone https://github.com/uclanelsonlab/nl-rna-seq_wf.git
cd nl-rna-seq_wf/

Set up Docker images:

# Build or pull required Docker images
# Update nextflow.config with your Docker image paths

Verify installation:

nextflow run main.nf --help

Required Parameters

Parameter	Description	Example
`--fastq_r1`	Forward FASTQ file path	`/path/to/sample_R1.fastq.gz`
`--fastq_r2`	Reverse FASTQ file path	`/path/to/sample_R2.fastq.gz`
`--sample_name`	Sample identifier	`SAMPLE001`
`--rrna_reference`	rRNA reference genome	`/path/to/rrna_ref.fa`
`--globinrna_reference`	Globin RNA reference	`/path/to/globin_ref.fa`
`--gencode_gtf_path`	Gencode GTF annotation	`/path/to/gencode.v43.gtf`
`--human_fasta`	Human reference genome	`/path/to/GRCh38.fa`
`--star_index_151`	STAR index for 151bp reads	`/path/to/star_index_151`
`--kallisto_index`	Kallisto transcript index	`/path/to/kallisto.idx`

📊 Output Structure

The pipeline generates organized outputs in the following structure:

results/
├── QC/                          # Quality control reports
│   ├── *.rrna.flagstat.txt     # rRNA contamination stats
│   ├── *.globinrna.flagstat.txt # Globin contamination stats
│   ├── *.html                  # Fastp QC reports
│   └── *.json                  # Fastp QC metrics
├── ALIGNMENT/                   # Alignment files
│   ├── *.ReadsPerGene.out.tab.gz # STAR gene counts
│   ├── *.Log.final.out         # STAR alignment stats
│   ├── *.SJ.out.tab.gz        # Splice junctions
│   ├── *.bam2SJ.out.tab.gz    # Reconstructed junctions
│   └── *.cram                  # Compressed alignment files
├── COUNTS/                      # Gene expression counts
│   ├── *.gene_id.exon.ct      # FeatureCounts output
│   └── *.gene_id.exon.ct.summary # Counts summary
├── QUANT/                       # Kallisto quantification
│   ├── *abundance.h5           # Transcript abundances h5
│   ├── *abundance.tsv           # Transcript abundances
│   └── *run_info.json           # Quantification info
├── IR/                          # IRFinder results
│   ├── *.txt.gz                # Intron retention data
│   └── *.bw                    # Coverage bigWig files
├── MOSDEPTH/                    # Coverage analysis
│   ├── *.mosdepth.global.dist.txt # Global coverage
│   ├── *.regions.bed.gz        # Region-specific coverage
│   └── *.per-base.bed.gz       # Per-base coverage analysis
├── BEDTOOLS/                    # CDS analysis and filtering
│   └── *_cds_3x.bed            # Coding sequence regions with coverage filtering
├── DEEPVARIANT/                 # Variant calling
│   ├── *.vcf.gz                # Variant call format files
│   ├── *.g.vcf.gz              # Genomic variant call format
│   └── *.visual_report.html    # Variant calling reports
└── BAM2SJ/                      # Splice junction analysis
    ├── *_rare_junctions_all.tsv # All detected junctions
    └── *_rare_junctions_filtered.xlsx # Filtered junctions

🔧 Configuration

Docker Images

Update nextflow.config with your Docker image paths:

params {
    bwa_docker = "your-registry/bwa:latest"
    star_docker = "your-registry/star:latest"
    fastp_docker = "your-registry/fastp:latest"
    // ... other images
}

Resource Allocation

Modify process resources in nextflow.config:

process {
    withLabel: 'star_alignreads' {
        memory = 192.GB
        cpus = 48
    }
}

🧬 Analysis Components

1. Quality Control (Fastp)

Adapter trimming
Quality filtering
Length filtering
Duplicate detection

2. Contamination Detection

rRNA contamination using BWA
Globin RNA contamination analysis
Statistical reporting

3. Alignment (STAR)

Multi-read length optimization
Splice-aware alignment
Junction detection
Gene-level quantification

4. Quantification

STAR: Gene-level read counts
Kallisto: Transcript-level abundance
FeatureCounts: Exon-level counting

5. Quality Assessment (RNA-SeQC)

Coverage metrics
GC bias analysis
Mapping quality
Strand specificity

6. Advanced Analysis

IRFinder: Intron retention detection
BAM2SJ: Splice junction reconstruction
Mosdepth: Enhanced coverage analysis with dual processes for comprehensive coverage assessment
BEDTOOLS: CDS region identification and coverage-based filtering for variant calling
DEEPVARIANT: Advanced variant calling with custom model support and comprehensive reporting

Workflow Dependencies: DEEPVARIANT requires BEDTOOLS to run first, as it uses the generated CDS bed file to define the regions for variant calling. This ensures variant calling is focused on coding sequences with adequate coverage.

🐛 Troubleshooting

Common Issues

Memory Errors:
- Increase memory allocation in nextflow.config
- Use -resume to restart from failed step
Docker Issues:
- Verify Docker daemon is running
- Check image availability and permissions
Reference File Errors:
- Ensure all reference files are accessible
- Verify file formats and integrity
Channel Input Errors (Fixed in v1.1.0):
- All input channel mismatches have been resolved
- Reference file handling has been optimized
- Duplicate output channel issues have been fixed

📈 Performance

Resource Recommendations

STAR Alignment: 48 cores, 192GB RAM
IRFinder: 16 cores, 32GB RAM
Other processes: 8-16 cores, 16-32GB RAM

Expected Runtime

Small dataset (<50M reads): 2-4 hours
Medium dataset (50-100M reads): 4-8 hours
Large dataset (>100M reads): 8-16 hours

🤝 Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Contact

George Carvalho - [email protected]
UCLA Nelson Lab - https://nelsonlab.ucla.edu/

�� Acknowledgments

Nextflow community for the workflow engine
STAR developers for RNA-seq alignment
IRFinder team for intron retention analysis
All contributors to the bioinformatics tools used

📚 Additional Resources

FRASER Analysis

For aberrant splicing analysis, see the FRASER configuration in drop_files/config.yaml.

OUTRIDER Analysis

For aberrant expression analysis, use the R script in script/run_outrider.R.

Batch Processing

Use the provided sample annotation files for processing multiple samples.

🔄 Recent Updates (v1.1.0)

Stability Improvements

Fixed Channel Errors: Resolved all input channel count mismatches across samtools processes
Eliminated Duplicates: Fixed duplicate output channel naming issues
Optimized References: Streamlined reference file handling for better compatibility

Enhanced Coverage Analysis

Dual Mosdepth Processes: Added comprehensive coverage analysis with both MOSDEPTH and MOSDEPTH_BED
Improved Integration: Better workflow integration between coverage analysis and downstream processing
Performance Optimization: Enhanced channel efficiency and process compatibility

Advanced Variant Calling & Analysis

Enhanced DEEPVARIANT Integration: Improved variant calling pipeline with optimized input handling
CDS Region Analysis: Advanced BEDTOOLS processing for coding sequence identification and coverage filtering
Comprehensive Variant Reports: Enhanced VCF outputs with visual reporting and gVCF support
Sequential Processing: BEDTOOLS runs first to generate CDS bed files, which DEEPVARIANT then uses for targeted variant calling
Coverage-Based Filtering: Variant calling is restricted to coding regions with adequate coverage, improving accuracy and efficiency

What's New

Pipeline now runs without channel-related errors
Enhanced coverage analysis capabilities
Improved stability and reliability
Better resource utilization and workflow efficiency
Advanced variant calling with DEEPVARIANT integration
Enhanced CDS region analysis and filtering capabilities
Comprehensive variant reporting and visualization

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
docker		docker
drop_files		drop_files
modules		modules
script		script
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PARAMETERS.md		PARAMETERS.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
parameters.json		parameters.json
pipeline.dot		pipeline.dot
pipeline_dag.dot		pipeline_dag.dot
pipeline_dag.svg		pipeline_dag.svg
run_parameters.json		run_parameters.json

License

uclanelsonlab/nl-rna-seq_wf

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Workflow (nl-rna-seq_wf)

🚀 Features

🏗️ Architecture

🔄 Workflow Process Flow

📋 Prerequisites

System Requirements

Software Requirements

Docker

Nextflow

🛠️ Installation

Required Parameters

📊 Output Structure

🔧 Configuration

Docker Images

Resource Allocation

🧬 Analysis Components

1. Quality Control (Fastp)

2. Contamination Detection

3. Alignment (STAR)

4. Quantification

5. Quality Assessment (RNA-SeQC)

6. Advanced Analysis

🐛 Troubleshooting

Common Issues

📈 Performance

Resource Recommendations

Expected Runtime

🤝 Contributing

📄 License

👥 Contact

�� Acknowledgments

📚 Additional Resources

FRASER Analysis

OUTRIDER Analysis

Batch Processing

🔄 Recent Updates (v1.1.0)

Stability Improvements

Enhanced Coverage Analysis

Advanced Variant Calling & Analysis

What's New

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages