Skip to content

huiz0916/sc5Pseq

Repository files navigation

sc5Pseq

Script

Some useful scripts for sc5Pseq project. Not everyone is well tested. 😀

1. stat_base_content.py

Description

This script calculates and plots base frequencies from a FASTQ file. It can generate an interactive HTML plot. It can also specify base positions of interest.

Requirements

It could run in python2+ and python3+ enviroment, but need some libraries.

pip3 install pandas matplotlib plotly Bio argparse gzip os

Usage:

usage: stat_base_content.py [-h] [-i INPUT_FILE] [-a TABLE_INPUT] -o OUTPUT_PREFIX
                            [-n NUM_BASES] [-s POS [POS ...]]
                            [-sp SPECIFIC_OUTPUT_PREFIX] [--interactive]
                            [--add_percentage]

Calculate and plot base frequencies in a FASTQ file or from an existing CSV table.

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        Path to the FASTQ file, support .gz
  -a TABLE_INPUT, --table_input TABLE_INPUT
                        Path to an existing base frequencies table to use as input
                        instead of a FASTQ file.
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Output file name prefix.
  -n NUM_BASES, --num_bases NUM_BASES
                        Number of bases to analyze (default: 85).
  -s POS [POS ...], --specific_positions POS [POS ...]
                        Specific base positions or range to analyze (e.g., -s 10 15)
  -sp SPECIFIC_OUTPUT_PREFIX, --specific_output_prefix SPECIFIC_OUTPUT_PREFIX
                        Output file name prefix for specific positions plot.
  --interactive         Generate an interactive HTML plot.
  --add_percentage      Add percentage labels to the static image.

Example

    module load python #(python/3.12.1)  
    - To analyze a FASTQ file and generate output files:  
        python script.py -i example.fastq -o output_prefix  
    - Specify bases at specific positions, generate dynamic interactive HTML, and add base percentage to PNG:  
        python script.py -i example.fastq -o output_prefix -s 8 16 --interactive --add_percentage -sp special_output_prefix  
    - Generate plots directly from an existing CSV table:  
        python script.py --table_input existing_frequencies.csv -o output_prefix --interactive  

2. star_format.py

Description

Parses the .final.out file and extracts the relevant information from the STAR alignment result.

Usage:

python star_format.py DirectoryPath OutputName

3. star_plot.py

Description

Easily plot the format mapping stat results from STAR. It can generate bar plot and interactive html to easy compare the mulit samples. You could also specify the features you interest to plot.

The input are the format file from star_format.py

Usage

usage: star_plot.py [-h] -i INPUT [-f FEATURES [FEATURES ...]]
                    [-c {blue,orange,green,red,purple}] -o OUTPUT
                    [--interactive] [--sort-samples]

Plot features from formatted STAR log data.

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input CSV file containing formatted STAR
                        log data.
  -f FEATURES [FEATURES ...], --features FEATURES [FEATURES ...]
                        List of features to compare (default is
                        'umr_pct'). Available features: input,
                        umr_num, umr_pct, mismatch_rate, avg_len,
                        multi_num, multi_pct, unmapped_short_num,
                        unmapped_short_pct.
  -c {blue,orange,green,red,purple}, --color {blue,orange,green,red,purple}
                        Specify the color for the plots (default is
                        'blue'). Available colors: blue, orange,
                        green, red, purple.
  -o OUTPUT, --output OUTPUT
                        Output file name prefix for the plots.
  --interactive         Generate interactive plots using Plotly.
  --sort-samples        Sort sample names in alphanumeric order.

Example

    module load python #python3.12.7
    - plot umr_pct and umr_num feature with sort sample name and in red color.
    python star_plot.py -i final.out.all -f  "umr_pct" "umr_num" -o cdiff --interactive --sort-samples -c red

4. barcode_filter.py

Description

This tools could analysis and filter the barcode you have. It allows you control the hamming distance, GC content and exclude any sequence based on the window sliding. It will generate the hamming distance matrix and filter information. It also visualizes the hamming distance matrix.

The input including the barcode file (fasta) and excluded fasta.

Usage

usage: barcode_filter.py [-h] -i INPUT [-e EXCLUDE] [-ht HAMMING] [-gmin GC_MIN] [-gmax GC_MAX] -of
                         OUTPUT_FILTERED -oi OUTPUT_INFO -hmf HAMMING_MATRIX_FILE -hpf HAMMING_PLOT_FILE
                         [-d]

Filter barcode sequences based on Hamming distance, GC content, and exclusions.

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input fasta file containing barcodes.
  -e EXCLUDE, --exclude EXCLUDE
                        Fasta file containing sequences to exclude.
  -ht HAMMING, --hamming HAMMING
                        Minimum Hamming distance threshold (default: 3).
  -gmin GC_MIN, --gc_min GC_MIN
                        Minimum GC content percentage (default: 40.0).
  -gmax GC_MAX, --gc_max GC_MAX
                        Maximum GC content percentage (default: 60.0).
  -of OUTPUT_FILTERED, --output_filtered OUTPUT_FILTERED
                        Output fasta file for filtered barcodes.
  -oi OUTPUT_INFO, --output_info OUTPUT_INFO
                        Output text file for detailed filtering information.
  -hmf HAMMING_MATRIX_FILE, --hamming_matrix_file HAMMING_MATRIX_FILE
                        Output text file for Hamming distance matrix.
  -hpf HAMMING_PLOT_FILE, --hamming_plot_file HAMMING_PLOT_FILE
                        Output image file for Hamming distance heatmap.
  -d, --debug           Enable debug logging.

Example

barcode_filter.py -i barcode.fa -e exclude_file.fa -of barcode.filter.fa -oi barcode.filterinfo.txt -hmf barcode.filter.matrix.txt -hpf barcode.filter.matrix.plot

5. parse_umi_log.py

Description

This script parses umi-tools log files and extracts statistics for R1 only. It extracts the following from each log file:

  • Input Reads (only R1; lines mentioning read2 are skipped)
  • regex matches read1
  • regex does not match read1

The sample name is derived from the file name by taking the part before the first underscore (e.g., for A1_S188.umi.extract.log, the sample is A1). If multiple log files belong to the same sample, their statistics are merged. Additionally, an optional parameter (-a) allows you to append an extra column (for example, to indicate a version or any other field) immediately after the sample column in the output.

Requirements

This script requires Python 3 and uses only standard libraries:

  • argparse
  • re
  • os

No additional packages are needed.

Usage

usage: parse_umi_log.py -i INPUT [INPUT ...] [-o OUTPUT] [-a APPEND]

Parse umi-tools log files and output merged summary for R1 only.

options:
  -i INPUT [INPUT ...], --input INPUT [INPUT ...]
                        List of .log files to process.
  -o OUTPUT, --output OUTPUT
                        Output file name (default: merged_summary.txt).
  -a APPEND, --append APPEND
                        Additional field to append after sample (e.g. version).

Example

python parse_umi_log.py -i A1_S188.umi.extract.log B2_S189.umi.extract.log -o umi_summary.txt -a version1.0

This command will:

  • Process the log files A1_S188.umi.extract.log and B2_S189.umi.extract.log.
  • Parse only the R1 statistics from each file.
  • Merge statistics for samples with the same sample name.
  • Append an extra column (here, version1.0) after the sample column.
  • Write the output summary to umi_summary.txt.

6. detect_remove_artifactTSO.py

This toolset identifies, visualizes, and filters TSO-related artifacts in 5' end-mapping RNA-seq data. Input BAM should be pre-trimmed to remove TSO/barcode/primer, mapping only the RNA/cDNA true 5' ends. Without pre-trimmed are developing.

module load python/3.12.3 # if working on PDC
python3 detect_remove_artifactTSO.py -h

usage: detect_remove_artifactTSO.py [-h] {stat,plot,strand_invasion,missing_pairing} ...

A pipeline for strand invasion/missing pairing artifact analysis: stat, plot, filter.

positional arguments:
  {stat,plot,strand_invasion,missing_pairing}
    stat                Stat mode: outputs TSV for TSO-hamming+G, rcTSO-hamming+C, logo table, and grouped percentage curves.
    plot                Plot mode: visualize heatmaps/logo/curve from stat .tsv
    strand_invasion     Filter by G in last 3 nt + TSO hamming
    missing_pairing     Filter by C in last 3 nt + rcTSO hamming

options:
  -h, --help            show this help message and exit

Features:

  • stat: Output statistics (TSO/rcTSO hamming distance vs. last 3nt G/C count) and sequence logo table.
  • plot: Flexible visualization (heatmaps, sequence logos, grouped line plots) from stat output.
  • filtering: Filter BAM files by user-specified thresholds.

TSO sequence can include IUPAC ambiguity codes. Hamming distance is computed only on the TSO core (excluding the trailing GGG/CCC).


Installation

Python 3.7+ is required. Install dependencies with:

pip install pysam pyfaidx pandas matplotlib logomaker

usage

  1. Stat: Generate all statistics and tables
python detect_remove_artifactTSO.py stat \
  -b input.bam -f genome.fa -t TTTCTTATATGGG \
  --g_out stat_g.tsv --c_out stat_c.tsv --logo_table logo_table.tsv \
  --g_curve_out g_curve.tsv --c_curve_out c_curve.tsv
  1. Plot: Visualize heatmap, sequence logo, or grouped line plot
python detect_remove_artifactTSO.py plot \
  --g_out stat_g.tsv --c_out stat_c.tsv --logo_table logo_table.tsv \
  --g_curve g_curve.tsv --logo_mode G --curve_png gcurve.png --logo_png logo.png
  1. Filter: Remove artifact reads from BAM
python detect_remove_artifactTSO.py strand_invasion \
  -b input.bam -f genome.fa -t TTTCTTATATGGG \
  --min_g 2 --max_hamming 3 --outbam filtered.bam

scSaturation.py

Single-cell sequencing saturation curves for a general bam file with library and per-cell support.

Modes (mutually exclusive via --mode):

  • global: Monte Carlo subsampling of the whole BAM and counting uniques --dedup cell -> unique (CB, UMI) --dedup library -> unique UMI (ignore CB & Gene) --dedup cell-gene -> unique (CB, Gene, UMI)
  • per-cell: For each selected CB, compute EXPECTED unique molecules vs reads analytically (no Monte Carlo) --per-cell-dedup {cell, cell-gene}

Plot Axes:

  • X axis is always "Number of reads (M)"; you can override both axes via --xlabel/--ylabel
  • Journal style (no grid). Optional --logx, and --no-band (global only)

CSV (long-format):

  • global : reads,reads_M,mean_unique,std_unique,dedup
  • per-cell: cell,reads,reads_M,expected_unique,dedup

Requirements

If working on PDC: ml bioinfo-tools; ml python/3.12.3

usage

  1. Global, per-cell dedup, 25 points, CSV
python scSaturation.py --mode global --bam in.bam --out-prefix sample \
  --dedup cell --num-points 25 --skip-secondary --csv
  1. Global, library-level unique UMI (ignore CB/Gene), log-x, no band
python scSaturation.py --mode global --bam in.bam --out-prefix sample_lib \
  --dedup library --logx --no-band
  1. Global, cell-gene dedup using GX then GN
python scSaturation.py --mode global --bam in.bam --out-prefix sample_cg \
  --dedup cell-gene --gene-tags GX,GN
  1. Per-cell curves, per-cell dedup, top100 cells (>=2k reads), CSV
python scSaturation.py --mode per-cell --bam in.bam --out-prefix sample_cells \
  --per-cell-dedup cell --per-cell-max 100 --per-cell-min-reads 2000 --csv
  1. Per-cell curves, cell-gene dedup for a custom CB list
python scSaturation.py --mode per-cell --bam in.bam --out-prefix sample_cells_cg \
  --per-cell-dedup cell-gene --cb-list cells.txt

libend_artifact_filter.py

Pipeline to score, visualize and filter 5' and 3' end artefacts in RNA-seq libraries

Usage

usage: libend_artifact_filter.py [-h] --mode {5p,3p} [--log-level {DEBUG,INFO,WARNING,ERROR}] {score,plot,filter} ...

Unified 5' (TSO) & 3' (oligo-dT) artefact analysis: score, plot, filter.

positional arguments:
  {score,plot,filter}
    score               Compute per-read scores and threshold grids.
    plot                Plot heatmaps, coverage curves and logos.
    filter              Filter BAM by artefact rules.

options:
  -h, --help            show this help message and exit
  --mode {5p,3p}        Library end mode: 5p (TSO) or 3p (oligo-dT)
  --log-level {DEBUG,INFO,WARNING,ERROR}
                        Logging verbosity.

An example

python3 libend_artifact_filter.py --mode 3p score \
  -b CCE78_SCFlu25r1t2_3P_dedup.bam \
  -f /cfs/klemming/projects/supr/sllstore2017018/ref_share/fungi/SaCer_embl_74/Saccharomyces_cerevisiae.EF4.74.dna.toplevel.fa \
  --oligos /cfs/klemming/home/h/huizhou/software/RNAEdgeFlow_dev/external/oligo_dT.tsv \
  --min-mapq 255 \
  --prefix CCE78_SCFlu25r1t2_3P_dedup

python3 libend_artifact_filter.py --mode 3p plot \
  --prefix CCE78_SCFlu25r1t2_3P_dedup

python3 libend_artifact_filter.py --mode 3p plot \
  --prefix CCE78_SCFlu25r1t2_3P_dedup \
  --make-logo bits \
  --logo-score1-max 2 \
  --logo-score2-max 7
  
# Filter reads based on 3' artifact scores and logo plots
python3 libend_artifact_filter.py --mode 3p filter \
    -b CCE78_SCFlu25r1t2_3P_dedup.bam \
    -f /cfs/klemming/projects/supr/sllstore2017018/ref_share/fungi/SaCer_embl_74/Saccharomyces_cerevisiae.EF4.74.dna.toplevel.fa \
    --oligos /cfs/klemming/home/h/huizhou/software/RNAEdgeFlow_dev/external/oligo_dT.tsv \
    --read-window-len 34 --score1-thresh 2 --score2-thresh 7 
    -o CCE78_SCFlu25r1t2_3P_filtered.bam

About

useful script for sc5Pseq project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors