Decomposes satellite DNA arrays into monomers within telomere-to-telomere (T2T) assemblies. Ideal for analyzing centromeric and pericentromeric regions on monomeric level.
Status: Production ready. Successfully handles arrays from kilobase to megabase scale.
Key Features:
- De novo monomer identification without prior knowledge
- Autocorrelation-based period detection for robust periodicity analysis
- Automatic orientation to canonical form (A>T, C>G)
- Deterministic output sorted by genomic coordinates
- Multi-threaded processing
Performance: CHM13v2.0 assembly (~1300 alpha satellite arrays) processes in ~3.5 minutes (16 threads)
pip install arraysplitterOr build from source:
cd src/rust/arraysplitter
cargo build --release# Basic decomposition
arraysplitter -i arrays.fa -o output_prefix -t 16
# With predefined cut sequences
arraysplitter -i arrays.fa -o output_prefix -c ATG,CGCG -t 16
# Show version
arraysplitter --versionAll output is deterministically sorted by chromosome and genomic position (chr1 → chr22 → chrX → chrY → chrM).
| File | Description |
|---|---|
.decomposed.fasta |
Monomers with orientation info in headers |
.hors.tsv |
HOR-level decomposition with metrics per HOR monomer |
.monomers.tsv |
Base-level monomers from recursive HOR decomposition |
.lengths |
Fragment lengths for each array |
Contains the primary decomposition into HOR (Higher Order Repeat) monomers.
| Column | Description |
|---|---|
array_id |
Array identifier (chr_start_end_len_period_type) |
type |
pred_array, monomer, flank, array, consensus |
idx |
Monomer index within array |
length |
Sequence length |
source |
Detection method (anchor, split_2x, etc.) |
ed_tmpl |
Edit distance to consensus template |
ed_prev |
Edit distance to previous monomer |
ed_next |
Edit distance to next monomer |
period |
Detected repeat period |
autocorr |
Autocorrelation value at period |
cut_sequence |
Anchor sequence used for splitting |
orientation |
fwd or rev (reverse complemented) |
sequence |
Actual DNA sequence |
Contains base-level monomers after recursive decomposition of HORs. Each HOR is recursively decomposed until no further periodicity is detected (autocorrelation ≤ 0.5) or minimum length (5bp) is reached.
| Column | Description |
|---|---|
array_id |
Array identifier |
hor_idx |
Index of parent HOR from primary decomposition |
sub_idx |
Index within parent HOR (hierarchical for nested decomposition) |
level |
Recursion depth (1 = direct child of HOR) |
length |
Sequence length |
period |
Detected period at this level (0 if base monomer) |
autocorr |
Autocorrelation value at detected period |
source |
recursive_anchor, recursive_split, base, recursive_flank |
sequence |
Actual DNA sequence |
ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.
Arrays are oriented to canonical form:
- Primary rule: A > T (more A's than T's)
- Secondary rule: C > G (if A=T)
- Non-canonical arrays are reverse complemented
The algorithm computes sequence autocorrelation to detect periodicity:
autocorr(offset) = matches / comparisons
Where matches counts identical nucleotides at positions i and i + offset.
Key innovations:
- Random expectation correction: Subtracts expected random match rate based on nucleotide composition
- Refined period search: Uses FFT-like peak detection to find true period vs harmonics
- Confidence scoring: Autocorrelation excess over random indicates detection confidence
For the detected period, finds optimal anchor (cut sequence) using:
- K-mer enumeration: Extract all k-mers (k=10 by default) from the sequence
- Position analysis: For each k-mer, record all occurrence positions
- Scoring metrics:
- Uniqueness: Fraction of occurrences exactly
periodapart - Regularity: How evenly spaced the occurrences are
- Uniqueness: Fraction of occurrences exactly
- Combined score:
uniqueness × regularity - Deterministic selection: K-mers sorted lexicographically for reproducible tie-breaking
Using the selected anchor:
- Split array at all anchor occurrences
- First fragment → left flank (if < 70% of period)
- Middle fragments → monomers
- Last fragment → right flank (if < 70% of period)
- Apply heuristics for multiplet splitting (doublets, triplets, etc.)
Results are:
- Sorted by chromosome (natural order: 1, 2, ..., 22, X, Y, M)
- Within chromosome, sorted by start position
- Fully deterministic across runs
Uses autocorrelation for period detection. Best for:
- Regular tandem repeats
- Alpha satellite arrays
- HOR (Higher Order Repeat) structures
Uses frequency suffix tree approach. Better for:
- Irregular or degenerate repeats
- Very short arrays
- Arrays with high mutation rates
Tries autocorrelation first, falls back to classic if autocorr fails.
arraysplitter --help
Options:
-i, --input <FILE> Input FASTA file
-o, --output <PREFIX> Output prefix
-t, --threads <N> Number of threads [default: all cores]
-c, --cuts <SEQ,SEQ> Predefined cut sequences (comma-separated)
-d, --depth <N> Max depth for cut search [default: 100]
--method <METHOD> Detection method: autocorr, classic, both [default: autocorr]
--max-ed-len <N> Max monomer length for edit distance [default: 10000]
--stats Print detailed statistics
--top-outliers <N> Number of outliers to show [default: 10]
-V, --version Print version
If you use ArraySplitter in your research, please cite: [Publication pending]
For questions or support: ad3002@gmail.com