🧬🪢 DNA Risk Classification 🪢🧬

This bioinformatics pipeline brings together knot theory and genomic analysis. It aims to classify synthetic DNA sequences based on their structural and biological risks. The setup relies on five complementary metrics. Those metrics offer a thorough risk assessment for use in genetic engineering projects.

📋 Overview

This project looks twenty diverse DNA sequences. They cover risk profiles from low to medium-high. It draws on knot invariants and biophysical properties to predict secondary structure formation. The approach mixes topological analysis with standard bioinformatics methods. In that way, it offers a fresh take on risk stratification for synthetic biology.

📊 Dataset Profile

Sequences Analyzed: 20 diverse DNA sequences
Risk Distribution: 35 low-risk, 27 medium-risk, 0 high-risk, 0 critical-risk
Structural Variety: Simple repeats, GC-rich, hairpin-prone, inverted structures, triplex-forming, palindromic, complex topologies

📖 Five Core Metrics

Test	Metric	Mean	Standard Deviation	Finding
Test 1	GC Content	53.39%	23.95%	Moderate thermal stability; optimal for synthetic applications
Test 2	Melting Temperature	65.4°C	9.8°C	Stable DNA denaturation profile; suitable for PCR
Test 3	Homopolymer Runs	2.97 bp	—	16.1% sequences problematic; structural risk indicator
Test 4	Shannon Entropy	1.163	0.525	Low-moderate complexity; secondary structure propensity
Test 5	Codon Bias	GAT/GAC	—	Complete analysis; expression-level risk assessment

📈 Results & Visualizations

Test 1: GC Content Distribution

Figure 1.1 - Represents the distribution of guanine-cytosine content in DNA structures.

Test 2: Melting Temperature Distribution

Figure 1.2 - Represents the temperature at which half of the hydrogen bonds of the double helix are broken and separated.

Test 3: Homopolymer vs Complexity

Figure 1.3 - Represents the comparisons between being long runs of the same nucleotide or a varied sequence.

Test 4: Shannon Entropy Distribution

Figure 1.4 - Represents how entropy valleys aligns with knots, risky hotspots, and fragile sites.

Test 5: Codon Usage Bias

Figure 1.5 - Represents the likelihood of synonymous condons are used, which indicate translational risk & evolutionary pressure.

Integrated Risk Assessment

Figure 1.6 - Represents the overlaying map of stability, entropy, condon bias, and topology, which reveal genomic 'hotspots' of biological & structural risk.

Figure 1.7 - Represents the distribution of clusters of high-risk regions versus broadly stable zones.

👨‍🔬 Results Interpretation

The five-metric approach worked well for stratifying things. It classified sequences right across the risk spectrum in an effective way. Topological features had a strong correlation. They showed high predictive value when it came to structural stability. The GC content sweet spot sits in the 50 to 60 percent range. That turns out optimal for applications in synthetic biology. Entropy plays a real role here. Sequences with lower entropy demonstrated significantly higher propensity for secondary structure. Codon choices matter too. The dominance of GAT and GAC points to considerations in design at the expression level.

👨‍💻 Quick Start

python main.py

⌨️ Technical Stack

Language: Python 3.x
Libraries: NumPy, Pandas, Matplotlib, Seaborn
Methods: Knot invariant computation, thermodynamic modeling, statistical analysis

📁 Project Structure

secondary_structure_predictor/
├── config.py                      # Configuration & thresholds
├── knot_analyzer.py               # Knot theory calculations
├── bioinformatics_analyzer.py      # Metric computation
├── main.py                        # Pipeline orchestration
├── sequence_parser.py             # FASTA parsing
├── structure_predictor.py         # Secondary structure prediction
├── visualization.py               # Visualization generation
├── example_sequences.fasta         # Test data
├── requirements.txt               # Dependencies
├── README.md                      # This file
├── QUICKSTART.md                  # Quick reference
└── results/
    ├── test1_gc_content.png
    ├── test2_tm_analysis.png
    ├── test3_homopolymer.png
    ├── test4_entropy.png
    ├── test5_codon_bias.png
    ├── knot_risk_landscape.png
    ├── risk_distribution.png
    └── sequence_*.png             # 20 individual analyses

🛠️ Workflow

FASTA Input
    ↓
Parse DNA Sequences
    ↓
Extract Biophysical Metrics (GC, Tm, Homopolymers, Entropy, Codons)
    ↓
Analyze Topological Properties (Knot Theory)
    ↓
Predict Secondary Structure & Classify Risk
    ↓
Generate Visualizations (7 plots)
    ↓
Execute Pipeline & Generate Report
    ↓
20 Sequence Analyses + Comprehensive Visualizations + Summary Statistics

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
analysis		analysis
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬🪢 DNA Risk Classification 🪢🧬

📋 Overview

📊 Dataset Profile

📖 Five Core Metrics

📈 Results & Visualizations

Test 1: GC Content Distribution

Test 2: Melting Temperature Distribution

Test 3: Homopolymer vs Complexity

Test 4: Shannon Entropy Distribution

Test 5: Codon Usage Bias

Integrated Risk Assessment

👨‍🔬 Results Interpretation

👨‍💻 Quick Start

⌨️ Technical Stack

📁 Project Structure

🛠️ Workflow

About

Uh oh!

Releases

Packages

Languages

License

JobinJohn24/DNA_Risk_Classification

Folders and files

Latest commit

History

Repository files navigation

🧬🪢 DNA Risk Classification 🪢🧬

📋 Overview

📊 Dataset Profile

📖 Five Core Metrics

📈 Results & Visualizations

Test 1: GC Content Distribution

Test 2: Melting Temperature Distribution

Test 3: Homopolymer vs Complexity

Test 4: Shannon Entropy Distribution

Test 5: Codon Usage Bias

Integrated Risk Assessment

👨‍🔬 Results Interpretation

👨‍💻 Quick Start

⌨️ Technical Stack

📁 Project Structure

🛠️ Workflow

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages