EnzEngDB

A comprehensive database and analysis pipeline for studying directed evolution of enzymes performing new-to-nature reactions.

Overview

The Enzyme Engineering database curates and analyzes data from directed evolution experiments documented in scientific literature, focusing on engineered enzymes that catalyze reactions not found in nature. The project creates molecular embeddings for both protein sequences and chemical reactions to enable machine learning applications and comparative analyses.

Features

Data Curation: Systematic collection of enzyme-reaction pairs from 36+ research papers
Molecular Embeddings: State-of-the-art embeddings for proteins (ESM3) and reactions (ChemBERTa2, RxnFP)
Chemical Space Analysis: Visualization and comparison of engineered vs natural enzyme reaction space
Standardized Format: Conversion to LevSeq format for broader accessibility
Comprehensive Pipeline: End-to-end processing from raw data to analysis-ready datasets

Dataset Statistics

1,341 enzyme-reaction pairs
640 unique reactions
367 unique protein variants
36 research papers included

Installation

Prerequisites

Python 3.8+
PyTorch (for ESM models)
RDKit (for chemistry operations)

Install via pip

# Clone the repository
git clone https://github.com/yourusername/EnzymeEngineeringDB.git
cd EnzymeEngineeringDB

# Create env
conda create --name enzengdb

# Install dependencies
pip install -r requirements.txt

Dependencies

The project requires the following main packages:

Core: pandas, numpy, scikit-learn, matplotlib, seaborn
Chemistry: rdkit, pubchempy, biopython
Deep Learning: torch, esm, huggingface-hub
Other: enzymetk, sciutil, sciviso

See requirements.txt for complete list with versions.

Usage

The analysis pipeline consists of four main notebooks that should be run in sequence:

1. Clean Reaction Data

jupyter notebook analysis/N1_CleanReactionData.ipynb

Validates and canonicalizes reaction SMILES
Creates reaction embeddings using ChemBERTa2 and RxnFP
Outputs: cannoical_smiles.pkl, rxn_chemberta.pkl, rxn_rxnfp.pkl

2. Clean Enzyme Data

jupyter notebook analysis/N2_CleanEnzymeData.ipynb

Processes enzyme sequences and mutations
Generates protein embeddings using ESM3
Outputs: protein-evolution-database_V4_embedded_proteins.pkl, variant_df_no_errors.pkl

3. Analyze Combined Data

jupyter notebook analysis/N3_AnalyseEnzymeReactionData.ipynb

Combines protein and reaction data
Performs PCA analysis and visualization
Compares engineered enzymes to natural enzyme space

4. Convert to Standard Format

jupyter notebook analysis/N4_ConvertFormatToLevSeq.ipynb

Converts data to LevSeq format
Organizes by experiment/paper
Creates metadata files

Project Structure

DirectedEvolutionDB/
---  README.md
---  requirements.txt
---  LICENSE
---  data/                    # Raw data files
---  nalysis/
------   N1_CleanReactionData.ipynb
------ N2_CleanEnzymeData.ipynb
------ N3_AnalyseEnzymeReactionData.ipynb
------ N4_ConvertFormatToLevSeq.ipynb
------  scripts/
------ esm3.py         # ESM3 embedding utilities
------ output/             # Processed data outputs
------ Archive/            # Previous notebook versions

Output Files

cannoical_smiles.pkl: Standardized reaction SMILES
rxn_chemberta.pkl: ChemBERTa2 reaction embeddings
rxn_rxnfp.pkl: RxnFP reaction fingerprints
protein-evolution-database_V4_embedded_proteins.pkl: ESM3 protein embeddings
variant_df_no_errors.pkl: Cleaned variant data with yields

Key Findings

Directed evolution has successfully expanded enzyme function into previously unexplored chemical space
Engineered enzymes cluster in distinct regions when visualized using dimensionality reduction
Different research groups tend to explore different regions of chemical/sequence space
The database captures the diversity of new-to-nature enzymatic reactions

LLM pipeline

The LLM pipeline can be accessed at: https://github.com/YuemingLong/DEBase The Automated download of pubmed papers can be accessed at: https://github.com/31415erre/pubmed2pdf

Website and database

The database and website can be accessed at:

Contributing

We welcome contributions! Please feel free to submit issues or pull requests. For formatting please format your data in the LevSeq output format.

This means we require several headers.

Citation

To cite please refer to our releases.

A paper citation will be coming soon.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or collaborations, please open an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EnzEngDB

Overview

Features

Dataset Statistics

Installation

Prerequisites

Install via pip

Dependencies

Usage

1. Clean Reaction Data

2. Clean Enzyme Data

3. Analyze Combined Data

4. Convert to Standard Format

Project Structure

Output Files

Key Findings

LLM pipeline

Website and database

Contributing

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
analysis		analysis
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

fhalab/EnzymeEngineeringDB

Folders and files

Latest commit

History

Repository files navigation

EnzEngDB

Overview

Features

Dataset Statistics

Installation

Prerequisites

Install via pip

Dependencies

Usage

1. Clean Reaction Data

2. Clean Enzyme Data

3. Analyze Combined Data

4. Convert to Standard Format

Project Structure

Output Files

Key Findings

LLM pipeline

Website and database

Contributing

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages