Multi-level BO with Hierarchical Coarse-Graining

This repository contains the source code for the paper "Navigating Chemical Space: Multi-Level Bayesian Optimization with Hierarchical Coarse-Graining".

Citing this work

If you find this work useful in your research, please consider citing:

@article{Walter2025,
  title = {Navigating Chemical Space: Multi-Level Bayesian Optimization with Hierarchical Coarse-Graining},
  url = {http://dx.doi.org/10.1039/D5SC03855C},
  DOI = {10.1039/d5sc03855c},
  journal = {Chem. Sci.},
  publisher = {The Royal Society of Chemistry},
  author = {Walter,  Luis J. and Bereau,  Tristan},
  year = {2025}
}

Tutorial

The repository Molecule Optimization w Hierarchical CG Tutorial contains a tutorial notebook that explains multi-level Bayesian optimization and hierarchical coarse-graining for simple two-bead molecules in a water-hexane mixture.

Overview

The following presents the main components of the repository:

chespex: A small Python package that helps with the implementation of the multi-level Bayesian optimization (BO) algorithm for coarse-grained (CG) molecules. It does not contain application-specific details, but rather provides the general framework for the algorithm. It includes the following components:
- molecules: This module contains a Molecule and a MoleculeGenerator class. The Molecule class is used to represent a molecule and its properties, while the MoleculeGenerator class is used to enumerate all possible molecular graphs for a given number of CG beads.
- encoding: This module contains the model for the Encoder and Decoder of the molecular Autoencoder. It also contains the permutation invariant loss function used to train the autoencoder.
- optimization: This module includes implementation of a simple Dataframe class that is used to store the data for the optimization process. It also includes a GPyTorch based GaussianProcess class used for the Bayesian optimization.
- simulation: This module contains simulation utilities for a high-throughput molecular dynamics (MD) simulation workflow. It includes a Simulation and SimulationSetup class that provide a Python interface to GROMACS simulations. Together with the simulator functions they implement a simple queueing system to run multiple GROMACS simulations in parallel and/or on different machines.
Application specific code in 1_bead_types, 2_molecule_enumeration, 3_autoencoder, 4_membrane_setup, and 5_optimize folders. These folders contain the code used to optimize for a phospholipid phase-separation enhancing small molecule. The numbers indicate the execution sequence. See below for further details.
toy-mol-optimization: This folder contains a Jupyter notebook which was used for the toy example shown in the supplementary information of the paper. It is a simplified (non-invariant) system used to compare the performance of the multi-level BO algorithm with a standard BO algorithm.
molecules.json: This JSON file contains the list of molecules which were selected by the multi-level BO algorithm and evaluated via MD simulations. It also lists the results for the demixing free-energy difference, the latent space representation together with the CG resolution level.
molecules-single-level.json: Similar to the above, but for the standard BO algorithm performed at the highest CG resolution only.

Remarks

CG molecules are generally treated as graphs with fixed bead-type dependent bond lengths and no angle or dihedral potentials. Molecules are represented as a string by concatenating the bead types and the bond indices. For example, a molecule with bead types A, B, and C and bonds between A and B and between B and C is represented as A B C,0-1 1-2. The Molecule class implements a function to convert molecules to this string representation.
The repository includes neither a list of all enumerated molecules nor their latent space representations due to the size of the files. However, the molecule enumeration can be easily repeated by the MoleculeGenerator as shown in the 2_molecule_enumeration/1_generate-molecules.ipynb notebook. The 3_autoencoder folder contains the trained autoencoder models. 2_molecule_enumeration/2_insert-latent-space.ipynb notebook shows how to generate latent space representations for all enumerated molecules based on the trained autoencoder models.
The low, medium, and high CG resolution are generally referred to as level 0, level 1, and level 2, respectively.

Setup

It is recommended to create a new Python environment (e.g. with conda or uv) with Python version 3.11. The following commands install the helper package chespex and other dependencies for the optimization procedure.

# The following command is only needed if you want to install the CUDA version of PyTorch
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu118
# Install all required packages
pip install -r requirements.txt

Other requirements:

GROMACS is required to run the MD simulations. We used GROMACS version 2024.2 for all simulations. This page describes how to install GROMACS.
A local MongoDB database is used to store and retrieve the enumerated molecules. It is also possible to use a remote MongoDB database. With adapted code, any other database with multiple indexes can be used as well. The indexes are used for a fast retrieval of molecules.

Optimization procedure

Bead type generation → 1_bead_types

force-fields.ipynb: This notebook generates a modified martini.itp force field file with custom bead types for low-resolution CG models. The lower resolution bead types are called K and L for the low and medium resolution models, respectively. This file is required for the GROMACS simulations of lower resolution models.
prepare-bead-types.ipynb: This notebook generates the files mapping.json and bead-types.json which contain the mapping of the bead types between CG resolutions and bead type features, respectively. These files are used for the enumeration and encoding of the molecules.

Molecule enumeration (part 1) → 2_molecule_enumeration

1_generate-molecules.ipynb: This notebook generates all possible molecular graphs for a given maximum number of CG beads based on the previously generated bead types. The molecules are stored in a MongoDB database.

Autoencoder training → 3_autoencoder

training.ipynb: This notebook trains an autoencoder model for the latent space representation of the enumerated molecules. The model is trained on the enumerated molecules in the MongoDB database. Trained models are saved in the 3_autoencoder folder.

Molecule enumeration (part 2) → 2_molecule_enumeration

2_insert-latent-space.ipynb: This notebook generates the latent space representation of all enumerated molecules using the trained autoencoder models. The latent space representations are stored in the MongoDB database.
3_insert-parents.ipynb: This notebook generates the parent-child relationships between the enumerated molecules at different CG resolutions. The parent-child relationships are stored in the MongoDB database for a fast retrieval of the molecules based on their parent or child molecules.
4_index-latent-space.ipynb: This notebook generates a cell list for the latent space and a corresponding index. This cell list is used for a fast retrieval of neighboring molecules in the latent space.

Membrane system setup → 4_membrane_setup

generate_membrane.py: This script uses the program Insane to generate a membrane system. After the generation, the system is minimized and equilibrated with GROMACS. The script can be called with various arguments:

usage: generate_membrane.py [-h] [-t {DPPC,DIPC,MIX}] [-s SIZE] [-z HEIGHT] [-d DIRECTORY]
options:
-h                    Show help message
-t {DPPC,DIPC,MIX}    Type of membrane to setup
-s SIZE               X and Y size of the membrane
-z HEIGHT             Height of the simulation box
-d DIRECTORY          Directory to store the membrane files

Molecule optimization → 5_optimize

Generate simulation directories for all single-bead molecules at the lowest CG resolution to obtain a prior for the lowest CG resolution:
```
cd 5_optimize
for m in C N P Q+ Q- SC SN SP SQ+ SQ- TC TN TP TQ+ TQ-; do
    mkdir -p "simulations/level-0/$m,"
done
```

Run simulations for all single-bead molecules at the lowest CG resolution:

python simulation_helper.py
# After the simulations are finished, we calculate the free energies
python run_mbar.py

initialize.ipynb: This notebook generates the initialization molecules for the optimization.

Once again, we run the simulation_helper.py script to run the simulations for the initialization molecules:

python simulation_helper.py
# After the simulations are finished, we calculate the free energies
python run_mbar.py

optimize.py: This script runs the multi-level Bayesian optimization algorithm. It continues the optimization until interrupted by the user.

Single-level optimization with standard BO → 5_optimize

initialize-single-level.ipynb: This notebook generates the initialization molecules for the standard BO.

We run the simulation_helper.py script to run the simulations for the initialization molecules:

python simulation_helper.py
# After the simulations are finished, we calculate the free energies
python run_mbar.py level-2

single-level-helper-files/create.py: Execute this script to generate numpy files of the high resolution latent space. These files are used for a faster acquisition function evaluation in the subsequent optimization step.
optimize-single-level.py: This script runs the standard Bayesian optimization algorithm. It continues the optimization until interrupted by the user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-level BO with Hierarchical Coarse-Graining

Citing this work

Tutorial

Overview

Remarks

Setup

Optimization procedure

Bead type generation → 1_bead_types

Molecule enumeration (part 1) → 2_molecule_enumeration

Autoencoder training → 3_autoencoder

Molecule enumeration (part 2) → 2_molecule_enumeration

Membrane system setup → 4_membrane_setup

Molecule optimization → 5_optimize

Single-level optimization with standard BO → 5_optimize

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
1_bead_types		1_bead_types
2_molecule_enumeration		2_molecule_enumeration
3_autoencoder		3_autoencoder
4_membrane_setup		4_membrane_setup
5_optimize		5_optimize
chespex		chespex
toy-mol-optimization		toy-mol-optimization
.gitignore		.gitignore
README.md		README.md
molecules-single-level.json		molecules-single-level.json
molecules.json		molecules.json
requirements.txt		requirements.txt

BereauLab/Multi-Level-BO-w-Hierarchical-CG

Folders and files

Latest commit

History

Repository files navigation

Multi-level BO with Hierarchical Coarse-Graining

Citing this work

Tutorial

Overview

Remarks

Setup

Optimization procedure

Bead type generation → 1_bead_types

Molecule enumeration (part 1) → 2_molecule_enumeration

Autoencoder training → 3_autoencoder

Molecule enumeration (part 2) → 2_molecule_enumeration

Membrane system setup → 4_membrane_setup

Molecule optimization → 5_optimize

Single-level optimization with standard BO → 5_optimize

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages