This repository contains the source code for the paper "Navigating Chemical Space: Multi-Level Bayesian Optimization with Hierarchical Coarse-Graining".
If you find this work useful in your research, please consider citing:
@article{Walter2025,
title = {Navigating Chemical Space: Multi-Level Bayesian Optimization with Hierarchical Coarse-Graining},
url = {http://dx.doi.org/10.1039/D5SC03855C},
DOI = {10.1039/d5sc03855c},
journal = {Chem. Sci.},
publisher = {The Royal Society of Chemistry},
author = {Walter, Luis J. and Bereau, Tristan},
year = {2025}
}
The repository Molecule Optimization w Hierarchical CG Tutorial contains a tutorial notebook that explains multi-level Bayesian optimization and hierarchical coarse-graining for simple two-bead molecules in a water-hexane mixture.
The following presents the main components of the repository:
- chespex: A small Python package that helps with the implementation of the multi-level Bayesian optimization (BO) algorithm for coarse-grained (CG) molecules. It does not contain application-specific details, but rather provides the general framework for the algorithm. It includes the following components:
- molecules: This module contains a
Molecule
and aMoleculeGenerator
class. TheMolecule
class is used to represent a molecule and its properties, while theMoleculeGenerator
class is used to enumerate all possible molecular graphs for a given number of CG beads. - encoding: This module contains the model for the
Encoder
andDecoder
of the molecularAutoencoder
. It also contains the permutation invariant loss function used to train the autoencoder. - optimization: This module includes implementation of a simple
Dataframe
class that is used to store the data for the optimization process. It also includes a GPyTorch basedGaussianProcess
class used for the Bayesian optimization. - simulation: This module contains simulation utilities for a high-throughput molecular dynamics (MD) simulation workflow. It includes a
Simulation
andSimulationSetup
class that provide a Python interface to GROMACS simulations. Together with the simulator functions they implement a simple queueing system to run multiple GROMACS simulations in parallel and/or on different machines.
- molecules: This module contains a
- Application specific code in 1_bead_types, 2_molecule_enumeration, 3_autoencoder, 4_membrane_setup, and 5_optimize folders. These folders contain the code used to optimize for a phospholipid phase-separation enhancing small molecule. The numbers indicate the execution sequence. See below for further details.
- toy-mol-optimization: This folder contains a Jupyter notebook which was used for the toy example shown in the supplementary information of the paper. It is a simplified (non-invariant) system used to compare the performance of the multi-level BO algorithm with a standard BO algorithm.
- molecules.json: This JSON file contains the list of molecules which were selected by the multi-level BO algorithm and evaluated via MD simulations. It also lists the results for the demixing free-energy difference, the latent space representation together with the CG resolution level.
- molecules-single-level.json: Similar to the above, but for the standard BO algorithm performed at the highest CG resolution only.
- CG molecules are generally treated as graphs with fixed bead-type dependent bond lengths and no angle or dihedral potentials. Molecules are represented as a string by concatenating the bead types and the bond indices. For example, a molecule with bead types
A
,B
, andC
and bonds betweenA
andB
and betweenB
andC
is represented asA B C,0-1 1-2
. TheMolecule
class implements a function to convert molecules to this string representation. - The repository includes neither a list of all enumerated molecules nor their latent space representations due to the size of the files. However, the molecule enumeration can be easily repeated by the
MoleculeGenerator
as shown in the 2_molecule_enumeration/1_generate-molecules.ipynb notebook. The 3_autoencoder folder contains the trained autoencoder models. 2_molecule_enumeration/2_insert-latent-space.ipynb notebook shows how to generate latent space representations for all enumerated molecules based on the trained autoencoder models. - The low, medium, and high CG resolution are generally referred to as level 0, level 1, and level 2, respectively.
It is recommended to create a new Python environment (e.g. with conda or uv) with Python version 3.11. The following commands install the helper package chespex
and other dependencies for the optimization procedure.
# The following command is only needed if you want to install the CUDA version of PyTorch
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu118
# Install all required packages
pip install -r requirements.txt
Other requirements:
- GROMACS is required to run the MD simulations. We used GROMACS version 2024.2 for all simulations. This page describes how to install GROMACS.
- A local MongoDB database is used to store and retrieve the enumerated molecules. It is also possible to use a remote MongoDB database. With adapted code, any other database with multiple indexes can be used as well. The indexes are used for a fast retrieval of molecules.
Bead type generation → 1_bead_types
- force-fields.ipynb: This notebook generates a modified
martini.itp
force field file with custom bead types for low-resolution CG models. The lower resolution bead types are called K and L for the low and medium resolution models, respectively. This file is required for the GROMACS simulations of lower resolution models. - prepare-bead-types.ipynb: This notebook generates the files
mapping.json
andbead-types.json
which contain the mapping of the bead types between CG resolutions and bead type features, respectively. These files are used for the enumeration and encoding of the molecules.
Molecule enumeration (part 1) → 2_molecule_enumeration
- 1_generate-molecules.ipynb: This notebook generates all possible molecular graphs for a given maximum number of CG beads based on the previously generated bead types. The molecules are stored in a MongoDB database.
Autoencoder training → 3_autoencoder
- training.ipynb: This notebook trains an autoencoder model for the latent space representation of the enumerated molecules. The model is trained on the enumerated molecules in the MongoDB database. Trained models are saved in the 3_autoencoder folder.
Molecule enumeration (part 2) → 2_molecule_enumeration
- 2_insert-latent-space.ipynb: This notebook generates the latent space representation of all enumerated molecules using the trained autoencoder models. The latent space representations are stored in the MongoDB database.
- 3_insert-parents.ipynb: This notebook generates the parent-child relationships between the enumerated molecules at different CG resolutions. The parent-child relationships are stored in the MongoDB database for a fast retrieval of the molecules based on their parent or child molecules.
- 4_index-latent-space.ipynb: This notebook generates a cell list for the latent space and a corresponding index. This cell list is used for a fast retrieval of neighboring molecules in the latent space.
Membrane system setup → 4_membrane_setup
- generate_membrane.py: This script uses the program Insane to generate a membrane system. After the generation, the system is minimized and equilibrated with GROMACS. The script can be called with various arguments:
usage: generate_membrane.py [-h] [-t {DPPC,DIPC,MIX}] [-s SIZE] [-z HEIGHT] [-d DIRECTORY] options: -h Show help message -t {DPPC,DIPC,MIX} Type of membrane to setup -s SIZE X and Y size of the membrane -z HEIGHT Height of the simulation box -d DIRECTORY Directory to store the membrane files
Molecule optimization → 5_optimize
- Generate simulation directories for all single-bead molecules at the lowest CG resolution to obtain a prior for the lowest CG resolution:
cd 5_optimize for m in C N P Q+ Q- SC SN SP SQ+ SQ- TC TN TP TQ+ TQ-; do mkdir -p "simulations/level-0/$m," done
- Run simulations for all single-bead molecules at the lowest CG resolution:
python simulation_helper.py # After the simulations are finished, we calculate the free energies python run_mbar.py
- initialize.ipynb: This notebook generates the initialization molecules for the optimization.
- Once again, we run the
simulation_helper.py
script to run the simulations for the initialization molecules:python simulation_helper.py # After the simulations are finished, we calculate the free energies python run_mbar.py
- optimize.py: This script runs the multi-level Bayesian optimization algorithm. It continues the optimization until interrupted by the user.
Single-level optimization with standard BO → 5_optimize
- initialize-single-level.ipynb: This notebook generates the initialization molecules for the standard BO.
- We run the
simulation_helper.py
script to run the simulations for the initialization molecules:python simulation_helper.py # After the simulations are finished, we calculate the free energies python run_mbar.py level-2
- single-level-helper-files/create.py: Execute this script to generate numpy files of the high resolution latent space. These files are used for a faster acquisition function evaluation in the subsequent optimization step.
- optimize-single-level.py: This script runs the standard Bayesian optimization algorithm. It continues the optimization until interrupted by the user.