VibraCLIP is a multi-modal framework, inspired by the CLIP model [1], that integrates molecular graph representations with infrared (IR) and Raman spectra, from the QM9S Dataset [2] and experimental data, leveraging advanced machine learning to capture complex relationships between molecular structures and vibrational spectroscopy. By aligning these diverse data modalities in a shared representation space, VibraCLIP enables precise molecular identification, bridging the gap between spectral data and molecular interpretation.
After installing conda, run the following commands to create a new environment
named vibraclip_cpu/gpu
and install the dependencies.
conda env create -f env_gpu.yml
conda activate vibraclip_gpu
pre-commit install
We recommend to use
lightning
instead of the deprecatedpytorch_lightning
, by implementing the changes suggested in the following pull request topytorch_geometric
link.
To generate the LMDB
file, first you need to place a pickle file in the data
folder with all the raw information inside. We provide both the pickle file (see supplementary data) and the generation script in the scripts
folder, so the user can either use our pickle file to re-generate the lmdb file or create a new pickle file from the original QM9S dataset[2].
Then, after getting the pickle file, the user needs to generate the LMDB
file using the create_lmdb.py
script as follows:
from vibraclip.preprocessing.graph import QM9Spectra
# Paths
data_path = "./data/qm9s_ir_raman.pkl"
db_path = "./data/qm9s_ir_raman"
# LMDB Generator
extractor = QM9Spectra(
data_path=data_path, # Path where the pickle file is placed
db_path=db_path, # Path where the LMDB file will be located
spectra_dim=1750, # To interpolate both IR and Raman spectra to a given dimension
)
# Run
extractor.get_lmdb()
# extractor.get_pickle()
This method automatically generates the molecular graph representations and stores the processed IR and Raman spectra inside the PyG Data
object with other metadata. Finally, the LMDB file is exported in the same data
folder.
To train VibraCLIP, we use hydra to configure the model's hyperparameters and training settings through the config.yaml
file stored in the configs
folder. We provide the general configuration file config.yaml
for pre-training the model and the config_ft.yaml
for the fine-tuning (realignment) stage with the QM9S external dataset from PubChem and experimental data. We refer the user to check all the hyperparameters used in our work within the yaml files.
Please, change the experiment id inside the
config.yaml
file with a lable that tracks your experiments as (e.g.,id: "vibraclip_graph_ir_mass_01"
). Also, inside thepaths
theroot_dir
tag should be changed to the path wherevibraclip
is cloned (e.g.,root_dir: "/home/USER/vibraclip"
).
VibraCLIP considers different scenarios for training, depending on the included modalities:
To train VibraCLIP only on the Graph-IR relationship, use the following command:
python main_ir.py --config-name config.yaml
Then, to train VibraCLIP on the Graph-IR-Raman relationships, use the following command:
python main_ir_raman.py --config-name config.yaml
Note that both models can be trained using the same config.yaml
file.
The model's checkpoint files are stored automatically in the checkpoints
folder and the RetrievalAccuracy
callbacks will save a pickle file in the outputs
folder for further analysis of the model's performance in the test dataset. These outputs can be visualized with the provided jupyter notebooks
(see Evaluate vibraclip performance section).
We strongly recommend to use wandb platform to track the training/validation/testing loss functions during the execution.
When training starts, a
processed
folder is created inside thedata
directory; if another training is launched, the system will automatically reuse the processed data, and to force reprocessing, simply delete theprocessed
folder.
For HPO
we use optuna python library to optimize both the model's architecture and the training hyperparameters. Since, VibraCLIP is a multi-modal framework we used a multi-objective strategy optimization where both the validation loss referred to the graph representation and also the validation loss from the spectra. We recommend the user to look at the main_optuna.py
script before launching an HPO
experiment.
We provide two jupyter notebooks, along with the pickle files (see supplementary data section) with all the testing data, in the folder notebooks
to analyze and visualize the performance of VibraCLIP to get the same plots that are in the publication manuscript.
notebooks/vibraclip_metrics.ipynb
: To plot the retrieval accuracy plot of the test set and chemical spaces based on TopK.notebooks/vibraclip_plots.ipynb
: The actual retrieval accuracy plots from the publication for better comparison.
We also include a notebooks/figures
folder with all the performance and molecular grids from the publication. The notebooks/outputs
folder is to place the callback pickle files for analysis.
Inside the Makefile
there are a few handy commands to streamline cleaning tasks.
make wandb-sync # In case of using wandb offline
make clean-data # To remove the processed folder from PyG
make clean-all # To clean __pycache__ folders and unnecessary things
The supplemetary data has been published in a Zenodo repository, providing the datasets in both pickle and LMDB formats, the pre-trained VibraCLIP
checkpoints for all experiments, and the output callback pickle files to ensure full repoducibility of the reported results.
To reproduce the reported results, the dataset pickle and LMDB files should be placed in the
data
folder, the pre-trained checkpoints in thepre_trained
folder, and the outputs from the callback can be visualize by placing them within thenotebooks/outputs
folder.
The authors thank the Institute of Chemical Research of Catalonia (ICIQ) Summer Fellow Program for its support. We also acknowledge the Department of Research and Universities of the Generalitat de Catalunya for funding through grant (reference: SGR-01155). Additionally, we are grateful to Dr. Georgiana Stoica and Mariona Urtasun from the ICIQ Research Support Area (Spectroscopy and Material Characterization Unit) for their valuable assistance. Computational resources were provided by the Barcelona Supercomputing Center (BSC), which we gratefully acknowledge.
VibraCLIP
is released under the MIT license.
If you use this codebase in your work, please consider citing:
@article{vibraclip,
title = {Multi-Modal Contrastive Learning for Chemical Structure Elucidation with VibraCLIP},
author = {Pau Rocabert-Oriols, Nuria López, Javier Heras-Domingo},
journal = {submitted},
year = {2025},
doi = {},
}
[1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sutskever, I., Learning transferable visual models from natural language supervision. ICML, 2021, 8748-8763, URL
[2] Zou, Z., Zhang, Y., Liang, L., Wei, M., Leng, J., Jiang, J., Hu, W., A deep learning model for predicting selected organic molecular spectra. Nature Computational Science, 2023, 3(11), 957-964, URL