Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds Parquet storage backend support to FPSim2 and updates Python version requirements. The PR is titled "Py314" but primarily introduces a new storage backend using Apache Parquet format as an alternative to PyTables HDF5 and SQLAlchemy backends. It also drops Python 3.10 support, increases the minimum Python version to 3.11, and attempts to add Python 3.14 support (though Python 3.14 has not been officially released yet).
Changes:
- Adds ParquetStorageBackend as a new storage backend option with support for creating Parquet files from molecule sources and converting existing HDF5 files to Parquet format
- Drops Python 3.10 support and raises minimum Python version to 3.11
- Attempts to add Python 3.14 support (including free-threaded builds) across CI workflows and package classifiers
- Updates pybind11 requirement from >=2.2 to >=2.13
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| FPSim2/io/backends/parquet.py | New file implementing ParquetStorageBackend with create_parquet_file, h5_to_parquet conversion, and backend loading functionality |
| FPSim2/io/backends/init.py | Exports ParquetStorageBackend for use by other modules |
| FPSim2/base.py | Integrates Parquet backend into BaseEngine with validation for required parameters |
| FPSim2/FPSim2.py | Adds validation requiring in-memory fingerprints for Parquet backend |
| tests/test_parquet.py | Comprehensive test suite for Parquet backend covering file creation, loading, searching, and HDF5 conversion |
| pyproject.toml | Updates Python version classifiers, minimum version requirement, and pybind11 dependency |
| .github/workflows/wheels.yml | Removes Python 3.10, adds Python 3.14 (including free-threaded variants) to build matrix |
| .github/workflows/ci.yml | Updates Python version matrix for CI testing |
Comments suppressed due to low confidence (1)
pyproject.toml:37
- PyArrow is required for the Parquet storage backend but is not listed in the dependencies. While the code handles the missing dependency gracefully with HAS_PYARROW flag, users who want to use the Parquet backend will encounter import errors. Consider either: 1) adding pyarrow to dependencies, or 2) creating an optional extras group like [parquet] that includes pyarrow, similar to how some packages handle optional dependencies.
dependencies = [
"rdkit>=2022.3.3",
"tables>=3.10",
"numpy>=1.14",
"sqlalchemy>=1.4.47",
"scipy",
"tqdm"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
FPSim2/io/backends/parquet.py
Outdated
| """Parquet storage backend for FPSim2.""" | ||
|
|
||
| from typing import Dict, Iterable as IterableType, Tuple, Union | ||
| from .base import BaseStorageBackend | ||
| import numpy as np | ||
| import rdkit | ||
| import math | ||
| import json | ||
| import time | ||
| from importlib.metadata import version | ||
|
|
||
| __version__ = version("FPSim2") | ||
|
|
||
| try: | ||
| import pyarrow.parquet as pq | ||
| import pyarrow as pa |
There was a problem hiding this comment.
Missing documentation for the new Parquet storage backend. The codebase documents alternative storage backends (see user_guide/sql_backend.md), so the Parquet backend should also be documented. Consider adding a user guide that explains how to create Parquet files, convert from HDF5, and use the Parquet backend with FPSim2Engine.
| @@ -1,5 +1,5 @@ | |||
| [build-system] | |||
There was a problem hiding this comment.
The PR title "Py314" is misleading. Python 3.14 has not been officially released yet, and the main contribution of this PR is adding Parquet storage backend support, not Python 3.14 compatibility. Consider a more descriptive title like "Add Parquet storage backend and update Python version support" or "Add Parquet backend, drop Python 3.10".
FPSim2/io/backends/parquet.py
Outdated
|
|
||
| return fp_type, fp_params, rdkit_ver, fpsim2_ver | ||
|
|
||
| def load_fps(self) -> None: |
There was a problem hiding this comment.
The ParquetStorageBackend does not follow the abstract method signature from BaseStorageBackend. The abstract method expects load_fps(in_memory_fps, fps_sort) but this implementation has load_fps() with no parameters. While the current implementation always loads into memory and always sorts, the method signature should match the base class for consistency, even if some parameters are ignored.
| def load_fps(self) -> None: | |
| def load_fps(self, in_memory_fps: bool = True, fps_sort: bool = True) -> None: |
| urls = { Homepage = "https://github.com/chembl/FPSim2" } | ||
| license = { text = "MIT" } | ||
| requires-python = ">=3.10" | ||
| requires-python = ">=3.11" |
There was a problem hiding this comment.
Dropping support for Python 3.10 while the package is still in Beta (as indicated by the classifier "Development Status :: 4 - Beta") may affect users. Consider documenting this breaking change in the release notes or changelog, and ensure the version number follows semantic versioning to indicate the breaking change.
FPSim2/io/backends/parquet.py
Outdated
| # Create structured array and fill column by column (avoids full table copy) | ||
| fps = np.empty(n_rows, dtype=dtype) | ||
| fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy() | ||
| for col in fp_cols: | ||
| fps[col] = pf.read(columns=[col]).column(0).to_numpy() | ||
| fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy() |
There was a problem hiding this comment.
The load_fps method reads the Parquet file multiple times - once for each column (mol_id, each f column, and popcnt). This is inefficient and could significantly slow down loading for files with many fingerprint columns. Consider reading all columns in a single operation: pf.read(columns=columns) and then extracting the individual columns from the result.
| # Create structured array and fill column by column (avoids full table copy) | |
| fps = np.empty(n_rows, dtype=dtype) | |
| fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy() | |
| for col in fp_cols: | |
| fps[col] = pf.read(columns=[col]).column(0).to_numpy() | |
| fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy() | |
| # Read all required columns in a single operation | |
| table = pf.read(columns=columns) | |
| # Create structured array and fill column by column (avoids full table copy) | |
| fps = np.empty(n_rows, dtype=dtype) | |
| # mol_id is stored in the first column of the table | |
| fps["fp_id"] = table.column(0).to_numpy() | |
| # Fingerprint columns follow in the same order as in fp_cols | |
| for idx, col in enumerate(fp_cols, start=1): | |
| fps[col] = table.column(idx).to_numpy() | |
| # popcnt is the last column in "columns" | |
| fps["popcnt"] = table.column(len(columns) - 1).to_numpy() |
FPSim2/io/backends/parquet.py
Outdated
| # Time sorting | ||
| sort_start = time.time() | ||
| # In-place sort (no copy) | ||
| fps.sort(order="popcnt") | ||
| sort_end = time.time() | ||
| print(f"Time to sort: {sort_end - sort_start:.4f} seconds") |
There was a problem hiding this comment.
The fps_sort parameter is accepted in init but never used. The load_fps method always sorts the fingerprints (line 263) regardless of the fps_sort value. Either: 1) use the fps_sort parameter to conditionally sort, matching PyTablesStorageBackend's behavior, or 2) if sorting is always required for the Parquet backend, document this and consider removing the unused parameter.
| # Time sorting | |
| sort_start = time.time() | |
| # In-place sort (no copy) | |
| fps.sort(order="popcnt") | |
| sort_end = time.time() | |
| print(f"Time to sort: {sort_end - sort_start:.4f} seconds") | |
| # Time sorting (if enabled) | |
| if getattr(self, "fps_sort", True): | |
| sort_start = time.time() | |
| # In-place sort (no copy) | |
| fps.sort(order="popcnt") | |
| sort_end = time.time() | |
| print(f"Time to sort: {sort_end - sort_start:.4f} seconds") |
FPSim2/io/backends/parquet.py
Outdated
| print(f"Time to load Parquet file: {load_end - load_start:.4f} seconds") | ||
|
|
||
| # Time converting to numpy | ||
| convert_start = time.time() | ||
| # Create structured array and fill column by column (avoids full table copy) | ||
| fps = np.empty(n_rows, dtype=dtype) | ||
| fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy() | ||
| for col in fp_cols: | ||
| fps[col] = pf.read(columns=[col]).column(0).to_numpy() | ||
| fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy() | ||
| convert_end = time.time() | ||
| print(f"Time to convert to numpy: {convert_end - convert_start:.4f} seconds") | ||
|
|
||
| # Time sorting | ||
| sort_start = time.time() | ||
| # In-place sort (no copy) | ||
| fps.sort(order="popcnt") | ||
| sort_end = time.time() | ||
| print(f"Time to sort: {sort_end - sort_start:.4f} seconds") |
There was a problem hiding this comment.
Debug timing print statements should be removed from production code or made conditional (e.g., using a logger or debug flag). These print statements will clutter the output for all users loading Parquet files, which is not appropriate for a library.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.