Skip to content

Py314#139

Merged
eloyfelix merged 11 commits intomasterfrom
py314
Feb 26, 2026
Merged

Py314#139
eloyfelix merged 11 commits intomasterfrom
py314

Conversation

@eloyfelix
Copy link
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings February 25, 2026 23:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Parquet storage backend support to FPSim2 and updates Python version requirements. The PR is titled "Py314" but primarily introduces a new storage backend using Apache Parquet format as an alternative to PyTables HDF5 and SQLAlchemy backends. It also drops Python 3.10 support, increases the minimum Python version to 3.11, and attempts to add Python 3.14 support (though Python 3.14 has not been officially released yet).

Changes:

  • Adds ParquetStorageBackend as a new storage backend option with support for creating Parquet files from molecule sources and converting existing HDF5 files to Parquet format
  • Drops Python 3.10 support and raises minimum Python version to 3.11
  • Attempts to add Python 3.14 support (including free-threaded builds) across CI workflows and package classifiers
  • Updates pybind11 requirement from >=2.2 to >=2.13

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
FPSim2/io/backends/parquet.py New file implementing ParquetStorageBackend with create_parquet_file, h5_to_parquet conversion, and backend loading functionality
FPSim2/io/backends/init.py Exports ParquetStorageBackend for use by other modules
FPSim2/base.py Integrates Parquet backend into BaseEngine with validation for required parameters
FPSim2/FPSim2.py Adds validation requiring in-memory fingerprints for Parquet backend
tests/test_parquet.py Comprehensive test suite for Parquet backend covering file creation, loading, searching, and HDF5 conversion
pyproject.toml Updates Python version classifiers, minimum version requirement, and pybind11 dependency
.github/workflows/wheels.yml Removes Python 3.10, adds Python 3.14 (including free-threaded variants) to build matrix
.github/workflows/ci.yml Updates Python version matrix for CI testing
Comments suppressed due to low confidence (1)

pyproject.toml:37

  • PyArrow is required for the Parquet storage backend but is not listed in the dependencies. While the code handles the missing dependency gracefully with HAS_PYARROW flag, users who want to use the Parquet backend will encounter import errors. Consider either: 1) adding pyarrow to dependencies, or 2) creating an optional extras group like [parquet] that includes pyarrow, similar to how some packages handle optional dependencies.
dependencies = [
    "rdkit>=2022.3.3",
    "tables>=3.10",
    "numpy>=1.14",
    "sqlalchemy>=1.4.47",
    "scipy",
    "tqdm"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +16
"""Parquet storage backend for FPSim2."""

from typing import Dict, Iterable as IterableType, Tuple, Union
from .base import BaseStorageBackend
import numpy as np
import rdkit
import math
import json
import time
from importlib.metadata import version

__version__ = version("FPSim2")

try:
import pyarrow.parquet as pq
import pyarrow as pa
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for the new Parquet storage backend. The codebase documents alternative storage backends (see user_guide/sql_backend.md), so the Parquet backend should also be documented. Consider adding a user guide that explains how to create Parquet files, convert from HDF5, and use the Parquet backend with FPSim2Engine.

Copilot uses AI. Check for mistakes.
@@ -1,5 +1,5 @@
[build-system]
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title "Py314" is misleading. Python 3.14 has not been officially released yet, and the main contribution of this PR is adding Parquet storage backend support, not Python 3.14 compatibility. Consider a more descriptive title like "Add Parquet storage backend and update Python version support" or "Add Parquet backend, drop Python 3.10".

Copilot uses AI. Check for mistakes.

return fp_type, fp_params, rdkit_ver, fpsim2_ver

def load_fps(self) -> None:
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ParquetStorageBackend does not follow the abstract method signature from BaseStorageBackend. The abstract method expects load_fps(in_memory_fps, fps_sort) but this implementation has load_fps() with no parameters. While the current implementation always loads into memory and always sorts, the method signature should match the base class for consistency, even if some parameters are ignored.

Suggested change
def load_fps(self) -> None:
def load_fps(self, in_memory_fps: bool = True, fps_sort: bool = True) -> None:

Copilot uses AI. Check for mistakes.
urls = { Homepage = "https://github.com/chembl/FPSim2" }
license = { text = "MIT" }
requires-python = ">=3.10"
requires-python = ">=3.11"
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping support for Python 3.10 while the package is still in Beta (as indicated by the classifier "Development Status :: 4 - Beta") may affect users. Consider documenting this breaking change in the release notes or changelog, and ensure the version number follows semantic versioning to indicate the breaking change.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +256
# Create structured array and fill column by column (avoids full table copy)
fps = np.empty(n_rows, dtype=dtype)
fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
for col in fp_cols:
fps[col] = pf.read(columns=[col]).column(0).to_numpy()
fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_fps method reads the Parquet file multiple times - once for each column (mol_id, each f column, and popcnt). This is inefficient and could significantly slow down loading for files with many fingerprint columns. Consider reading all columns in a single operation: pf.read(columns=columns) and then extracting the individual columns from the result.

Suggested change
# Create structured array and fill column by column (avoids full table copy)
fps = np.empty(n_rows, dtype=dtype)
fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
for col in fp_cols:
fps[col] = pf.read(columns=[col]).column(0).to_numpy()
fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()
# Read all required columns in a single operation
table = pf.read(columns=columns)
# Create structured array and fill column by column (avoids full table copy)
fps = np.empty(n_rows, dtype=dtype)
# mol_id is stored in the first column of the table
fps["fp_id"] = table.column(0).to_numpy()
# Fingerprint columns follow in the same order as in fp_cols
for idx, col in enumerate(fp_cols, start=1):
fps[col] = table.column(idx).to_numpy()
# popcnt is the last column in "columns"
fps["popcnt"] = table.column(len(columns) - 1).to_numpy()

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this saves RAM

Comment on lines +260 to +265
# Time sorting
sort_start = time.time()
# In-place sort (no copy)
fps.sort(order="popcnt")
sort_end = time.time()
print(f"Time to sort: {sort_end - sort_start:.4f} seconds")
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fps_sort parameter is accepted in init but never used. The load_fps method always sorts the fingerprints (line 263) regardless of the fps_sort value. Either: 1) use the fps_sort parameter to conditionally sort, matching PyTablesStorageBackend's behavior, or 2) if sorting is always required for the Parquet backend, document this and consider removing the unused parameter.

Suggested change
# Time sorting
sort_start = time.time()
# In-place sort (no copy)
fps.sort(order="popcnt")
sort_end = time.time()
print(f"Time to sort: {sort_end - sort_start:.4f} seconds")
# Time sorting (if enabled)
if getattr(self, "fps_sort", True):
sort_start = time.time()
# In-place sort (no copy)
fps.sort(order="popcnt")
sort_end = time.time()
print(f"Time to sort: {sort_end - sort_start:.4f} seconds")

Copilot uses AI. Check for mistakes.
Comment on lines +247 to +265
print(f"Time to load Parquet file: {load_end - load_start:.4f} seconds")

# Time converting to numpy
convert_start = time.time()
# Create structured array and fill column by column (avoids full table copy)
fps = np.empty(n_rows, dtype=dtype)
fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
for col in fp_cols:
fps[col] = pf.read(columns=[col]).column(0).to_numpy()
fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()
convert_end = time.time()
print(f"Time to convert to numpy: {convert_end - convert_start:.4f} seconds")

# Time sorting
sort_start = time.time()
# In-place sort (no copy)
fps.sort(order="popcnt")
sort_end = time.time()
print(f"Time to sort: {sort_end - sort_start:.4f} seconds")
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug timing print statements should be removed from production code or made conditional (e.g., using a logger or debug flag). These print statements will clutter the output for all users loading Parquet files, which is not appropriate for a library.

Copilot uses AI. Check for mistakes.
@eloyfelix eloyfelix merged commit 576a732 into master Feb 26, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants