Py314 by eloyfelix · Pull Request #139 · chembl/FPSim2

eloyfelix · 2026-02-25T23:03:58Z

No description provided.

Copilot

Pull request overview

This pull request adds Parquet storage backend support to FPSim2 and updates Python version requirements. The PR is titled "Py314" but primarily introduces a new storage backend using Apache Parquet format as an alternative to PyTables HDF5 and SQLAlchemy backends. It also drops Python 3.10 support, increases the minimum Python version to 3.11, and attempts to add Python 3.14 support (though Python 3.14 has not been officially released yet).

Changes:

Adds ParquetStorageBackend as a new storage backend option with support for creating Parquet files from molecule sources and converting existing HDF5 files to Parquet format
Drops Python 3.10 support and raises minimum Python version to 3.11
Attempts to add Python 3.14 support (including free-threaded builds) across CI workflows and package classifiers
Updates pybind11 requirement from >=2.2 to >=2.13

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
FPSim2/io/backends/parquet.py	New file implementing ParquetStorageBackend with create_parquet_file, h5_to_parquet conversion, and backend loading functionality
FPSim2/io/backends/init.py	Exports ParquetStorageBackend for use by other modules
FPSim2/base.py	Integrates Parquet backend into BaseEngine with validation for required parameters
FPSim2/FPSim2.py	Adds validation requiring in-memory fingerprints for Parquet backend
tests/test_parquet.py	Comprehensive test suite for Parquet backend covering file creation, loading, searching, and HDF5 conversion
pyproject.toml	Updates Python version classifiers, minimum version requirement, and pybind11 dependency
.github/workflows/wheels.yml	Removes Python 3.10, adds Python 3.14 (including free-threaded variants) to build matrix
.github/workflows/ci.yml	Updates Python version matrix for CI testing

Comments suppressed due to low confidence (1)

pyproject.toml:37

PyArrow is required for the Parquet storage backend but is not listed in the dependencies. While the code handles the missing dependency gracefully with HAS_PYARROW flag, users who want to use the Parquet backend will encounter import errors. Consider either: 1) adding pyarrow to dependencies, or 2) creating an optional extras group like [parquet] that includes pyarrow, similar to how some packages handle optional dependencies.

dependencies = [
    "rdkit>=2022.3.3",
    "tables>=3.10",
    "numpy>=1.14",
    "sqlalchemy>=1.4.47",
    "scipy",
    "tqdm"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_parquet.py

Copilot · 2026-02-25T23:10:12Z

FPSim2/io/backends/parquet.py

+"""Parquet storage backend for FPSim2."""
+
+from typing import Dict, Iterable as IterableType, Tuple, Union
+from .base import BaseStorageBackend
+import numpy as np
+import rdkit
+import math
+import json
+import time
+from importlib.metadata import version
+
+__version__ = version("FPSim2")
+
+try:
+    import pyarrow.parquet as pq
+    import pyarrow as pa


Missing documentation for the new Parquet storage backend. The codebase documents alternative storage backends (see user_guide/sql_backend.md), so the Parquet backend should also be documented. Consider adding a user guide that explains how to create Parquet files, convert from HDF5, and use the Parquet backend with FPSim2Engine.

Copilot · 2026-02-25T23:10:13Z

pyproject.toml

@@ -1,5 +1,5 @@
 [build-system]


The PR title "Py314" is misleading. Python 3.14 has not been officially released yet, and the main contribution of this PR is adding Parquet storage backend support, not Python 3.14 compatibility. Consider a more descriptive title like "Add Parquet storage backend and update Python version support" or "Add Parquet backend, drop Python 3.10".

Copilot · 2026-02-25T23:10:13Z

FPSim2/io/backends/parquet.py

+
+        return fp_type, fp_params, rdkit_ver, fpsim2_ver
+
+    def load_fps(self) -> None:


The ParquetStorageBackend does not follow the abstract method signature from BaseStorageBackend. The abstract method expects load_fps(in_memory_fps, fps_sort) but this implementation has load_fps() with no parameters. While the current implementation always loads into memory and always sorts, the method signature should match the base class for consistency, even if some parameters are ignored.

Suggested change

def load_fps(self) -> None:

def load_fps(self, in_memory_fps: bool = True, fps_sort: bool = True) -> None:

Copilot · 2026-02-25T23:10:13Z

pyproject.toml

 urls = { Homepage = "https://github.com/chembl/FPSim2" }
 license = { text = "MIT" }
-requires-python = ">=3.10"
+requires-python = ">=3.11"


Dropping support for Python 3.10 while the package is still in Beta (as indicated by the classifier "Development Status :: 4 - Beta") may affect users. Consider documenting this breaking change in the release notes or changelog, and ensure the version number follows semantic versioning to indicate the breaking change.

Copilot · 2026-02-25T23:10:13Z

FPSim2/io/backends/parquet.py

+        # Create structured array and fill column by column (avoids full table copy)
+        fps = np.empty(n_rows, dtype=dtype)
+        fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
+        for col in fp_cols:
+            fps[col] = pf.read(columns=[col]).column(0).to_numpy()
+        fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()


The load_fps method reads the Parquet file multiple times - once for each column (mol_id, each f column, and popcnt). This is inefficient and could significantly slow down loading for files with many fingerprint columns. Consider reading all columns in a single operation: pf.read(columns=columns) and then extracting the individual columns from the result.

Suggested change

# Create structured array and fill column by column (avoids full table copy)

fps = np.empty(n_rows, dtype=dtype)

fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()

for col in fp_cols:

fps[col] = pf.read(columns=[col]).column(0).to_numpy()

fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()

# Read all required columns in a single operation

table = pf.read(columns=columns)

# Create structured array and fill column by column (avoids full table copy)

fps = np.empty(n_rows, dtype=dtype)

# mol_id is stored in the first column of the table

fps["fp_id"] = table.column(0).to_numpy()

# Fingerprint columns follow in the same order as in fp_cols

for idx, col in enumerate(fp_cols, start=1):

fps[col] = table.column(idx).to_numpy()

# popcnt is the last column in "columns"

fps["popcnt"] = table.column(len(columns) - 1).to_numpy()

this saves RAM

Copilot · 2026-02-25T23:10:14Z

FPSim2/io/backends/parquet.py

+        # Time sorting
+        sort_start = time.time()
+        # In-place sort (no copy)
+        fps.sort(order="popcnt")
+        sort_end = time.time()
+        print(f"Time to sort: {sort_end - sort_start:.4f} seconds")


The fps_sort parameter is accepted in init but never used. The load_fps method always sorts the fingerprints (line 263) regardless of the fps_sort value. Either: 1) use the fps_sort parameter to conditionally sort, matching PyTablesStorageBackend's behavior, or 2) if sorting is always required for the Parquet backend, document this and consider removing the unused parameter.

Suggested change

# Time sorting

sort_start = time.time()

# In-place sort (no copy)

fps.sort(order="popcnt")

sort_end = time.time()

print(f"Time to sort: {sort_end - sort_start:.4f} seconds")

# Time sorting (if enabled)

if getattr(self, "fps_sort", True):

sort_start = time.time()

# In-place sort (no copy)

fps.sort(order="popcnt")

sort_end = time.time()

print(f"Time to sort: {sort_end - sort_start:.4f} seconds")

Copilot · 2026-02-25T23:10:14Z

FPSim2/io/backends/parquet.py

+        print(f"Time to load Parquet file: {load_end - load_start:.4f} seconds")
+
+        # Time converting to numpy
+        convert_start = time.time()
+        # Create structured array and fill column by column (avoids full table copy)
+        fps = np.empty(n_rows, dtype=dtype)
+        fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
+        for col in fp_cols:
+            fps[col] = pf.read(columns=[col]).column(0).to_numpy()
+        fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()
+        convert_end = time.time()
+        print(f"Time to convert to numpy: {convert_end - convert_start:.4f} seconds")
+
+        # Time sorting
+        sort_start = time.time()
+        # In-place sort (no copy)
+        fps.sort(order="popcnt")
+        sort_end = time.time()
+        print(f"Time to sort: {sort_end - sort_start:.4f} seconds")


Debug timing print statements should be removed from production code or made conditional (e.g., using a logger or debug flag). These print statements will clutter the output for all users loading Parquet files, which is not appropriate for a library.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

eloyfelix added 6 commits January 15, 2026 22:17

parquet backend

59bd723

parquet backend

aa68d4f

parquet backend

4b3d66d

add tests

5aca9ed

time parquet backend

b059734

python 3.14

b09781a

Copilot AI review requested due to automatic review settings February 25, 2026 23:03

Copilot started reviewing on behalf of eloyfelix February 25, 2026 23:04 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

eloyfelix and others added 5 commits February 26, 2026 00:26

sort dependencies better

8cbb47e

Update tests/test_parquet.py

8d76a8d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

install sqlalchemy when running its tests

715738e

early exit on top-k with dynamic thres

2c87b4f

remove unintentional parquet support

6b7a102

eloyfelix merged commit 576a732 into master Feb 26, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Py314#139

Py314#139
eloyfelix merged 11 commits intomasterfrom
py314

eloyfelix commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

eloyfelix Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return fp_type, fp_params, rdkit_ver, fpsim2_ver

		def load_fps(self) -> None:

	def load_fps(self) -> None:
	def load_fps(self, in_memory_fps: bool = True, fps_sort: bool = True) -> None:

-        # Create structured array and fill column by column (avoids full table copy)
-        fps = np.empty(n_rows, dtype=dtype)
-        fps["fp_id"] = pf.read(columns=["mol_id"]).column(0).to_numpy()
-        for col in fp_cols:
-            fps[col] = pf.read(columns=[col]).column(0).to_numpy()
-        fps["popcnt"] = pf.read(columns=["popcnt"]).column(0).to_numpy()
+        # Read all required columns in a single operation
+        table = pf.read(columns=columns)
+        # Create structured array and fill column by column (avoids full table copy)
+        fps = np.empty(n_rows, dtype=dtype)
+        # mol_id is stored in the first column of the table
+        fps["fp_id"] = table.column(0).to_numpy()
+        # Fingerprint columns follow in the same order as in fp_cols
+        for idx, col in enumerate(fp_cols, start=1):
+            fps[col] = table.column(idx).to_numpy()
+        # popcnt is the last column in "columns"
+        fps["popcnt"] = table.column(len(columns) - 1).to_numpy()

Conversation

eloyfelix commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

eloyfelix Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants