Fix CIF conversion failures and coordinate overflows (#445, #303) by bamattsson · Pull Request #446 · Electrostatics/pdb2pqr

bamattsson · 2026-01-26T14:01:59Z

Hi @sobolevnrm and the other maintainers! Thanks for maintaining this tool for the structural biology community.

This PR aims to resolve issues raised in #445 and #303 regarding mmCIF support and conversion stability.

Key changes:

Graceful handling of missing data: In the file cif.py the following functions header, keywds, expdata, author, cryst1, cryst1, scalen, origxn have been updated to return default/empty values if the corresponding objects are missing from the CIF file, rather than raising an exception.
Coordinate overflow protection: Fixed a failure in atom_site where CIF coordinates with high precision (many significant digits) caused PDB line formatting to exceed the strict 80-character limit and the program crashing. Added logic to ensure coordinates adhere to the fixed-width 8.3f format.
Refactoring: Refactored atom_site to reduce code duplication.
Test Coverage: Added a new test suite using three real-world CIF files that previously caused crashes.

How this has been tested:

Ran the pre-existing pytest suites (all passed)
Added new tests to test cif file reading (all passed)
Manually verified pdb2pqr output on a few cases

Question for maintainers:

In Point 1, I assumed that missing optional metadata (like author or scalen) does not break the downstream hydrogen assignment logic. My investigation suggests these are primarily used for PDB header generation, but I would appreciate a second look from someone familiar with the core solver.

Looking forward to any feedback :)

…oring

sobolevnrm · 2026-02-01T03:32:23Z

Hi @bamattsson --

Thank you for submitting this. Did the CIF files you included in your new tests fail with the old code? If not, could you please include one that failed previously and now works?

Thanks again,

Nathan

bamattsson · 2026-02-01T15:26:24Z

Hi Nathan!

Yes, all of the three cifs I added in a07392dc fails on master and need these changes. I reconfirmed that today by:

Creating a fresh conda environment (python 3.10)
Installing pdb2pqr:master with pip install -e ".[pkaani,test]"
Running pytest -> passes
Cherry-pick only the new tests from a07392dc onto master (e.g. without the new cif-parsing code) and rerunning pytest -> those 3 new tests fail
Checking out the rest of the code on this branch as well (e.g. with the new cif-parsing code) -> all tests (including these 3 new) passes

I can see that the build failed above. Seems to be something to do with pkaani tests. Would I need to do any changes to make those pass? (they pass when I run it locally)

… datapoints

bamattsson · 2026-02-04T10:35:12Z

Found one more edge case bug that I fixed. Reran linting checks and all the tests

Copilot

Pull request overview

This PR addresses mmCIF parsing robustness in pdb2pqr, aiming to prevent crashes when optional CIF metadata is missing and when high-precision coordinates would otherwise overflow fixed-width PDB formatting.

Changes:

Refactors atom_site CIF→PDB-line conversion and adds fixed-width coordinate formatting safeguards.
Adds guard clauses to several CIF metadata handlers (header, keywds, expdata, author, cryst1, scalen, origxn) to better tolerate missing CIF categories.
Adds regression tests that run pdb2pqr on previously-problematic real-world CIF files.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 8 comments.

File	Description
`pdb2pqr/cif.py`	Refactors atom_site conversion and adds missing-data handling for several CIF-derived PDB header records.
`tests/core_test.py`	Adds a parametrized test to ensure CIF inputs run through the pipeline without crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdb2pqr/cif.py

tests/core_test.py

pdb2pqr/cif.py

Copilot · 2026-02-07T15:14:02Z

pdb2pqr/cif.py

+    # Extract and cast values
+    group = atoms.get_value("group_PDB", row_index=row_index)
+    serial = int(atoms.get_value("id", row_index=row_index))
+    name = atoms.get_value("label_atom_id", row_index=row_index)
+    alt_id = atoms.get_value("label_alt_id", row_index=row_index)
+    res_name = atoms.get_value("label_comp_id", row_index=row_index)
+    chain = atoms.get_value("label_asym_id", row_index=row_index)
+    res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index))
+    x = float(atoms.get_value("Cartn_x", row_index=row_index))
+    y = float(atoms.get_value("Cartn_y", row_index=row_index))
+    z = float(atoms.get_value("Cartn_z", row_index=row_index))
+    occ = float(atoms.get_value("occupancy", row_index=row_index))
+    temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index))
+    element = atoms.get_value("type_symbol", row_index=row_index)
+
+    # Handle the '?' or '.' cases for alt_id and charge
+    alt_id = alt_id if alt_id != "." else " "
+    charge = (
+        atoms.get_value("pdbx_formal_charge", row_index=row_index)
+        if "pdbx_formal_charge" in atoms.attribute_list
+        else "  "
+    )
+    if charge in ["?", None]:


convert_cif_atom_site_to_pdb_line() casts several CIF fields with int()/float() directly. If any value is missing/unknown (e.g. ?, ., or None), this will raise ValueError/TypeError, and atom_site() currently doesn’t catch exceptions around the conversion call. Please add handling for missing/unknown values (return None for the line, or substitute safe defaults) and ensure the caller catches conversion failures.

Suggested change

# Extract and cast values

group = atoms.get_value("group_PDB", row_index=row_index)

serial = int(atoms.get_value("id", row_index=row_index))

name = atoms.get_value("label_atom_id", row_index=row_index)

alt_id = atoms.get_value("label_alt_id", row_index=row_index)

res_name = atoms.get_value("label_comp_id", row_index=row_index)

chain = atoms.get_value("label_asym_id", row_index=row_index)

res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index))

x = float(atoms.get_value("Cartn_x", row_index=row_index))

y = float(atoms.get_value("Cartn_y", row_index=row_index))

z = float(atoms.get_value("Cartn_z", row_index=row_index))

occ = float(atoms.get_value("occupancy", row_index=row_index))

temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index))

element = atoms.get_value("type_symbol", row_index=row_index)

# Handle the '?' or '.' cases for alt_id and charge

alt_id = alt_id if alt_id != "." else " "

charge = (

atoms.get_value("pdbx_formal_charge", row_index=row_index)

if "pdbx_formal_charge" in atoms.attribute_list

else " "

)

if charge in ["?", None]:

def _is_missing(value) -> bool:

"""Return True if a CIF value represents missing/unknown data."""

return value is None or value == "?" or value == "."

# Extract and cast values, handling missing/unknown data robustly

group = atoms.get_value("group_PDB", row_index=row_index)

serial_raw = atoms.get_value("id", row_index=row_index)

if _is_missing(serial_raw):

_LOGGER.warning("Missing atom serial number at row %s; skipping atom.", row_index)

return None

try:

serial = int(serial_raw)

except (TypeError, ValueError):

_LOGGER.warning("Invalid atom serial number %r at row %s; skipping atom.", serial_raw, row_index)

return None

name = atoms.get_value("label_atom_id", row_index=row_index)

if name is None:

_LOGGER.warning("Missing atom name at row %s; skipping atom.", row_index)

return None

alt_id = atoms.get_value("label_alt_id", row_index=row_index)

res_name = atoms.get_value("label_comp_id", row_index=row_index)

chain = atoms.get_value("label_asym_id", row_index=row_index)

res_seq_raw = atoms.get_value("auth_seq_id", row_index=row_index)

if _is_missing(res_seq_raw):

_LOGGER.warning("Missing residue sequence number at row %s; skipping atom.", row_index)

return None

try:

res_seq = int(res_seq_raw)

except (TypeError, ValueError):

_LOGGER.warning(

"Invalid residue sequence number %r at row %s; skipping atom.",

res_seq_raw,

row_index,

)

return None

x_raw = atoms.get_value("Cartn_x", row_index=row_index)

y_raw = atoms.get_value("Cartn_y", row_index=row_index)

z_raw = atoms.get_value("Cartn_z", row_index=row_index)

if _is_missing(x_raw) or _is_missing(y_raw) or _is_missing(z_raw):

_LOGGER.warning("Missing Cartesian coordinates at row %s; skipping atom.", row_index)

return None

try:

x = float(x_raw)

y = float(y_raw)

z = float(z_raw)

except (TypeError, ValueError):

_LOGGER.warning(

"Invalid Cartesian coordinates (%r, %r, %r) at row %s; skipping atom.",

x_raw,

y_raw,

z_raw,

row_index,

)

return None

occ_raw = atoms.get_value("occupancy", row_index=row_index)

if _is_missing(occ_raw):

occ = 1.0

else:

try:

occ = float(occ_raw)

except (TypeError, ValueError):

_LOGGER.warning(

"Invalid occupancy %r at row %s; using default 1.0.",

occ_raw,

row_index,

)

occ = 1.0

temp_raw = atoms.get_value("B_iso_or_equiv", row_index=row_index)

if _is_missing(temp_raw):

temp = 0.0

else:

try:

temp = float(temp_raw)

except (TypeError, ValueError):

_LOGGER.warning(

"Invalid B-factor %r at row %s; using default 0.0.",

temp_raw,

row_index,

)

temp = 0.0

element = atoms.get_value("type_symbol", row_index=row_index)

if _is_missing(element):

element = ""

else:

element = str(element).strip()

# Handle the '?' or '.' cases for alt_id and charge

alt_id = alt_id if (alt_id is not None and alt_id not in (".", "?")) else " "

charge = (

atoms.get_value("pdbx_formal_charge", row_index=row_index)

if "pdbx_formal_charge" in atoms.attribute_list

else " "

)

if charge in ["?", ".", None]:

I did this for occupancy and B_iso_or_equiv.

I did not do it for id, auth_seq_id or Cartn_x/y/z. Imho, if any of these cannot be converted into floats it's probably worth raising an error and letting the user see this and decide what to do, as there's probably some larger error with the cif-file. Simply dropping the line and logging an error could cause very unexpected results down stream which the user might not realise.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

bamattsson · 2026-02-10T20:20:31Z

Thanks for the review Copilot & @sobolevnrm. I fixed the suggestions now. Left a comment on one I only fixed partly.

I ran all tests locally and they all pass.

sobolevnrm · 2026-02-13T14:33:30Z

I'll try the test/build workflows here again. If they don't work, we'll need to figure out how to fix them before I can approve the merge.

bamattsson · 2026-02-15T18:44:47Z

Thanks @sobolevnrm! It seemed to fail again.

I looked through the three test/build workflows that have failed above. They seem to fall into two categories. Both driven by pkaani and torchani failures, that seems to be unrelated to the changes in this PR. I include the full details below.

I don't really know much about torchani or pkaani, but this looks like maybe there's some version incompatibility in the packages. From comparing the most recently successful build that I can find I can see that torchani has been updated 2.2->2.7.9. The only thing I could suggest is to pin the requirement on torchani to <=2.2. If that fails maybe it would be better if we involve the people who contributed the pkaani code?

Failure in Python 3.10

For job 1 and job 2 the app failed here. Inside pkaani it fails here:

.venv/lib/python3.10/site-packages/pkaani/pkaani.py:147: in calculate_pka
    ani_descriptors,features=get_desc_arrays(ani,species_coordinates,aev,res_acti,res_aevi,a_symbols,a_type)
.venv/lib/python3.10/site-packages/pkaani/ani_descriptors.py:303: in get_desc_arrays
    nn_act = nn[:-1](aev[species_coordinates[0] == i])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = AtomicNetwork(
  layer_dims=(1008, 192, 160, 128, 1), 
  activation=TightCELU(), 
  bias=True,
  (layers): ModuleList(...s=128, bias=True)
  )
  (final_layer): Linear(in_features=128, out_features=1, bias=True)
  (activation): TightCELU()
)
idx = slice(None, -1, None)

    def __getitem__(self, idx: int) -> torch.nn.Module:
        if idx in [-1, len(self.layers)]:
            return self.final_layer
>       if idx < -1:
E       TypeError: '<' not supported between instances of 'slice' and 'int'

.venv/lib/python3.10/site-packages/torchani/nn/_core.py:144: TypeError

Failure in Python 3.13

For job 3 the app fails here. Note that the try-except ImportError which complains that pdb2pqr[pkaani] is not installed is incorrect! By looking through the stack-trace from the job it seems like the ImportError inside that try-block is coming from torchani trying to run from pkg_resources import get_distribution, DistributionNotFound as you can see in an excerpt from the log below

>           from pkaani.pkaani import calculate_pka as calculate_pkaani

pdb2pqr/main.py:631: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.13/site-packages/pkaani/__init__.py:12: in <module>
    from pkaani.pkaani import calculate_pka
.venv/lib/python3.13/site-packages/pkaani/pkaani.py:2: in <module>
    import torchani
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    # -*- coding: utf-8 -*-
    """`TorchANI`_ is a PyTorch implementation of `ANI`_, created and maintained by
    the `Roitberg group`_.  TorchANI contains classes like
    :class:`AEVComputer`, :class:`ANIModel`, and :class:`EnergyShifter` that can
    be pipelined to compute molecular energies from the 3D coordinates of
    molecules.  It also include tools to: deal with ANI datasets(e.g. `ANI-1`_,
    `ANI-1x`_, `ANI-1ccx`_, `ANI-2x`_) at :attr:`torchani.data`, import various file
    formats of NeuroChem at :attr:`torchani.neurochem`, and more at :attr:`torchani.utils`.
    
    .. _TorchANI:
        https://doi.org/10.26434/chemrxiv.12218294.v1
    
    .. _ANI:
        http://pubs.rsc.org/en/Content/ArticleLanding/2017/SC/C6SC05720A#!divAbstract
    
    .. _Roitberg group:
        https://roitberg.chem.ufl.edu/
    
    .. _ANI-1:
        https://www.nature.com/articles/sdata2017193
    
    .. _ANI-1x:
        https://aip.scitation.org/doi/abs/10.1063/1.5023802
    
    .. _ANI-1ccx:
        https://doi.org/10.26434/chemrxiv.6744440.v1
    
    .. _ANI-2x:
        https://doi.org/10.26434/chemrxiv.11819268.v1
    """
    
    from .utils import EnergyShifter
    from .nn import ANIModel, Ensemble, SpeciesConverter
    from .aev import AEVComputer
    from . import utils
    from . import neurochem
    from . import models
    from . import units
>   from pkg_resources import get_distribution, DistributionNotFound
E   ModuleNotFoundError: No module named 'pkg_resources'

.venv/lib/python3.13/site-packages/torchani/__init__.py:39: ModuleNotFoundError

sobolevnrm · 2026-02-15T19:16:34Z

Yes, we need help from @sastrys1 and @adnaksskanda on these errors. Hopefully they can fix them in another PR.

sobolevnrm · 2026-02-15T19:23:10Z

@bamattsson - since we're stuck for a bit, I've updated the main branch to main from master. You may need to update your local copy with something like:

git checkout main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

bamattsson · 2026-02-16T09:47:34Z

Perfect! I've merged in the changes on main now, and repointed this PR to main.

sobolevnrm · 2026-02-16T15:36:27Z

PR #451 fixes the build failure -- but in a way that I'm not very happy about. I don't have much time for this project anymore, so I'm hoping others can help find a better solution. If not, we can merge it next weekend.

bamattsson added 4 commits January 25, 2026 16:22

Expanding support for cif files missing certain data values

a7f35cc

Pulling out atom_site to pdb line conversion into function and refact…

a55774a

…oring

Fixing ruff suggestions

ef7cd9b

Adding a test for reading cif files

a07392d

Fixing a bug that happens when some entity_src_gen lines miss certain…

fcda43b

… datapoints

sobolevnrm requested a review from Copilot February 7, 2026 15:10

Copilot started reviewing on behalf of sobolevnrm February 7, 2026 15:11 View session

sobolevnrm self-requested a review February 7, 2026 15:11

Copilot AI reviewed Feb 7, 2026

View reviewed changes

bamattsson and others added 2 commits February 10, 2026 19:15

Using args instead of kwards as suggested by copilot

26e31d8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixing suggestions from copilot

ab72cb2

Merge branch 'main' into bamattsson/#445_CIF_file_support

da8e216

bamattsson changed the base branch from DEPRECATED_master to main February 16, 2026 09:45

Conversation

bamattsson commented Jan 26, 2026

Key changes:

How this has been tested:

Question for maintainers:

Uh oh!

sobolevnrm commented Feb 1, 2026

Uh oh!

bamattsson commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bamattsson commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

bamattsson Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bamattsson commented Feb 10, 2026

Uh oh!

sobolevnrm commented Feb 13, 2026

Uh oh!

bamattsson commented Feb 15, 2026

Failure in Python 3.10

Failure in Python 3.13

Uh oh!

sobolevnrm commented Feb 15, 2026

Uh oh!

sobolevnrm commented Feb 15, 2026

Uh oh!

bamattsson commented Feb 16, 2026

Uh oh!

sobolevnrm commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bamattsson commented Feb 1, 2026 •

edited

Loading