Skip to content

Fix CIF conversion failures and coordinate overflows (#445, #303)#446

Open
bamattsson wants to merge 8 commits intoElectrostatics:mainfrom
bamattsson:bamattsson/#445_CIF_file_support
Open

Fix CIF conversion failures and coordinate overflows (#445, #303)#446
bamattsson wants to merge 8 commits intoElectrostatics:mainfrom
bamattsson:bamattsson/#445_CIF_file_support

Conversation

@bamattsson
Copy link
Copy Markdown

Hi @sobolevnrm and the other maintainers! Thanks for maintaining this tool for the structural biology community.

This PR aims to resolve issues raised in #445 and #303 regarding mmCIF support and conversion stability.

Key changes:

  1. Graceful handling of missing data: In the file cif.py the following functions header, keywds, expdata, author, cryst1, cryst1, scalen, origxn have been updated to return default/empty values if the corresponding objects are missing from the CIF file, rather than raising an exception.
  2. Coordinate overflow protection: Fixed a failure in atom_site where CIF coordinates with high precision (many significant digits) caused PDB line formatting to exceed the strict 80-character limit and the program crashing. Added logic to ensure coordinates adhere to the fixed-width 8.3f format.
  3. Refactoring: Refactored atom_site to reduce code duplication.
  4. Test Coverage: Added a new test suite using three real-world CIF files that previously caused crashes.

How this has been tested:

  • Ran the pre-existing pytest suites (all passed)
  • Added new tests to test cif file reading (all passed)
  • Manually verified pdb2pqr output on a few cases

Question for maintainers:

In Point 1, I assumed that missing optional metadata (like author or scalen) does not break the downstream hydrogen assignment logic. My investigation suggests these are primarily used for PDB header generation, but I would appreciate a second look from someone familiar with the core solver.

Looking forward to any feedback :)

@sobolevnrm
Copy link
Copy Markdown
Member

Hi @bamattsson --

Thank you for submitting this. Did the CIF files you included in your new tests fail with the old code? If not, could you please include one that failed previously and now works?

Thanks again,

Nathan

@bamattsson
Copy link
Copy Markdown
Author

bamattsson commented Feb 1, 2026

Hi Nathan!

Yes, all of the three cifs I added in a07392dc fails on master and need these changes. I reconfirmed that today by:

  • Creating a fresh conda environment (python 3.10)
  • Installing pdb2pqr:master with pip install -e ".[pkaani,test]"
  • Running pytest -> passes
  • Cherry-pick only the new tests from a07392dc onto master (e.g. without the new cif-parsing code) and rerunning pytest -> those 3 new tests fail
  • Checking out the rest of the code on this branch as well (e.g. with the new cif-parsing code) -> all tests (including these 3 new) passes

I can see that the build failed above. Seems to be something to do with pkaani tests. Would I need to do any changes to make those pass? (they pass when I run it locally)

@bamattsson
Copy link
Copy Markdown
Author

Found one more edge case bug that I fixed. Reran linting checks and all the tests

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses mmCIF parsing robustness in pdb2pqr, aiming to prevent crashes when optional CIF metadata is missing and when high-precision coordinates would otherwise overflow fixed-width PDB formatting.

Changes:

  • Refactors atom_site CIF→PDB-line conversion and adds fixed-width coordinate formatting safeguards.
  • Adds guard clauses to several CIF metadata handlers (header, keywds, expdata, author, cryst1, scalen, origxn) to better tolerate missing CIF categories.
  • Adds regression tests that run pdb2pqr on previously-problematic real-world CIF files.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 8 comments.

File Description
pdb2pqr/cif.py Refactors atom_site conversion and adds missing-data handling for several CIF-derived PDB header records.
tests/core_test.py Adds a parametrized test to ensure CIF inputs run through the pipeline without crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdb2pqr/cif.py Outdated
Comment on lines +110 to +132
# Extract and cast values
group = atoms.get_value("group_PDB", row_index=row_index)
serial = int(atoms.get_value("id", row_index=row_index))
name = atoms.get_value("label_atom_id", row_index=row_index)
alt_id = atoms.get_value("label_alt_id", row_index=row_index)
res_name = atoms.get_value("label_comp_id", row_index=row_index)
chain = atoms.get_value("label_asym_id", row_index=row_index)
res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index))
x = float(atoms.get_value("Cartn_x", row_index=row_index))
y = float(atoms.get_value("Cartn_y", row_index=row_index))
z = float(atoms.get_value("Cartn_z", row_index=row_index))
occ = float(atoms.get_value("occupancy", row_index=row_index))
temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index))
element = atoms.get_value("type_symbol", row_index=row_index)

# Handle the '?' or '.' cases for alt_id and charge
alt_id = alt_id if alt_id != "." else " "
charge = (
atoms.get_value("pdbx_formal_charge", row_index=row_index)
if "pdbx_formal_charge" in atoms.attribute_list
else " "
)
if charge in ["?", None]:
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_cif_atom_site_to_pdb_line() casts several CIF fields with int()/float() directly. If any value is missing/unknown (e.g. ?, ., or None), this will raise ValueError/TypeError, and atom_site() currently doesn’t catch exceptions around the conversion call. Please add handling for missing/unknown values (return None for the line, or substitute safe defaults) and ensure the caller catches conversion failures.

Suggested change
# Extract and cast values
group = atoms.get_value("group_PDB", row_index=row_index)
serial = int(atoms.get_value("id", row_index=row_index))
name = atoms.get_value("label_atom_id", row_index=row_index)
alt_id = atoms.get_value("label_alt_id", row_index=row_index)
res_name = atoms.get_value("label_comp_id", row_index=row_index)
chain = atoms.get_value("label_asym_id", row_index=row_index)
res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index))
x = float(atoms.get_value("Cartn_x", row_index=row_index))
y = float(atoms.get_value("Cartn_y", row_index=row_index))
z = float(atoms.get_value("Cartn_z", row_index=row_index))
occ = float(atoms.get_value("occupancy", row_index=row_index))
temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index))
element = atoms.get_value("type_symbol", row_index=row_index)
# Handle the '?' or '.' cases for alt_id and charge
alt_id = alt_id if alt_id != "." else " "
charge = (
atoms.get_value("pdbx_formal_charge", row_index=row_index)
if "pdbx_formal_charge" in atoms.attribute_list
else " "
)
if charge in ["?", None]:
def _is_missing(value) -> bool:
"""Return True if a CIF value represents missing/unknown data."""
return value is None or value == "?" or value == "."
# Extract and cast values, handling missing/unknown data robustly
group = atoms.get_value("group_PDB", row_index=row_index)
serial_raw = atoms.get_value("id", row_index=row_index)
if _is_missing(serial_raw):
_LOGGER.warning("Missing atom serial number at row %s; skipping atom.", row_index)
return None
try:
serial = int(serial_raw)
except (TypeError, ValueError):
_LOGGER.warning("Invalid atom serial number %r at row %s; skipping atom.", serial_raw, row_index)
return None
name = atoms.get_value("label_atom_id", row_index=row_index)
if name is None:
_LOGGER.warning("Missing atom name at row %s; skipping atom.", row_index)
return None
alt_id = atoms.get_value("label_alt_id", row_index=row_index)
res_name = atoms.get_value("label_comp_id", row_index=row_index)
chain = atoms.get_value("label_asym_id", row_index=row_index)
res_seq_raw = atoms.get_value("auth_seq_id", row_index=row_index)
if _is_missing(res_seq_raw):
_LOGGER.warning("Missing residue sequence number at row %s; skipping atom.", row_index)
return None
try:
res_seq = int(res_seq_raw)
except (TypeError, ValueError):
_LOGGER.warning(
"Invalid residue sequence number %r at row %s; skipping atom.",
res_seq_raw,
row_index,
)
return None
x_raw = atoms.get_value("Cartn_x", row_index=row_index)
y_raw = atoms.get_value("Cartn_y", row_index=row_index)
z_raw = atoms.get_value("Cartn_z", row_index=row_index)
if _is_missing(x_raw) or _is_missing(y_raw) or _is_missing(z_raw):
_LOGGER.warning("Missing Cartesian coordinates at row %s; skipping atom.", row_index)
return None
try:
x = float(x_raw)
y = float(y_raw)
z = float(z_raw)
except (TypeError, ValueError):
_LOGGER.warning(
"Invalid Cartesian coordinates (%r, %r, %r) at row %s; skipping atom.",
x_raw,
y_raw,
z_raw,
row_index,
)
return None
occ_raw = atoms.get_value("occupancy", row_index=row_index)
if _is_missing(occ_raw):
occ = 1.0
else:
try:
occ = float(occ_raw)
except (TypeError, ValueError):
_LOGGER.warning(
"Invalid occupancy %r at row %s; using default 1.0.",
occ_raw,
row_index,
)
occ = 1.0
temp_raw = atoms.get_value("B_iso_or_equiv", row_index=row_index)
if _is_missing(temp_raw):
temp = 0.0
else:
try:
temp = float(temp_raw)
except (TypeError, ValueError):
_LOGGER.warning(
"Invalid B-factor %r at row %s; using default 0.0.",
temp_raw,
row_index,
)
temp = 0.0
element = atoms.get_value("type_symbol", row_index=row_index)
if _is_missing(element):
element = ""
else:
element = str(element).strip()
# Handle the '?' or '.' cases for alt_id and charge
alt_id = alt_id if (alt_id is not None and alt_id not in (".", "?")) else " "
charge = (
atoms.get_value("pdbx_formal_charge", row_index=row_index)
if "pdbx_formal_charge" in atoms.attribute_list
else " "
)
if charge in ["?", ".", None]:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for occupancy and B_iso_or_equiv.

I did not do it for id, auth_seq_id or Cartn_x/y/z. Imho, if any of these cannot be converted into floats it's probably worth raising an error and letting the user see this and decide what to do, as there's probably some larger error with the cif-file. Simply dropping the line and logging an error could cause very unexpected results down stream which the user might not realise.

bamattsson and others added 2 commits February 10, 2026 19:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@bamattsson
Copy link
Copy Markdown
Author

Thanks for the review Copilot & @sobolevnrm. I fixed the suggestions now. Left a comment on one I only fixed partly.

I ran all tests locally and they all pass.

@sobolevnrm
Copy link
Copy Markdown
Member

I'll try the test/build workflows here again. If they don't work, we'll need to figure out how to fix them before I can approve the merge.

@bamattsson
Copy link
Copy Markdown
Author

Thanks @sobolevnrm! It seemed to fail again.

I looked through the three test/build workflows that have failed above. They seem to fall into two categories. Both driven by pkaani and torchani failures, that seems to be unrelated to the changes in this PR. I include the full details below.

I don't really know much about torchani or pkaani, but this looks like maybe there's some version incompatibility in the packages. From comparing the most recently successful build that I can find I can see that torchani has been updated 2.2->2.7.9. The only thing I could suggest is to pin the requirement on torchani to <=2.2. If that fails maybe it would be better if we involve the people who contributed the pkaani code?

Failure in Python 3.10

For job 1 and job 2 the app failed here. Inside pkaani it fails here:

.venv/lib/python3.10/site-packages/pkaani/pkaani.py:147: in calculate_pka
    ani_descriptors,features=get_desc_arrays(ani,species_coordinates,aev,res_acti,res_aevi,a_symbols,a_type)
.venv/lib/python3.10/site-packages/pkaani/ani_descriptors.py:303: in get_desc_arrays
    nn_act = nn[:-1](aev[species_coordinates[0] == i])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = AtomicNetwork(
  layer_dims=(1008, 192, 160, 128, 1), 
  activation=TightCELU(), 
  bias=True,
  (layers): ModuleList(...s=128, bias=True)
  )
  (final_layer): Linear(in_features=128, out_features=1, bias=True)
  (activation): TightCELU()
)
idx = slice(None, -1, None)

    def __getitem__(self, idx: int) -> torch.nn.Module:
        if idx in [-1, len(self.layers)]:
            return self.final_layer
>       if idx < -1:
E       TypeError: '<' not supported between instances of 'slice' and 'int'

.venv/lib/python3.10/site-packages/torchani/nn/_core.py:144: TypeError

Failure in Python 3.13

For job 3 the app fails here. Note that the try-except ImportError which complains that pdb2pqr[pkaani] is not installed is incorrect! By looking through the stack-trace from the job it seems like the ImportError inside that try-block is coming from torchani trying to run from pkg_resources import get_distribution, DistributionNotFound as you can see in an excerpt from the log below

>           from pkaani.pkaani import calculate_pka as calculate_pkaani

pdb2pqr/main.py:631: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.13/site-packages/pkaani/__init__.py:12: in <module>
    from pkaani.pkaani import calculate_pka
.venv/lib/python3.13/site-packages/pkaani/pkaani.py:2: in <module>
    import torchani
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    # -*- coding: utf-8 -*-
    """`TorchANI`_ is a PyTorch implementation of `ANI`_, created and maintained by
    the `Roitberg group`_.  TorchANI contains classes like
    :class:`AEVComputer`, :class:`ANIModel`, and :class:`EnergyShifter` that can
    be pipelined to compute molecular energies from the 3D coordinates of
    molecules.  It also include tools to: deal with ANI datasets(e.g. `ANI-1`_,
    `ANI-1x`_, `ANI-1ccx`_, `ANI-2x`_) at :attr:`torchani.data`, import various file
    formats of NeuroChem at :attr:`torchani.neurochem`, and more at :attr:`torchani.utils`.
    
    .. _TorchANI:
        https://doi.org/10.26434/chemrxiv.12218294.v1
    
    .. _ANI:
        http://pubs.rsc.org/en/Content/ArticleLanding/2017/SC/C6SC05720A#!divAbstract
    
    .. _Roitberg group:
        https://roitberg.chem.ufl.edu/
    
    .. _ANI-1:
        https://www.nature.com/articles/sdata2017193
    
    .. _ANI-1x:
        https://aip.scitation.org/doi/abs/10.1063/1.5023802
    
    .. _ANI-1ccx:
        https://doi.org/10.26434/chemrxiv.6744440.v1
    
    .. _ANI-2x:
        https://doi.org/10.26434/chemrxiv.11819268.v1
    """
    
    from .utils import EnergyShifter
    from .nn import ANIModel, Ensemble, SpeciesConverter
    from .aev import AEVComputer
    from . import utils
    from . import neurochem
    from . import models
    from . import units
>   from pkg_resources import get_distribution, DistributionNotFound
E   ModuleNotFoundError: No module named 'pkg_resources'

.venv/lib/python3.13/site-packages/torchani/__init__.py:39: ModuleNotFoundError

@sobolevnrm
Copy link
Copy Markdown
Member

Yes, we need help from @sastrys1 and @adnaksskanda on these errors. Hopefully they can fix them in another PR.

@sobolevnrm
Copy link
Copy Markdown
Member

@bamattsson - since we're stuck for a bit, I've updated the main branch to main from master. You may need to update your local copy with something like:

git checkout main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

@bamattsson bamattsson changed the base branch from DEPRECATED_master to main February 16, 2026 09:45
@bamattsson
Copy link
Copy Markdown
Author

Perfect! I've merged in the changes on main now, and repointed this PR to main.

@sobolevnrm
Copy link
Copy Markdown
Member

PR #451 fixes the build failure -- but in a way that I'm not very happy about. I don't have much time for this project anymore, so I'm hoping others can help find a better solution. If not, we can merge it next weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants