Fix CIF conversion failures and coordinate overflows (#445, #303)#446
Fix CIF conversion failures and coordinate overflows (#445, #303)#446bamattsson wants to merge 8 commits intoElectrostatics:mainfrom
Conversation
|
Hi @bamattsson -- Thank you for submitting this. Did the CIF files you included in your new tests fail with the old code? If not, could you please include one that failed previously and now works? Thanks again, Nathan |
|
Hi Nathan! Yes, all of the three cifs I added in
I can see that the build failed above. Seems to be something to do with pkaani tests. Would I need to do any changes to make those pass? (they pass when I run it locally) |
|
Found one more edge case bug that I fixed. Reran linting checks and all the tests |
There was a problem hiding this comment.
Pull request overview
This PR addresses mmCIF parsing robustness in pdb2pqr, aiming to prevent crashes when optional CIF metadata is missing and when high-precision coordinates would otherwise overflow fixed-width PDB formatting.
Changes:
- Refactors
atom_siteCIF→PDB-line conversion and adds fixed-width coordinate formatting safeguards. - Adds guard clauses to several CIF metadata handlers (
header,keywds,expdata,author,cryst1,scalen,origxn) to better tolerate missing CIF categories. - Adds regression tests that run
pdb2pqron previously-problematic real-world CIF files.
Reviewed changes
Copilot reviewed 2 out of 5 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
pdb2pqr/cif.py |
Refactors atom_site conversion and adds missing-data handling for several CIF-derived PDB header records. |
tests/core_test.py |
Adds a parametrized test to ensure CIF inputs run through the pipeline without crashing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pdb2pqr/cif.py
Outdated
| # Extract and cast values | ||
| group = atoms.get_value("group_PDB", row_index=row_index) | ||
| serial = int(atoms.get_value("id", row_index=row_index)) | ||
| name = atoms.get_value("label_atom_id", row_index=row_index) | ||
| alt_id = atoms.get_value("label_alt_id", row_index=row_index) | ||
| res_name = atoms.get_value("label_comp_id", row_index=row_index) | ||
| chain = atoms.get_value("label_asym_id", row_index=row_index) | ||
| res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index)) | ||
| x = float(atoms.get_value("Cartn_x", row_index=row_index)) | ||
| y = float(atoms.get_value("Cartn_y", row_index=row_index)) | ||
| z = float(atoms.get_value("Cartn_z", row_index=row_index)) | ||
| occ = float(atoms.get_value("occupancy", row_index=row_index)) | ||
| temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index)) | ||
| element = atoms.get_value("type_symbol", row_index=row_index) | ||
|
|
||
| # Handle the '?' or '.' cases for alt_id and charge | ||
| alt_id = alt_id if alt_id != "." else " " | ||
| charge = ( | ||
| atoms.get_value("pdbx_formal_charge", row_index=row_index) | ||
| if "pdbx_formal_charge" in atoms.attribute_list | ||
| else " " | ||
| ) | ||
| if charge in ["?", None]: |
There was a problem hiding this comment.
convert_cif_atom_site_to_pdb_line() casts several CIF fields with int()/float() directly. If any value is missing/unknown (e.g. ?, ., or None), this will raise ValueError/TypeError, and atom_site() currently doesn’t catch exceptions around the conversion call. Please add handling for missing/unknown values (return None for the line, or substitute safe defaults) and ensure the caller catches conversion failures.
| # Extract and cast values | |
| group = atoms.get_value("group_PDB", row_index=row_index) | |
| serial = int(atoms.get_value("id", row_index=row_index)) | |
| name = atoms.get_value("label_atom_id", row_index=row_index) | |
| alt_id = atoms.get_value("label_alt_id", row_index=row_index) | |
| res_name = atoms.get_value("label_comp_id", row_index=row_index) | |
| chain = atoms.get_value("label_asym_id", row_index=row_index) | |
| res_seq = int(atoms.get_value("auth_seq_id", row_index=row_index)) | |
| x = float(atoms.get_value("Cartn_x", row_index=row_index)) | |
| y = float(atoms.get_value("Cartn_y", row_index=row_index)) | |
| z = float(atoms.get_value("Cartn_z", row_index=row_index)) | |
| occ = float(atoms.get_value("occupancy", row_index=row_index)) | |
| temp = float(atoms.get_value("B_iso_or_equiv", row_index=row_index)) | |
| element = atoms.get_value("type_symbol", row_index=row_index) | |
| # Handle the '?' or '.' cases for alt_id and charge | |
| alt_id = alt_id if alt_id != "." else " " | |
| charge = ( | |
| atoms.get_value("pdbx_formal_charge", row_index=row_index) | |
| if "pdbx_formal_charge" in atoms.attribute_list | |
| else " " | |
| ) | |
| if charge in ["?", None]: | |
| def _is_missing(value) -> bool: | |
| """Return True if a CIF value represents missing/unknown data.""" | |
| return value is None or value == "?" or value == "." | |
| # Extract and cast values, handling missing/unknown data robustly | |
| group = atoms.get_value("group_PDB", row_index=row_index) | |
| serial_raw = atoms.get_value("id", row_index=row_index) | |
| if _is_missing(serial_raw): | |
| _LOGGER.warning("Missing atom serial number at row %s; skipping atom.", row_index) | |
| return None | |
| try: | |
| serial = int(serial_raw) | |
| except (TypeError, ValueError): | |
| _LOGGER.warning("Invalid atom serial number %r at row %s; skipping atom.", serial_raw, row_index) | |
| return None | |
| name = atoms.get_value("label_atom_id", row_index=row_index) | |
| if name is None: | |
| _LOGGER.warning("Missing atom name at row %s; skipping atom.", row_index) | |
| return None | |
| alt_id = atoms.get_value("label_alt_id", row_index=row_index) | |
| res_name = atoms.get_value("label_comp_id", row_index=row_index) | |
| chain = atoms.get_value("label_asym_id", row_index=row_index) | |
| res_seq_raw = atoms.get_value("auth_seq_id", row_index=row_index) | |
| if _is_missing(res_seq_raw): | |
| _LOGGER.warning("Missing residue sequence number at row %s; skipping atom.", row_index) | |
| return None | |
| try: | |
| res_seq = int(res_seq_raw) | |
| except (TypeError, ValueError): | |
| _LOGGER.warning( | |
| "Invalid residue sequence number %r at row %s; skipping atom.", | |
| res_seq_raw, | |
| row_index, | |
| ) | |
| return None | |
| x_raw = atoms.get_value("Cartn_x", row_index=row_index) | |
| y_raw = atoms.get_value("Cartn_y", row_index=row_index) | |
| z_raw = atoms.get_value("Cartn_z", row_index=row_index) | |
| if _is_missing(x_raw) or _is_missing(y_raw) or _is_missing(z_raw): | |
| _LOGGER.warning("Missing Cartesian coordinates at row %s; skipping atom.", row_index) | |
| return None | |
| try: | |
| x = float(x_raw) | |
| y = float(y_raw) | |
| z = float(z_raw) | |
| except (TypeError, ValueError): | |
| _LOGGER.warning( | |
| "Invalid Cartesian coordinates (%r, %r, %r) at row %s; skipping atom.", | |
| x_raw, | |
| y_raw, | |
| z_raw, | |
| row_index, | |
| ) | |
| return None | |
| occ_raw = atoms.get_value("occupancy", row_index=row_index) | |
| if _is_missing(occ_raw): | |
| occ = 1.0 | |
| else: | |
| try: | |
| occ = float(occ_raw) | |
| except (TypeError, ValueError): | |
| _LOGGER.warning( | |
| "Invalid occupancy %r at row %s; using default 1.0.", | |
| occ_raw, | |
| row_index, | |
| ) | |
| occ = 1.0 | |
| temp_raw = atoms.get_value("B_iso_or_equiv", row_index=row_index) | |
| if _is_missing(temp_raw): | |
| temp = 0.0 | |
| else: | |
| try: | |
| temp = float(temp_raw) | |
| except (TypeError, ValueError): | |
| _LOGGER.warning( | |
| "Invalid B-factor %r at row %s; using default 0.0.", | |
| temp_raw, | |
| row_index, | |
| ) | |
| temp = 0.0 | |
| element = atoms.get_value("type_symbol", row_index=row_index) | |
| if _is_missing(element): | |
| element = "" | |
| else: | |
| element = str(element).strip() | |
| # Handle the '?' or '.' cases for alt_id and charge | |
| alt_id = alt_id if (alt_id is not None and alt_id not in (".", "?")) else " " | |
| charge = ( | |
| atoms.get_value("pdbx_formal_charge", row_index=row_index) | |
| if "pdbx_formal_charge" in atoms.attribute_list | |
| else " " | |
| ) | |
| if charge in ["?", ".", None]: |
There was a problem hiding this comment.
I did this for occupancy and B_iso_or_equiv.
I did not do it for id, auth_seq_id or Cartn_x/y/z. Imho, if any of these cannot be converted into floats it's probably worth raising an error and letting the user see this and decide what to do, as there's probably some larger error with the cif-file. Simply dropping the line and logging an error could cause very unexpected results down stream which the user might not realise.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Thanks for the review Copilot & @sobolevnrm. I fixed the suggestions now. Left a comment on one I only fixed partly. I ran all tests locally and they all pass. |
|
I'll try the test/build workflows here again. If they don't work, we'll need to figure out how to fix them before I can approve the merge. |
|
Thanks @sobolevnrm! It seemed to fail again. I looked through the three test/build workflows that have failed above. They seem to fall into two categories. Both driven by pkaani and torchani failures, that seems to be unrelated to the changes in this PR. I include the full details below. I don't really know much about torchani or pkaani, but this looks like maybe there's some version incompatibility in the packages. From comparing the most recently successful build that I can find I can see that torchani has been updated 2.2->2.7.9. The only thing I could suggest is to pin the requirement on torchani to <=2.2. If that fails maybe it would be better if we involve the people who contributed the pkaani code? Failure in Python 3.10For job 1 and job 2 the app failed here. Inside pkaani it fails here: Failure in Python 3.13For job 3 the app fails here. Note that the try-except ImportError which complains that |
|
Yes, we need help from @sastrys1 and @adnaksskanda on these errors. Hopefully they can fix them in another PR. |
|
@bamattsson - since we're stuck for a bit, I've updated the main branch to git checkout main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a |
|
Perfect! I've merged in the changes on main now, and repointed this PR to main. |
|
PR #451 fixes the build failure -- but in a way that I'm not very happy about. I don't have much time for this project anymore, so I'm hoping others can help find a better solution. If not, we can merge it next weekend. |
Hi @sobolevnrm and the other maintainers! Thanks for maintaining this tool for the structural biology community.
This PR aims to resolve issues raised in #445 and #303 regarding mmCIF support and conversion stability.
Key changes:
cif.pythe following functionsheader,keywds,expdata,author,cryst1,cryst1,scalen,origxnhave been updated to return default/empty values if the corresponding objects are missing from the CIF file, rather than raising an exception.atom_sitewhere CIF coordinates with high precision (many significant digits) caused PDB line formatting to exceed the strict 80-character limit and the program crashing. Added logic to ensure coordinates adhere to the fixed-width8.3fformat.atom_siteto reduce code duplication.How this has been tested:
pytestsuites (all passed)pdb2pqroutput on a few casesQuestion for maintainers:
In Point 1, I assumed that missing optional metadata (like author or scalen) does not break the downstream hydrogen assignment logic. My investigation suggests these are primarily used for PDB header generation, but I would appreciate a second look from someone familiar with the core solver.
Looking forward to any feedback :)