Implementation of fetch_pdb() #4943

jauy123 · 2025-03-03T18:28:48Z

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
Added requests as a dependency
mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

Issue raised/referenced?
Tests updated/added?
Documentation updated/added?
package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

jauy123 · 2025-03-03T18:31:48Z

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
Integration with Mdanalysis' logger
Probably could wrap the cache logic into a separate function

BradyAJohnston · 2025-03-04T01:20:19Z

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

codecov · 2025-03-04T20:44:31Z

Codecov Report

Attention: Patch coverage is 91.76471% with 7 lines in your changes missing coverage. Please review.

Project coverage is 93.62%. Comparing base (5c0563d) to head (252b23c).

Files with missing lines	Patch %	Lines
package/MDAnalysis/web/downloaders.py	90.90%	3 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
- Coverage    93.62%   93.62%   -0.01%     
===========================================
  Files          177      180       +3     
  Lines        22001    22086      +85     
  Branches      3114     3127      +13     
===========================================
+ Hits         20599    20677      +78     
- Misses         947      950       +3     
- Partials       455      459       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jauy123 · 2025-03-05T02:49:30Z

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

jauy123 · 2025-03-10T17:41:23Z

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

BradyAJohnston · 2025-03-19T12:36:18Z

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

mdanalysis/package/MDAnalysis/analysis/align.py

Lines 200 to 207 in dcaa087

    
           try: 
        
               import Bio.AlignIO 
        
               import Bio.Align 
        
               import Bio.Align.Applications 
        
           except ImportError: 
        
               HAS_BIOPYTHON = False 
        
           else: 
        
               HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

jauy123 · 2025-03-19T23:14:58Z

Thanks for the comment!

jauy123 · 2025-03-19T23:15:52Z

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

orbeckst · 2025-03-20T18:17:46Z

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

orbeckst · 2025-06-12T16:18:16Z

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

jauy123 · 2025-06-12T21:25:55Z

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

…xt files

…since text files are just binary files with special encoding

Instead of Temporary File (which are slower), buffers are used instead!

modified: pyproject.toml

jauy123 · 2025-06-26T20:14:50Z

@BradyAJohnston @orbeckst

Can I formally ask for a code review? I finished up with my code, and I'm currently unsure where to put it. I have placed all my code in package/MDAnalysis/web and wrote tests in testsuite/MDAnalysisTests/web/ for right now. I defined a downloader class (PDBDownloader) that I think should be placed in MDAnalysis/coordinates/, and a wrapper function (fetch_pdb)for that class which should be loaded in the main MDAnalysis namespace. I'm not sure if that should be the definitive spot for that though.

orbeckst · 2025-06-26T20:54:42Z

Even without code review, you can try to make the linters happy (click and see what they're complaining about – probably start by running black on the files that you touched).

Look at the Azure tests (such as https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&jobId=c20f733f-1203-5ae6-f137-2a50b85410ce&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328 ): I see failures in your functionality https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328&l=337 ; see if you can do something about that.

orbeckst

It's a fair amount of code to do one thing so the question is if the complexity of an BaseDownloader and PDBDownloader class, and a fetch_pdb() is justified. We are always careful with adding new code because it invariably increases maintenance burden. To help making decisions:

Can you think of other applications of BaseDownloader (eg mda.fetch_alphafold() #3377)?
Can you move code from PDBDownloader into BaseDownloader to make it more reusable?
Can you summarize the capabilities and advantages of your code?

I am really not quite sure where to put such code. My first instinct is to add any base functionality to coordinates.base and the format-specific code to coordinates.PDB. The fetch_pdb function can then be imported at the top level or we write a top-level mda.fetch(...) that automatically calls the right fetcher.

@MDAnalysis/coredevs any suggestions how to organize "fetchers"?

orbeckst · 2025-06-26T20:58:20Z

package/MDAnalysis/web/TODO

orbeckst · 2025-06-26T21:19:23Z

package/pyproject.toml

@@ -76,6 +77,7 @@ extra_formats = [
    "pytng>=0.2.3",
    "gsd>3.0.0",
    "rdkit>=2020.03.1",
+    "requests"


remove, it's already in the core deps

orbeckst · 2025-06-26T21:21:02Z

testsuite/MDAnalysisTests/web/test_web.py

+    return tmp_path_factory.mktemp("cache")
+
+
+class Test_PDBDownloaderBaseFunctionality:


skipif if there's no connection to the PDB (basically make it so that the test does not fail if there are internet issues)

orbeckst · 2025-06-26T21:24:13Z

testsuite/MDAnalysisTests/web/test_web.py

+            mda.web.PDBDownloader(PDB_ID="BananaBoat").convert_to_universe()
+
+
+class Test_PDBDownloader_Cache:


The tests should not be using a shared cache directory. Tests may run in parallel and then you may get the behavior that multiple tests write at the same time to the cache or find a file there that they didn't expect.

Make it so that the tests use a temporary directory that is cleaned up afterwards. pytest has a temp_factory or tmppath fixture to aid in this common usage.

orbeckst · 2025-06-26T21:27:12Z

testsuite/MDAnalysisTests/web/test_web.py

+import pytest
+import requests
+
+working_PDB_ID = "1DPX"  # egg white lysozyme


May be more in line with how pytest does it if you made it a fixture (can be module level) PDB_ID and then use it as a fixture in all tests that need it. This makes it explicit where it's used.

It also then becomes possible to make it a fixture that provides multiple PDB_IDs if that was needed (to run each test multiple times), e.g., if we later find out that specific PDBs create issues.

orbeckst · 2025-06-26T21:39:47Z

package/MDAnalysis/web/downloaders.py

+                if progress_bar:
+                    self._requests_progress_bar(r)
+                else:
+                    self._file.write(r.content)


The method _requests_progress_bar is poorly named, I suggest something like _write_with_progressbar to indicate what it's actually doing.

orbeckst · 2025-06-26T21:43:54Z

package/MDAnalysis/web/downloaders.py

+        if self._download:
+            try:
+                r = requests.get(
+                    f"https://files.rcsb.org/download/{self.id}.{self.file_format}",


I'd make the URL a class variable as it is what makes this class the PDB downloader.

orbeckst · 2025-06-26T21:51:20Z

package/MDAnalysis/web/downloaders.py

+                pb.update(chunk_size)
+
+    def download(self, cache_path=None, timeout=None, progress_bar=False):
+        """Downloads files from the Protein Data Bank"""


Needs more docs explaining function and arguments.

Needs to make clear that it can either keep it in memory StringIO or as a file.

orbeckst · 2025-06-26T21:51:54Z

package/MDAnalysis/web/downloaders.py

+    def _open_file(self, cache_path):
+        """This method either load/create cache or reserve a spot in memory to store topologt"""
+
+        if cache_path is None:
+            self._file = io.BytesIO()
+            self._download = True
+
+        else:
+            cache_file_path = Path(cache_path) / self.file_name
+
+            # Found Cache, so don't download anything and open existing file
+            # Note this doesn't check the content of the file!
+            if cache_file_path.exists() and cache_file_path.is_file():
+                self._file = open(cache_file_path, "r")
+                self._download = False
+
+            else:  # No cache found, so create Cache
+                self._file = open(cache_file_path, "wb")
+                self._download = True
+
+    def _requests_progress_bar(self, requests_response):
+        """Puts a progress bar when writing content with a request object"""
+        chunk_size = (
+            1  # Files are so small that you can read them one byte at a time
+        )
+
+        with ProgressBar(
+            total=len(requests_response.content),
+            unit="B",
+            unit_scale=True,
+            desc=self.file_name,
+        ) as pb:
+            for byte in requests_response.iter_content(chunk_size=chunk_size):
+                self._file.write(byte)
+                pb.update(chunk_size)


The logic for caching should be common for all file-based downloaders. This is something I'd expect to see in a base class so that the actual PDB-one can then be simply containing the URL where to download from and a format hint to pass on to the Universe.

orbeckst · 2025-06-26T21:53:17Z

package/MDAnalysis/web/functions.py

+from .downloaders import PDBDownloader
+
+
+def fetch_pdb(


Perhaps add as coordinates.PDB.fetch() and then we can add fetch() functions to others if necessary.

BradyAJohnston self-assigned this Mar 4, 2025

jauy123 added 18 commits June 23, 2025 10:25

Added requests as a dependency

7899f3d

Inital download code

44393be

fixed typo

b1f6002

cleaner convert_to_universe()

9c6e87a

Added abc module and allowed closing of file stream for downloaded te…

aecefc9

…xt files

Fixed __all__ -- should fixed pull request test on github

9510cc6

refactored cache logic

8c1a196

Initial tests

f0e30ed

Added __init.py to make tests work

1c7d909

typos fixed

eb23ed1

Refactored Tests -- put them in classes!

b0c7f5a

PdbDownloader().download() now downloads in binary rather than text (…

f2ec203

…since text files are just binary files with special encoding

Updated Tests to comply with pdb.gz

a21fd94

Added Progress bar to PdbDownloader().download()

d58bed9

Added a few clarifications to _requests_progress_bar

1147b6d

Added filename attribute() to BaseDownloader()

91feb16

made _requests_progress_bar a private method of PdbDownloader

ddcef9e

minor comments

bf3e07f

jauy123 added 6 commits June 23, 2025 10:25

Added Buffer as default option for PdbDownloader.download()

09cc409

Instead of Temporary File (which are slower), buffers are used instead!

Renamed PdbDownloader to PDBDownloader to match PDBReader()

d78a954

better __str__ method for BaseDownloader()

560e1c2

Enhanced tests

c43c10d

Added TODO list for future me

e6a0f05

Added requests as optional dep to pyproject.toml

ada1b38

modified: pyproject.toml

jauy123 force-pushed the downloads branch from 651bf26 to ada1b38 Compare June 23, 2025 17:26

jauy123 added 5 commits June 26, 2025 12:42

update todo list

043c006

minor cleanup

ea5c5b7

Ran black on package/

5d6d3e8

Ran black on tests

6590c42

updated TODO

6e9b9f3

orbeckst added Format-PDB streaming labels Jun 26, 2025

attempt to fix mypy issue

252b23c

orbeckst requested changes Jun 26, 2025

View reviewed changes

		return tmp_path_factory.mktemp("cache")


		class Test_PDBDownloaderBaseFunctionality:

		mda.web.PDBDownloader(PDB_ID="BananaBoat").convert_to_universe()


		class Test_PDBDownloader_Cache:

Implementation of fetch_pdb() #4943

Are you sure you want to change the base?

Implementation of fetch_pdb() #4943

Uh oh!

Conversation

jauy123 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers Certificate of Origin

Uh oh!

jauy123 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BradyAJohnston commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jauy123 commented Mar 5, 2025

Uh oh!

jauy123 commented Mar 10, 2025

Uh oh!

BradyAJohnston commented Mar 19, 2025

Uh oh!

jauy123 commented Mar 19, 2025

Uh oh!

jauy123 commented Mar 19, 2025

Uh oh!

orbeckst commented Mar 20, 2025

Uh oh!

orbeckst commented Jun 12, 2025

Uh oh!

jauy123 commented Jun 12, 2025

Uh oh!

jauy123 commented Jun 26, 2025

Uh oh!

orbeckst commented Jun 26, 2025

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jauy123 commented Mar 3, 2025 •

edited

Loading

jauy123 commented Mar 3, 2025 •

edited

Loading

BradyAJohnston commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading