-
Notifications
You must be signed in to change notification settings - Fork 716
Implementation of fetch_pdb() #4943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere Some stuff which I like to still add (besides tests and docs):
|
I think others will have to confirm, but likely we'll want to have Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4943 +/- ##
===========================================
- Coverage 93.62% 93.62% -0.01%
===========================================
Files 177 180 +3
Lines 22001 22086 +85
Branches 3114 3127 +13
===========================================
+ Hits 20599 20677 +78
- Misses 947 950 +3
- Partials 455 459 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I'm ok with that. I can make the code raise an exception if |
Assuming that |
You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython: mdanalysis/package/MDAnalysis/analysis/align.py Lines 200 to 207 in dcaa087
I'm not an expert on the pipelines so someone else would have to pitch in more on that. |
Thanks for the comment! |
I happen to have another question! Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit. The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours. Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)
|
In principle, tests should pass everywhere. The Azure tests time out in the test
which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows. |
@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10! |
I have time again. I was busy starting the end of spring break with comps, classes, and you know what. |
…since text files are just binary files with special encoding
Instead of Temporary File (which are slower), buffers are used instead!
modified: pyproject.toml
Can I formally ask for a code review? I finished up with my code, and I'm currently unsure where to put it. I have placed all my code in |
Even without code review, you can try to make the linters happy (click and see what they're complaining about – probably start by running Look at the Azure tests (such as https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&jobId=c20f733f-1203-5ae6-f137-2a50b85410ce&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328 ): I see failures in your functionality https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328&l=337 ; see if you can do something about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair amount of code to do one thing so the question is if the complexity of an BaseDownloader and PDBDownloader class, and a fetch_pdb() is justified. We are always careful with adding new code because it invariably increases maintenance burden. To help making decisions:
- Can you think of other applications of BaseDownloader (eg mda.fetch_alphafold() #3377)?
- Can you move code from PDBDownloader into BaseDownloader to make it more reusable?
- Can you summarize the capabilities and advantages of your code?
I am really not quite sure where to put such code. My first instinct is to add any base functionality to coordinates.base
and the format-specific code to coordinates.PDB
. The fetch_pdb
function can then be imported at the top level or we write a top-level mda.fetch(...)
that automatically calls the right fetcher.
@MDAnalysis/coredevs any suggestions how to organize "fetchers"?
package/MDAnalysis/web/TODO
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
@@ -76,6 +77,7 @@ extra_formats = [ | |||
"pytng>=0.2.3", | |||
"gsd>3.0.0", | |||
"rdkit>=2020.03.1", | |||
"requests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove, it's already in the core deps
return tmp_path_factory.mktemp("cache") | ||
|
||
|
||
class Test_PDBDownloaderBaseFunctionality: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipif if there's no connection to the PDB (basically make it so that the test does not fail if there are internet issues)
mda.web.PDBDownloader(PDB_ID="BananaBoat").convert_to_universe() | ||
|
||
|
||
class Test_PDBDownloader_Cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests should not be using a shared cache directory. Tests may run in parallel and then you may get the behavior that multiple tests write at the same time to the cache or find a file there that they didn't expect.
Make it so that the tests use a temporary directory that is cleaned up afterwards. pytest has a temp_factory or tmppath fixture to aid in this common usage.
import pytest | ||
import requests | ||
|
||
working_PDB_ID = "1DPX" # egg white lysozyme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be more in line with how pytest does it if you made it a fixture (can be module level) PDB_ID
and then use it as a fixture in all tests that need it. This makes it explicit where it's used.
It also then becomes possible to make it a fixture that provides multiple PDB_IDs if that was needed (to run each test multiple times), e.g., if we later find out that specific PDBs create issues.
if progress_bar: | ||
self._requests_progress_bar(r) | ||
else: | ||
self._file.write(r.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method _requests_progress_bar
is poorly named, I suggest something like _write_with_progressbar
to indicate what it's actually doing.
if self._download: | ||
try: | ||
r = requests.get( | ||
f"https://files.rcsb.org/download/{self.id}.{self.file_format}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make the URL a class variable as it is what makes this class the PDB downloader.
pb.update(chunk_size) | ||
|
||
def download(self, cache_path=None, timeout=None, progress_bar=False): | ||
"""Downloads files from the Protein Data Bank""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more docs explaining function and arguments.
Needs to make clear that it can either keep it in memory StringIO or as a file.
def _open_file(self, cache_path): | ||
"""This method either load/create cache or reserve a spot in memory to store topologt""" | ||
|
||
if cache_path is None: | ||
self._file = io.BytesIO() | ||
self._download = True | ||
|
||
else: | ||
cache_file_path = Path(cache_path) / self.file_name | ||
|
||
# Found Cache, so don't download anything and open existing file | ||
# Note this doesn't check the content of the file! | ||
if cache_file_path.exists() and cache_file_path.is_file(): | ||
self._file = open(cache_file_path, "r") | ||
self._download = False | ||
|
||
else: # No cache found, so create Cache | ||
self._file = open(cache_file_path, "wb") | ||
self._download = True | ||
|
||
def _requests_progress_bar(self, requests_response): | ||
"""Puts a progress bar when writing content with a request object""" | ||
chunk_size = ( | ||
1 # Files are so small that you can read them one byte at a time | ||
) | ||
|
||
with ProgressBar( | ||
total=len(requests_response.content), | ||
unit="B", | ||
unit_scale=True, | ||
desc=self.file_name, | ||
) as pb: | ||
for byte in requests_response.iter_content(chunk_size=chunk_size): | ||
self._file.write(byte) | ||
pb.update(chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for caching should be common for all file-based downloaders. This is something I'd expect to see in a base class so that the actual PDB-one can then be simply containing the URL where to download from and a format hint to pass on to the Universe.
from .downloaders import PDBDownloader | ||
|
||
|
||
def fetch_pdb( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add as coordinates.PDB.fetch()
and then we can add fetch()
functions to others if necessary.
Fixes #4907
Changes made in this Pull Request:
This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.
DownloaderBase
and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.requests
as a dependencymda.fetch_pdb()
is implemented as a wrapper to commonly used option in 'PDBDownloader'PR Checklist
package/CHANGELOG
file updated?Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/