Migrate http client from requests to httpx async client#739
Migrate http client from requests to httpx async client#739ttys0dev wants to merge 1 commit intofreelawproject:mainfrom
Conversation
ce26428 to
8dd8337
Compare
mlissner
left a comment
There was a problem hiding this comment.
Interesting PR. I gave it a look through and it looks fine as far as I can tell, but I worry about how this will affect callers of this library. Won't they all need to be rewritten whever this version goes live?
Yes, callers will need to be tweaked to make async calls but that should be fairly straight forwards. |
|
Hm, I was afraid of that, this library is used by others as well. If this is an important change to make, we'll want to do a full version bump, or if there's a way to keep it backwards compatible, we'll want to do that. If we do do the version bump, is there a way for others using the library to upgrade without going full async? We should write something in the changelog about that too before releasing this. |
Not really a good way to do this without making things a lot more complex.
Yeah, I mean we can generally use |
|
So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about. I'm fully on board now that async is the right architecture for this library, but I'm afraid that if we make this library async-first, that'll have a similar affect on anybody that is considering using it, even if we document the heck out of it. I know as recently as a year ago, I'd have run away from this library as soon as I saw the word async. So I'm not sure what to do with this PR. Async is the right thing to do architecturally, but the wrong thing to do for the community, at least for now. A couple thoughts:
I'm also aware of the fact that Juriscraper has a number of things about it that are really bad:
I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind. |
|
I feel like it was alluded to above, but what's wrong with synchronous wrappers around asynchronous core functions?
I do not think that the PACER/CMECF code is unpopular because it is "buried." |
Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little
Sure, so we can fix that too with a clean break. |
Sorry, no. I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version. |
|
Oh, so you're proposing the django thing of every method having an |
Well native asyncio is a bit different than the many older 3rd party library variants since it's built in to python.
I think this probably the best approach.
Was the opposite for me generally, non-async functions doing network-IO always seemed to be potential performance foot-gun to be avoided.
IMO for something like this incremental refactoring is probably a better option.
I think it's better to have calling code handle the synchronous wrappers and not implement it in juriscraper directly, we don't really want to make assumptions here about say the |
|
What I like about forking is that it allows the old Juriscraper to continue living on (in infamy?), whereas if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version. This feels big enough to me that a fork makes sense. It's also probably time to fix all these many issues, and make a more focused PACER-only library.
Are there many others and would they be incompatible? Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.) |
I mean, you can create a legacy branch if there's any actual interest in a sync version still(which is kind of an unknown at this point).
Well there's stuff like
Well django internally is more sync oriented still than async and tends to shim things in the reverse direction(ie sync native functions shimmed to have async semantics using thread pools), while this would be native async being shimmed to sync so it's not quite the same. |
|
You're focused on the PACER part, right? Why not do a async-native PACER fork in a fresh library? It shouldn't be too terribly hard and it'd be useful for others. We can start there, then do something for the state scrapers if we want to, in a separate approach/library/version/etc. But if we look just at PACER, I think that shouldn't be too terribly hard? |
Well anything doing network IO I was trying to convert.
Maintaining two codebases that basically do the same thing seems to be a lot of extra work for no significant benefit(unless there really are a lot of applications that depend on juriscraper that can't be modified to handle an async API easially). |
|
I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead. |
I mean, if the code here wouldn't be maintained then what's the point of creating a separate project? |
|
That would let us untether parts of Juriscraper. So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral. Meanwhile, the rest of Juriscraper can continue advancing. It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways. |
It's a bit unclear to me what we want to untether exactly, but I'm not seeing an advantage to creating a separate juriscraper project for that.
I don't really see how that would be better than just refactoring things here as needed, then we don't have two implementations doing effectively the same thing.
If there's two different versions of juriscraper that have significant overlapping functionality I think it becomes less clear which one projects should be using vs a single version. Also large flag day upgrades tend to be difficult and risky in general so they often are avoided vs say more incremental migrations/refactoring over time. |
|
I think the biggest advantage is that when you Google for "pacer python" you'd get a package called The second advantage is just making the package smaller in general. If all you want to do is scrape PACER, you get a lean library just for that. If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc. I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules. The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle |
I mean, juriscraper is the top active python pacer scraping library in google, the others are either completely abandoned/unmaintained or unrelated to pacer. Splitting things to a separate library may make things worse there.
You can run a subset of tests already, for example like this:
As long as the library is modular where you can pull in on the functionality you need I don't really think this is a big issue, and having some things in the same library does have benefits in regards to simplifying code sharing/maintenance.
I still don't see a good reason for forking the library unless there's a longterm plan to maintain both versions which is kinda a PITA. By the way I tried to find any active projects using juriscraper other than courtlistener but didn't see any so this seems likely to be only a theoretical issue to some degree. I'd say we should refactor/clean things up here and if anyone complains then see what migration/compatibility strategy appears to best address their concerns at that point. Trying to prematurely anticipate if changing something like going async will cause major downstream issues is kinda hard without knowing specifics regarding the downstream users use case/architecture. |
|
Discussed via Slack, and @ttys0dev wins. I'll give this another look early next week. |
|
Man, time flies, but it'd be good to return to this, if you're interested, @ttys0dev. The new challenge we have is that a lot of work is being done on Juriscraper now by our new developer, @grossir and by @flooie, so we'd need to have a branch with this that stays up to date with their work until we cut this version, and then we'd quickly need to have a PR in CL that's ready to use the new version. I think I'd propose the following, but I'd love input from the three of you:
An optional step is to repeat this process for each module of Juriscraper, so we do PACER first, then opinions, then oral arguments, etc, but I'm not sure it's necessary. It would be safer, but it'd also wind up with Juriscaper 2.0, 2.1, 2.2, etc (i.e., lots of breaking releases). Would the approach above be safe, incremental, and efficient to get this big change done? |
79e866a to
4ca1e13
Compare
|
Rebased again, think we could get this reviewed soon since there's a lot of merge conflicts that keep showing up? |
3ce903f to
2d53ce5
Compare
grossir
left a comment
There was a problem hiding this comment.
I suggest splitting this PR (and its CL counterpart) into X smaller PRs, where each team can check the modules they work on
I think I can review AbstractSite children (Opinion, OralArguments, FDSysSite)
| if has_cipher | ||
| else httpx.AsyncClient( | ||
| verify=has_cipher, # WA has a certificate we don't understand |
There was a problem hiding this comment.
Just a note, not requesting a change here since this is in the current codebase too, but may be confusing.
this is disabling verification for anything that doesn't have the cipher attribute; but we have a "verify" attribute
There was a problem hiding this comment.
ended up just removing this, shouldn't be needed anymore
I don't see a way to do that due to how juriscraper is designed, essentially everything uses |
grossir
left a comment
There was a problem hiding this comment.
Some minor comments. I will now check the CL integration
About splitting the PR, I think it is feasible
For example,
- state/texas scrapers don't use AbstractSite
- The PACER module has a BaseReport and BaseDocketReport that don't interact with AbstractSite
I think we should split this PR, since other parts will have to be reviewed by other teams, both here and their integrations with CL. The people on those teams will notice the small details (such as I am noticing in opinions)
Trying to merge everything in a single PR may be possible, but will surely be slow. If we split it we may get some parts merged faster
| days_interval=days_interval, | ||
| ).back_scrape_iterable | ||
| sites = site_yielder(bs_iterable, mod, **site_kwargs) | ||
| sites = await site_yielder(bs_iterable, mod, **site_kwargs) |
There was a problem hiding this comment.
Error
sites = await site_yielder(bs_iterable, mod, **site_kwargs)
TypeError: object async_generator can't be used in 'await' expression
when trying to test a backscraper
python sample_caller.py -c juriscraper.opinions.united_states.territories.prapp --verbosity 3 --save-responses --backscrape --backscrape-start=2025/01/01 --backscrape-end=2025/02/01
There was a problem hiding this comment.
oh, shouldn't have an await there, should be fixed now
|
|
||
| def __del__(self): | ||
| self.close_session() | ||
| async def __aexit__(self): |
There was a problem hiding this comment.
incorrect signature?
https://docs.python.org/3/reference/datamodel.html#object.__aexit__
There was a problem hiding this comment.
Why do you think that? Is this causing some error?
|
|
||
| transport = httpx.MockTransport(handler) | ||
| s = httpx.AsyncClient(transport=transport) | ||
| r = await s.get(url=self.url) |
There was a problem hiding this comment.
should use the download_url arg
There was a problem hiding this comment.
In test mode download_url can contain non-url values such as ./bva_subexample_1.txt, we can't pass a non-url value as a url when calling s.get() here as the url validator will reject non-url values. Since download_content appears to only return raw bytes and not a response object I don't think using self.url will have any effect on the function output.
| return await self.fetch_document_link(case["docket"]) | ||
|
|
||
| return DeferringList(seed=self.cases, fetcher=fetcher) | ||
| case_names = [] |
There was a problem hiding this comment.
just incorrect naming here, these are supposed to be download_urls
There was a problem hiding this comment.
changed to download_urls
I'm still a bit wary of this due to module like lib being used across multiple areas of the codebase, also by removing requests entirely we can get better confidence that there aren't any instances of accidental usage anywhere within juriscraper. Also when integrating with courtlistener, especially around exception handling it's a bit easier to reason about if we can just assume juriscraper only uses httpx and not requests anywhere. |
|
This is rebased now that the AbstractSite parts have been merged. It migrates all remaining usage of requests to httpx. |
MorganBennetDev
left a comment
There was a problem hiding this comment.
The Texas and SCOTUS code looks good to me with just one minor change needed. @Brennan-Chesley-FLP can take a closer look at the TAMES scraper, but it seems fine.
| elif isinstance(cookies, RequestsCookieJar): | ||
| # Requests cookies. Convert to dict. | ||
| requests_cookies = dict(cookies) | ||
| httpx_cookies[cookie["name"]] = cookie["value"] |
There was a problem hiding this comment.
While we're changing this code could we just move the HTTPX check up and do an early return since (afaik) that's going to be the most common case? Then we don't need any elif or else blocks.
There was a problem hiding this comment.
Refactored to return early for httpx Cookies.
| headers={"User-Agent": "Free Law Project"}, | ||
| timeout=timeout, | ||
| ) | ||
| async with httpx.AsyncClient(http2=True) as client: |
There was a problem hiding this comment.
Why do we need a client for this single request?
There was a problem hiding this comment.
Async requests with httpx requires an explicit AsyncClient. Note that httpx internally uses a Client instance even for sync requests via a wrapper, but that's just a convenience thing that doesn't seem to really have an equivalent for AsyncClient.
| params=params, | ||
| extensions={"timeout": req_timeout.as_dict()}, | ||
| ) | ||
| r = await self.session.send(request) |
There was a problem hiding this comment.
This is attaching cookies in a way it used to not, so it may not be as anonymous as the comments are implying above.
There was a problem hiding this comment.
I'm not seeing where this would be adding cookies, but I added an explicit auth=None to the send() call which should ensure any session auth credentials don't get used.
| if self.get_acms_tokens: | ||
| for court_id in ["ca2", "ca9"]: | ||
| self.get_acms_auth_object(court_id) | ||
| await self.get_acms_auth_object(court_id) |
There was a problem hiding this comment.
I think there's a potential race here depending on how we're calling this, where we'll get a partial cookie jar. I /suspect/ we're fine though.
There was a problem hiding this comment.
Hmm, I went ahead and refactored the cookie handling to update self.cookies all at once like before.
Related CL PR #6579