Migrate http client from requests to httpx async client by ttys0dev · Pull Request #739 · freelawproject/juriscraper

ttys0dev · 2023-10-06T05:20:52Z

Related CL PR #6579

mlissner

Interesting PR. I gave it a look through and it looks fine as far as I can tell, but I worry about how this will affect callers of this library. Won't they all need to be rewritten whever this version goes live?

ttys0dev · 2023-10-13T23:12:54Z

Won't they all need to be rewritten whever this version goes live?

Yes, callers will need to be tweaked to make async calls but that should be fairly straight forwards.

mlissner · 2023-10-13T23:19:04Z

Hm, I was afraid of that, this library is used by others as well. If this is an important change to make, we'll want to do a full version bump, or if there's a way to keep it backwards compatible, we'll want to do that.

If we do do the version bump, is there a way for others using the library to upgrade without going full async? We should write something in the changelog about that too before releasing this.

ttys0dev · 2023-10-14T00:45:09Z

if there's a way to keep it backwards compatible, we'll want to do that.

Not really a good way to do this without making things a lot more complex.

If we do do the version bump, is there a way for others using the library to upgrade without going full async?

Yeah, I mean we can generally use async_to_sync wrappers or similar from calling code.

mlissner · 2023-10-18T13:45:51Z

So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about.

I'm fully on board now that async is the right architecture for this library, but I'm afraid that if we make this library async-first, that'll have a similar affect on anybody that is considering using it, even if we document the heck out of it. I know as recently as a year ago, I'd have run away from this library as soon as I saw the word async.

So I'm not sure what to do with this PR. Async is the right thing to do architecturally, but the wrong thing to do for the community, at least for now.

A couple thoughts:

We fork Juriscraper, and the async version gets a new name. Anybody that wants to maintain the old version is free to do so.
We do a major version bump and warn prolifically to those that might upgrade. They can do so, or not, as they deem fit.
We don't do this and stick with the wrong approach for a few more years until async is common in the Python world (if ever?)

I'm also aware of the fact that Juriscraper has a number of things about it that are really bad:

The FdSys module is obsolete and needs to be removed (the website is scraped is gone).
The modular approach to the state scrapers is painful to anybody that has to learn it, and we're in an awkward transition to the linear scrapers from the regular ones.
Many of the state scrapers are broken.
Async makes more sense.
Maybe moving the PACER libraries out to their own tool would increase their popularity
Etc.

I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind.

johnhawkinson · 2023-10-18T13:59:04Z

I feel like it was alluded to above, but what's wrong with synchronous wrappers around asynchronous core functions?

Maybe moving the PACER libraries out to their own tool would increase their popularity
…
I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library

I do not think that the PACER/CMECF code is unpopular because it is "buried."
When I needed to deal with CM/ECF parsing for a personal (well) project, I wrote my own parser because I disliked juriscraper's architecture and…I also I suspected it was easier to do so than to wrangle juriscraper's parser to my requirements.

mlissner · 2023-10-18T14:02:33Z

What's wrong with synchronous wrappers around asynchronous core functions?

Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little async_to_sync wrappers throughout their code all the time, and it just feels like one of those things that makes you think, "Hm, maybe this library isn't for me?" And as a coder that grew up on Python, as soon as I see the word "async", I run.

also I suspected it was easier to do so than to wrangle juriscraper's parser to my requirements.

Sure, so we can fix that too with a clean break.

johnhawkinson · 2023-10-18T14:07:47Z

What's wrong with synchronous wrappers around asynchronous core functions?

Well, it's just kind of ugly? Everybody that uses Juriscaper would have to sprinkle little async_to_sync wrappers throughout their code all the time,

Sorry, no. I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version.

mlissner · 2023-10-18T14:09:55Z

Oh, so you're proposing the django thing of every method having an a___ method. For example, get becomes aget, filter becomes afilter?

ttys0dev · 2023-10-18T18:57:19Z

So, for many years I've occasionally seen libraries that use tornado or other async frameworks, and whenever I did, I ran away as quickly as I could, knowing that it was a whole universe of bugs I didn't understand or what to learn about.

Well native asyncio is a bit different than the many older 3rd party library variants since it's built in to python.

We do a major version bump and warn prolifically to those that might upgrade. They can do so, or not, as they deem fit.

I think this probably the best approach.

And as a coder that grew up on Python, as soon as I see the word "async", I run.

Was the opposite for me generally, non-async functions doing network-IO always seemed to be potential performance foot-gun to be avoided.

I can't help but think that the thing to do is abandon ship and start breaking things out. First, we do async PACER in a new library then we do states, one at a time, in a greenfield approach, then we see if there's anything else we care about that's left behind.

IMO for something like this incremental refactoring is probably a better option.

I mean we retain our existing interface and make that interface a syncrhonous wrapper around the asynchronous core interface. No changes for juriscraper callers or users, unless they want to use async functions, in which case they have to call the async version.

I think it's better to have calling code handle the synchronous wrappers and not implement it in juriscraper directly, we don't really want to make assumptions here about say the asgiref library being used as the wrapper as other sync wrapping techniques may make more sense in non-django/non-asgi based applications.

mlissner · 2023-10-18T21:06:21Z

What I like about forking is that it allows the old Juriscraper to continue living on (in infamy?), whereas if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version. This feels big enough to me that a fork makes sense. It's also probably time to fix all these many issues, and make a more focused PACER-only library.

we don't really want to make assumptions here about say the asgiref library being used

Are there many others and would they be incompatible? Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.)

ttys0dev · 2023-10-18T21:15:52Z

if we don't fork, it's pretty hard to do bug fixes or anything like that for folks that are on the old version

I mean, you can create a legacy branch if there's any actual interest in a sync version still(which is kind of an unknown at this point).

Are there many others and would they be incompatible?

Well there's stuff like asyncio.run_coroutine_threadsafe.

Django made it's decision, I guess, but it feels like if Django can do a-prefixed async functions alongside the regular ones, we should be able to also. (But if we fork, this doesn't matter so much.)

Well django internally is more sync oriented still than async and tends to shim things in the reverse direction(ie sync native functions shimmed to have async semantics using thread pools), while this would be native async being shimmed to sync so it's not quite the same.

mlissner · 2023-10-18T21:33:22Z

You're focused on the PACER part, right? Why not do a async-native PACER fork in a fresh library? It shouldn't be too terribly hard and it'd be useful for others. We can start there, then do something for the state scrapers if we want to, in a separate approach/library/version/etc. But if we look just at PACER, I think that shouldn't be too terribly hard?

ttys0dev · 2023-10-18T21:45:06Z

You're focused on the PACER part, right?

Well anything doing network IO I was trying to convert.

Why not do a async-native PACER fork in a fresh library?

Maintaining two codebases that basically do the same thing seems to be a lot of extra work for no significant benefit(unless there really are a lot of applications that depend on juriscraper that can't be modified to handle an async API easially).

mlissner · 2023-10-18T21:47:27Z

I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead.

ttys0dev · 2023-10-18T21:52:26Z

I don't propose we maintain the code here anymore. We would say that it's old and that we're focused on ajuriscraper or whatever we wind up calling it, and that we recommend people use that instead.

I mean, if the code here wouldn't be maintained then what's the point of creating a separate project?

mlissner · 2023-10-18T22:13:38Z

That would let us untether parts of Juriscraper. So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral. Meanwhile, the rest of Juriscraper can continue advancing. It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways.

ttys0dev · 2023-10-19T00:31:45Z

That would let us untether parts of Juriscraper.

It's a bit unclear to me what we want to untether exactly, but I'm not seeing an advantage to creating a separate juriscraper project for that.

So PACER stuff goes into a new library and the PACER stuff here goes into a slow death spiral.

I don't really see how that would be better than just refactoring things here as needed, then we don't have two implementations doing effectively the same thing.

It also gives a strong signal of "This is something new and different," and lets us think in greenfield kinds of ways.

If there's two different versions of juriscraper that have significant overlapping functionality I think it becomes less clear which one projects should be using vs a single version. Also large flag day upgrades tend to be difficult and risky in general so they often are avoided vs say more incremental migrations/refactoring over time.

mlissner · 2023-10-19T17:50:33Z

I think the biggest advantage is that when you Google for "pacer python" you'd get a package called pypacer or something, instead of Juriscraper as result four or five. So part of it is what you might call branding.

The second advantage is just making the package smaller in general. If all you want to do is scrape PACER, you get a lean library just for that. If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc. I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules.

The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle async_to_sync-like wappers throughout their code.

ttys0dev · 2023-10-19T23:26:25Z

I think the biggest advantage is that when you Google for "pacer python" you'd get a package called pypacer or something, instead of Juriscraper as result four or five. So part of it is what you might call branding.

I mean, juriscraper is the top active python pacer scraping library in google, the others are either completely abandoned/unmaintained or unrelated to pacer. Splitting things to a separate library may make things worse there.

If you want to fix a bug in the PACER library, you can do that and run tests more quickly, without worrying about the rest, etc.

You can run a subset of tests already, for example like this:

python -m unittest -v tests.local.test_DocketParseTest.DocketParseTest

I much prefer that as a library user, rather than big ones that do a lot more than I need. For example, if we do this, it gets a lot easier to document the library, than if we want to document all of Juriscraper and its various modules.

As long as the library is modular where you can pull in on the functionality you need I don't really think this is a big issue, and having some things in the same library does have benefits in regards to simplifying code sharing/maintenance.

The third thing is that it'd allow us to open our minds to bigger changes to the API (go async, clean up a few other things), because we can change things without worrying about breaking other people's systems or demanding that they sprinkle async_to_sync-like wappers throughout their code.

I still don't see a good reason for forking the library unless there's a longterm plan to maintain both versions which is kinda a PITA.

By the way I tried to find any active projects using juriscraper other than courtlistener but didn't see any so this seems likely to be only a theoretical issue to some degree.

I'd say we should refactor/clean things up here and if anyone complains then see what migration/compatibility strategy appears to best address their concerns at that point. Trying to prematurely anticipate if changing something like going async will cause major downstream issues is kinda hard without knowing specifics regarding the downstream users use case/architecture.

mlissner · 2023-10-21T00:40:38Z

Discussed via Slack, and @ttys0dev wins. I'll give this another look early next week.

mlissner · 2024-01-31T17:29:37Z

Man, time flies, but it'd be good to return to this, if you're interested, @ttys0dev. The new challenge we have is that a lot of work is being done on Juriscraper now by our new developer, @grossir and by @flooie, so we'd need to have a branch with this that stays up to date with their work until we cut this version, and then we'd quickly need to have a PR in CL that's ready to use the new version.

I think I'd propose the following, but I'd love input from the three of you:

We get this branch good again and get it through the review process again.
We write detailed release notes (and add them to this PR).
@ttys0dev, you keep this branch updated with main (you're good at that).
@ttys0dev, you do a PR for CL to use this branch, and it can even use a revision number from this branch to make tests pass.
We cut Juriscraper 2.0.0.
We cut over from the revision number to version 2.0.0 in CL.

An optional step is to repeat this process for each module of Juriscraper, so we do PACER first, then opinions, then oral arguments, etc, but I'm not sure it's necessary. It would be safer, but it'd also wind up with Juriscaper 2.0, 2.1, 2.2, etc (i.e., lots of breaking releases).

Would the approach above be safe, incremental, and efficient to get this big change done?

ttys0dev · 2026-01-21T18:28:44Z

Rebased again, think we could get this reviewed soon since there's a lot of merge conflicts that keep showing up?

grossir

I suggest splitting this PR (and its CL counterpart) into X smaller PRs, where each team can check the modules they work on

I think I can review AbstractSite children (Opinion, OralArguments, FDSysSite)

grossir · 2026-02-27T21:57:16Z

+                if has_cipher
+                else httpx.AsyncClient(
+                    verify=has_cipher,  # WA has a certificate we don't understand


Just a note, not requesting a change here since this is in the current codebase too, but may be confusing.

this is disabling verification for anything that doesn't have the cipher attribute; but we have a "verify" attribute

ended up just removing this, shouldn't be needed anymore

ttys0dev · 2026-03-02T20:26:51Z

I suggest splitting this PR (and its CL counterpart) into X smaller PRs, where each team can check the modules they work on

I don't see a way to do that due to how juriscraper is designed, essentially everything uses AbstractSite in some way so we kinda have to migrate all modules at the same time.

grossir

Some minor comments. I will now check the CL integration

About splitting the PR, I think it is feasible
For example,

state/texas scrapers don't use AbstractSite
The PACER module has a BaseReport and BaseDocketReport that don't interact with AbstractSite

I think we should split this PR, since other parts will have to be reviewed by other teams, both here and their integrations with CL. The people on those teams will notice the small details (such as I am noticing in opinions)

Trying to merge everything in a single PR may be possible, but will surely be slow. If we split it we may get some parts merged faster

grossir · 2026-03-05T20:07:09Z

                    days_interval=days_interval,
                ).back_scrape_iterable
-                sites = site_yielder(bs_iterable, mod, **site_kwargs)
+                sites = await site_yielder(bs_iterable, mod, **site_kwargs)


Error

sites = await site_yielder(bs_iterable, mod, **site_kwargs) TypeError: object async_generator can't be used in 'await' expression

when trying to test a backscraper

python sample_caller.py -c juriscraper.opinions.united_states.territories.prapp --verbosity 3 --save-responses --backscrape --backscrape-start=2025/01/01 --backscrape-end=2025/02/01

oh, shouldn't have an await there, should be fixed now

grossir · 2026-03-05T20:13:48Z


-    def __del__(self):
-        self.close_session()
+    async def __aexit__(self):


incorrect signature?
https://docs.python.org/3/reference/datamodel.html#object.__aexit__

Why do you think that? Is this causing some error?

grossir · 2026-03-05T20:16:54Z

+
+            transport = httpx.MockTransport(handler)
+            s = httpx.AsyncClient(transport=transport)
+            r = await s.get(url=self.url)


should use the download_url arg

In test mode download_url can contain non-url values such as ./bva_subexample_1.txt, we can't pass a non-url value as a url when calling s.get() here as the url validator will reject non-url values. Since download_content appears to only return raw bytes and not a response object I don't think using self.url will have any effect on the function output.

grossir · 2026-03-05T20:19:48Z

+            return await self.fetch_document_link(case["docket"])

-        return DeferringList(seed=self.cases, fetcher=fetcher)
+        case_names = []


just incorrect naming here, these are supposed to be download_urls

changed to download_urls

ttys0dev · 2026-03-06T17:40:59Z

I think we should split this PR, since other parts will have to be reviewed by other teams, both here and their integrations with CL. The people on those teams will notice the small details (such as I am noticing in opinions)

Trying to merge everything in a single PR may be possible, but will surely be slow. If we split it we may get some parts merged faster

I'm still a bit wary of this due to module like lib being used across multiple areas of the codebase, also by removing requests entirely we can get better confidence that there aren't any instances of accidental usage anywhere within juriscraper.

Also when integrating with courtlistener, especially around exception handling it's a bit easier to reason about if we can just assume juriscraper only uses httpx and not requests anywhere.

ttys0dev · 2026-03-23T18:28:24Z

This is rebased now that the AbstractSite parts have been merged. It migrates all remaining usage of requests to httpx.

MorganBennetDev

The Texas and SCOTUS code looks good to me with just one minor change needed. @Brennan-Chesley-FLP can take a closer look at the TAMES scraper, but it seems fine.

MorganBennetDev · 2026-03-24T22:22:04Z

-    elif isinstance(cookies, RequestsCookieJar):
-        # Requests cookies. Convert to dict.
-        requests_cookies = dict(cookies)
+            httpx_cookies[cookie["name"]] = cookie["value"]


While we're changing this code could we just move the HTTPX check up and do an early return since (afaik) that's going to be the most common case? Then we don't need any elif or else blocks.

Refactored to return early for httpx Cookies.

MorganBennetDev · 2026-03-24T22:23:36Z

-            headers={"User-Agent": "Free Law Project"},
-            timeout=timeout,
-        )
+        async with httpx.AsyncClient(http2=True) as client:


Why do we need a client for this single request?

Async requests with httpx requires an explicit AsyncClient. Note that httpx internally uses a Client instance even for sync requests via a wrapper, but that's just a convenience thing that doesn't seem to really have an equivalent for AsyncClient.

Brennan-Chesley-FLP · 2026-03-25T15:42:38Z

+                params=params,
+                extensions={"timeout": req_timeout.as_dict()},
+            )
+            r = await self.session.send(request)


This is attaching cookies in a way it used to not, so it may not be as anonymous as the comments are implying above.

I'm not seeing where this would be adding cookies, but I added an explicit auth=None to the send() call which should ensure any session auth credentials don't get used.

Brennan-Chesley-FLP · 2026-03-25T15:51:34Z

        if self.get_acms_tokens:
            for court_id in ["ca2", "ca9"]:
-                self.get_acms_auth_object(court_id)
+                await self.get_acms_auth_object(court_id)


I think there's a potential race here depending on how we're calling this, where we'll get a partial cookie jar. I /suspect/ we're fine though.

Hmm, I went ahead and refactored the cookie handling to update self.cookies all at once like before.

ttys0dev force-pushed the async branch 7 times, most recently from ce26428 to 8dd8337 Compare October 7, 2023 01:05

mlissner reviewed Oct 13, 2023

View reviewed changes

ttys0dev force-pushed the async branch from 8dd8337 to cfe7a22 Compare January 6, 2024 23:19

ttys0dev force-pushed the async branch 14 times, most recently from 79e866a to 4ca1e13 Compare January 16, 2026 00:10

ttys0dev force-pushed the async branch from 4ca1e13 to 0c72d32 Compare January 21, 2026 18:25

ttys0dev force-pushed the async branch 2 times, most recently from 3ce903f to 2d53ce5 Compare January 29, 2026 01:27

grossir reviewed Feb 27, 2026

View reviewed changes

grossir reviewed Mar 5, 2026

View reviewed changes

ttys0dev mentioned this pull request Mar 9, 2026

Migrate AbstractSite from requests to httpx #1843

Merged

MorganBennetDev requested changes Mar 24, 2026

View reviewed changes

Brennan-Chesley-FLP reviewed Mar 25, 2026

View reviewed changes

Comment thread juriscraper/pacer/rss_feeds.py Outdated

Brennan-Chesley-FLP reviewed Mar 25, 2026

View reviewed changes

Comment thread juriscraper/state/texas/tames.py

Migrate http client from requests to httpx async client

30f066a

Uh oh!

Conversation

ttys0dev commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlissner left a comment

Choose a reason for hiding this comment

Uh oh!

ttys0dev commented Oct 13, 2023

Uh oh!

mlissner commented Oct 13, 2023

Uh oh!

ttys0dev commented Oct 14, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

johnhawkinson commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

johnhawkinson commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

ttys0dev commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

ttys0dev commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

ttys0dev commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

ttys0dev commented Oct 18, 2023

Uh oh!

mlissner commented Oct 18, 2023

Uh oh!

ttys0dev commented Oct 19, 2023

Uh oh!

mlissner commented Oct 19, 2023

Uh oh!

ttys0dev commented Oct 19, 2023

Uh oh!

mlissner commented Oct 21, 2023

Uh oh!

mlissner commented Jan 31, 2024

Uh oh!

ttys0dev commented Jan 21, 2026

Uh oh!

grossir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grossir Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttys0dev commented Mar 2, 2026

Uh oh!

grossir left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttys0dev commented Oct 6, 2023 •

edited

Loading

grossir Feb 27, 2026 •

edited

Loading