Skip to content

Conversation

@thismanyboyfriends2
Copy link

Scraper type(s)

  • performerByName
  • performerByFragment
  • performerByURL
  • sceneByName
  • sceneByQueryFragment
  • sceneByFragment
  • sceneByURL
  • groupByURL
  • galleryByFragment
  • galleryByURL
  • imageByFragment
  • imageByURL

Examples to test

Short description

A brand new MeanBitches scraper in python to replace the old simple scraper that only supported the old site (which is now defunct).

Now supports:

  • New website
  • fragment and search scraping
  • multiple fallbacks for images if they don't exist
  • searches parallel pages for querying
  • caching for search results

@thismanyboyfriends2 thismanyboyfriends2 changed the title feat: new python meanbitches scraper to replace old one New python meanbitches scraper to replace old one Nov 16, 2025
@feederbox826
Copy link
Collaborator

This smells like LLM and the new site looks like it can be parsed just fine with xPath but LLM is too lazy to implement properly

Manifest should not be included under any circumstance.

there's no reason why you would use urllib.request when requests is already part of the base requirements includes.

Converting to draft, the LLM is overcomplicating it, there's no need for async thread-safe python

@feederbox826 feederbox826 marked this pull request as draft November 17, 2025 05:06
@thismanyboyfriends2
Copy link
Author

So I already made an xpath scraper before I made the python one - the main issue really is the cover image. The image is not available on any page which also includes a trailer. Therefore the only way to get that image is to go back to the performer page, and iterates through the pages to find the thumbnail of that scene. The vast majority of the other data can be fetched with the xpath scraper, but I considered the image too important a piece of information to be omitted.

Yeah, the parallel search with async/threading was done by an LLM - this was an attempt at an optimisation because searching through lots of paginated pages just to find an image was taking a while. Happy to remove it, I just found it worked better.

I'll submit a re-do in a bit with your suggestions - but I do think the fact it's python is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants