Skip to content

Refactor list format to reduce the burden of maintaining the list#308

Open
Mangochicken13 wants to merge 35 commits intolaylavish:mainfrom
Mangochicken13:main
Open

Refactor list format to reduce the burden of maintaining the list#308
Mangochicken13 wants to merge 35 commits intolaylavish:mainfrom
Mangochicken13:main

Conversation

@Mangochicken13
Copy link
Copy Markdown

Purpose

Addresses #204 and #164, and makes addressing #54 and #301 significantly easier (by just adding an entry to the ublock_formats dictionary in the new list_generator.py file (line 297)

Moving forward would also include looking at prefixing most, if not all urls with a . to address #158 and #198, and potentially changing the google wrapper to address #268

Changes

  1. Lists have been split up into folders for:
    • Common pages (ie. base urls that are shared between uBlockOrigin, uBlacklist, and hosts formats)
    • SubPages (ie. specific subdomains, users, or blogs under a parent page)
    • Nuclear (the nuclear option files shared between uBlockOrigin and uBlacklist)
    • Elements (the extra elements blocked in the uBlockOrigin list)
  2. The source lists have been split up from the original list.txt file to better sort and organise sites, and why they are on the list
  3. The newly split lists have been (mostly) organized alphabetically under their headers, to better organize and de-duplicate page entries
    • The YouTube list is a notable exception, as it has to be manually sorted due to the existence of channel names and channel ids
  4. Python scripts have been created to "alphabeticise" the source lists, and compile these source lists into the blocklists under the Export/ directory.
    • These currently have to be manually run by the maintainer, I am hoping to potentially work on the .githooks folder to have the generation run before committing files.

Regressions

  • The regex in the uBlacklist list are not currently preserved, this can be changed if deemed necessary before merging
    # // Non-Conflicting Regular Expressions
    /stock\.adobe\.com\/.*(generat(ed|(ive))-ai|ai-generated)/
    /vecteezy\.com\/.*(ai-generat((ed)|(ive))|generat(ed?|(ive))-ai)/
    /pixabay\.com\/.*(ai(-|%20)generated|\/ai%20anime|\/ai-creation)/
    /freepik\.com\/.*(ai-generat(ed|or)|generative-ai)/
    /dreamstime\.com\/.*(generat(ed|ive)-(ai|art)|ai-generat(ive|ed))/
    /deviantart\.com\/.*(-generative-ai|ai-generat(ive|ed)|-ai-art)/i
    /shutterstock\.com\/.*(image-generated|generated-by-ai|ai-image-generator|image_type=generated)/

@DanWaLes
Copy link
Copy Markdown

Looks good but imo need better separation for DNS rules (like the hosts file), which I've mentioned in #312 . I don't mind going through all the sites and categorising them. It's important to group the domains by purpose and adding lists as needed, instead of completely blocking everything.

For hosts file support, there is no benefit of having a www and non-www version of the file. Need to have both in the same file.

May be beneficial to allow alternative DNS blocking syntaxes when generating the files. I know enough of the syntax for hosts, AdBlock, domains subdomain and dnsmasq rules.

I'm not sure if this project is actively being maintained by the author, but this does look like a new good base to stem from.

@Mangochicken13
Copy link
Copy Markdown
Author

Looks good but imo need better separation for DNS rules (like the hosts file), which I've mentioned in #312 . I don't mind going through all the sites and categorising them. It's important to group the domains by purpose and adding lists as needed, instead of completely blocking everything.

I'm absolutely open to a better organisation system, the current version in this pr is much more of a proof of concept than finalised system, but going to all the sites and checking if they're still active, let alone what to categorise them under, was a massive task that I did not want to do lmao

For hosts file support, there is no benefit of having a www and non-www version of the file. Need to have both in the same file.

Can definitely clean this up rq, would it be better to have them work like the twitter sections currently get created (url/domain has all versions grouped together), or to just append the www version below everything else? I'd imagine the former is better for the case where someone wants to customise the list

May be beneficial to allow alternative DNS blocking syntaxes when generating the files. I know enough of the syntax for hosts, AdBlock, domains subdomain and dnsmasq rules.

Can absolutely chuck this in as well, should be simple enough. I can go and find the formatting for everything mentioned if need be, but if you have them to hand either dropping a comment or pr would be much appreciated (including the comment character for that format to replace the ! if necessary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants