-
Notifications
You must be signed in to change notification settings - Fork 179
fix: Council Fix Pack - December 2025 #1705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis PR adds two new council scrapers (Blackpool, Harlow), migrates several councils from direct HTTP to Selenium or updated HTML parsing, normalises/pads UPRNs, adjusts headers/user-agent usage, fixes bin-type mapping for Lambeth, updates date logic for Thurrock, and updates tests and documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FifeOld as Fife (Old: API)
participant FifeNew as Fife (New: Selenium)
participant Server as Fife Server
rect rgb(240, 240, 255)
Note over FifeOld,Server: Previous Flow (Blocked by API)
User->>FifeOld: parse_data(postcode)
FifeOld->>Server: GET /api/citizen
Server-->>FifeOld: 403 Forbidden
FifeOld-->>User: Error
end
rect rgb(230, 255, 230)
Note over FifeNew,Server: New Flow (Selenium-driven)
User->>FifeNew: parse_data(postcode, paon)
FifeNew->>Server: Browser navigate (Selenium)
Server-->>FifeNew: Rendered HTML
FifeNew->>FifeNew: Wait & bypass protections
FifeNew->>FifeNew: Fill postcode, select address
FifeNew->>Server: Trigger search / request data
Server-->>FifeNew: Rendered collections table
FifeNew->>FifeNew: Parse HTML -> {"bins":[...]}
FifeNew-->>User: Result
end
sequenceDiagram
participant User
participant RushmoorOld as Rushmoor (Old: HTML parse)
participant RushmoorNew as Rushmoor (New: API)
participant Server as Rushmoor Server
rect rgb(240, 240, 255)
Note over RushmoorOld: Old parsing relied on page structure and could hit NoneType
User->>RushmoorOld: parse_data(page, uprn)
RushmoorOld->>RushmoorOld: soup.find("p").contents[0] (fragile)
RushmoorOld-->>User: AttributeError (NoneType)
end
rect rgb(230, 255, 230)
Note over RushmoorNew,Server: New Flow (direct API)
User->>RushmoorNew: parse_data(uprn)
RushmoorNew->>Server: GET /api?uprn={uprn}
Server-->>RushmoorNew: JSON response
RushmoorNew->>RushmoorNew: Extract collections, format dates
RushmoorNew-->>User: {"bins":[...]}
end
sequenceDiagram
participant User
participant Blackpool as Blackpool (New)
participant Server as Blackpool API
rect rgb(230, 255, 230)
Note over Blackpool,Server: Blackpool token + POST flow
User->>Blackpool: parse_data(uprn, postcode)
Blackpool->>Server: GET /token endpoint (with headers)
Server-->>Blackpool: token
Blackpool->>Server: POST /PremiseJobs with uprn, postcode, token
Server-->>Blackpool: PremiseJobs JSON
Blackpool->>Blackpool: Build {"bins":[...]} sorted
Blackpool-->>User: Result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1705 +/- ##
=======================================
Coverage 86.79% 86.79%
=======================================
Files 9 9
Lines 1136 1136
=======================================
Hits 986 986
Misses 150 150 ☔ View full report in Codecov by Sentry. |
BREAKING CHANGE fix: robbrad#1693 This scraper now uses postcode and house number instead of UPRN. It also requires a Selenium browser
fix: robbrad#1706 - London Borough of Lambeth
fix: robbrad#1721 Mid Sussex
fix: robbrad#1720 - Thurrock Council
fix: robbrad#1718 - Argyll and Bute Council UPRN needs to be 12 digits long
fix: robbrad#1712 - South Lanarkshire Council Council would potentially return a date in the past for that week
fix: robbrad#1711 - Northumberland Council - Requires 12 digit UPRN
fix: robbrad#1209 Halton Borough Council
fix: robbrad#1689 - Wiltshire Council
feat: robbrad#1640 Adding Blackpool Council
feat: robbrad#1639 Adding Harlow Council
fix: robbrad#1724 - Rushmoor Council
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (9)
uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py (2)
36-40: Date update fixes the immediate issue; consider dynamic date calculation for long-term maintainability.The hardcoded date updates from 2024 to 2025 correctly restore bin collection data generation. However, these dates will require manual maintenance in the future, creating recurring technical debt.
Consider refactoring to calculate start dates dynamically based on the current date and known collection schedules, or move these dates to a configuration file with validation. This would eliminate the need for annual manual updates.
Example approach:
# Calculate the most recent start date for the current year def get_round_start_dates(year=None): if year is None: year = datetime.now().year # Define the initial dates for the current year round_a_blue_brown = datetime(year, 11, 17) round_a_green_grey = datetime(year, 11, 24) # Return dates based on round return { 'A': {'blue_brown': round_a_blue_brown, 'green_grey': round_a_green_grey}, 'B': {'green_grey': round_a_green_grey, 'blue_brown': round_a_blue_brown} }
1-1: Note: Static analysis flagged star import (pre-existing).Ruff flags that
datetimemay be undefined due to the star import on line 1. While the code works correctly (thedatetimeclass is available from the common module), explicit imports are generally preferred for clarity and to avoid namespace pollution.This is pre-existing code and not introduced by this PR, so it's low priority. If refactoring in the future, consider replacing the star import with explicit imports:
from uk_bin_collection.uk_bin_collection.common import datetime, timedelta, get_dates_every_x_daysuk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
112-112: Variable nametypeshadows Python built-in.Using
typeas a loop variable shadows the built-intype()function. Consider renaming tobin_typeorcollection_type_itemfor clarity.Apply this diff to use a clearer variable name:
- for type in collection_types: + for bin_type in collection_types: dict_data = { - "type": type, + "type": bin_type, "collectionDate": collectiondate, }uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (1)
64-82: Consider addressingdate_formatF405 and making cadence parsing more robustRuff’s F405 for
date_formatis a style/linting issue but may fail CI depending on project settings. Two options:
- Import explicitly instead of relying on
import *, e.g.:(and import other symbols you need explicitly), orfrom uk_bin_collection.uk_bin_collection.common import date_format- Keep the wildcard import but add a local
# noqa: F405wheredate_formatis used if that’s the established pattern in this repo.Separately, since
schedule_cadenceis only used for a substring check, you can simplify and make it safer by operating on the full cell text and lowering it:- schedule_cadence = row.find("td").get_text().strip().split(" ")[1] + td_text = row.find("td").get_text().strip() + schedule_cadence = td_text.lower() ... - if "Fortnightly" in schedule_cadence: + if "fortnightly" in schedule_cadence:This avoids depending on a fixed token position and is more resilient to minor HTML/content changes.
uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py (1)
22-25: User-Agent header looks good; consider adding a request timeoutThe added
headers(User-Agent) on therequests.getcall are a sensible fix for remote blocking. To avoid potential hangs on network issues, consider adding an explicit timeout:- response = requests.get(url, headers=headers) + response = requests.get(url, headers=headers, timeout=20)Adjust the timeout value as appropriate for typical council response times.
Also applies to: 47-47
uk_bin_collection/uk_bin_collection/councils/LondonBoroughLambeth.py (1)
45-52: Commercial bin disambiguation looks good; consider case-insensitive matchThe new logic properly splits ambiguous
"commercial bin"containers into"recycling"vs"refuse"based onContainer["Name"], which should prevent entity collapse while keeping other types unchanged.For a bit more robustness against API text changes, you could optionally make the match case-insensitive:
name = Container["Name"] if "recycling" in name.lower(): Bin_Type = "recycling" else: Bin_Type = "refuse"uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py (1)
18-33: API flow is much cleaner; add basic HTTP and JSON guards to avoid regressionsThe switch to a direct Umbraco API with UPRN looks good and should resolve the earlier setup issues. To harden this against future content changes (and avoid reintroducing the
NoneType.contentsfailure):
- Add a timeout and status check on the HTTP call:
- response = requests.get( - "https://www.rushmoor.gov.uk/Umbraco/Api/BinLookUpWorkAround/Get?", - params=params, - headers=headers, - ) + response = requests.get( + "https://www.rushmoor.gov.uk/Umbraco/Api/BinLookUpWorkAround/Get?", + params=params, + headers=headers, + timeout=20, + ) + response.raise_for_status()
- Guard the
<p>lookup andNextCollectionextraction so a missing node produces a clear error instead of an AttributeError:- result = soup.find("p").contents[0] - - json_data = json.loads(result)["NextCollection"] + p_tag = soup.find("p") + if p_tag is None or not p_tag.contents: + raise ValueError("Unexpected response format from Rushmoor API") + result = p_tag.contents[0] + + json_data = (json.loads(result).get("NextCollection") or {})The existing
if len(collections) < 1: raise ValueError("No collections found")will still correctly signal the “no data” case.Also applies to: 35-45
uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py (1)
22-23: New HTTP-based scraper is a good replacement; tighten a few robustness edgesThe move away from Selenium to a
requests+ BeautifulSoup pipeline is a solid simplification and aligns with the updated Mid Sussex site. A few low-friction tweaks would make this more resilient:
- Add timeouts to all HTTP calls
- r = session.get( - URI, - ) + r = session.get(URI, timeout=20) ... - r = session.post(nextpageurl, data) + r = session.post(nextpageurl, data=data, timeout=20) ... - r = session.get( - nextpageurl, - ) + r = session.get(nextpageurl, timeout=20)
- Avoid
.find(...).find("a")on a possibleNone- alink = soup.find("div", id="property_list").find("a") - - if alink is None: - raise Exception("Address not found") + property_list = soup.find("div", id="property_list") + if property_list is None: + raise Exception("Address list not found") + alink = property_list.find("a") + if alink is None: + raise Exception("Address not found")
- Likewise for
section#scheduled-collections- u1s = soup.find("section", id="scheduled-collections").find_all( - "ul", class_="displayinlineblock" - ) + section = soup.find("section", id="scheduled-collections") + if section is None: + raise Exception("Scheduled collections section not found") + u1s = section.find_all("ul", class_="displayinlineblock")
- Minor readability
datais used as the POST payload whilebindataholds output. Renaming the POST dict topayload(or similar) would make the intent clearer without any behavioural change.These changes preserve the current success path while turning markup/API drift into clear, controlled exceptions.
Also applies to: 27-37, 43-48, 49-56, 58-66, 68-76, 78-80, 82-99, 101-102
uk_bin_collection/tests/input.json (1)
232-240: Blackpool config looks consistent; optionally mark asskip_get_url.The new
BlackpoolCouncilentry correctly provides bothuprnandpostcodeexpected by the scraper. Because the scraper makes its own API calls and ignores the initialpagecontent, consider adding"skip_get_url": truehere to avoid an unnecessary preliminary GET ofurl.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
uk_bin_collection/tests/input.json(5 hunks)uk_bin_collection/uk_bin_collection/councils/ArgyllandButeCouncil.py(2 hunks)uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py(2 hunks)uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py(3 hunks)uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py(2 hunks)uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/LondonBoroughLambeth.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/NorthumberlandCouncil.py(2 hunks)uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py(2 hunks)uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py(1 hunks)uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py(3 hunks)wiki/Councils.md(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
create_webdriver(321-360)
uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
check_uprn(67-78)
uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py (2)
uk_bin_collection/uk_bin_collection/get_bin_data.py (1)
AbstractGetBinDataClass(43-146)uk_bin_collection/uk_bin_collection/common.py (2)
check_uprn(67-78)check_postcode(36-49)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
custom_components/uk_bin_collection/calendar.py (1)
event(54-63)
uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py (3)
uk_bin_collection/uk_bin_collection/councils/NuneatonBedworthBoroughCouncil.py (1)
get_bin_data(63-945)uk_bin_collection/uk_bin_collection/get_bin_data.py (1)
AbstractGetBinDataClass(43-146)uk_bin_collection/uk_bin_collection/common.py (1)
check_uprn(67-78)
uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
check_postcode(36-49)
🪛 Ruff (0.14.5)
uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py
36-36: datetime may be undefined, or defined from star imports
(F405)
37-37: datetime may be undefined, or defined from star imports
(F405)
39-39: datetime may be undefined, or defined from star imports
(F405)
40-40: datetime may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py
40-40: create_webdriver may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py
19-19: check_uprn may be undefined, or defined from star imports
(F405)
29-29: requests may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py
5-5: from uk_bin_collection.uk_bin_collection.common import * used; unable to detect undefined names
(F403)
17-17: Unused method argument: page
(ARG002)
21-21: check_uprn may be undefined, or defined from star imports
(F405)
22-22: check_postcode may be undefined, or defined from star imports
(F405)
41-41: Probable use of requests call without timeout
(S113)
59-59: Probable use of requests call without timeout
(S113)
77-77: datetime may be undefined, or defined from star imports
(F405)
80-80: date_format may be undefined, or defined from star imports
(F405)
85-85: datetime may be undefined, or defined from star imports
(F405)
85-85: date_format may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
74-74: date_format may be undefined, or defined from star imports
(F405)
82-82: date_format may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py
47-47: Probable use of requests call without timeout
(S113)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py
102-102: datetime may be undefined, or defined from star imports
(F405)
105-105: date_format may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py
6-6: from uk_bin_collection.uk_bin_collection.common import * used; unable to detect undefined names
(F403)
18-18: Unused method argument: page
(ARG002)
21-21: check_uprn may be undefined, or defined from star imports
(F405)
32-32: Probable use of requests call without timeout
(S113)
48-48: datetime may be undefined, or defined from star imports
(F405)
54-54: date_format may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py
27-27: check_postcode may be undefined, or defined from star imports
(F405)
28-28: check_paon may be undefined, or defined from star imports
(F405)
38-38: create_webdriver may be undefined, or defined from star imports
(F405)
66-66: Unused lambda argument: d
(ARG005)
83-85: Abstract raise to an inner function
(TRY301)
83-85: Avoid specifying long messages outside the exception class
(TRY003)
102-104: Abstract raise to an inner function
(TRY301)
102-104: Avoid specifying long messages outside the exception class
(TRY003)
126-126: date_format may be undefined, or defined from star imports
(F405)
131-131: date_format may be undefined, or defined from star imports
(F405)
uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py
25-25: check_postcode may be undefined, or defined from star imports
(F405)
29-29: requests may be undefined, or defined from star imports
(F405)
44-44: Abstract raise to an inner function
(TRY301)
44-44: Create your own exception
(TRY002)
44-44: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Abstract raise to an inner function
(TRY301)
64-64: Create your own exception
(TRY002)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Abstract raise to an inner function
(TRY301)
76-76: Create your own exception
(TRY002)
76-76: Avoid specifying long messages outside the exception class
(TRY003)
90-90: datetime may be undefined, or defined from star imports
(F405)
93-93: date_format may be undefined, or defined from star imports
(F405)
98-98: datetime may be undefined, or defined from star imports
(F405)
98-98: date_format may be undefined, or defined from star imports
(F405)
101-101: Consider moving this statement to an else block
(TRY300)
uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py
110-112: Abstract raise to an inner function
(TRY301)
110-112: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (14)
uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py (2)
17-37: LGTM!The updated docstring clearly documents the new dynamic ICS lookup flow and parameter requirements.
119-145: LGTM!The ICS event parsing logic correctly handles events within the next 60 days, splits comma-separated collection types, and ensures proper driver cleanup in the finally block.
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (4)
30-30: LGTM: UPRN padding implemented correctly.The 12-digit UPRN padding aligns with fixes for other councils in this PR and ensures consistent data format.
57-57: LGTM: Clean data structure initialization.The
data_binsinitialization follows the unified aggregation approach and will accumulate results across all requested months.
82-82: LGTM: Endpoint updated to wastecollectioncalendar.This endpoint change addresses issue #1689 to ensure all collection types are returned.
110-110: Verify the separator format used by Wiltshire Council's API.The review raises a valid concern: while the code assumes " and " (with spaces) as the separator, no Wiltshire-specific test data or documentation was found in the codebase to confirm this. Other councils in the same project use different separators (comma, forward slash, newline). Without access to actual Wiltshire API responses, the assumption cannot be verified. Check Wiltshire Council's live data to confirm the collection type format is always "Type1 and Type2" with spaces, and not alternatives like "&", "+", or formats without consistent spacing.
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (1)
4-6: BeautifulSoup import is fineThe
BeautifulSoupimport and usage remain standard and appropriate for this parser; no issues here.uk_bin_collection/uk_bin_collection/councils/NorthumberlandCouncil.py (1)
22-22: UPRN normalization is correct and aligns with other councilsCoercing
user_uprntostrandzfill(12)before using it inselect_by_valuematches the 12‑digit UPRN requirement and should fix the lookup issues without regressions for already‑padded values.Also applies to: 43-43
uk_bin_collection/uk_bin_collection/councils/ArgyllandButeCouncil.py (1)
1-1: Datetime import and 12‑digit UPRN padding are appropriateImporting
datetimematches its later usage for parsing and sorting, and zero‑paddinguser_uprnto 12 digits beforeselect_by_valuealigns with the council’s UPRN format requirements and the pattern used in Northumberland.Also applies to: 31-31
uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py (1)
38-41: Explicit user-agent for Selenium driver is a solid improvementPassing a realistic
user_agentintocreate_webdrivershould help with remote/SaaS blocking while remaining compatible with the existing helper signature. No further changes needed here.wiki/Councils.md (1)
40-40: Documentation updates match the new scraper behavioursThe additions and edits for Blackpool, Harlow, Fife, Mid Sussex, and Rushmoor are consistent with the corresponding scraper implementations and PR objectives:
- Blackpool/Harlow: new entries with
-u(and-pfor Blackpool).- Fife: switched to postcode + house number + Selenium, matching the new Selenium-based scraper.
- Mid Sussex: removed web driver parameter and updated note for the non-Selenium HTML flow.
- Rushmoor: now uses the base URL and a separate
-uUPRN parameter as per the new API-based scraper.Given this file is auto-generated from
uk_bin_collection/tests/input.json, the changes look correct and in sync with the code.Also applies to: 144-144, 675-685, 1728-1735, 1949-1958, 2557-2564, 3246-3251
uk_bin_collection/tests/input.json (3)
1144-1151: Harlow config matches new scraper expectations.The
HarlowCouncilentry provides anuprnand baseurlas required by the new scraper; wiki fields and LAD code look consistent with other councils.
1564-1570: Mid Sussex sample updated to match “street name only” requirement.The
house_numberexample ("OAKLANDS") now aligns with the wiki note that the street name should be passed via thehouse_numberparameter; this looks correct.
2048-2051: Rushmoor config aligns with reworked scraper flow.Switching
urlto the base Rushmoor site and adding an explicituprnwith an updated wiki note matches the described change to accept a separate UPRN field. No issues spotted.
| "FifeCouncil": { | ||
| "uprn": "320203521", | ||
| "postcode": "KY16 9NQ", | ||
| "house_number": "1", | ||
| "web_driver": "http://selenium:4444", | ||
| "url": "https://www.fife.gov.uk", | ||
| "wiki_command_url_override": "https://www.fife.gov.uk", | ||
| "wiki_name": "Fife", | ||
| "wiki_note": "You will need to use [FindMyAddress](https://www.findmyaddress.co.uk/search) to find the UPRN.", | ||
| "wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.", | ||
| "LAD24CD": "S12000047" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fife config uses house_number but parser reads paon – this will fail address matching.
FifeCouncil.CouncilClass.parse_data currently pulls the address from kwargs["paon"], but this JSON only defines house_number. In typical flows where kwargs are taken directly from this file, paon will be None, so the dropdown search will never match and a ValueError will be raised for every run.
Add a paon value (or align the parser to also read house_number) so the config and implementation agree. For example:
"FifeCouncil": {
"postcode": "KY16 9NQ",
- "house_number": "1",
+ "house_number": "1",
+ "paon": "1",
"web_driver": "http://selenium:4444",
"url": "https://www.fife.gov.uk",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "FifeCouncil": { | |
| "uprn": "320203521", | |
| "postcode": "KY16 9NQ", | |
| "house_number": "1", | |
| "web_driver": "http://selenium:4444", | |
| "url": "https://www.fife.gov.uk", | |
| "wiki_command_url_override": "https://www.fife.gov.uk", | |
| "wiki_name": "Fife", | |
| "wiki_note": "You will need to use [FindMyAddress](https://www.findmyaddress.co.uk/search) to find the UPRN.", | |
| "wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.", | |
| "LAD24CD": "S12000047" | |
| "FifeCouncil": { | |
| "postcode": "KY16 9NQ", | |
| "house_number": "1", | |
| "paon": "1", | |
| "web_driver": "http://selenium:4444", | |
| "url": "https://www.fife.gov.uk", | |
| "wiki_command_url_override": "https://www.fife.gov.uk", | |
| "wiki_name": "Fife", | |
| "wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.", | |
| "LAD24CD": "S12000047" |
🤖 Prompt for AI Agents
In uk_bin_collection/tests/input.json around lines 957–965, the FifeCouncil test
config uses "house_number" while the parser expects kwargs["paon"], causing paon
to be None and address matching to fail; fix this by updating the parser to read
paon = kwargs.get("paon") or kwargs.get("house_number") (i.e., fall back to
"house_number" when "paon" is absent), adjust any callers/tests if they assume
paon-only, and add a unit test or update the existing test fixture to ensure the
fallback works (or alternatively add "paon": "1" to the JSON if you prefer the
config-side fix).
| def parse_data(self, page: str, **kwargs) -> dict: | ||
|
|
||
| user_uprn = kwargs.get("uprn") | ||
| user_postcode = kwargs.get("postcode") | ||
| check_uprn(user_uprn) | ||
| check_postcode(user_postcode) | ||
| bindata = {"bins": []} | ||
|
|
||
| headers = { | ||
| "Accept": "*/*", | ||
| "Accept-Language": "en-GB,en;q=0.9", | ||
| "Connection": "keep-alive", | ||
| "DNT": "1", | ||
| "Origin": "https://www.blackpool.gov.uk", | ||
| "Referer": "https://www.blackpool.gov.uk/", | ||
| "Sec-Fetch-Dest": "empty", | ||
| "Sec-Fetch-Mode": "cors", | ||
| "Sec-Fetch-Site": "same-site", | ||
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36", | ||
| "sec-ch-ua": '"Google Chrome";v="141", "Not?A_Brand";v="8", "Chromium";v="141"', | ||
| "sec-ch-ua-mobile": "?0", | ||
| "sec-ch-ua-platform": '"Windows"', | ||
| } | ||
|
|
||
| response = requests.get( | ||
| "https://api.blackpool.gov.uk/live//api/bartec/security/token", | ||
| headers=headers, | ||
| ) | ||
|
|
||
| token = response.text.strip().replace('"', "") | ||
|
|
||
| json_data = { | ||
| "UPRN": user_uprn, | ||
| "USRN": "", | ||
| "PostCode": user_postcode, | ||
| "StreetNumber": "", | ||
| "CurrentUser": { | ||
| "UserId": "", | ||
| "Token": token, | ||
| }, | ||
| } | ||
|
|
||
| response = requests.post( | ||
| "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs", | ||
| headers=headers, | ||
| json=json_data, | ||
| ) | ||
|
|
||
| # Parse the JSON response | ||
| bin_collection = response.json() | ||
|
|
||
| # Loop through each collection in bin_collection | ||
| for collection in bin_collection["jobsField"]: | ||
|
|
||
| job = collection["jobField"] | ||
| date = job.get("scheduledStartField") | ||
| bin_type = job.get("nameField", "") or job.get("descriptionField", "") | ||
|
|
||
| dict_data = { | ||
| "type": bin_type, | ||
| "collectionDate": datetime.strptime( | ||
| date, | ||
| "%Y-%m-%dT%H:%M:%S", | ||
| ).strftime(date_format), | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
|
|
||
| bindata["bins"].sort( | ||
| key=lambda x: datetime.strptime(x.get("collectionDate"), date_format) | ||
| ) | ||
|
|
||
| return bindata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Blackpool scraper should use timeouts, check HTTP status, and handle missing jobs defensively.
As written:
- Both the token
GETandPremiseJobsPOSThave notimeoutand don’t validate status, so a stuck endpoint can hang forever and 4xx/5xx responses will cause obscure JSON/Key errors. - Iteration over
bin_collection["jobsField"]andcollection["jobField"]assumes those keys always exist and thatscheduledStartFieldand a name/description are present.
You can keep behaviour but make it much more robust like this:
- response = requests.get(
- "https://api.blackpool.gov.uk/live//api/bartec/security/token",
- headers=headers,
- )
-
- token = response.text.strip().replace('"', "")
+ response = requests.get(
+ "https://api.blackpool.gov.uk/live//api/bartec/security/token",
+ headers=headers,
+ timeout=120,
+ )
+ response.raise_for_status()
+
+ token = response.text.strip().replace('"', "")
@@
- response = requests.post(
- "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
- headers=headers,
- json=json_data,
- )
-
- # Parse the JSON response
- bin_collection = response.json()
-
- # Loop through each collection in bin_collection
- for collection in bin_collection["jobsField"]:
-
- job = collection["jobField"]
- date = job.get("scheduledStartField")
- bin_type = job.get("nameField", "") or job.get("descriptionField", "")
-
- dict_data = {
- "type": bin_type,
- "collectionDate": datetime.strptime(
- date,
- "%Y-%m-%dT%H:%M:%S",
- ).strftime(date_format),
- }
- bindata["bins"].append(dict_data)
+ response = requests.post(
+ "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
+ headers=headers,
+ json=json_data,
+ timeout=120,
+ )
+ response.raise_for_status()
+
+ bin_collection = response.json()
+
+ for collection in bin_collection.get("jobsField", []):
+ job = collection.get("jobField") or {}
+ date = job.get("scheduledStartField")
+ bin_type = job.get("nameField") or job.get("descriptionField")
+ if not (date and bin_type):
+ continue
+
+ dict_data = {
+ "type": bin_type,
+ "collectionDate": datetime.strptime(
+ date,
+ "%Y-%m-%dT%H:%M:%S",
+ ).strftime(date_format),
+ }
+ bindata["bins"].append(dict_data)This keeps the same output structure while avoiding hangs and hard crashes on minor API changes.
🏁 Script executed:
#!/bin/bash
# Verify BlackpoolCouncil.py exists and contains the code mentioned in review
find . -name "BlackpoolCouncil.py" -type fLength of output: 66
🏁 Script executed:
#!/bin/bash
# Inspect the parse_data method in BlackpoolCouncil.py
cat uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py | head -100Length of output: 2865
🏁 Script executed:
#!/bin/bash
# Check how other council scrapers handle requests (e.g., timeout patterns)
rg -n "requests\.(get|post)" --type py -C 2 uk_bin_collection/uk_bin_collection/councils/ | head -60Length of output: 5470
🏁 Script executed:
#!/bin/bash
# Search for timeout usage in the codebase to see if it's a standard pattern
rg -n "timeout\s*=" --type py uk_bin_collection/uk_bin_collection/councils/ | head -20Length of output: 2862
🏁 Script executed:
#!/bin/bash
# Look for response.raise_for_status() usage in other scrapers
rg -n "raise_for_status" --type py uk_bin_collection/Length of output: 11821
Add timeout parameters and HTTP status validation to both API requests; use defensive JSON parsing.
The BlackpoolCouncil scraper omits critical error handling that is standard across 50+ other council implementations in the codebase. Both the token request and PremiseJobs request lack timeouts and status validation:
- Without
timeout, network issues will cause indefinite hangs - Without
raise_for_status(), error responses (4xx/5xx) bypass validation and cause JSON parse failures - Direct dict access (
bin_collection["jobsField"]) crashes on API schema changes instead of gracefully skipping incomplete records
Apply the suggested changes:
- response = requests.get(
- "https://api.blackpool.gov.uk/live//api/bartec/security/token",
- headers=headers,
- )
-
- token = response.text.strip().replace('"', "")
+ response = requests.get(
+ "https://api.blackpool.gov.uk/live//api/bartec/security/token",
+ headers=headers,
+ timeout=120,
+ )
+ response.raise_for_status()
+
+ token = response.text.strip().replace('"', "")
@@
- response = requests.post(
- "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
- headers=headers,
- json=json_data,
- )
-
- # Parse the JSON response
- bin_collection = response.json()
-
- # Loop through each collection in bin_collection
- for collection in bin_collection["jobsField"]:
-
- job = collection["jobField"]
- date = job.get("scheduledStartField")
- bin_type = job.get("nameField", "") or job.get("descriptionField", "")
-
- dict_data = {
- "type": bin_type,
- "collectionDate": datetime.strptime(
- date,
- "%Y-%m-%dT%H:%M:%S",
- ).strftime(date_format),
- }
- bindata["bins"].append(dict_data)
+ response = requests.post(
+ "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
+ headers=headers,
+ json=json_data,
+ timeout=120,
+ )
+ response.raise_for_status()
+
+ bin_collection = response.json()
+
+ for collection in bin_collection.get("jobsField", []):
+ job = collection.get("jobField") or {}
+ date = job.get("scheduledStartField")
+ bin_type = job.get("nameField") or job.get("descriptionField")
+ if not (date and bin_type):
+ continue
+
+ dict_data = {
+ "type": bin_type,
+ "collectionDate": datetime.strptime(
+ date,
+ "%Y-%m-%dT%H:%M:%S",
+ ).strftime(date_format),
+ }
+ bindata["bins"].append(dict_data)🧰 Tools
🪛 Ruff (0.14.5)
17-17: Unused method argument: page
(ARG002)
21-21: check_uprn may be undefined, or defined from star imports
(F405)
22-22: check_postcode may be undefined, or defined from star imports
(F405)
41-41: Probable use of requests call without timeout
(S113)
59-59: Probable use of requests call without timeout
(S113)
77-77: datetime may be undefined, or defined from star imports
(F405)
80-80: date_format may be undefined, or defined from star imports
(F405)
85-85: datetime may be undefined, or defined from star imports
(F405)
85-85: date_format may be undefined, or defined from star imports
(F405)
| calendar_url = f"https://www.chelmsford.gov.uk/bins-and-recycling/check-your-collection-day/{day}-{letter}-collection-calendar/" | ||
| driver.get(calendar_url) | ||
| soup = BeautifulSoup(driver.page_source, features="html.parser") | ||
| a = soup.find( | ||
| "a", href=lambda h: h and h.lower().endswith(".ics") | ||
| ) | ||
| if a: | ||
| ics_url = a["href"] | ||
| else: | ||
| raise ValueError( | ||
| f"Could not find collection ICS file for address: {user_paon}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle relative URLs when extracting ICS href.
The a["href"] attribute extracted at line 108 may be a relative URL (e.g., "/bins-and-recycling/file.ics" or "../../file.ics"). The events() function at line 124 expects an absolute URL and will fail if given a relative path.
Apply this diff to convert potentially relative URLs to absolute:
+from urllib.parse import urljoin
+
# (at line 101-112)
calendar_url = f"https://www.chelmsford.gov.uk/bins-and-recycling/check-your-collection-day/{day}-{letter}-collection-calendar/"
driver.get(calendar_url)
soup = BeautifulSoup(driver.page_source, features="html.parser")
a = soup.find(
"a", href=lambda h: h and h.lower().endswith(".ics")
)
if a:
- ics_url = a["href"]
+ ics_url = urljoin(calendar_url, a["href"])
else:
raise ValueError(
f"Could not find collection ICS file for address: {user_paon}"
)Additionally, consider improving the error message to be more specific:
else:
raise ValueError(
- f"Could not find collection ICS file for address: {user_paon}"
+ f"Could not find ICS link on calendar page ({day}-{letter}) for address: {user_paon}"
)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.5)
110-112: Abstract raise to an inner function
(TRY301)
110-112: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py around
lines 101-112, the code takes a[href] which can be a relative URL and later
feeds it to events(), causing failures; update the extraction to resolve
relative hrefs into absolute URLs (for example use urllib.parse.urljoin with the
current page URL or driver.current_url) before assigning ics_url, and when
raising the ValueError include more specific context (address/user_paon, the
calendar page URL and the raw href if present) so the error message is
actionable.
| def parse_data(self, page: str, **kwargs) -> dict: | ||
| # Get and check UPRN | ||
| user_uprn = kwargs.get("uprn") | ||
| check_uprn(user_uprn) | ||
| bindata = {"bins": []} | ||
|
|
||
| API_URL = "https://www.fife.gov.uk/api/custom?action=powersuite_bin_calendar_collections&actionedby=bin_calendar&loadform=true&access=citizen&locale=en" | ||
| AUTH_URL = "https://www.fife.gov.uk/api/citizen?preview=false&locale=en" | ||
| AUTH_KEY = "Authorization" | ||
|
|
||
| r = requests.get(AUTH_URL) | ||
| r.raise_for_status() | ||
| auth_token = r.headers[AUTH_KEY] | ||
|
|
||
| post_data = { | ||
| "name": "bin_calendar", | ||
| "data": { | ||
| "uprn": user_uprn, | ||
| }, | ||
| "email": "", | ||
| "caseid": "", | ||
| "xref": "", | ||
| "xref1": "", | ||
| "xref2": "", | ||
| } | ||
|
|
||
| headers = { | ||
| "referer": "https://www.fife.gov.uk/services/forms/bin-calendar", | ||
| "accept": "application/json", | ||
| "content-type": "application/json", | ||
| AUTH_KEY: auth_token, | ||
| } | ||
|
|
||
| r = requests.post(API_URL, data=json.dumps(post_data), headers=headers) | ||
| r.raise_for_status() | ||
|
|
||
| result = r.json() | ||
|
|
||
| for collection in result["data"]["tab_collections"]: | ||
| dict_data = { | ||
| "type": collection["colour"], | ||
| "collectionDate": datetime.strptime( | ||
| collection["date"], | ||
| "%A, %B %d, %Y", | ||
| ).strftime("%d/%m/%Y"), | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
|
|
||
| bindata["bins"].sort( | ||
| key=lambda x: datetime.strptime(x.get("collectionDate"), "%d/%m/%Y") | ||
| ) | ||
| driver = None | ||
| try: | ||
| # Get and check UPRN | ||
| user_postcode = kwargs.get("postcode") | ||
| user_paon = kwargs.get("paon") | ||
| check_postcode(user_postcode) | ||
| check_paon(user_paon) | ||
|
|
||
| web_driver = kwargs.get("web_driver") | ||
| headless = kwargs.get("headless") | ||
|
|
||
| bindata = {"bins": []} | ||
|
|
||
| URL = "https://fife.portal.uk.empro.verintcloudservices.com/site/fife/request/bin_calendar" | ||
|
|
||
| user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" | ||
| driver = create_webdriver(web_driver, headless, user_agent, __name__) | ||
| driver.get(URL) | ||
|
|
||
| wait = WebDriverWait(driver, 30) | ||
|
|
||
| ID_POSTCODE = "dform_widget_ps_45M3LET8_txt_postcode" | ||
| ID_SEARCH_BTN = "dform_widget_ps_3SHSN93_searchbutton" | ||
| ID_ADDRESS_SELECT = "dform_widget_ps_3SHSN93_id" | ||
| ID_COLLECTIONS = "dform_table_tab_collections" | ||
|
|
||
| # Wait for initial page load and Cloudflare bypass | ||
| wait.until(lambda d: "Just a moment" not in d.title and d.title != "") | ||
| time.sleep(3) | ||
|
|
||
| # Wait for the postcode field to appear then populate it | ||
| inputElement_postcode = wait.until( | ||
| EC.presence_of_element_located((By.ID, ID_POSTCODE)) | ||
| ) | ||
| inputElement_postcode.send_keys(user_postcode) | ||
|
|
||
| # Click search button | ||
| findAddress = wait.until(EC.element_to_be_clickable((By.ID, ID_SEARCH_BTN))) | ||
| findAddress.click() | ||
|
|
||
| # Wait for the 'Select address' dropdown to appear and select option matching the house name/number | ||
| select_el = wait.until( | ||
| EC.visibility_of_element_located((By.ID, ID_ADDRESS_SELECT)) | ||
| ) | ||
| wait.until(lambda d: len(Select(select_el).options) > 1) | ||
|
|
||
| paon_norm = str(user_paon).strip().casefold() | ||
| sel = Select(select_el) | ||
|
|
||
| time.sleep(10) | ||
|
|
||
| def _best_option(): | ||
| # Prefer exact contains on visible text; fallback to casefold contains | ||
| for opt in sel.options: | ||
| txt = (opt.text or "").strip() | ||
| if paon_norm and paon_norm in txt.casefold(): | ||
| return opt | ||
| return None | ||
|
|
||
| opt = _best_option() | ||
| if not opt: | ||
| raise ValueError( | ||
| f"Could not find an address containing '{user_paon}' in the dropdown." | ||
| ) | ||
| sel.select_by_visible_text(opt.text) | ||
|
|
||
| # After selecting, the collections table should (re)render; wait for it | ||
| wait.until(EC.presence_of_element_located((By.ID, ID_COLLECTIONS))) | ||
| # Also wait until at least one data row is present (beyond headers) | ||
| wait.until( | ||
| lambda d: len( | ||
| d.find_elements(By.CSS_SELECTOR, f"#{ID_COLLECTIONS} .dform_tr") | ||
| ) | ||
| > 1 | ||
| ) | ||
|
|
||
| soup = BeautifulSoup(driver.page_source, features="html.parser") | ||
|
|
||
| table = soup.find("div", id=ID_COLLECTIONS) | ||
| if not table: | ||
| raise ValueError( | ||
| f"Could not find collections table by id='{ID_COLLECTIONS}'" | ||
| ) | ||
|
|
||
| rows = table.find_all("div", class_="dform_tr") | ||
|
|
||
| # Skip header row (first row with .dform_th entries) | ||
| for row in rows[1:]: | ||
| tds = row.find_all("div", class_="dform_td") | ||
| if len(tds) < 3: | ||
| continue | ||
|
|
||
| # Colour comes from the <img alt="..."> | ||
| colour_cell = tds[0] | ||
| img = colour_cell.find("img") | ||
| colour = img.get("alt").strip() if img and img.has_attr("alt") else None | ||
|
|
||
| # Date text | ||
| raw_date = tds[1].get_text(strip=True) | ||
| # Example: "Wednesday, November 12, 2025" | ||
| dt = datetime.strptime(raw_date, "%A, %B %d, %Y") | ||
|
|
||
| dict_data = { | ||
| "type": colour, | ||
| "collectionDate": dt.strftime(date_format), | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
|
|
||
| bindata["bins"].sort( | ||
| key=lambda x: datetime.strptime(x.get("collectionDate"), date_format) | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| # Here you can log the exception if needed | ||
| print(f"An error occurred: {e}") | ||
| # Optionally, re-raise the exception if you want it to propagate | ||
| raise | ||
| finally: | ||
| # This block ensures that the driver is closed regardless of an exception | ||
| if driver: | ||
| driver.quit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fife parser expects paon but config/wikis only supply house_number.
Inside parse_data you do:
user_postcode = kwargs.get("postcode")
user_paon = kwargs.get("paon")
check_postcode(user_postcode)
check_paon(user_paon)
...
paon_norm = str(user_paon).strip().casefold()However, the FifeCouncil entry in tests/input.json now only defines house_number, and the wiki text tells users to pass a “house number” parameter, not paon. In flows that rely on those kwargs, user_paon will be None, so _best_option will end up searching for "none" and always raise the “Could not find an address” ValueError.
To make this work with both existing conventions and the new config, have the parser fall back to house_number when paon is absent:
- user_postcode = kwargs.get("postcode")
- user_paon = kwargs.get("paon")
- check_postcode(user_postcode)
- check_paon(user_paon)
+ user_postcode = kwargs.get("postcode")
+ user_paon = kwargs.get("paon") or kwargs.get("house_number")
+ check_postcode(user_postcode)
+ check_paon(user_paon)
@@
- paon_norm = str(user_paon).strip().casefold()
+ paon_norm = str(user_paon).strip().casefold()(Only the source of user_paon changes.)
This keeps support for any callers already using paon while making the documented house_number parameter work for Fife.
🧰 Tools
🪛 Ruff (0.14.5)
21-21: Unused method argument: page
(ARG002)
27-27: check_postcode may be undefined, or defined from star imports
(F405)
28-28: check_paon may be undefined, or defined from star imports
(F405)
38-38: create_webdriver may be undefined, or defined from star imports
(F405)
66-66: Unused lambda argument: d
(ARG005)
83-85: Abstract raise to an inner function
(TRY301)
83-85: Avoid specifying long messages outside the exception class
(TRY003)
102-104: Abstract raise to an inner function
(TRY301)
102-104: Avoid specifying long messages outside the exception class
(TRY003)
126-126: date_format may be undefined, or defined from star imports
(F405)
131-131: date_format may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py around lines
21-142, the parser reads user_paon from kwargs["paon"] which can be None because
callers provide "house_number"; update the source of the PAON to use
kwargs.get("paon") with a fallback to kwargs.get("house_number"), assign that
resolved value to user_paon, and use that resolved value for check_paon() and
further processing (paon_norm and error messages) so both old callers using
"paon" and new callers using "house_number" work correctly.
| def parse_data(self, page: str, **kwargs) -> dict: | ||
|
|
||
| user_uprn = kwargs.get("uprn") | ||
| check_uprn(user_uprn) | ||
| bindata = {"bins": []} | ||
|
|
||
| headers = { | ||
| "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36", | ||
| } | ||
|
|
||
| params = { | ||
| "uprn": user_uprn, | ||
| } | ||
|
|
||
| response = requests.get( | ||
| "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho", | ||
| params=params, | ||
| headers=headers, | ||
| ) | ||
|
|
||
| soup = BeautifulSoup(response.text, features="html.parser") | ||
|
|
||
| summary = soup.find("div", {"class": "summary"}) | ||
| collectionrows = summary.find_all("div", {"class": "collectionsrow"}) | ||
|
|
||
| for collectionrow in collectionrows: | ||
| bin_type = collectionrow.find("div", {"class": "col-xs-4"}) | ||
| collection_time = collectionrow.find("div", {"class": "col-sm-6"}) | ||
|
|
||
| if bin_type and collection_time: | ||
| collectionDate = datetime.strptime( | ||
| collection_time.text.strip(), "%a - %d %b %Y" | ||
| ) | ||
|
|
||
| dict_data = { | ||
| "type": bin_type.text.strip(), | ||
| "collectionDate": collectionDate.strftime(date_format), | ||
| } | ||
| bindata["bins"].append(dict_data) | ||
|
|
||
| return bindata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Harden Harlow HTTP/HTML handling: add timeouts/status checks and guard missing .summary.
Right now the scraper can hang indefinitely or crash on layout/HTTP changes:
requests.gethas notimeoutand doesn’t checkstatus_code; a slow or broken endpoint will block indefinitely, and 4xx/5xx HTML will fall through to parsing.summary = soup.find("div", {"class": "summary"})is assumed to succeed; if it’s missing,summary.find_all(...)will raiseAttributeError.
Consider tightening this up along these lines:
- response = requests.get(
- "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
- params=params,
- headers=headers,
- )
-
- soup = BeautifulSoup(response.text, features="html.parser")
-
- summary = soup.find("div", {"class": "summary"})
- collectionrows = summary.find_all("div", {"class": "collectionsrow"})
+ response = requests.get(
+ "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
+ params=params,
+ headers=headers,
+ timeout=120,
+ )
+ response.raise_for_status()
+
+ soup = BeautifulSoup(response.text, features="html.parser")
+
+ summary = soup.find("div", {"class": "summary"})
+ if not summary:
+ raise ValueError("Could not find bin collection summary on Harlow page")
+ collectionrows = summary.find_all("div", {"class": "collectionsrow"})You can also safely drop the unused import time at the top.
---
🏁 Script executed:
```shell
#!/bin/bash
# Locate and examine HarlowCouncil.py file
fd -t f "HarlowCouncil.py" 2>/dev/null | head -20
Length of output: 61
🏁 Script executed:
#!/bin/bash
# Examine the full HarlowCouncil.py file to check imports and implementation
if [ -f "uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py" ]; then
cat "uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py"
fiLength of output: 1911
🏁 Script executed:
#!/bin/bash
# Check for any try-except or error handling patterns in the file
rg -n "timeout|raise_for_status|try|except" uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py -C2 2>/dev/nullLength of output: 0
🏁 Script executed:
#!/bin/bash
# Check if requests module has timeout defaults or if other council scrapers use timeouts
rg -n "timeout=" uk_bin_collection/uk_bin_collection/councils/ -t py 2>/dev/null | head -20Length of output: 2826
🏁 Script executed:
#!/bin/bash
# Check import patterns in the file for 'time' module
rg -n "^import|^from" uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py 2>/dev/nullLength of output: 209
Add timeout, status check, and guard against missing .summary div.
The code lacks critical robustness safeguards:
requests.get()(lines 33–38) has notimeoutand doesn't checkresponse.status_code; requests can hang indefinitely, and error pages will pass through to parsing. Other council scrapers in this codebase usetimeout=20or higher as standard.summary.find_all()(line 43) assumes the.summarydiv exists; if the HTML changes,None.find_all()will raiseAttributeError.import time(line 1) is unused and should be removed.
Apply the suggested changes:
response = requests.get(
"https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
params=params,
headers=headers,
+ timeout=120,
)
+ response.raise_for_status()
soup = BeautifulSoup(response.text, features="html.parser")
summary = soup.find("div", {"class": "summary"})
+ if not summary:
+ raise ValueError("Could not find bin collection summary on Harlow page")
collectionrows = summary.find_all("div", {"class": "collectionsrow"})Also remove the unused import time at line 1.
🧰 Tools
🪛 Ruff (0.14.5)
18-18: Unused method argument: page
(ARG002)
21-21: check_uprn may be undefined, or defined from star imports
(F405)
32-32: Probable use of requests call without timeout
(S113)
48-48: datetime may be undefined, or defined from star imports
(F405)
54-54: date_format may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py around lines 18
to 58, the HTTP call and HTML parsing need hardening: remove the unused import
time at line 1; call requests.get(...) with a timeout (e.g. timeout=20) and
verify the response is successful (check response.status_code == 200 or call
response.raise_for_status()) before parsing; guard against a missing summary div
by checking if summary is not None before calling summary.find_all(...) (return
bindata or an empty bins list if summary is missing) so you don't call
.find_all() on None.
| for day in week_days: | ||
| for row in collection_schedule: | ||
| schedule_type = row.find("th").get_text().strip() | ||
| results2 = re.search("([^(]+)", row.find("td").get_text().strip()) | ||
| schedule_cadence = row.find("td").get_text().strip().split(" ")[1] | ||
| if results2: | ||
| schedule_day = results2[1].strip() | ||
| for collection_type in week_collection_types: | ||
| collectionDate = None | ||
| if collection_type in schedule_type.lower(): | ||
| if ( | ||
| day.weekday() | ||
| == time.strptime(schedule_day, "%A").tm_wday | ||
| ): | ||
| dict_data = { | ||
| "type": schedule_type, | ||
| "collectionDate": day.strftime(date_format), | ||
| } | ||
| data["bins"].append(dict_data) | ||
| collectionDate = day.strftime(date_format) | ||
| else: | ||
| if "Fortnightly" in schedule_cadence: | ||
| if ( | ||
| day.weekday() | ||
| == time.strptime(schedule_day, "%A").tm_wday | ||
| ): | ||
| day = day + timedelta(days=7) | ||
| collectionDate = day.strftime(date_format) | ||
|
|
||
| if schedule_type and collectionDate: | ||
| dict_data = { | ||
| "type": schedule_type, | ||
| "collectionDate": collectionDate, | ||
| } | ||
| data["bins"].append(dict_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortnightly cadence branch is mis-attached and can generate incorrect dates/types
The new cadence logic has a couple of problems that can lead to wrong or missing collections:
-
elseis attached to the wrong condition (logic bug)- Currently:
else:(Line 75) pairs withif collection_type in schedule_type.lower():(Line 69).- This means the “Fortnightly” branch runs only when the collection_type does not match the
schedule_type, and never when it does match.
- Consequences:
- Fortnightly adjustment is skipped for the actual matching bin type.
- For non-matching
collection_typevalues, the fortnightly branch can still set acollectionDate, thenif schedule_type and collectionDate:(Lines 84–89) will append entries using the wrongschedule_type, potentially duplicating or mis-dating bins.
- Currently:
-
Brittle
schedule_cadenceparsingschedule_cadence = row.find("td").get_text().strip().split(" ")[1](Line 64) assumes there are at least two space‑separated tokens in the cell.- If the HTML ever changes to a single word (e.g. just
"Thursday") or another format, this will raise anIndexErrorand break the scraper.
-
Repeated parsing of the
<td>textrow.find("td").get_text().strip()is called multiple times (Lines 63–64). It’s minor, but you can compute this once for clarity and slight efficiency.
A safer and clearer structure is to:
- Parse the
<td>text once. - Safely derive
schedule_cadence. - Compute
schedule_dayand its weekday once. - Apply the fortnightly adjustment inside the matching-
collection_typebranch.
For example:
- for row in collection_schedule:
- schedule_type = row.find("th").get_text().strip()
- results2 = re.search("([^(]+)", row.find("td").get_text().strip())
- schedule_cadence = row.find("td").get_text().strip().split(" ")[1]
- if results2:
- schedule_day = results2[1].strip()
- for collection_type in week_collection_types:
- collectionDate = None
- if collection_type in schedule_type.lower():
- if (
- day.weekday()
- == time.strptime(schedule_day, "%A").tm_wday
- ):
- collectionDate = day.strftime(date_format)
- else:
- if "Fortnightly" in schedule_cadence:
- if (
- day.weekday()
- == time.strptime(schedule_day, "%A").tm_wday
- ):
- day = day + timedelta(days=7)
- collectionDate = day.strftime(date_format)
-
- if schedule_type and collectionDate:
- dict_data = {
- "type": schedule_type,
- "collectionDate": collectionDate,
- }
- data["bins"].append(dict_data)
+ for row in collection_schedule:
+ schedule_type = row.find("th").get_text().strip()
+ td = row.find("td")
+ if not td:
+ continue
+
+ td_text = td.get_text().strip()
+ results2 = re.search(r"([^(]+)", td_text)
+
+ tokens = td_text.split()
+ schedule_cadence = tokens[1] if len(tokens) > 1 else ""
+
+ if results2:
+ schedule_day = results2.group(1).strip()
+ schedule_wday = time.strptime(schedule_day, "%A").tm_wday
+ is_fortnightly = "Fortnightly" in schedule_cadence
+
+ for collection_type in week_collection_types:
+ collectionDate = None
+ if collection_type in schedule_type.lower():
+ if day.weekday() == schedule_wday:
+ effective_day = (
+ day + timedelta(days=7)
+ if is_fortnightly
+ else day
+ )
+ collectionDate = effective_day.strftime(date_format)
+
+ if schedule_type and collectionDate:
+ dict_data = {
+ "type": schedule_type,
+ "collectionDate": collectionDate,
+ }
+ data["bins"].append(dict_data)This keeps the fortnightly handling tied to the correct schedule_type, avoids accidental extra entries, and guards against brittle .split()[1] usage.
🧰 Tools
🪛 Ruff (0.14.5)
63-63: re may be undefined, or defined from star imports
(F405)
74-74: date_format may be undefined, or defined from star imports
(F405)
82-82: date_format may be undefined, or defined from star imports
(F405)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
102-116: Consider renaming loop variable and reducing reliance on*importsTwo small polish items:
- The loop uses
for type in collection_types:which shadows the built-intype; consider renaming tobin_type(and updating the dict accordingly) for clarity.- Ruff flags
datetime/date_formatas F405 because they come from afrom ... import *; if you want cleaner linting and clearer dependencies, consider importing the specific names you need instead of using a star import.Example for the loop rename:
- collection_types = collection_type.split(" and ") - - for type in collection_types: - - dict_data = { - "type": type, + collection_types = collection_type.split(" and ") + + for bin_type in collection_types: + + dict_data = { + "type": bin_type, "collectionDate": collectiondate, }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
custom_components/uk_bin_collection/calendar.py (1)
event(54-63)
🪛 Ruff (0.14.5)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py
102-102: datetime may be undefined, or defined from star imports
(F405)
105-105: date_format may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run Unit Tests (3.12, 1.8.4)
- GitHub Check: Run Integration Tests (3.12, 1.8.4)
🔇 Additional comments (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
57-120: Bin list construction and type/date mapping look correct
data_binsis initialised once and populated with per-event dicts using"type": typeand"collectionDate": collectiondate, andcollection_type.split(" and ")ensures multi-service entries are split into individual bin types. This resolves the earlier swapped-key issue and aligns with the expected{ "bins": [...] }structure.
BREAKING CHANGES
fix: #1693 - Fife Council - This scraper now uses postcode and house number instead of UPRN. It also requires a Selenium
browser
fix: #1724 - Rushmoor Council - This scraper requires you to reconfigure your council URL and input your UPRN in a separate field
FIXES
fix: #1622 - Armagh Banbridge Craigavon Council
fix: #1706 - London Borough of Lambeth
fix: #1707 - Chelmsford City Council
fix: #1721 - Mid Sussex Council - No longer requires Selenium
fix: #1720 - Thurrock Council
fix: #1718 - Argyll and Bute Council - Required padding of UPRN to 12 digits
fix: #1712 - South Lanarkshire Council - Scraper could return a date in the past leading to no data being displayed
fix: #1711 - Northumberland Council - Required padding of UPRN to 12 digits
fix: #1209 - Halton Borough Council
fix: #1689 - Wiltshire Council
New Councils
fix: #1640 - Adding Blackpool Council
fix: #1639 - Adding Harlow Council
Summary by CodeRabbit
New Features
Bug Fixes & Improvements
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.