Skip to content

Conversation

@m26dvd
Copy link
Contributor

@m26dvd m26dvd commented Nov 11, 2025

BREAKING CHANGES
fix: #1693 - Fife Council - This scraper now uses postcode and house number instead of UPRN. It also requires a Selenium
browser
fix: #1724 - Rushmoor Council - This scraper requires you to reconfigure your council URL and input your UPRN in a separate field

FIXES

fix: #1622 - Armagh Banbridge Craigavon Council
fix: #1706 - London Borough of Lambeth
fix: #1707 - Chelmsford City Council
fix: #1721 - Mid Sussex Council - No longer requires Selenium
fix: #1720 - Thurrock Council
fix: #1718 - Argyll and Bute Council - Required padding of UPRN to 12 digits
fix: #1712 - South Lanarkshire Council - Scraper could return a date in the past leading to no data being displayed
fix: #1711 - Northumberland Council - Required padding of UPRN to 12 digits
fix: #1209 - Halton Borough Council
fix: #1689 - Wiltshire Council

New Councils

fix: #1640 - Adding Blackpool Council
fix: #1639 - Adding Harlow Council

Summary by CodeRabbit

  • New Features

    • Added support for Blackpool and Harlow council bin lookups.
  • Bug Fixes & Improvements

    • Updated Thurrock collection start dates to 2025.
    • Improved retrieval and parsing for Fife, Mid Sussex, Rushmoor, Wiltshire, Northumberland, South Lanarkshire, Chelmsford and others for greater reliability.
    • Better address/UPRN handling and refined commercial bin classification for Lambeth.
  • Documentation

    • Updated council index and examples to include Blackpool, Harlow and parameter changes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

This PR adds two new council scrapers (Blackpool, Harlow), migrates several councils from direct HTTP to Selenium or updated HTML parsing, normalises/pads UPRNs, adjusts headers/user-agent usage, fixes bin-type mapping for Lambeth, updates date logic for Thurrock, and updates tests and documentation.

Changes

Cohort / File(s) Summary
New council scrapers
uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py, uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py
Add new CouncilClass implementations: Blackpool uses token + POST to API and parses PremiseJobs; Harlow GETs HTML with custom headers and parses collection rows.
Selenium-driven scraping / Cloudflare workarounds
uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py, uk_bin_collection/uk_bin_collection/councils/ArgyllandButeCouncil.py, uk_bin_collection/uk_bin_collection/councils/NorthumberlandCouncil.py, uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py
Replace or add WebDriver flows (Fife, MidSussex) and add UPRN padding (zfill(12)) for Argyll & Bute and Northumberland; manage driver lifecycle and element waits.
API / parsing switches
uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py, uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py, uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py
Rushmoor switched to direct API call using UPRN; MidSussex replaced Selenium-heavy flow with HTTP/BeautifulSoup in one variant; Wiltshire changed endpoint, selectors and now returns {"bins":[...]} with split types.
Headers / user-agent changes
uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py, uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py
Add custom User-Agent headers for requests and pass explicit user_agent to webdriver creation.
ICS/calendar handling
uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py
Replace hardcoded ICS URL construction with dynamic calendar page retrieval and parsing to locate the ICS link; raise error if ICS not found.
Bin type disambiguation
uk_bin_collection/uk_bin_collection/councils/LondonBoroughLambeth.py
Map "commercial bin" DisplayPhrase to "recycling" or "refuse" based on container Name to avoid duplicate-key collisions.
Collection cadence & date fixes
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py, uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py
South Lanarkshire: add cadence parsing and fortnight handling; Thurrock: update hardcoded start dates to 2025.
Tests & docs updates
uk_bin_collection/tests/input.json, wiki/Councils.md
Add/modify test inputs (Blackpool, Harlow, Fife, MidSussex, Rushmoor) and update wiki usage notes and examples for affected councils.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FifeOld as Fife (Old: API)
    participant FifeNew as Fife (New: Selenium)
    participant Server as Fife Server

    rect rgb(240, 240, 255)
    Note over FifeOld,Server: Previous Flow (Blocked by API)
    User->>FifeOld: parse_data(postcode)
    FifeOld->>Server: GET /api/citizen
    Server-->>FifeOld: 403 Forbidden
    FifeOld-->>User: Error
    end

    rect rgb(230, 255, 230)
    Note over FifeNew,Server: New Flow (Selenium-driven)
    User->>FifeNew: parse_data(postcode, paon)
    FifeNew->>Server: Browser navigate (Selenium)
    Server-->>FifeNew: Rendered HTML
    FifeNew->>FifeNew: Wait & bypass protections
    FifeNew->>FifeNew: Fill postcode, select address
    FifeNew->>Server: Trigger search / request data
    Server-->>FifeNew: Rendered collections table
    FifeNew->>FifeNew: Parse HTML -> {"bins":[...]}
    FifeNew-->>User: Result
    end
Loading
sequenceDiagram
    participant User
    participant RushmoorOld as Rushmoor (Old: HTML parse)
    participant RushmoorNew as Rushmoor (New: API)
    participant Server as Rushmoor Server

    rect rgb(240, 240, 255)
    Note over RushmoorOld: Old parsing relied on page structure and could hit NoneType
    User->>RushmoorOld: parse_data(page, uprn)
    RushmoorOld->>RushmoorOld: soup.find("p").contents[0] (fragile)
    RushmoorOld-->>User: AttributeError (NoneType)
    end

    rect rgb(230, 255, 230)
    Note over RushmoorNew,Server: New Flow (direct API)
    User->>RushmoorNew: parse_data(uprn)
    RushmoorNew->>Server: GET /api?uprn={uprn}
    Server-->>RushmoorNew: JSON response
    RushmoorNew->>RushmoorNew: Extract collections, format dates
    RushmoorNew-->>User: {"bins":[...]}
    end
Loading
sequenceDiagram
    participant User
    participant Blackpool as Blackpool (New)
    participant Server as Blackpool API

    rect rgb(230, 255, 230)
    Note over Blackpool,Server: Blackpool token + POST flow
    User->>Blackpool: parse_data(uprn, postcode)
    Blackpool->>Server: GET /token endpoint (with headers)
    Server-->>Blackpool: token
    Blackpool->>Server: POST /PremiseJobs with uprn, postcode, token
    Server-->>Blackpool: PremiseJobs JSON
    Blackpool->>Blackpool: Build {"bins":[...]} sorted
    Blackpool-->>User: Result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • FifeCouncil.py — Selenium flow, waits, Cloudflare bypass and webdriver teardown.
  • MidSussexDistrictCouncil.py — large control-flow changes and selector correctness.
  • BlackpoolCouncil.py & HarlowCouncil.py — new classes: validate API requests, tokens, date parsing and error handling.
  • LondonBoroughLambeth.py — ensure commercial-bin mapping covers edge cases without breaking existing mappings.
  • WiltshireCouncil.py & SouthLanarkshireCouncil.py — verify splitting of types and fortnight cadence logic.
  • UPRN padding changes — confirm zfill(12) is appropriate for all targets.

Possibly related PRs

Suggested reviewers

  • dp247

Poem

🐇 Hop hop — a rabbit’s tiny cheer,

New scrapers bloom both far and near,
Tokens fetched and pages browed,
UPRNs padded, headers proud,
Bins sorted, dates set bright and clear. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'fix: Council Fix Pack - December 2025' is generic and does not clearly specify the main changes; it refers to multiple unrelated council fixes without highlighting a primary objective. Consider a more specific title that highlights the primary change, such as 'fix: Update council scrapers for API/website changes and add Blackpool/Harlow' or focus on the most impactful fix.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR successfully addresses all linked coding objectives: updates 10+ council scrapers (Fife postcode+Selenium, Rushmoor API, Lambeth logic, Chelmsford ICS, Mid Sussex HTTP, Thurrock dates, Argyll/Northumberland UPRN padding, Halton user-agent, South Lanarkshire cadence, Wiltshire parsing) and adds 2 new councils (Blackpool, Harlow).
Out of Scope Changes check ✅ Passed All changes align with linked issues: council scraper fixes, configuration updates, documentation in wiki/Councils.md, and input.json entries match the stated requirements with no extraneous modifications.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.79%. Comparing base (37c8a80) to head (1c68a0d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1705   +/-   ##
=======================================
  Coverage   86.79%   86.79%           
=======================================
  Files           9        9           
  Lines        1136     1136           
=======================================
  Hits          986      986           
  Misses        150      150           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BREAKING CHANGE
fix: robbrad#1693

This scraper now uses postcode and house number instead of UPRN. It also requires a Selenium browser
fix: robbrad#1706 -  London Borough of Lambeth
fix: robbrad#1720 - Thurrock Council
fix: robbrad#1718 - Argyll and Bute Council

UPRN needs to be 12 digits long
fix: robbrad#1712 - South Lanarkshire Council

Council would potentially return a date in the past for that week
fix: robbrad#1711 - Northumberland Council - Requires 12 digit UPRN
fix: robbrad#1209 Halton Borough Council
fix: robbrad#1689 - Wiltshire Council
feat: robbrad#1640 Adding Blackpool Council
feat: robbrad#1639 Adding Harlow Council
fix: robbrad#1724 - Rushmoor Council
@m26dvd m26dvd mentioned this pull request Nov 20, 2025
4 tasks
@m26dvd m26dvd marked this pull request as ready for review November 24, 2025 20:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (9)
uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py (2)

36-40: Date update fixes the immediate issue; consider dynamic date calculation for long-term maintainability.

The hardcoded date updates from 2024 to 2025 correctly restore bin collection data generation. However, these dates will require manual maintenance in the future, creating recurring technical debt.

Consider refactoring to calculate start dates dynamically based on the current date and known collection schedules, or move these dates to a configuration file with validation. This would eliminate the need for annual manual updates.

Example approach:

# Calculate the most recent start date for the current year
def get_round_start_dates(year=None):
    if year is None:
        year = datetime.now().year
    # Define the initial dates for the current year
    round_a_blue_brown = datetime(year, 11, 17)
    round_a_green_grey = datetime(year, 11, 24)
    # Return dates based on round
    return {
        'A': {'blue_brown': round_a_blue_brown, 'green_grey': round_a_green_grey},
        'B': {'green_grey': round_a_green_grey, 'blue_brown': round_a_blue_brown}
    }

1-1: Note: Static analysis flagged star import (pre-existing).

Ruff flags that datetime may be undefined due to the star import on line 1. While the code works correctly (the datetime class is available from the common module), explicit imports are generally preferred for clarity and to avoid namespace pollution.

This is pre-existing code and not introduced by this PR, so it's low priority. If refactoring in the future, consider replacing the star import with explicit imports:

from uk_bin_collection.uk_bin_collection.common import datetime, timedelta, get_dates_every_x_days
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)

112-112: Variable name type shadows Python built-in.

Using type as a loop variable shadows the built-in type() function. Consider renaming to bin_type or collection_type_item for clarity.

Apply this diff to use a clearer variable name:

-                    for type in collection_types:
+                    for bin_type in collection_types:

                        dict_data = {
-                            "type": type,
+                            "type": bin_type,
                             "collectionDate": collectiondate,
                        }
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (1)

64-82: Consider addressing date_format F405 and making cadence parsing more robust

Ruff’s F405 for date_format is a style/linting issue but may fail CI depending on project settings. Two options:

  • Import explicitly instead of relying on import *, e.g.:
    from uk_bin_collection.uk_bin_collection.common import date_format
    (and import other symbols you need explicitly), or
  • Keep the wildcard import but add a local # noqa: F405 where date_format is used if that’s the established pattern in this repo.

Separately, since schedule_cadence is only used for a substring check, you can simplify and make it safer by operating on the full cell text and lowering it:

-                    schedule_cadence = row.find("td").get_text().strip().split(" ")[1]
+                    td_text = row.find("td").get_text().strip()
+                    schedule_cadence = td_text.lower()
...
-                                if "Fortnightly" in schedule_cadence:
+                                if "fortnightly" in schedule_cadence:

This avoids depending on a fixed token position and is more resilient to minor HTML/content changes.

uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py (1)

22-25: User-Agent header looks good; consider adding a request timeout

The added headers (User-Agent) on the requests.get call are a sensible fix for remote blocking. To avoid potential hangs on network issues, consider adding an explicit timeout:

-        response = requests.get(url, headers=headers)
+        response = requests.get(url, headers=headers, timeout=20)

Adjust the timeout value as appropriate for typical council response times.

Also applies to: 47-47

uk_bin_collection/uk_bin_collection/councils/LondonBoroughLambeth.py (1)

45-52: Commercial bin disambiguation looks good; consider case-insensitive match

The new logic properly splits ambiguous "commercial bin" containers into "recycling" vs "refuse" based on Container["Name"], which should prevent entity collapse while keeping other types unchanged.

For a bit more robustness against API text changes, you could optionally make the match case-insensitive:

name = Container["Name"]
if "recycling" in name.lower():
    Bin_Type = "recycling"
else:
    Bin_Type = "refuse"
uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py (1)

18-33: API flow is much cleaner; add basic HTTP and JSON guards to avoid regressions

The switch to a direct Umbraco API with UPRN looks good and should resolve the earlier setup issues. To harden this against future content changes (and avoid reintroducing the NoneType.contents failure):

  • Add a timeout and status check on the HTTP call:
-        response = requests.get(
-            "https://www.rushmoor.gov.uk/Umbraco/Api/BinLookUpWorkAround/Get?",
-            params=params,
-            headers=headers,
-        )
+        response = requests.get(
+            "https://www.rushmoor.gov.uk/Umbraco/Api/BinLookUpWorkAround/Get?",
+            params=params,
+            headers=headers,
+            timeout=20,
+        )
+        response.raise_for_status()
  • Guard the <p> lookup and NextCollection extraction so a missing node produces a clear error instead of an AttributeError:
-        result = soup.find("p").contents[0]
-
-        json_data = json.loads(result)["NextCollection"]
+        p_tag = soup.find("p")
+        if p_tag is None or not p_tag.contents:
+            raise ValueError("Unexpected response format from Rushmoor API")
+        result = p_tag.contents[0]
+
+        json_data = (json.loads(result).get("NextCollection") or {})

The existing if len(collections) < 1: raise ValueError("No collections found") will still correctly signal the “no data” case.

Also applies to: 35-45

uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py (1)

22-23: New HTTP-based scraper is a good replacement; tighten a few robustness edges

The move away from Selenium to a requests + BeautifulSoup pipeline is a solid simplification and aligns with the updated Mid Sussex site. A few low-friction tweaks would make this more resilient:

  1. Add timeouts to all HTTP calls
-        r = session.get(
-            URI,
-        )
+        r = session.get(URI, timeout=20)
...
-        r = session.post(nextpageurl, data)
+        r = session.post(nextpageurl, data=data, timeout=20)
...
-        r = session.get(
-            nextpageurl,
-        )
+        r = session.get(nextpageurl, timeout=20)
  1. Avoid .find(...).find("a") on a possible None
-            alink = soup.find("div", id="property_list").find("a")
-
-            if alink is None:
-                raise Exception("Address not found")
+            property_list = soup.find("div", id="property_list")
+            if property_list is None:
+                raise Exception("Address list not found")
+            alink = property_list.find("a")
+            if alink is None:
+                raise Exception("Address not found")
  1. Likewise for section#scheduled-collections
-            u1s = soup.find("section", id="scheduled-collections").find_all(
-                "ul", class_="displayinlineblock"
-            )
+            section = soup.find("section", id="scheduled-collections")
+            if section is None:
+                raise Exception("Scheduled collections section not found")
+            u1s = section.find_all("ul", class_="displayinlineblock")
  1. Minor readability

data is used as the POST payload while bindata holds output. Renaming the POST dict to payload (or similar) would make the intent clearer without any behavioural change.

These changes preserve the current success path while turning markup/API drift into clear, controlled exceptions.

Also applies to: 27-37, 43-48, 49-56, 58-66, 68-76, 78-80, 82-99, 101-102

uk_bin_collection/tests/input.json (1)

232-240: Blackpool config looks consistent; optionally mark as skip_get_url.

The new BlackpoolCouncil entry correctly provides both uprn and postcode expected by the scraper. Because the scraper makes its own API calls and ignores the initial page content, consider adding "skip_get_url": true here to avoid an unnecessary preliminary GET of url.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37c8a80 and 330bcd9.

📒 Files selected for processing (16)
  • uk_bin_collection/tests/input.json (5 hunks)
  • uk_bin_collection/uk_bin_collection/councils/ArgyllandButeCouncil.py (2 hunks)
  • uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py (2 hunks)
  • uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py (3 hunks)
  • uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py (2 hunks)
  • uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/LondonBoroughLambeth.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/NorthumberlandCouncil.py (2 hunks)
  • uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (2 hunks)
  • uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py (1 hunks)
  • uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (3 hunks)
  • wiki/Councils.md (7 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
  • create_webdriver (321-360)
uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
  • check_uprn (67-78)
uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py (2)
uk_bin_collection/uk_bin_collection/get_bin_data.py (1)
  • AbstractGetBinDataClass (43-146)
uk_bin_collection/uk_bin_collection/common.py (2)
  • check_uprn (67-78)
  • check_postcode (36-49)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
custom_components/uk_bin_collection/calendar.py (1)
  • event (54-63)
uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py (3)
uk_bin_collection/uk_bin_collection/councils/NuneatonBedworthBoroughCouncil.py (1)
  • get_bin_data (63-945)
uk_bin_collection/uk_bin_collection/get_bin_data.py (1)
  • AbstractGetBinDataClass (43-146)
uk_bin_collection/uk_bin_collection/common.py (1)
  • check_uprn (67-78)
uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
  • check_postcode (36-49)
🪛 Ruff (0.14.5)
uk_bin_collection/uk_bin_collection/councils/ThurrockCouncil.py

36-36: datetime may be undefined, or defined from star imports

(F405)


37-37: datetime may be undefined, or defined from star imports

(F405)


39-39: datetime may be undefined, or defined from star imports

(F405)


40-40: datetime may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py

40-40: create_webdriver may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/RushmoorCouncil.py

19-19: check_uprn may be undefined, or defined from star imports

(F405)


29-29: requests may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py

5-5: from uk_bin_collection.uk_bin_collection.common import * used; unable to detect undefined names

(F403)


17-17: Unused method argument: page

(ARG002)


21-21: check_uprn may be undefined, or defined from star imports

(F405)


22-22: check_postcode may be undefined, or defined from star imports

(F405)


41-41: Probable use of requests call without timeout

(S113)


59-59: Probable use of requests call without timeout

(S113)


77-77: datetime may be undefined, or defined from star imports

(F405)


80-80: date_format may be undefined, or defined from star imports

(F405)


85-85: datetime may be undefined, or defined from star imports

(F405)


85-85: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py

74-74: date_format may be undefined, or defined from star imports

(F405)


82-82: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/ArmaghBanbridgeCraigavonCouncil.py

47-47: Probable use of requests call without timeout

(S113)

uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py

102-102: datetime may be undefined, or defined from star imports

(F405)


105-105: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py

6-6: from uk_bin_collection.uk_bin_collection.common import * used; unable to detect undefined names

(F403)


18-18: Unused method argument: page

(ARG002)


21-21: check_uprn may be undefined, or defined from star imports

(F405)


32-32: Probable use of requests call without timeout

(S113)


48-48: datetime may be undefined, or defined from star imports

(F405)


54-54: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py

27-27: check_postcode may be undefined, or defined from star imports

(F405)


28-28: check_paon may be undefined, or defined from star imports

(F405)


38-38: create_webdriver may be undefined, or defined from star imports

(F405)


66-66: Unused lambda argument: d

(ARG005)


83-85: Abstract raise to an inner function

(TRY301)


83-85: Avoid specifying long messages outside the exception class

(TRY003)


102-104: Abstract raise to an inner function

(TRY301)


102-104: Avoid specifying long messages outside the exception class

(TRY003)


126-126: date_format may be undefined, or defined from star imports

(F405)


131-131: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/MidSussexDistrictCouncil.py

25-25: check_postcode may be undefined, or defined from star imports

(F405)


29-29: requests may be undefined, or defined from star imports

(F405)


44-44: Abstract raise to an inner function

(TRY301)


44-44: Create your own exception

(TRY002)


44-44: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Abstract raise to an inner function

(TRY301)


64-64: Create your own exception

(TRY002)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


76-76: Abstract raise to an inner function

(TRY301)


76-76: Create your own exception

(TRY002)


76-76: Avoid specifying long messages outside the exception class

(TRY003)


90-90: datetime may be undefined, or defined from star imports

(F405)


93-93: date_format may be undefined, or defined from star imports

(F405)


98-98: datetime may be undefined, or defined from star imports

(F405)


98-98: date_format may be undefined, or defined from star imports

(F405)


101-101: Consider moving this statement to an else block

(TRY300)

uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py

110-112: Abstract raise to an inner function

(TRY301)


110-112: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (14)
uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py (2)

17-37: LGTM!

The updated docstring clearly documents the new dynamic ICS lookup flow and parameter requirements.


119-145: LGTM!

The ICS event parsing logic correctly handles events within the next 60 days, splits comma-separated collection types, and ensures proper driver cleanup in the finally block.

uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (4)

30-30: LGTM: UPRN padding implemented correctly.

The 12-digit UPRN padding aligns with fixes for other councils in this PR and ensures consistent data format.


57-57: LGTM: Clean data structure initialization.

The data_bins initialization follows the unified aggregation approach and will accumulate results across all requested months.


82-82: LGTM: Endpoint updated to wastecollectioncalendar.

This endpoint change addresses issue #1689 to ensure all collection types are returned.


110-110: Verify the separator format used by Wiltshire Council's API.

The review raises a valid concern: while the code assumes " and " (with spaces) as the separator, no Wiltshire-specific test data or documentation was found in the codebase to confirm this. Other councils in the same project use different separators (comma, forward slash, newline). Without access to actual Wiltshire API responses, the assumption cannot be verified. Check Wiltshire Council's live data to confirm the collection type format is always "Type1 and Type2" with spaces, and not alternatives like "&", "+", or formats without consistent spacing.

uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (1)

4-6: BeautifulSoup import is fine

The BeautifulSoup import and usage remain standard and appropriate for this parser; no issues here.

uk_bin_collection/uk_bin_collection/councils/NorthumberlandCouncil.py (1)

22-22: UPRN normalization is correct and aligns with other councils

Coercing user_uprn to str and zfill(12) before using it in select_by_value matches the 12‑digit UPRN requirement and should fix the lookup issues without regressions for already‑padded values.

Also applies to: 43-43

uk_bin_collection/uk_bin_collection/councils/ArgyllandButeCouncil.py (1)

1-1: Datetime import and 12‑digit UPRN padding are appropriate

Importing datetime matches its later usage for parsing and sorting, and zero‑padding user_uprn to 12 digits before select_by_value aligns with the council’s UPRN format requirements and the pattern used in Northumberland.

Also applies to: 31-31

uk_bin_collection/uk_bin_collection/councils/HaltonBoroughCouncil.py (1)

38-41: Explicit user-agent for Selenium driver is a solid improvement

Passing a realistic user_agent into create_webdriver should help with remote/SaaS blocking while remaining compatible with the existing helper signature. No further changes needed here.

wiki/Councils.md (1)

40-40: Documentation updates match the new scraper behaviours

The additions and edits for Blackpool, Harlow, Fife, Mid Sussex, and Rushmoor are consistent with the corresponding scraper implementations and PR objectives:

  • Blackpool/Harlow: new entries with -u (and -p for Blackpool).
  • Fife: switched to postcode + house number + Selenium, matching the new Selenium-based scraper.
  • Mid Sussex: removed web driver parameter and updated note for the non-Selenium HTML flow.
  • Rushmoor: now uses the base URL and a separate -u UPRN parameter as per the new API-based scraper.

Given this file is auto-generated from uk_bin_collection/tests/input.json, the changes look correct and in sync with the code.

Also applies to: 144-144, 675-685, 1728-1735, 1949-1958, 2557-2564, 3246-3251

uk_bin_collection/tests/input.json (3)

1144-1151: Harlow config matches new scraper expectations.

The HarlowCouncil entry provides an uprn and base url as required by the new scraper; wiki fields and LAD code look consistent with other councils.


1564-1570: Mid Sussex sample updated to match “street name only” requirement.

The house_number example ("OAKLANDS") now aligns with the wiki note that the street name should be passed via the house_number parameter; this looks correct.


2048-2051: Rushmoor config aligns with reworked scraper flow.

Switching url to the base Rushmoor site and adding an explicit uprn with an updated wiki note matches the described change to accept a separate UPRN field. No issues spotted.

Comment on lines 957 to 965
"FifeCouncil": {
"uprn": "320203521",
"postcode": "KY16 9NQ",
"house_number": "1",
"web_driver": "http://selenium:4444",
"url": "https://www.fife.gov.uk",
"wiki_command_url_override": "https://www.fife.gov.uk",
"wiki_name": "Fife",
"wiki_note": "You will need to use [FindMyAddress](https://www.findmyaddress.co.uk/search) to find the UPRN.",
"wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.",
"LAD24CD": "S12000047"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fife config uses house_number but parser reads paon – this will fail address matching.

FifeCouncil.CouncilClass.parse_data currently pulls the address from kwargs["paon"], but this JSON only defines house_number. In typical flows where kwargs are taken directly from this file, paon will be None, so the dropdown search will never match and a ValueError will be raised for every run.

Add a paon value (or align the parser to also read house_number) so the config and implementation agree. For example:

     "FifeCouncil": {
         "postcode": "KY16 9NQ",
-        "house_number": "1",
+        "house_number": "1",
+        "paon": "1",
         "web_driver": "http://selenium:4444",
         "url": "https://www.fife.gov.uk",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"FifeCouncil": {
"uprn": "320203521",
"postcode": "KY16 9NQ",
"house_number": "1",
"web_driver": "http://selenium:4444",
"url": "https://www.fife.gov.uk",
"wiki_command_url_override": "https://www.fife.gov.uk",
"wiki_name": "Fife",
"wiki_note": "You will need to use [FindMyAddress](https://www.findmyaddress.co.uk/search) to find the UPRN.",
"wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.",
"LAD24CD": "S12000047"
"FifeCouncil": {
"postcode": "KY16 9NQ",
"house_number": "1",
"paon": "1",
"web_driver": "http://selenium:4444",
"url": "https://www.fife.gov.uk",
"wiki_command_url_override": "https://www.fife.gov.uk",
"wiki_name": "Fife",
"wiki_note": "Pass in the house number and postcode parameters. This parser requires a Selenium webdriver.",
"LAD24CD": "S12000047"
🤖 Prompt for AI Agents
In uk_bin_collection/tests/input.json around lines 957–965, the FifeCouncil test
config uses "house_number" while the parser expects kwargs["paon"], causing paon
to be None and address matching to fail; fix this by updating the parser to read
paon = kwargs.get("paon") or kwargs.get("house_number") (i.e., fall back to
"house_number" when "paon" is absent), adjust any callers/tests if they assume
paon-only, and add a unit test or update the existing test fixture to ensure the
fallback works (or alternatively add "paon": "1" to the JSON if you prefer the
config-side fix).

Comment on lines +17 to +88
def parse_data(self, page: str, **kwargs) -> dict:

user_uprn = kwargs.get("uprn")
user_postcode = kwargs.get("postcode")
check_uprn(user_uprn)
check_postcode(user_postcode)
bindata = {"bins": []}

headers = {
"Accept": "*/*",
"Accept-Language": "en-GB,en;q=0.9",
"Connection": "keep-alive",
"DNT": "1",
"Origin": "https://www.blackpool.gov.uk",
"Referer": "https://www.blackpool.gov.uk/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36",
"sec-ch-ua": '"Google Chrome";v="141", "Not?A_Brand";v="8", "Chromium";v="141"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
}

response = requests.get(
"https://api.blackpool.gov.uk/live//api/bartec/security/token",
headers=headers,
)

token = response.text.strip().replace('"', "")

json_data = {
"UPRN": user_uprn,
"USRN": "",
"PostCode": user_postcode,
"StreetNumber": "",
"CurrentUser": {
"UserId": "",
"Token": token,
},
}

response = requests.post(
"https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
headers=headers,
json=json_data,
)

# Parse the JSON response
bin_collection = response.json()

# Loop through each collection in bin_collection
for collection in bin_collection["jobsField"]:

job = collection["jobField"]
date = job.get("scheduledStartField")
bin_type = job.get("nameField", "") or job.get("descriptionField", "")

dict_data = {
"type": bin_type,
"collectionDate": datetime.strptime(
date,
"%Y-%m-%dT%H:%M:%S",
).strftime(date_format),
}
bindata["bins"].append(dict_data)

bindata["bins"].sort(
key=lambda x: datetime.strptime(x.get("collectionDate"), date_format)
)

return bindata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Blackpool scraper should use timeouts, check HTTP status, and handle missing jobs defensively.

As written:

  • Both the token GET and PremiseJobs POST have no timeout and don’t validate status, so a stuck endpoint can hang forever and 4xx/5xx responses will cause obscure JSON/Key errors.
  • Iteration over bin_collection["jobsField"] and collection["jobField"] assumes those keys always exist and that scheduledStartField and a name/description are present.

You can keep behaviour but make it much more robust like this:

-        response = requests.get(
-            "https://api.blackpool.gov.uk/live//api/bartec/security/token",
-            headers=headers,
-        )
-
-        token = response.text.strip().replace('"', "")
+        response = requests.get(
+            "https://api.blackpool.gov.uk/live//api/bartec/security/token",
+            headers=headers,
+            timeout=120,
+        )
+        response.raise_for_status()
+
+        token = response.text.strip().replace('"', "")

@@
-        response = requests.post(
-            "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
-            headers=headers,
-            json=json_data,
-        )
-
-        # Parse the JSON response
-        bin_collection = response.json()
-
-        # Loop through each collection in bin_collection
-        for collection in bin_collection["jobsField"]:
-
-            job = collection["jobField"]
-            date = job.get("scheduledStartField")
-            bin_type = job.get("nameField", "") or job.get("descriptionField", "")
-
-            dict_data = {
-                "type": bin_type,
-                "collectionDate": datetime.strptime(
-                    date,
-                    "%Y-%m-%dT%H:%M:%S",
-                ).strftime(date_format),
-            }
-            bindata["bins"].append(dict_data)
+        response = requests.post(
+            "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
+            headers=headers,
+            json=json_data,
+            timeout=120,
+        )
+        response.raise_for_status()
+
+        bin_collection = response.json()
+
+        for collection in bin_collection.get("jobsField", []):
+            job = collection.get("jobField") or {}
+            date = job.get("scheduledStartField")
+            bin_type = job.get("nameField") or job.get("descriptionField")
+            if not (date and bin_type):
+                continue
+
+            dict_data = {
+                "type": bin_type,
+                "collectionDate": datetime.strptime(
+                    date,
+                    "%Y-%m-%dT%H:%M:%S",
+                ).strftime(date_format),
+            }
+            bindata["bins"].append(dict_data)

This keeps the same output structure while avoiding hangs and hard crashes on minor API changes.


🏁 Script executed:

#!/bin/bash
# Verify BlackpoolCouncil.py exists and contains the code mentioned in review
find . -name "BlackpoolCouncil.py" -type f

Length of output: 66


🏁 Script executed:

#!/bin/bash
# Inspect the parse_data method in BlackpoolCouncil.py
cat uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py | head -100

Length of output: 2865


🏁 Script executed:

#!/bin/bash
# Check how other council scrapers handle requests (e.g., timeout patterns)
rg -n "requests\.(get|post)" --type py -C 2 uk_bin_collection/uk_bin_collection/councils/ | head -60

Length of output: 5470


🏁 Script executed:

#!/bin/bash
# Search for timeout usage in the codebase to see if it's a standard pattern
rg -n "timeout\s*=" --type py uk_bin_collection/uk_bin_collection/councils/ | head -20

Length of output: 2862


🏁 Script executed:

#!/bin/bash
# Look for response.raise_for_status() usage in other scrapers
rg -n "raise_for_status" --type py uk_bin_collection/

Length of output: 11821


Add timeout parameters and HTTP status validation to both API requests; use defensive JSON parsing.

The BlackpoolCouncil scraper omits critical error handling that is standard across 50+ other council implementations in the codebase. Both the token request and PremiseJobs request lack timeouts and status validation:

  • Without timeout, network issues will cause indefinite hangs
  • Without raise_for_status(), error responses (4xx/5xx) bypass validation and cause JSON parse failures
  • Direct dict access (bin_collection["jobsField"]) crashes on API schema changes instead of gracefully skipping incomplete records

Apply the suggested changes:

-        response = requests.get(
-            "https://api.blackpool.gov.uk/live//api/bartec/security/token",
-            headers=headers,
-        )
-
-        token = response.text.strip().replace('"', "")
+        response = requests.get(
+            "https://api.blackpool.gov.uk/live//api/bartec/security/token",
+            headers=headers,
+            timeout=120,
+        )
+        response.raise_for_status()
+
+        token = response.text.strip().replace('"', "")

@@
-        response = requests.post(
-            "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
-            headers=headers,
-            json=json_data,
-        )
-
-        # Parse the JSON response
-        bin_collection = response.json()
-
-        # Loop through each collection in bin_collection
-        for collection in bin_collection["jobsField"]:
-
-            job = collection["jobField"]
-            date = job.get("scheduledStartField")
-            bin_type = job.get("nameField", "") or job.get("descriptionField", "")
-
-            dict_data = {
-                "type": bin_type,
-                "collectionDate": datetime.strptime(
-                    date,
-                    "%Y-%m-%dT%H:%M:%S",
-                ).strftime(date_format),
-            }
-            bindata["bins"].append(dict_data)
+        response = requests.post(
+            "https://api.blackpool.gov.uk/live//api/bartec/collection/PremiseJobs",
+            headers=headers,
+            json=json_data,
+            timeout=120,
+        )
+        response.raise_for_status()
+
+        bin_collection = response.json()
+
+        for collection in bin_collection.get("jobsField", []):
+            job = collection.get("jobField") or {}
+            date = job.get("scheduledStartField")
+            bin_type = job.get("nameField") or job.get("descriptionField")
+            if not (date and bin_type):
+                continue
+
+            dict_data = {
+                "type": bin_type,
+                "collectionDate": datetime.strptime(
+                    date,
+                    "%Y-%m-%dT%H:%M:%S",
+                ).strftime(date_format),
+            }
+            bindata["bins"].append(dict_data)
🧰 Tools
🪛 Ruff (0.14.5)

17-17: Unused method argument: page

(ARG002)


21-21: check_uprn may be undefined, or defined from star imports

(F405)


22-22: check_postcode may be undefined, or defined from star imports

(F405)


41-41: Probable use of requests call without timeout

(S113)


59-59: Probable use of requests call without timeout

(S113)


77-77: datetime may be undefined, or defined from star imports

(F405)


80-80: date_format may be undefined, or defined from star imports

(F405)


85-85: datetime may be undefined, or defined from star imports

(F405)


85-85: date_format may be undefined, or defined from star imports

(F405)

Comment on lines +101 to +112
calendar_url = f"https://www.chelmsford.gov.uk/bins-and-recycling/check-your-collection-day/{day}-{letter}-collection-calendar/"
driver.get(calendar_url)
soup = BeautifulSoup(driver.page_source, features="html.parser")
a = soup.find(
"a", href=lambda h: h and h.lower().endswith(".ics")
)
if a:
ics_url = a["href"]
else:
raise ValueError(
f"Could not find collection ICS file for address: {user_paon}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle relative URLs when extracting ICS href.

The a["href"] attribute extracted at line 108 may be a relative URL (e.g., "/bins-and-recycling/file.ics" or "../../file.ics"). The events() function at line 124 expects an absolute URL and will fail if given a relative path.

Apply this diff to convert potentially relative URLs to absolute:

+from urllib.parse import urljoin
+
 # (at line 101-112)
                        calendar_url = f"https://www.chelmsford.gov.uk/bins-and-recycling/check-your-collection-day/{day}-{letter}-collection-calendar/"
                        driver.get(calendar_url)
                        soup = BeautifulSoup(driver.page_source, features="html.parser")
                        a = soup.find(
                            "a", href=lambda h: h and h.lower().endswith(".ics")
                        )
                        if a:
-                           ics_url = a["href"]
+                           ics_url = urljoin(calendar_url, a["href"])
                        else:
                            raise ValueError(
                                f"Could not find collection ICS file for address: {user_paon}"
                            )

Additionally, consider improving the error message to be more specific:

                        else:
                            raise ValueError(
-                               f"Could not find collection ICS file for address: {user_paon}"
+                               f"Could not find ICS link on calendar page ({day}-{letter}) for address: {user_paon}"
                            )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.5)

110-112: Abstract raise to an inner function

(TRY301)


110-112: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/ChelmsfordCityCouncil.py around
lines 101-112, the code takes a[href] which can be a relative URL and later
feeds it to events(), causing failures; update the extraction to resolve
relative hrefs into absolute URLs (for example use urllib.parse.urljoin with the
current page URL or driver.current_url) before assigning ics_url, and when
raising the ValueError include more specific context (address/user_paon, the
calendar page URL and the raw href if present) so the error message is
actionable.

Comment on lines 17 to +142
def parse_data(self, page: str, **kwargs) -> dict:
# Get and check UPRN
user_uprn = kwargs.get("uprn")
check_uprn(user_uprn)
bindata = {"bins": []}

API_URL = "https://www.fife.gov.uk/api/custom?action=powersuite_bin_calendar_collections&actionedby=bin_calendar&loadform=true&access=citizen&locale=en"
AUTH_URL = "https://www.fife.gov.uk/api/citizen?preview=false&locale=en"
AUTH_KEY = "Authorization"

r = requests.get(AUTH_URL)
r.raise_for_status()
auth_token = r.headers[AUTH_KEY]

post_data = {
"name": "bin_calendar",
"data": {
"uprn": user_uprn,
},
"email": "",
"caseid": "",
"xref": "",
"xref1": "",
"xref2": "",
}

headers = {
"referer": "https://www.fife.gov.uk/services/forms/bin-calendar",
"accept": "application/json",
"content-type": "application/json",
AUTH_KEY: auth_token,
}

r = requests.post(API_URL, data=json.dumps(post_data), headers=headers)
r.raise_for_status()

result = r.json()

for collection in result["data"]["tab_collections"]:
dict_data = {
"type": collection["colour"],
"collectionDate": datetime.strptime(
collection["date"],
"%A, %B %d, %Y",
).strftime("%d/%m/%Y"),
}
bindata["bins"].append(dict_data)

bindata["bins"].sort(
key=lambda x: datetime.strptime(x.get("collectionDate"), "%d/%m/%Y")
)
driver = None
try:
# Get and check UPRN
user_postcode = kwargs.get("postcode")
user_paon = kwargs.get("paon")
check_postcode(user_postcode)
check_paon(user_paon)

web_driver = kwargs.get("web_driver")
headless = kwargs.get("headless")

bindata = {"bins": []}

URL = "https://fife.portal.uk.empro.verintcloudservices.com/site/fife/request/bin_calendar"

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
driver = create_webdriver(web_driver, headless, user_agent, __name__)
driver.get(URL)

wait = WebDriverWait(driver, 30)

ID_POSTCODE = "dform_widget_ps_45M3LET8_txt_postcode"
ID_SEARCH_BTN = "dform_widget_ps_3SHSN93_searchbutton"
ID_ADDRESS_SELECT = "dform_widget_ps_3SHSN93_id"
ID_COLLECTIONS = "dform_table_tab_collections"

# Wait for initial page load and Cloudflare bypass
wait.until(lambda d: "Just a moment" not in d.title and d.title != "")
time.sleep(3)

# Wait for the postcode field to appear then populate it
inputElement_postcode = wait.until(
EC.presence_of_element_located((By.ID, ID_POSTCODE))
)
inputElement_postcode.send_keys(user_postcode)

# Click search button
findAddress = wait.until(EC.element_to_be_clickable((By.ID, ID_SEARCH_BTN)))
findAddress.click()

# Wait for the 'Select address' dropdown to appear and select option matching the house name/number
select_el = wait.until(
EC.visibility_of_element_located((By.ID, ID_ADDRESS_SELECT))
)
wait.until(lambda d: len(Select(select_el).options) > 1)

paon_norm = str(user_paon).strip().casefold()
sel = Select(select_el)

time.sleep(10)

def _best_option():
# Prefer exact contains on visible text; fallback to casefold contains
for opt in sel.options:
txt = (opt.text or "").strip()
if paon_norm and paon_norm in txt.casefold():
return opt
return None

opt = _best_option()
if not opt:
raise ValueError(
f"Could not find an address containing '{user_paon}' in the dropdown."
)
sel.select_by_visible_text(opt.text)

# After selecting, the collections table should (re)render; wait for it
wait.until(EC.presence_of_element_located((By.ID, ID_COLLECTIONS)))
# Also wait until at least one data row is present (beyond headers)
wait.until(
lambda d: len(
d.find_elements(By.CSS_SELECTOR, f"#{ID_COLLECTIONS} .dform_tr")
)
> 1
)

soup = BeautifulSoup(driver.page_source, features="html.parser")

table = soup.find("div", id=ID_COLLECTIONS)
if not table:
raise ValueError(
f"Could not find collections table by id='{ID_COLLECTIONS}'"
)

rows = table.find_all("div", class_="dform_tr")

# Skip header row (first row with .dform_th entries)
for row in rows[1:]:
tds = row.find_all("div", class_="dform_td")
if len(tds) < 3:
continue

# Colour comes from the <img alt="...">
colour_cell = tds[0]
img = colour_cell.find("img")
colour = img.get("alt").strip() if img and img.has_attr("alt") else None

# Date text
raw_date = tds[1].get_text(strip=True)
# Example: "Wednesday, November 12, 2025"
dt = datetime.strptime(raw_date, "%A, %B %d, %Y")

dict_data = {
"type": colour,
"collectionDate": dt.strftime(date_format),
}
bindata["bins"].append(dict_data)

bindata["bins"].sort(
key=lambda x: datetime.strptime(x.get("collectionDate"), date_format)
)

except Exception as e:
# Here you can log the exception if needed
print(f"An error occurred: {e}")
# Optionally, re-raise the exception if you want it to propagate
raise
finally:
# This block ensures that the driver is closed regardless of an exception
if driver:
driver.quit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fife parser expects paon but config/wikis only supply house_number.

Inside parse_data you do:

user_postcode = kwargs.get("postcode")
user_paon = kwargs.get("paon")
check_postcode(user_postcode)
check_paon(user_paon)
...
paon_norm = str(user_paon).strip().casefold()

However, the FifeCouncil entry in tests/input.json now only defines house_number, and the wiki text tells users to pass a “house number” parameter, not paon. In flows that rely on those kwargs, user_paon will be None, so _best_option will end up searching for "none" and always raise the “Could not find an address” ValueError.

To make this work with both existing conventions and the new config, have the parser fall back to house_number when paon is absent:

-            user_postcode = kwargs.get("postcode")
-            user_paon = kwargs.get("paon")
-            check_postcode(user_postcode)
-            check_paon(user_paon)
+            user_postcode = kwargs.get("postcode")
+            user_paon = kwargs.get("paon") or kwargs.get("house_number")
+            check_postcode(user_postcode)
+            check_paon(user_paon)
@@
-            paon_norm = str(user_paon).strip().casefold()
+            paon_norm = str(user_paon).strip().casefold()

(Only the source of user_paon changes.)

This keeps support for any callers already using paon while making the documented house_number parameter work for Fife.

🧰 Tools
🪛 Ruff (0.14.5)

21-21: Unused method argument: page

(ARG002)


27-27: check_postcode may be undefined, or defined from star imports

(F405)


28-28: check_paon may be undefined, or defined from star imports

(F405)


38-38: create_webdriver may be undefined, or defined from star imports

(F405)


66-66: Unused lambda argument: d

(ARG005)


83-85: Abstract raise to an inner function

(TRY301)


83-85: Avoid specifying long messages outside the exception class

(TRY003)


102-104: Abstract raise to an inner function

(TRY301)


102-104: Avoid specifying long messages outside the exception class

(TRY003)


126-126: date_format may be undefined, or defined from star imports

(F405)


131-131: date_format may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/FifeCouncil.py around lines
21-142, the parser reads user_paon from kwargs["paon"] which can be None because
callers provide "house_number"; update the source of the PAON to use
kwargs.get("paon") with a fallback to kwargs.get("house_number"), assign that
resolved value to user_paon, and use that resolved value for check_paon() and
further processing (paon_norm and error messages) so both old callers using
"paon" and new callers using "house_number" work correctly.

Comment on lines +18 to +58
def parse_data(self, page: str, **kwargs) -> dict:

user_uprn = kwargs.get("uprn")
check_uprn(user_uprn)
bindata = {"bins": []}

headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36",
}

params = {
"uprn": user_uprn,
}

response = requests.get(
"https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
params=params,
headers=headers,
)

soup = BeautifulSoup(response.text, features="html.parser")

summary = soup.find("div", {"class": "summary"})
collectionrows = summary.find_all("div", {"class": "collectionsrow"})

for collectionrow in collectionrows:
bin_type = collectionrow.find("div", {"class": "col-xs-4"})
collection_time = collectionrow.find("div", {"class": "col-sm-6"})

if bin_type and collection_time:
collectionDate = datetime.strptime(
collection_time.text.strip(), "%a - %d %b %Y"
)

dict_data = {
"type": bin_type.text.strip(),
"collectionDate": collectionDate.strftime(date_format),
}
bindata["bins"].append(dict_data)

return bindata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Harden Harlow HTTP/HTML handling: add timeouts/status checks and guard missing .summary.

Right now the scraper can hang indefinitely or crash on layout/HTTP changes:

  • requests.get has no timeout and doesn’t check status_code; a slow or broken endpoint will block indefinitely, and 4xx/5xx HTML will fall through to parsing.
  • summary = soup.find("div", {"class": "summary"}) is assumed to succeed; if it’s missing, summary.find_all(...) will raise AttributeError.

Consider tightening this up along these lines:

-        response = requests.get(
-            "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
-            params=params,
-            headers=headers,
-        )
-
-        soup = BeautifulSoup(response.text, features="html.parser")
-
-        summary = soup.find("div", {"class": "summary"})
-        collectionrows = summary.find_all("div", {"class": "collectionsrow"})
+        response = requests.get(
+            "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
+            params=params,
+            headers=headers,
+            timeout=120,
+        )
+        response.raise_for_status()
+
+        soup = BeautifulSoup(response.text, features="html.parser")
+
+        summary = soup.find("div", {"class": "summary"})
+        if not summary:
+            raise ValueError("Could not find bin collection summary on Harlow page")
+        collectionrows = summary.find_all("div", {"class": "collectionsrow"})

You can also safely drop the unused import time at the top.



---

🏁 Script executed:

```shell
#!/bin/bash
# Locate and examine HarlowCouncil.py file
fd -t f "HarlowCouncil.py" 2>/dev/null | head -20

Length of output: 61


🏁 Script executed:

#!/bin/bash
# Examine the full HarlowCouncil.py file to check imports and implementation
if [ -f "uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py" ]; then
  cat "uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py"
fi

Length of output: 1911


🏁 Script executed:

#!/bin/bash
# Check for any try-except or error handling patterns in the file
rg -n "timeout|raise_for_status|try|except" uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py -C2 2>/dev/null

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if requests module has timeout defaults or if other council scrapers use timeouts
rg -n "timeout=" uk_bin_collection/uk_bin_collection/councils/ -t py 2>/dev/null | head -20

Length of output: 2826


🏁 Script executed:

#!/bin/bash
# Check import patterns in the file for 'time' module
rg -n "^import|^from" uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py 2>/dev/null

Length of output: 209


Add timeout, status check, and guard against missing .summary div.

The code lacks critical robustness safeguards:

  • requests.get() (lines 33–38) has no timeout and doesn't check response.status_code; requests can hang indefinitely, and error pages will pass through to parsing. Other council scrapers in this codebase use timeout=20 or higher as standard.
  • summary.find_all() (line 43) assumes the .summary div exists; if the HTML changes, None.find_all() will raise AttributeError.
  • import time (line 1) is unused and should be removed.

Apply the suggested changes:

         response = requests.get(
             "https://selfserve.harlow.gov.uk/appshost/firmstep/self/apps/custompage/bincollectionsecho",
             params=params,
             headers=headers,
+            timeout=120,
         )
+        response.raise_for_status()
 
         soup = BeautifulSoup(response.text, features="html.parser")
 
         summary = soup.find("div", {"class": "summary"})
+        if not summary:
+            raise ValueError("Could not find bin collection summary on Harlow page")
         collectionrows = summary.find_all("div", {"class": "collectionsrow"})

Also remove the unused import time at line 1.

🧰 Tools
🪛 Ruff (0.14.5)

18-18: Unused method argument: page

(ARG002)


21-21: check_uprn may be undefined, or defined from star imports

(F405)


32-32: Probable use of requests call without timeout

(S113)


48-48: datetime may be undefined, or defined from star imports

(F405)


54-54: date_format may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents
In uk_bin_collection/uk_bin_collection/councils/HarlowCouncil.py around lines 18
to 58, the HTTP call and HTML parsing need hardening: remove the unused import
time at line 1; call requests.get(...) with a timeout (e.g. timeout=20) and
verify the response is successful (check response.status_code == 200 or call
response.raise_for_status()) before parsing; guard against a missing summary div
by checking if summary is not None before calling summary.find_all(...) (return
bindata or an empty bins list if summary is missing) so you don't call
.find_all() on None.

Comment on lines 60 to +89
for day in week_days:
for row in collection_schedule:
schedule_type = row.find("th").get_text().strip()
results2 = re.search("([^(]+)", row.find("td").get_text().strip())
schedule_cadence = row.find("td").get_text().strip().split(" ")[1]
if results2:
schedule_day = results2[1].strip()
for collection_type in week_collection_types:
collectionDate = None
if collection_type in schedule_type.lower():
if (
day.weekday()
== time.strptime(schedule_day, "%A").tm_wday
):
dict_data = {
"type": schedule_type,
"collectionDate": day.strftime(date_format),
}
data["bins"].append(dict_data)
collectionDate = day.strftime(date_format)
else:
if "Fortnightly" in schedule_cadence:
if (
day.weekday()
== time.strptime(schedule_day, "%A").tm_wday
):
day = day + timedelta(days=7)
collectionDate = day.strftime(date_format)

if schedule_type and collectionDate:
dict_data = {
"type": schedule_type,
"collectionDate": collectionDate,
}
data["bins"].append(dict_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fortnightly cadence branch is mis-attached and can generate incorrect dates/types

The new cadence logic has a couple of problems that can lead to wrong or missing collections:

  1. else is attached to the wrong condition (logic bug)

    • Currently:
      • else: (Line 75) pairs with if collection_type in schedule_type.lower(): (Line 69).
      • This means the “Fortnightly” branch runs only when the collection_type does not match the schedule_type, and never when it does match.
    • Consequences:
      • Fortnightly adjustment is skipped for the actual matching bin type.
      • For non-matching collection_type values, the fortnightly branch can still set a collectionDate, then if schedule_type and collectionDate: (Lines 84–89) will append entries using the wrong schedule_type, potentially duplicating or mis-dating bins.
  2. Brittle schedule_cadence parsing

    • schedule_cadence = row.find("td").get_text().strip().split(" ")[1] (Line 64) assumes there are at least two space‑separated tokens in the cell.
    • If the HTML ever changes to a single word (e.g. just "Thursday") or another format, this will raise an IndexError and break the scraper.
  3. Repeated parsing of the <td> text

    • row.find("td").get_text().strip() is called multiple times (Lines 63–64). It’s minor, but you can compute this once for clarity and slight efficiency.

A safer and clearer structure is to:

  • Parse the <td> text once.
  • Safely derive schedule_cadence.
  • Compute schedule_day and its weekday once.
  • Apply the fortnightly adjustment inside the matching-collection_type branch.

For example:

-                for row in collection_schedule:
-                    schedule_type = row.find("th").get_text().strip()
-                    results2 = re.search("([^(]+)", row.find("td").get_text().strip())
-                    schedule_cadence = row.find("td").get_text().strip().split(" ")[1]
-                    if results2:
-                        schedule_day = results2[1].strip()
-                        for collection_type in week_collection_types:
-                            collectionDate = None
-                            if collection_type in schedule_type.lower():
-                                if (
-                                    day.weekday()
-                                    == time.strptime(schedule_day, "%A").tm_wday
-                                ):
-                                    collectionDate = day.strftime(date_format)
-                            else:
-                                if "Fortnightly" in schedule_cadence:
-                                    if (
-                                        day.weekday()
-                                        == time.strptime(schedule_day, "%A").tm_wday
-                                    ):
-                                        day = day + timedelta(days=7)
-                                        collectionDate = day.strftime(date_format)
-
-                            if schedule_type and collectionDate:
-                                dict_data = {
-                                    "type": schedule_type,
-                                    "collectionDate": collectionDate,
-                                }
-                                data["bins"].append(dict_data)
+                for row in collection_schedule:
+                    schedule_type = row.find("th").get_text().strip()
+                    td = row.find("td")
+                    if not td:
+                        continue
+
+                    td_text = td.get_text().strip()
+                    results2 = re.search(r"([^(]+)", td_text)
+
+                    tokens = td_text.split()
+                    schedule_cadence = tokens[1] if len(tokens) > 1 else ""
+
+                    if results2:
+                        schedule_day = results2.group(1).strip()
+                        schedule_wday = time.strptime(schedule_day, "%A").tm_wday
+                        is_fortnightly = "Fortnightly" in schedule_cadence
+
+                        for collection_type in week_collection_types:
+                            collectionDate = None
+                            if collection_type in schedule_type.lower():
+                                if day.weekday() == schedule_wday:
+                                    effective_day = (
+                                        day + timedelta(days=7)
+                                        if is_fortnightly
+                                        else day
+                                    )
+                                    collectionDate = effective_day.strftime(date_format)
+
+                            if schedule_type and collectionDate:
+                                dict_data = {
+                                    "type": schedule_type,
+                                    "collectionDate": collectionDate,
+                                }
+                                data["bins"].append(dict_data)

This keeps the fortnightly handling tied to the correct schedule_type, avoids accidental extra entries, and guards against brittle .split()[1] usage.

🧰 Tools
🪛 Ruff (0.14.5)

63-63: re may be undefined, or defined from star imports

(F405)


74-74: date_format may be undefined, or defined from star imports

(F405)


82-82: date_format may be undefined, or defined from star imports

(F405)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)

102-116: Consider renaming loop variable and reducing reliance on * imports

Two small polish items:

  • The loop uses for type in collection_types: which shadows the built-in type; consider renaming to bin_type (and updating the dict accordingly) for clarity.
  • Ruff flags datetime/date_format as F405 because they come from a from ... import *; if you want cleaner linting and clearer dependencies, consider importing the specific names you need instead of using a star import.

Example for the loop rename:

-                    collection_types = collection_type.split(" and ")
-
-                    for type in collection_types:
-
-                        dict_data = {
-                            "type": type,
+                    collection_types = collection_type.split(" and ")
+
+                    for bin_type in collection_types:
+
+                        dict_data = {
+                            "type": bin_type,
                             "collectionDate": collectiondate,
                         }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 330bcd9 and 1c68a0d.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)
custom_components/uk_bin_collection/calendar.py (1)
  • event (54-63)
🪛 Ruff (0.14.5)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py

102-102: datetime may be undefined, or defined from star imports

(F405)


105-105: date_format may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Run Unit Tests (3.12, 1.8.4)
  • GitHub Check: Run Integration Tests (3.12, 1.8.4)
🔇 Additional comments (1)
uk_bin_collection/uk_bin_collection/councils/WiltshireCouncil.py (1)

57-120: Bin list construction and type/date mapping look correct

data_bins is initialised once and populated with per-event dicts using "type": type and "collectionDate": collectiondate, and collection_type.split(" and ") ensures multi-service entries are split into individual bin types. This resolves the earlier swapped-key issue and aligns with the expected { "bins": [...] } structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment