Skip to content

feat(scan): generic Apify actor provider#693

Open
ageem23 wants to merge 9 commits into
santifer:mainfrom
ageem23:feat/apify-provider-pr
Open

feat(scan): generic Apify actor provider#693
ageem23 wants to merge 9 commits into
santifer:mainfrom
ageem23:feat/apify-provider-pr

Conversation

@ageem23
Copy link
Copy Markdown
Contributor

@ageem23 ageem23 commented May 18, 2026

Follow-up to the Plugin Architecture RFC (#521) — implements the third item in the RFC's split plan ("Follow-up RFC + PR (Apify, closes #325)"). Phase A (plugin contract + Greenhouse/Ashby/Lever ports) is already on main; this PR builds on that.

Closes #325.

Summary

  • providers/apify.mjs — generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's {title, url, company, location} shape. All variation (actor, input, field mappings) lives in portals.yml, not in code, so a single implementation serves any site with an Apify scraper available.
  • providers/_apify.mjs — shared Apify transport (auth, run/poll/fetch, retries with end-to-end deadline, SSRF guard on actorId). Follows the _-prefix helper convention established by _http.mjs / _types.js.
  • Optional field_map.description mapping persists the JD body to jds/{slug}-{hash}.md and rewrites the entry's url to local:jds/... so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Falls through to the remote URL when the description field is missing or shorter than 50 chars.
  • .env.example — documents APIFY_TOKEN.
  • templates/portals.example.yml — worked Indeed example showing both the basic field_map shape and the optional description mapping.
  • test-all.mjs — new §11 with 34 assertions covering the security-relevant paths (URL rejection, malformed field_map, missing token, actor-id SSRF guard, defaults allowlist).

Conformance to RFC #521

  • Provider contract: default export with id, detect(), fetch(entry, ctx). detect() returns null (explicit provider: apify required — no URL scheme to auto-detect from).
  • Error contract: missing token / missing required entry config / invalid field_map shape all throw so scan.mjs records the error per-entry and continues with the next.
  • Helper convention: _apify.mjs uses the _ prefix and is skipped by the loader.
  • scan-history.tsv: source label uses the slugified actor id, so misceres/indeed-scraper produces a stable per-actor dedup key.

Security hardening

  • APIFY_TOKEN sent via Authorization: Bearer header, never as a query string (prevents token leakage into access logs).
  • Strict actorId regex (^[A-Za-z0-9][A-Za-z0-9_.-]*[~/][A-Za-z0-9][A-Za-z0-9_.-]*$) so a malformed config can't divert the bearer-token request off the intended /acts/<actor>/runs path.
  • Single shared end-to-end deadline threaded through startRunwaitForRunfetchDatasetItems. runActor's timeoutMs is the actual wall-clock ceiling.
  • waitForRun() re-throws non-retriable 4xx (401/403/404) instead of looping to the deadline and masking the real cause as a generic timeout.
  • runActor() rejects non-finite/non-positive timeoutMs up front with a clear error.
  • saveJd() filename includes sha1(url)[:10] so two distinct postings sharing the same company+title slug can't collide on one cache file.
  • field_map spec validation accepts only a string or non-empty array of strings; rejects malformed values like { title: 42 } at load time with a clear error instead of crashing mid-scan.
  • normalizeItem() defaults only apply to title/url/company/locationportals.yml cannot inject arbitrary record fields via defaults:.
  • isHttpsUrl() guard on actor-supplied URLs — drops items whose URL isn't https:, blocking javascript:, data:, file:, and protocol-downgraded http: URLs from ending up clickable in pipeline.md or feeding the JD-cache filename hash.

Tests

test-all.mjs §11 (34 assertions):

  • isHttpsUrl() accepts https; rejects http:, javascript:, data:, file:, ftp:, malformed, empty, null, undefined
  • isFieldSpec() accepts string + non-empty string array; rejects number, empty array, mixed-type array, null, object
  • normalizeItem() picks first non-empty fallback; applies defaults only to allowlisted keys
  • normalizeActorId() rejects path traversal, query, fragment, extra segments, non-string
  • apify.fetch() throws clear errors for missing APIFY_TOKEN / missing actor / malformed field_map, never reaches the network in those cases

Style follows §11/§12/§13 in #653. Full suite (node test-all.mjs --quick): 100 passed, 0 failed.

Smoke-tested against real Apify actors (Indeed, Workday-via-third-party) on my fork — items flow through correctly, JD-cache files written, no regressions vs. the pre-PR provider.

Deliberate scope limits

  • One example entry in portals.example.yml — the Indeed actor. Workday-via-third-party-rental actors work too (tested) but recommending a specific paid third-party rental in upstream docs warrants its own discussion; happy to add in a follow-up.
  • No browser transport / no scraper provider — those are the separate follow-up PR per RFC RFC: Plugin-based Provider Architecture for `scan.mjs` #521.
  • Tests cover the security-relevant paths only — happy-path fetch() through a mocked runActor and the JD-cache write path are left for a follow-up to keep §11 narrow.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an Apify provider to fetch job data with configurable actor input and field mapping; includes an Indeed example and optional local caching of job descriptions as Markdown while preserving original remote URLs.
  • Documentation

    • .env.example documents APIFY_TOKEN and notes Apify-backed entries may error cleanly when token is unset.
  • Reliability

    • Enhanced validation, timeouts/deadlines, retries, and clearer error messages.
  • Tests

    • Added comprehensive provider tests for URL handling, field validation, normalization, and error cases.

Review Change Stack

ageem23 and others added 5 commits May 18, 2026 17:35
Adds a generic provider that runs any pre-built Apify actor and maps its
dataset items to the scanner's {title, url, company, location} shape.
All variation (which actor, what input, how to read fields) lives in
portals.yml, not in code — so a single provider implementation serves
any site that has a working Apify scraper.

Optional `field_map.description` persists the JD body to jds/{slug}.md
and rewrites the entry's url to local:jds/... so downstream tools read
it from disk instead of re-fetching the remote URL. Avoids paying for
the same Apify run twice and dodges HTTP failures on stale job-board
links.

Transport (auth, run/poll/fetch, retries with a shared end-to-end
deadline, SSRF guard on actorId) lives in providers/_apify.mjs, mirroring
the _http.mjs split used by the existing providers. APIFY_TOKEN is sent
via Authorization header, never as a query string. Entries error cleanly
when the token is missing so the rest of scan.mjs continues.

Includes a worked example in templates/portals.example.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worked example showing the minimum field_map shape (positionName→title,
company/companyName fallback, location/formattedLocation fallback) for
the misceres/indeed-scraper actor.

Also demonstrates the optional `description` mapping that enables local
jds/{slug}.md caching and the local:jds/... URL rewrite. Without an
example here, that feature is only discoverable by reading the provider
source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the APIFY_TOKEN entry to .env.example so users discover the
configuration requirement when they copy the file, instead of finding
out only after a scan run hits the missing-token error from
providers/_apify.mjs.

Per the split plan in RFC santifer#521, the env entry ships with the Apify
provider PR rather than the plugin-contract PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the round-1 review:

1. Dead `applicationUrl` fallback in saveJd() — normalizeItem never
   populates that field, so the middle clause of the hash input was
   never reached. Simplified to `normalized.url || company-title`.
   (apify.mjs:148)

2. Docs referenced `jds/{slug}.md` but saveJd actually writes
   `jds/{slug}-{hash}.md` (the sha1[:10] suffix prevents two distinct
   postings sharing the same company+title slug from colliding on one
   file). Fixed in both the apify.mjs module docstring and the
   portals.example.yml inline comment.

3. Apply PR santifer#653's URL-validation pattern: actor-supplied URLs were
   passed through unchecked to pipeline.md, the JD-cache filename
   hash, and `_remote_url`. Added isHttpsUrl() and reject items
   whose normalized.url isn't https — blocks `javascript:`, `data:`,
   `file:`, and protocol-downgraded `http:` URLs from a buggy or
   malicious actor before they become clickable links.

Also scrubbed a stale `mirrors providers/linkedin.mjs format` section
comment — that file doesn't exist on upstream/main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit's recommendation on PR #26, adds a narrow provider test
section focused on the new branching surface introduced by this PR:
URL rejection, field_map validation, missing token, actor-id SSRF
guard, and normalization defaults whitelist. 34 assertions, follows
the §11/§12/§13 style established by PR santifer#653.

Specifically covers:

- isHttpsUrl() rejects javascript:/data:/file:/ftp:/http:/malformed/
  empty/null/undefined (the URL-validation hardening from the PR santifer#653
  pattern)
- isFieldSpec() accepts string + non-empty string array, rejects
  number/empty array/mixed-type array/null/object
- normalizeItem() picks first non-empty fallback, applies defaults
  only to title/url/company/location (rejects defaults outside the
  allowlist so portals.yml can't inject arbitrary record fields)
- normalizeActorId() rejects path traversal (../), query (?), fragment
  (#), extra path segments, missing separator, non-string — the SSRF
  guard on /acts/<actor>/runs
- apify.fetch() throws clear errors and never reaches runActor when
  APIFY_TOKEN is unset, actor is missing, or field_map is malformed

To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem
(apify.mjs) and normalizeActorId (_apify.mjs) are now named exports
alongside the existing default export. Behavior unchanged.

The full happy-path / runActor mocking / JD-cache write path is left
for a follow-up — would require mocking _apify.mjs's runActor and is
out of scope for the security-focused §11 CodeRabbit asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Welcome to career-ops, @ageem23! Thanks for your first PR.

A few things to know:

  • Tests will run automatically — check the status below
  • Make sure you've linked a related issue (required for features)
  • Read CONTRIBUTING.md if you haven't

We'll review your PR soon. Join our Discord if you have questions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 026fd320-bf14-4c8f-b4c7-587ba76876fe

📥 Commits

Reviewing files that changed from the base of the PR and between 844e4fa and 78cd645.

📒 Files selected for processing (2)
  • providers/apify.mjs
  • test-all.mjs

📝 Walkthrough

Walkthrough

Adds Apify-backed job scraping: a transport module for actor lifecycle and retries, a provider that maps actor dataset items into job records and optionally persists descriptions to deterministic local Markdown files, configuration examples, and tests validating safety and configuration guards.

Changes

Apify Provider Integration

Layer / File(s) Summary
Apify transport and actor lifecycle
providers/_apify.mjs
Module constants for API endpoints and retry timing, hasToken() and normalizeActorId() for token validation and actor path normalization, HTTP helpers with deadline-aware retries and per-request timeouts, startRun() to create runs, abortRun() for best-effort cleanup, waitForRun() to poll until terminal status or deadline, fetchDatasetItems() to retrieve results, and runActor() orchestrating the full lifecycle.
Provider interface and job description pipeline
providers/apify.mjs
isFieldSpec() and isHttpsUrl() validators, dotted-path extraction, HTML→text, slug/hash-based deterministic jds/ Markdown persistence, normalizeItem() to map actor items into canonical {title,url,company,location}, filtering for required fields and HTTPS, optional local JD persistence with local:jds/... rewrite and _remote_url preservation, plus the default exported provider implementing detect() and fetch().
Environment and portal configuration
.env.example, templates/portals.example.yml
Documents optional APIFY_TOKEN and adds an "Apify-backed sources" example including an enabled misceres/indeed-scraper entry mapping actor output into the scanner's job fields and enabling description caching.
Provider behavior and safety test suite
test-all.mjs
Adds tests asserting provider invariants (no autodetect, id), URL and field_map validation, normalizeItem fallback behavior and allowed defaults, normalizeActorId safety checks, and pre-network fetch() guards for missing token/invalid entries; tests are wrapped to report provider test crashes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant runActor
  participant startRun
  participant waitForRun
  participant abortRun
  participant fetchDatasetItems
  Client->>runActor: call runActor(actorId, input, timeoutMs)
  runActor->>startRun: POST /actors/{actorId}/runs (create run)
  runActor->>waitForRun: poll /runs/{runId} until terminal or deadline
  waitForRun->>abortRun: on deadline expiry -> POST /runs/{runId}/abort
  waitForRun->>runActor: return terminal status
  alt status == SUCCEEDED
    runActor->>fetchDatasetItems: GET /runs/{runId}/dataset/items
    fetchDatasetItems->>Client: return items array
  else
    runActor->>Client: throw error (non-SUCCEEDED)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

🔧 scripts

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a generic Apify actor provider to the scanner.
Linked Issues check ✅ Passed The PR successfully implements all primary coding objectives from issue #325: Apify-backed provider, pipeline/history outputs, --dry-run support, dedup logic, title filters, graceful fallback on missing token, and field mapping without Redis (as scoped).
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #325: new Apify provider module, transport helper, configuration example, environment documentation, and security-focused tests. No unrelated alterations detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread providers/apify.mjs Fixed
Comment thread providers/apify.mjs Fixed
Comment thread providers/apify.mjs Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/apify.mjs`:
- Around line 157-187: The saveJd function currently performs filesystem
operations that can throw and abort fetch for the whole source; change it to
treat local caching as best-effort: wrap the mkdirSync/existsSync/writeFileSync
sequence in a try/catch and on any error simply return the remote URL fallback
(normalized.url or `${normalized.company}-${normalized.title}` fallback) instead
of throwing; keep using slugify/join to build filename but ensure you do not
propagate exceptions from mkdirSync or writeFileSync, and sanitize inputs passed
to slugify/ join by coercing to strings (String(normalized.title/company/url))
before use so path traversal/command-injection risks are reduced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6498802c-f269-451d-9b19-db46e7dbbb5a

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1f3a3 and c3e3b2f.

📒 Files selected for processing (5)
  • .env.example
  • providers/_apify.mjs
  • providers/apify.mjs
  • templates/portals.example.yml
  • test-all.mjs

Comment thread providers/apify.mjs
CodeRabbit feedback on PR santifer#693: a synchronous mkdirSync/writeFileSync
failure (disk full, EACCES, EROFS, Windows EBUSY) inside saveJd would
propagate up through the .map() callback and abort fetch() for the
entire entry — one bad cache write taking down a whole multi-source
scan run.

Wrap the FS sequence in try/catch. On failure, warn to stderr and
return null; the caller now checks for null and leaves the remote
URL on the record instead of producing a broken `local:` link.

The "coerce inputs to String() before slugify" half of the same CR
finding is already covered: slugify() does `String(text || '')` at
its first line, and the template-literal hash input and yamlEscape
already coerce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/apify.mjs`:
- Around line 160-194: The saveJd function uses a non-atomic existsSync() then
writeFileSync() pattern which allows races; remove the existsSync() pre-check
and perform the write with writeFileSync(filepath, content, { encoding: 'utf-8',
flag: 'wx' }) so the OS atomically fails if the file was created by a concurrent
process; catch the specific EEXIST error in the existing catch block (or check
err.code === 'EEXIST') and on EEXIST return the existing
`${JDS_DIR}/${filename}` (preserving "first save wins"), otherwise continue to
log and return null as before; reference saveJd, JDS_DIR, writeFileSync,
existsSync, and the error handling in the catch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 58f477b4-4e3d-4c0c-843d-20ebe7918024

📥 Commits

Reviewing files that changed from the base of the PR and between c3e3b2f and e5cf1f6.

📒 Files selected for processing (1)
  • providers/apify.mjs

Comment thread providers/apify.mjs
GitHub Advanced Security flagged htmlToText on PR santifer#693:

- Alert #7 (bad HTML filtering regexp): `<script[\s\S]*?<\/script>`
  didn't match `</script >` (whitespace before close). Added `\s*` to
  both script and style close patterns, plus `\b` after the open tag
  so `<scripty>` isn't matched as a script open.

- Alert #8 (incomplete multi-character sanitization): a polyglot like
  `<<a>b>` left a dangling `<` after one pass. Loop the strip until
  the string stabilizes.

- Alert #6 (double escape/unescape): decoding `&amp;` before `&#NN;`
  meant `&amp;santifer#60;` round-tripped through `&santifer#60;` and out as `<`.
  Moved `&amp;` to the end of the decode chain so the literal
  `&amp;santifer#60;` stays as `&santifer#60;`.

None of the three is exploitable in our context — htmlToText output
is plain text written to a markdown JD file, never rendered — but
clean CodeQL signal on the PR is worth the small regex tweaks.

§11 grows by three assertions (one per finding) so the fixes can't
silently regress. Full suite: 103 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread providers/apify.mjs Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
providers/apify.mjs (1)

135-137: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Close-tag regex doesn't handle all browser-tolerated variants.

CodeQL correctly identifies that <\/script\s*> and <\/style\s*> fail to match close tags with non-whitespace characters before >, such as </script foo> or </script\t\nbar>. Browsers treat these as valid close tags. Use [^>]* to match any characters up to >.

Proposed fix
-    .replace(/<script\b[\s\S]*?<\/script\s*>/gi, ' ')
-    .replace(/<style\b[\s\S]*?<\/style\s*>/gi, ' ')
+    .replace(/<script\b[\s\S]*?<\/script[^>]*>/gi, ' ')
+    .replace(/<style\b[\s\S]*?<\/style[^>]*>/gi, ' ')
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/apify.mjs` around lines 135 - 137, The regex used to strip
script/style blocks when computing the variable cleaned from raw doesn't match
browser-tolerated close-tag variants; update the two replace calls that target
"</script\s*>" and "</style\s*>" (the replacements applied to cleaned) to use a
closing-tag pattern like "<\/script[^>]*>" and "<\/style[^>]*>" so they match
any characters up to the final '>' (e.g., "</script foo>") before replacing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@providers/apify.mjs`:
- Around line 135-137: The regex used to strip script/style blocks when
computing the variable cleaned from raw doesn't match browser-tolerated
close-tag variants; update the two replace calls that target "</script\s*>" and
"</style\s*>" (the replacements applied to cleaned) to use a closing-tag pattern
like "<\/script[^>]*>" and "<\/style[^>]*>" so they match any characters up to
the final '>' (e.g., "</script foo>") before replacing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b82255c-7ba1-44ac-adbb-dd86ee859f01

📥 Commits

Reviewing files that changed from the base of the PR and between e5cf1f6 and 9b1baad.

📒 Files selected for processing (2)
  • providers/apify.mjs
  • test-all.mjs

ageem23 and others added 2 commits May 18, 2026 20:01
CodeRabbit feedback on PR santifer#693: scan.mjs runs up to 10 concurrent
workers, so the existsSync()-then-writeFileSync() pattern in saveJd
is a TOCTOU race — two workers can both pass the existence check
and then clobber each other's writes.

Keep existsSync as a cheap fast-path (avoids building the YAML/markdown
content string when the cache is already populated), but switch the
actual write to `{ flag: 'wx' }` so the OS atomically fails the open
with EEXIST if a sibling worker created the file in the meantime.
Handle EEXIST in the existing catch by returning the path (preserving
the documented "first save wins" behaviour); every other FS error
still logs and returns null for the per-item fallback added in
e5cf1f6.

`relPath` is hoisted above the try so it's accessible to the catch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous htmlToText fix (9b1baad) used `</script\s*>` to tolerate
the whitespace variant `</script >`. CodeQL #9 points out the regex
still misses the parser-tolerated junk-attribute form like
`</script\t\n foo bar>` that some scrapers emit — invalid per the
HTML spec but the browser parser silently accepts it as a script
close, so it can still hide script content from a stripper that's
stricter than the parser.

Switch both script and style close patterns from `\s*>` to `\b[^>]*>`
so the close tag consumes any whitespace, tabs, newlines, or invalid
attributes before the actual `>`. `\b` preserves the boundary so
`</scripty>` still doesn't match.

Adds one §11 assertion exercising the junk-attr variant. Full suite:
104 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Apify-based scraper as complement to scan.mjs for Indeed Canada

2 participants