feat(scan): generic Apify actor provider (closes #325)#26
Conversation
Adds a generic provider that runs any pre-built Apify actor and maps its
dataset items to the scanner's {title, url, company, location} shape.
All variation (which actor, what input, how to read fields) lives in
portals.yml, not in code — so a single provider implementation serves
any site that has a working Apify scraper.
Optional `field_map.description` persists the JD body to jds/{slug}.md
and rewrites the entry's url to local:jds/... so downstream tools read
it from disk instead of re-fetching the remote URL. Avoids paying for
the same Apify run twice and dodges HTTP failures on stale job-board
links.
Transport (auth, run/poll/fetch, retries with a shared end-to-end
deadline, SSRF guard on actorId) lives in providers/_apify.mjs, mirroring
the _http.mjs split used by the existing providers. APIFY_TOKEN is sent
via Authorization header, never as a query string. Entries error cleanly
when the token is missing so the rest of scan.mjs continues.
Includes a worked example in templates/portals.example.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worked example showing the minimum field_map shape (positionName→title,
company/companyName fallback, location/formattedLocation fallback) for
the misceres/indeed-scraper actor.
Also demonstrates the optional `description` mapping that enables local
jds/{slug}.md caching and the local:jds/... URL rewrite. Without an
example here, that feature is only discoverable by reading the provider
source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the APIFY_TOKEN entry to .env.example so users discover the configuration requirement when they copy the file, instead of finding out only after a scan run hits the missing-token error from providers/_apify.mjs. Per the split plan in RFC santifer#521, the env entry ships with the Apify provider PR rather than the plugin-contract PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/apify.mjs`:
- Around line 137-172: The fallback to normalized.applicationUrl in saveJd is
dead/confusing because normalizeItem never sets applicationUrl; either remove
the unused reference in saveJd and simplify the hash input to use normalized.url
|| `${normalized.company}-${normalized.title}`, or alternatively populate
applicationUrl in normalizeItem; update the code in saveJd (function saveJd) to
use only normalized.url (with the company-title fallback) or add an explicit
applicationUrl field in normalizeItem so the reference is valid—pick one
consistent approach and apply it across saveJd and normalizeItem.
In `@templates/portals.example.yml`:
- Around line 1039-1043: Update the example comment that describes where the job
description files are written for the "description" mapping: change the
documented filename from "jds/{slug}.md" to the actual hash-suffixed pattern
used at runtime (e.g., "jds/{slug}-{hash}.md" or "jds/{slug}-{sha256}.md") so
users can locate the cached files correctly when using the description:
[description, descriptionText, descriptionHTML, jobDescription] mapping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0aca62dc-07c7-4b97-8cc5-21c006aef44b
📒 Files selected for processing (4)
.env.exampleproviders/_apify.mjsproviders/apify.mjstemplates/portals.example.yml
Three findings from the round-1 review:
1. Dead `applicationUrl` fallback in saveJd() — normalizeItem never
populates that field, so the middle clause of the hash input was
never reached. Simplified to `normalized.url || company-title`.
(apify.mjs:148)
2. Docs referenced `jds/{slug}.md` but saveJd actually writes
`jds/{slug}-{hash}.md` (the sha1[:10] suffix prevents two distinct
postings sharing the same company+title slug from colliding on one
file). Fixed in both the apify.mjs module docstring and the
portals.example.yml inline comment.
3. Apply PR santifer#653's URL-validation pattern: actor-supplied URLs were
passed through unchecked to pipeline.md, the JD-cache filename
hash, and `_remote_url`. Added isHttpsUrl() and reject items
whose normalized.url isn't https — blocks `javascript:`, `data:`,
`file:`, and protocol-downgraded `http:` URLs from a buggy or
malicious actor before they become clickable links.
Also scrubbed a stale `mirrors providers/linkedin.mjs format` section
comment — that file doesn't exist on upstream/main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round-1 review feedback addressed in 2b5561b:
Also scrubbed a stale Test gap noted but not addressed in this commit: PR santifer#653 added §11/§12/§13 in @coderabbitai review |
|
On the test gap: My recommendation is to add §11 for the Apify provider in this PR. The new provider introduces meaningful branching surface (field_map validation, description fallback, URL rejection, JD file write/reuse, Triggering a fresh review now. ✅ Actions performedReview triggered.
|
Per CodeRabbit's recommendation on PR #26, adds a narrow provider test section focused on the new branching surface introduced by this PR: URL rejection, field_map validation, missing token, actor-id SSRF guard, and normalization defaults whitelist. 34 assertions, follows the §11/§12/§13 style established by PR santifer#653. Specifically covers: - isHttpsUrl() rejects javascript:/data:/file:/ftp:/http:/malformed/ empty/null/undefined (the URL-validation hardening from the PR santifer#653 pattern) - isFieldSpec() accepts string + non-empty string array, rejects number/empty array/mixed-type array/null/object - normalizeItem() picks first non-empty fallback, applies defaults only to title/url/company/location (rejects defaults outside the allowlist so portals.yml can't inject arbitrary record fields) - normalizeActorId() rejects path traversal (../), query (?), fragment (#), extra path segments, missing separator, non-string — the SSRF guard on /acts/<actor>/runs - apify.fetch() throws clear errors and never reaches runActor when APIFY_TOKEN is unset, actor is missing, or field_map is malformed To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem (apify.mjs) and normalizeActorId (_apify.mjs) are now named exports alongside the existing default export. Behavior unchanged. The full happy-path / runActor mocking / JD-cache write path is left for a follow-up — would require mocking _apify.mjs's runActor and is out of scope for the security-focused §11 CodeRabbit asked for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Added §11 to Coverage (kept narrow per the round-2 guidance):
To make the helpers testable, Skipped per CodeRabbit's "narrow §11" guidance: happy-path |
CodeRabbit feedback on PR santifer#693: a synchronous mkdirSync/writeFileSync failure (disk full, EACCES, EROFS, Windows EBUSY) inside saveJd would propagate up through the .map() callback and abort fetch() for the entire entry — one bad cache write taking down a whole multi-source scan run. Wrap the FS sequence in try/catch. On failure, warn to stderr and return null; the caller now checks for null and leaves the remote URL on the record instead of producing a broken `local:` link. The "coerce inputs to String() before slugify" half of the same CR finding is already covered: slugify() does `String(text || '')` at its first line, and the template-literal hash input and yamlEscape already coerce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Advanced Security flagged htmlToText on PR santifer#693: - Alert #7 (bad HTML filtering regexp): `<script[\s\S]*?<\/script>` didn't match `</script >` (whitespace before close). Added `\s*` to both script and style close patterns, plus `\b` after the open tag so `<scripty>` isn't matched as a script open. - Alert #8 (incomplete multi-character sanitization): a polyglot like `<<a>b>` left a dangling `<` after one pass. Loop the strip until the string stabilizes. - Alert #6 (double escape/unescape): decoding `&` before `&#NN;` meant `&santifer#60;` round-tripped through `&santifer#60;` and out as `<`. Moved `&` to the end of the decode chain so the literal `&santifer#60;` stays as `&santifer#60;`. None of the three is exploitable in our context — htmlToText output is plain text written to a markdown JD file, never rendered — but clean CodeQL signal on the PR is worth the small regex tweaks. §11 grows by three assertions (one per finding) so the fixes can't silently regress. Full suite: 103 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit feedback on PR santifer#693: scan.mjs runs up to 10 concurrent workers, so the existsSync()-then-writeFileSync() pattern in saveJd is a TOCTOU race — two workers can both pass the existence check and then clobber each other's writes. Keep existsSync as a cheap fast-path (avoids building the YAML/markdown content string when the cache is already populated), but switch the actual write to `{ flag: 'wx' }` so the OS atomically fails the open with EEXIST if a sibling worker created the file in the meantime. Handle EEXIST in the existing catch by returning the path (preserving the documented "first save wins" behaviour); every other FS error still logs and returns null for the per-item fallback added in e5cf1f6. `relPath` is hoisted above the try so it's accessible to the catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous htmlToText fix (9b1baad) used `</script\s*>` to tolerate the whitespace variant `</script >`. CodeQL #9 points out the regex still misses the parser-tolerated junk-attribute form like `</script\t\n foo bar>` that some scrapers emit — invalid per the HTML spec but the browser parser silently accepts it as a script close, so it can still hide script content from a stripper that's stricter than the parser. Switch both script and style close patterns from `\s*>` to `\b[^>]*>` so the close tag consumes any whitespace, tabs, newlines, or invalid attributes before the actual `>`. `\b` preserves the boundary so `</scripty>` still doesn't match. Adds one §11 assertion exercising the junk-attr variant. Full suite: 104 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the Plugin Architecture RFC (#521) — implements the third item in the RFC's split plan ("Follow-up RFC + PR (Apify, closes santifer#325)"). Phase A (plugin contract + Greenhouse/Ashby/Lever ports) is already on
main; this PR builds on that.Closes santifer#325.
Summary
providers/apify.mjs— generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's{title, url, company, location}shape. All variation (actor, input, field mappings) lives inportals.yml, not in code, so a single implementation serves any site with an Apify scraper available.providers/_apify.mjs— shared Apify transport (auth, run/poll/fetch, retries with end-to-end deadline, SSRF guard on actorId). Follows the_-prefix helper convention established by_http.mjs/_types.js.field_map.descriptionmapping persists the JD body tojds/{slug}.mdand rewrites the entry's url tolocal:jds/...so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Falls through to the remote URL when the description field is missing or shorter than 50 chars..env.example— documentsAPIFY_TOKEN.templates/portals.example.yml— worked Indeed example showing both the basicfield_mapshape and the optionaldescriptionmapping.Conformance to RFC santifer#521
id,detect(),fetch(entry, ctx).detect()returnsnull(explicitprovider: apifyrequired — no URL scheme to auto-detect from).field_mapshape allthrowsoscan.mjsrecords the error per-entry and continues with the next._apify.mjsuses the_prefix and is skipped by the loader.scan-history.tsv: source label uses the slugified actor id, somisceres/indeed-scraperproduces a stable per-actor dedup key.Round-2 review fixes already applied
From the prior PR santifer#454 review thread, these are folded in (one CodeRabbit pass already absorbed on my fork as PR #8):
APIFY_TOKENsent viaAuthorization: Bearerheader, never as a query string (prevents token leakage into access logs).actorId— strict^[A-Za-z0-9][A-Za-z0-9_.-]*[~/][A-Za-z0-9][A-Za-z0-9_.-]*$validation so a malformed config can't divert the bearer-token request off the intended/acts/<actor>/runspath.startRun→waitForRun→fetchDatasetItems.runActor'stimeoutMsis now an actual wall-clock ceiling, not just a wait-loop ceiling.waitForRun()re-throws non-retriable 4xx (401/403/404) instead of looping to the deadline and masking the real cause as a generic timeout.runActor()validatestimeoutMsis a positive finite number up front (a NaN previously producedDate.now() + NaN→ confusing "did not finish within 0s" errors).apify.mjssaveJd()filename includessha1(url)[:10]so two distinct postings sharing the same company+title slug don't collide on a single cache file (mirrors the same fix applied to the linkedin provider).field_mapspec validation accepts only a string or non-empty array of strings; rejects malformed values like{ title: 42 }at load time with a clear error instead of crashing mid-scan.Deliberate scope limits
portals.example.yml— the Indeed actor. Workday-via-third-party-rental actors are also tested on my fork but recommending a specific paid third-party rental in upstream docs warrants its own discussion; happy to add it in a follow-up.proxyConfigurationhandling — only relevant for the deferred Workday entry; can land with it.Test plan
node scan.mjs --dry-runwith the Indeed example enabled andAPIFY_TOKENset — verify items show up under themisceres-indeed-scrapersource labelAPIFY_TOKENunset — verify the Apify entry errors cleanly and the rest of the scan continuesfield_map(e.g.title: 42) is rejected at fetch with the documented errordescription:mapping writesjds/{slug}.mdand the pipeline entry uses thelocal:jds/...URLlocal:jds/...link)🤖 Generated with Claude Code
Summary by CodeRabbit