feat(scan): generic Apify actor provider#693
Conversation
Adds a generic provider that runs any pre-built Apify actor and maps its
dataset items to the scanner's {title, url, company, location} shape.
All variation (which actor, what input, how to read fields) lives in
portals.yml, not in code — so a single provider implementation serves
any site that has a working Apify scraper.
Optional `field_map.description` persists the JD body to jds/{slug}.md
and rewrites the entry's url to local:jds/... so downstream tools read
it from disk instead of re-fetching the remote URL. Avoids paying for
the same Apify run twice and dodges HTTP failures on stale job-board
links.
Transport (auth, run/poll/fetch, retries with a shared end-to-end
deadline, SSRF guard on actorId) lives in providers/_apify.mjs, mirroring
the _http.mjs split used by the existing providers. APIFY_TOKEN is sent
via Authorization header, never as a query string. Entries error cleanly
when the token is missing so the rest of scan.mjs continues.
Includes a worked example in templates/portals.example.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worked example showing the minimum field_map shape (positionName→title,
company/companyName fallback, location/formattedLocation fallback) for
the misceres/indeed-scraper actor.
Also demonstrates the optional `description` mapping that enables local
jds/{slug}.md caching and the local:jds/... URL rewrite. Without an
example here, that feature is only discoverable by reading the provider
source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the APIFY_TOKEN entry to .env.example so users discover the configuration requirement when they copy the file, instead of finding out only after a scan run hits the missing-token error from providers/_apify.mjs. Per the split plan in RFC santifer#521, the env entry ships with the Apify provider PR rather than the plugin-contract PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the round-1 review:
1. Dead `applicationUrl` fallback in saveJd() — normalizeItem never
populates that field, so the middle clause of the hash input was
never reached. Simplified to `normalized.url || company-title`.
(apify.mjs:148)
2. Docs referenced `jds/{slug}.md` but saveJd actually writes
`jds/{slug}-{hash}.md` (the sha1[:10] suffix prevents two distinct
postings sharing the same company+title slug from colliding on one
file). Fixed in both the apify.mjs module docstring and the
portals.example.yml inline comment.
3. Apply PR santifer#653's URL-validation pattern: actor-supplied URLs were
passed through unchecked to pipeline.md, the JD-cache filename
hash, and `_remote_url`. Added isHttpsUrl() and reject items
whose normalized.url isn't https — blocks `javascript:`, `data:`,
`file:`, and protocol-downgraded `http:` URLs from a buggy or
malicious actor before they become clickable links.
Also scrubbed a stale `mirrors providers/linkedin.mjs format` section
comment — that file doesn't exist on upstream/main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit's recommendation on PR #26, adds a narrow provider test section focused on the new branching surface introduced by this PR: URL rejection, field_map validation, missing token, actor-id SSRF guard, and normalization defaults whitelist. 34 assertions, follows the §11/§12/§13 style established by PR santifer#653. Specifically covers: - isHttpsUrl() rejects javascript:/data:/file:/ftp:/http:/malformed/ empty/null/undefined (the URL-validation hardening from the PR santifer#653 pattern) - isFieldSpec() accepts string + non-empty string array, rejects number/empty array/mixed-type array/null/object - normalizeItem() picks first non-empty fallback, applies defaults only to title/url/company/location (rejects defaults outside the allowlist so portals.yml can't inject arbitrary record fields) - normalizeActorId() rejects path traversal (../), query (?), fragment (#), extra path segments, missing separator, non-string — the SSRF guard on /acts/<actor>/runs - apify.fetch() throws clear errors and never reaches runActor when APIFY_TOKEN is unset, actor is missing, or field_map is malformed To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem (apify.mjs) and normalizeActorId (_apify.mjs) are now named exports alongside the existing default export. Behavior unchanged. The full happy-path / runActor mocking / JD-cache write path is left for a follow-up — would require mocking _apify.mjs's runActor and is out of scope for the security-focused §11 CodeRabbit asked for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Welcome to career-ops, @ageem23! Thanks for your first PR. A few things to know:
We'll review your PR soon. Join our Discord if you have questions. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds Apify-backed job scraping: a transport module for actor lifecycle and retries, a provider that maps actor dataset items into job records and optionally persists descriptions to deterministic local Markdown files, configuration examples, and tests validating safety and configuration guards. ChangesApify Provider Integration
Sequence Diagram(s)sequenceDiagram
participant Client
participant runActor
participant startRun
participant waitForRun
participant abortRun
participant fetchDatasetItems
Client->>runActor: call runActor(actorId, input, timeoutMs)
runActor->>startRun: POST /actors/{actorId}/runs (create run)
runActor->>waitForRun: poll /runs/{runId} until terminal or deadline
waitForRun->>abortRun: on deadline expiry -> POST /runs/{runId}/abort
waitForRun->>runActor: return terminal status
alt status == SUCCEEDED
runActor->>fetchDatasetItems: GET /runs/{runId}/dataset/items
fetchDatasetItems->>Client: return items array
else
runActor->>Client: throw error (non-SUCCEEDED)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/apify.mjs`:
- Around line 157-187: The saveJd function currently performs filesystem
operations that can throw and abort fetch for the whole source; change it to
treat local caching as best-effort: wrap the mkdirSync/existsSync/writeFileSync
sequence in a try/catch and on any error simply return the remote URL fallback
(normalized.url or `${normalized.company}-${normalized.title}` fallback) instead
of throwing; keep using slugify/join to build filename but ensure you do not
propagate exceptions from mkdirSync or writeFileSync, and sanitize inputs passed
to slugify/ join by coercing to strings (String(normalized.title/company/url))
before use so path traversal/command-injection risks are reduced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6498802c-f269-451d-9b19-db46e7dbbb5a
📒 Files selected for processing (5)
.env.exampleproviders/_apify.mjsproviders/apify.mjstemplates/portals.example.ymltest-all.mjs
CodeRabbit feedback on PR santifer#693: a synchronous mkdirSync/writeFileSync failure (disk full, EACCES, EROFS, Windows EBUSY) inside saveJd would propagate up through the .map() callback and abort fetch() for the entire entry — one bad cache write taking down a whole multi-source scan run. Wrap the FS sequence in try/catch. On failure, warn to stderr and return null; the caller now checks for null and leaves the remote URL on the record instead of producing a broken `local:` link. The "coerce inputs to String() before slugify" half of the same CR finding is already covered: slugify() does `String(text || '')` at its first line, and the template-literal hash input and yamlEscape already coerce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/apify.mjs`:
- Around line 160-194: The saveJd function uses a non-atomic existsSync() then
writeFileSync() pattern which allows races; remove the existsSync() pre-check
and perform the write with writeFileSync(filepath, content, { encoding: 'utf-8',
flag: 'wx' }) so the OS atomically fails if the file was created by a concurrent
process; catch the specific EEXIST error in the existing catch block (or check
err.code === 'EEXIST') and on EEXIST return the existing
`${JDS_DIR}/${filename}` (preserving "first save wins"), otherwise continue to
log and return null as before; reference saveJd, JDS_DIR, writeFileSync,
existsSync, and the error handling in the catch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 58f477b4-4e3d-4c0c-843d-20ebe7918024
📒 Files selected for processing (1)
providers/apify.mjs
GitHub Advanced Security flagged htmlToText on PR santifer#693: - Alert #7 (bad HTML filtering regexp): `<script[\s\S]*?<\/script>` didn't match `</script >` (whitespace before close). Added `\s*` to both script and style close patterns, plus `\b` after the open tag so `<scripty>` isn't matched as a script open. - Alert #8 (incomplete multi-character sanitization): a polyglot like `<<a>b>` left a dangling `<` after one pass. Loop the strip until the string stabilizes. - Alert #6 (double escape/unescape): decoding `&` before `&#NN;` meant `&santifer#60;` round-tripped through `&santifer#60;` and out as `<`. Moved `&` to the end of the decode chain so the literal `&santifer#60;` stays as `&santifer#60;`. None of the three is exploitable in our context — htmlToText output is plain text written to a markdown JD file, never rendered — but clean CodeQL signal on the PR is worth the small regex tweaks. §11 grows by three assertions (one per finding) so the fixes can't silently regress. Full suite: 103 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
providers/apify.mjs (1)
135-137:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClose-tag regex doesn't handle all browser-tolerated variants.
CodeQL correctly identifies that
<\/script\s*>and<\/style\s*>fail to match close tags with non-whitespace characters before>, such as</script foo>or</script\t\nbar>. Browsers treat these as valid close tags. Use[^>]*to match any characters up to>.Proposed fix
- .replace(/<script\b[\s\S]*?<\/script\s*>/gi, ' ') - .replace(/<style\b[\s\S]*?<\/style\s*>/gi, ' ') + .replace(/<script\b[\s\S]*?<\/script[^>]*>/gi, ' ') + .replace(/<style\b[\s\S]*?<\/style[^>]*>/gi, ' ')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@providers/apify.mjs` around lines 135 - 137, The regex used to strip script/style blocks when computing the variable cleaned from raw doesn't match browser-tolerated close-tag variants; update the two replace calls that target "</script\s*>" and "</style\s*>" (the replacements applied to cleaned) to use a closing-tag pattern like "<\/script[^>]*>" and "<\/style[^>]*>" so they match any characters up to the final '>' (e.g., "</script foo>") before replacing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@providers/apify.mjs`:
- Around line 135-137: The regex used to strip script/style blocks when
computing the variable cleaned from raw doesn't match browser-tolerated
close-tag variants; update the two replace calls that target "</script\s*>" and
"</style\s*>" (the replacements applied to cleaned) to use a closing-tag pattern
like "<\/script[^>]*>" and "<\/style[^>]*>" so they match any characters up to
the final '>' (e.g., "</script foo>") before replacing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8b82255c-7ba1-44ac-adbb-dd86ee859f01
📒 Files selected for processing (2)
providers/apify.mjstest-all.mjs
CodeRabbit feedback on PR santifer#693: scan.mjs runs up to 10 concurrent workers, so the existsSync()-then-writeFileSync() pattern in saveJd is a TOCTOU race — two workers can both pass the existence check and then clobber each other's writes. Keep existsSync as a cheap fast-path (avoids building the YAML/markdown content string when the cache is already populated), but switch the actual write to `{ flag: 'wx' }` so the OS atomically fails the open with EEXIST if a sibling worker created the file in the meantime. Handle EEXIST in the existing catch by returning the path (preserving the documented "first save wins" behaviour); every other FS error still logs and returns null for the per-item fallback added in e5cf1f6. `relPath` is hoisted above the try so it's accessible to the catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous htmlToText fix (9b1baad) used `</script\s*>` to tolerate the whitespace variant `</script >`. CodeQL #9 points out the regex still misses the parser-tolerated junk-attribute form like `</script\t\n foo bar>` that some scrapers emit — invalid per the HTML spec but the browser parser silently accepts it as a script close, so it can still hide script content from a stripper that's stricter than the parser. Switch both script and style close patterns from `\s*>` to `\b[^>]*>` so the close tag consumes any whitespace, tabs, newlines, or invalid attributes before the actual `>`. `\b` preserves the boundary so `</scripty>` still doesn't match. Adds one §11 assertion exercising the junk-attr variant. Full suite: 104 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the Plugin Architecture RFC (#521) — implements the third item in the RFC's split plan ("Follow-up RFC + PR (Apify, closes #325)"). Phase A (plugin contract + Greenhouse/Ashby/Lever ports) is already on
main; this PR builds on that.Closes #325.
Summary
providers/apify.mjs— generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's{title, url, company, location}shape. All variation (actor, input, field mappings) lives inportals.yml, not in code, so a single implementation serves any site with an Apify scraper available.providers/_apify.mjs— shared Apify transport (auth, run/poll/fetch, retries with end-to-end deadline, SSRF guard onactorId). Follows the_-prefix helper convention established by_http.mjs/_types.js.field_map.descriptionmapping persists the JD body tojds/{slug}-{hash}.mdand rewrites the entry'surltolocal:jds/...so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Falls through to the remote URL when the description field is missing or shorter than 50 chars..env.example— documentsAPIFY_TOKEN.templates/portals.example.yml— worked Indeed example showing both the basicfield_mapshape and the optionaldescriptionmapping.test-all.mjs— new §11 with 34 assertions covering the security-relevant paths (URL rejection, malformedfield_map, missing token, actor-id SSRF guard, defaults allowlist).Conformance to RFC #521
id,detect(),fetch(entry, ctx).detect()returnsnull(explicitprovider: apifyrequired — no URL scheme to auto-detect from).field_mapshape allthrowsoscan.mjsrecords the error per-entry and continues with the next._apify.mjsuses the_prefix and is skipped by the loader.scan-history.tsv: source label uses the slugified actor id, somisceres/indeed-scraperproduces a stable per-actor dedup key.Security hardening
APIFY_TOKENsent viaAuthorization: Bearerheader, never as a query string (prevents token leakage into access logs).actorIdregex (^[A-Za-z0-9][A-Za-z0-9_.-]*[~/][A-Za-z0-9][A-Za-z0-9_.-]*$) so a malformed config can't divert the bearer-token request off the intended/acts/<actor>/runspath.startRun→waitForRun→fetchDatasetItems.runActor'stimeoutMsis the actual wall-clock ceiling.waitForRun()re-throws non-retriable 4xx (401/403/404) instead of looping to the deadline and masking the real cause as a generic timeout.runActor()rejects non-finite/non-positivetimeoutMsup front with a clear error.saveJd()filename includessha1(url)[:10]so two distinct postings sharing the same company+title slug can't collide on one cache file.field_mapspec validation accepts only a string or non-empty array of strings; rejects malformed values like{ title: 42 }at load time with a clear error instead of crashing mid-scan.normalizeItem()defaults only apply totitle/url/company/location—portals.ymlcannot inject arbitrary record fields viadefaults:.isHttpsUrl()guard on actor-supplied URLs — drops items whose URL isn'thttps:, blockingjavascript:,data:,file:, and protocol-downgradedhttp:URLs from ending up clickable inpipeline.mdor feeding the JD-cache filename hash.Tests
test-all.mjs§11 (34 assertions):isHttpsUrl()accepts https; rejectshttp:,javascript:,data:,file:,ftp:, malformed, empty, null, undefinedisFieldSpec()accepts string + non-empty string array; rejects number, empty array, mixed-type array, null, objectnormalizeItem()picks first non-empty fallback; appliesdefaultsonly to allowlisted keysnormalizeActorId()rejects path traversal, query, fragment, extra segments, non-stringapify.fetch()throws clear errors for missingAPIFY_TOKEN/ missingactor/ malformedfield_map, never reaches the network in those casesStyle follows §11/§12/§13 in #653. Full suite (
node test-all.mjs --quick): 100 passed, 0 failed.Smoke-tested against real Apify actors (Indeed, Workday-via-third-party) on my fork — items flow through correctly, JD-cache files written, no regressions vs. the pre-PR provider.
Deliberate scope limits
portals.example.yml— the Indeed actor. Workday-via-third-party-rental actors work too (tested) but recommending a specific paid third-party rental in upstream docs warrants its own discussion; happy to add in a follow-up.fetch()through a mockedrunActorand the JD-cache write path are left for a follow-up to keep §11 narrow.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Reliability
Tests