feat(scan): generic Apify actor provider (closes #325) by ageem23 · Pull Request #26 · ageem23/career-ops

ageem23 · 2026-05-18T22:56:14Z

Note: Opened against upstream-sync (mirror of santifer/career-ops:main) on my fork to get CodeRabbit feedback before sending upstream. The body below is drafted for the eventual upstream PR.

Follow-up to the Plugin Architecture RFC (#521) — implements the third item in the RFC's split plan ("Follow-up RFC + PR (Apify, closes santifer#325)"). Phase A (plugin contract + Greenhouse/Ashby/Lever ports) is already on main; this PR builds on that.

Closes santifer#325.

Summary

providers/apify.mjs — generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's {title, url, company, location} shape. All variation (actor, input, field mappings) lives in portals.yml, not in code, so a single implementation serves any site with an Apify scraper available.
providers/_apify.mjs — shared Apify transport (auth, run/poll/fetch, retries with end-to-end deadline, SSRF guard on actorId). Follows the _-prefix helper convention established by _http.mjs / _types.js.
Optional field_map.description mapping persists the JD body to jds/{slug}.md and rewrites the entry's url to local:jds/... so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Falls through to the remote URL when the description field is missing or shorter than 50 chars.
.env.example — documents APIFY_TOKEN.
templates/portals.example.yml — worked Indeed example showing both the basic field_map shape and the optional description mapping.

Conformance to RFC santifer#521

Provider contract: default export with id, detect(), fetch(entry, ctx). detect() returns null (explicit provider: apify required — no URL scheme to auto-detect from).
Error contract: missing token / missing required entry config / invalid field_map shape all throw so scan.mjs records the error per-entry and continues with the next.
Helper convention: _apify.mjs uses the _ prefix and is skipped by the loader.
scan-history.tsv: source label uses the slugified actor id, so misceres/indeed-scraper produces a stable per-actor dedup key.

Round-2 review fixes already applied

From the prior PR santifer#454 review thread, these are folded in (one CodeRabbit pass already absorbed on my fork as PR #8):

APIFY_TOKEN sent via Authorization: Bearer header, never as a query string (prevents token leakage into access logs).
SSRF guard on actorId — strict ^[A-Za-z0-9][A-Za-z0-9_.-]*[~/][A-Za-z0-9][A-Za-z0-9_.-]*$ validation so a malformed config can't divert the bearer-token request off the intended /acts/<actor>/runs path.
End-to-end timeout via a single shared deadline threaded through startRun → waitForRun → fetchDatasetItems. runActor's timeoutMs is now an actual wall-clock ceiling, not just a wait-loop ceiling.
waitForRun() re-throws non-retriable 4xx (401/403/404) instead of looping to the deadline and masking the real cause as a generic timeout.
runActor() validates timeoutMs is a positive finite number up front (a NaN previously produced Date.now() + NaN → confusing "did not finish within 0s" errors).
apify.mjs saveJd() filename includes sha1(url)[:10] so two distinct postings sharing the same company+title slug don't collide on a single cache file (mirrors the same fix applied to the linkedin provider).
field_map spec validation accepts only a string or non-empty array of strings; rejects malformed values like { title: 42 } at load time with a clear error instead of crashing mid-scan.

Deliberate scope limits

Only one example entry in portals.example.yml — the Indeed actor. Workday-via-third-party-rental actors are also tested on my fork but recommending a specific paid third-party rental in upstream docs warrants its own discussion; happy to add it in a follow-up.
No browser transport, no scraper provider — those are the separate follow-up PR per RFC RFC: Plugin-based Provider Architecture for `scan.mjs` santifer/career-ops#521. This PR is Apify-only.
No proxyConfiguration handling — only relevant for the deferred Workday entry; can land with it.

Test plan

node scan.mjs --dry-run with the Indeed example enabled and APIFY_TOKEN set — verify items show up under the misceres-indeed-scraper source label
Same with APIFY_TOKEN unset — verify the Apify entry errors cleanly and the rest of the scan continues
Verify a malformed field_map (e.g. title: 42) is rejected at fetch with the documented error
Verify the optional description: mapping writes jds/{slug}.md and the pipeline entry uses the local:jds/... URL
Verify a posting with a short/missing description falls through to the remote URL (no broken local:jds/... link)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Integrated Apify for automated job discovery from external sources (e.g., Indeed scraping)
- Job listings can be cached locally as formatted markdown files with job descriptions
- Added environment variable documentation for Apify token authentication setup
- Included example configuration showing how to set up Apify-backed job sources with automatic field mapping

Adds a generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's {title, url, company, location} shape. All variation (which actor, what input, how to read fields) lives in portals.yml, not in code — so a single provider implementation serves any site that has a working Apify scraper. Optional `field_map.description` persists the JD body to jds/{slug}.md and rewrites the entry's url to local:jds/... so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Transport (auth, run/poll/fetch, retries with a shared end-to-end deadline, SSRF guard on actorId) lives in providers/_apify.mjs, mirroring the _http.mjs split used by the existing providers. APIFY_TOKEN is sent via Authorization header, never as a query string. Entries error cleanly when the token is missing so the rest of scan.mjs continues. Includes a worked example in templates/portals.example.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Worked example showing the minimum field_map shape (positionName→title, company/companyName fallback, location/formattedLocation fallback) for the misceres/indeed-scraper actor. Also demonstrates the optional `description` mapping that enables local jds/{slug}.md caching and the local:jds/... URL rewrite. Without an example here, that feature is only discoverable by reading the provider source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the APIFY_TOKEN entry to .env.example so users discover the configuration requirement when they copy the file, instead of finding out only after a scan run hits the missing-token error from providers/_apify.mjs. Per the split plan in RFC santifer#521, the env entry ships with the Apify provider PR rather than the plugin-contract PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-18T22:56:20Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)

main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 5c7c6e96-3c28-4e24-ba9c-71557c03d9b9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/apify-provider-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/apify.mjs`:
- Around line 137-172: The fallback to normalized.applicationUrl in saveJd is
dead/confusing because normalizeItem never sets applicationUrl; either remove
the unused reference in saveJd and simplify the hash input to use normalized.url
|| `${normalized.company}-${normalized.title}`, or alternatively populate
applicationUrl in normalizeItem; update the code in saveJd (function saveJd) to
use only normalized.url (with the company-title fallback) or add an explicit
applicationUrl field in normalizeItem so the reference is valid—pick one
consistent approach and apply it across saveJd and normalizeItem.

In `@templates/portals.example.yml`:
- Around line 1039-1043: Update the example comment that describes where the job
description files are written for the "description" mapping: change the
documented filename from "jds/{slug}.md" to the actual hash-suffixed pattern
used at runtime (e.g., "jds/{slug}-{hash}.md" or "jds/{slug}-{sha256}.md") so
users can locate the cached files correctly when using the description:
[description, descriptionText, descriptionHTML, jobDescription] mapping.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0aca62dc-07c7-4b97-8cc5-21c006aef44b

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1f3a3 and f0d49c2.

📒 Files selected for processing (4)

.env.example
providers/_apify.mjs
providers/apify.mjs
templates/portals.example.yml

Three findings from the round-1 review: 1. Dead `applicationUrl` fallback in saveJd() — normalizeItem never populates that field, so the middle clause of the hash input was never reached. Simplified to `normalized.url || company-title`. (apify.mjs:148) 2. Docs referenced `jds/{slug}.md` but saveJd actually writes `jds/{slug}-{hash}.md` (the sha1[:10] suffix prevents two distinct postings sharing the same company+title slug from colliding on one file). Fixed in both the apify.mjs module docstring and the portals.example.yml inline comment. 3. Apply PR santifer#653's URL-validation pattern: actor-supplied URLs were passed through unchecked to pipeline.md, the JD-cache filename hash, and `_remote_url`. Added isHttpsUrl() and reject items whose normalized.url isn't https — blocks `javascript:`, `data:`, `file:`, and protocol-downgraded `http:` URLs from a buggy or malicious actor before they become clickable links. Also scrubbed a stale `mirrors providers/linkedin.mjs format` section comment — that file doesn't exist on upstream/main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ageem23 · 2026-05-18T23:10:43Z

Round-1 review feedback addressed in 2b5561b:

Dead applicationUrl fallback (saveJd) — removed; hash input simplified to normalized.url || ${company}-${title}.
jds/{slug}.md → jds/{slug}-{hash}.md — fixed in both apify.mjs module docstring and the portals.example.yml inline comment.
URL validation (pattern from feat(scan): add Workable, SmartRecruiters, Recruitee ATS parsers santifer/career-ops#653's review of parseRecruiteeResponse / parseWorkableMarkdown) — actor-supplied URLs are now rejected unless protocol is https:. Blocks javascript:, data:, file:, and protocol-downgraded http: URLs from a buggy or malicious actor before they become clickable in pipeline.md or feed the JD-cache filename hash.

Also scrubbed a stale mirrors providers/linkedin.mjs format section comment — that file isn't upstream.

Test gap noted but not addressed in this commit: PR santifer#653 added §11/§12/§13 in test-all.mjs (~27 assertions per provider). Happy to add an equivalent §11 for the apify provider here or in a follow-up — let me know which fits better for the upstream PR scope.

@coderabbitai review

coderabbitai · 2026-05-18T23:11:10Z

@ageem23 Thanks for the detailed round-up — all three fixes look well-scoped:

The simplified hash input (normalized.url || \${company}-${title}``) is cleaner and the dead fallback removal reduces confusion.
The {slug}-{hash}.md docstring/YAML alignment is good housekeeping.
The https:-only guard is the right call — especially since actor output URLs flow into pipeline.md links and the JD cache filename hash; blocking downgrade/exotic protocols at the boundary is much safer than trusting the actor author.

On the test gap: My recommendation is to add §11 for the Apify provider in this PR. The new provider introduces meaningful branching surface (field_map validation, description fallback, URL rejection, JD file write/reuse, hasToken short-circuit) that benefits from co-located tests. Shipping the provider without any assertions makes future refactors riskier. If the upstream PR scope is strict, a narrow §11 covering just the security-relevant paths (URL rejection, malformed field_map, missing token) would be a reasonable minimum — the happy-path and description tests could follow in the same style as #653's §12/§13.

Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Per CodeRabbit's recommendation on PR #26, adds a narrow provider test section focused on the new branching surface introduced by this PR: URL rejection, field_map validation, missing token, actor-id SSRF guard, and normalization defaults whitelist. 34 assertions, follows the §11/§12/§13 style established by PR santifer#653. Specifically covers: - isHttpsUrl() rejects javascript:/data:/file:/ftp:/http:/malformed/ empty/null/undefined (the URL-validation hardening from the PR santifer#653 pattern) - isFieldSpec() accepts string + non-empty string array, rejects number/empty array/mixed-type array/null/object - normalizeItem() picks first non-empty fallback, applies defaults only to title/url/company/location (rejects defaults outside the allowlist so portals.yml can't inject arbitrary record fields) - normalizeActorId() rejects path traversal (../), query (?), fragment (#), extra path segments, missing separator, non-string — the SSRF guard on /acts/<actor>/runs - apify.fetch() throws clear errors and never reaches runActor when APIFY_TOKEN is unset, actor is missing, or field_map is malformed To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem (apify.mjs) and normalizeActorId (_apify.mjs) are now named exports alongside the existing default export. Behavior unchanged. The full happy-path / runActor mocking / JD-cache write path is left for a follow-up — would require mocking _apify.mjs's runActor and is out of scope for the security-focused §11 CodeRabbit asked for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ageem23 · 2026-05-18T23:19:28Z

Added §11 to test-all.mjs in c3e3b2f — 34 assertions covering the security-relevant paths CodeRabbit asked for. Full suite: 100 passed, 0 failed.

Coverage (kept narrow per the round-2 guidance):

URL rejection — isHttpsUrl() rejects http:, javascript:, data:, file:, ftp:, malformed/empty/null/undefined
Malformed field_map — isFieldSpec() accepts string + non-empty string array; rejects number, empty array, mixed-type array, null, object
Missing token / actor / field_map — apify.fetch() throws clear errors before any network call (env var save/restore so the test works whether or not APIFY_TOKEN is set locally)
Actor-id SSRF guard — normalizeActorId() rejects path traversal, query, fragment, extra path segments, missing separator, non-string
Defaults allowlist — normalizeItem() blocks portals.yml defaults outside title/url/company/location

To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem (apify.mjs) and normalizeActorId (_apify.mjs) are now named exports alongside the existing default export. Behavior unchanged.

Skipped per CodeRabbit's "narrow §11" guidance: happy-path fetch() through a mocked runActor, JD-cache write path, and htmlToText sanitization. Worth a follow-up if the maintainer wants the broader surface.

CodeRabbit feedback on PR santifer#693: a synchronous mkdirSync/writeFileSync failure (disk full, EACCES, EROFS, Windows EBUSY) inside saveJd would propagate up through the .map() callback and abort fetch() for the entire entry — one bad cache write taking down a whole multi-source scan run. Wrap the FS sequence in try/catch. On failure, warn to stderr and return null; the caller now checks for null and leaves the remote URL on the record instead of producing a broken `local:` link. The "coerce inputs to String() before slugify" half of the same CR finding is already covered: slugify() does `String(text || '')` at its first line, and the template-literal hash input and yamlEscape already coerce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GitHub Advanced Security flagged htmlToText on PR santifer#693: - Alert #7 (bad HTML filtering regexp): `<script[\s\S]*?<\/script>` didn't match `</script >` (whitespace before close). Added `\s*` to both script and style close patterns, plus `\b` after the open tag so `<scripty>` isn't matched as a script open. - Alert #8 (incomplete multi-character sanitization): a polyglot like `<<a>b>` left a dangling `<` after one pass. Loop the strip until the string stabilizes. - Alert #6 (double escape/unescape): decoding `&` before `&#NN;` meant `&santifer#60;` round-tripped through `&santifer#60;` and out as `<`. Moved `&` to the end of the decode chain so the literal `&santifer#60;` stays as `&santifer#60;`. None of the three is exploitable in our context — htmlToText output is plain text written to a markdown JD file, never rendered — but clean CodeQL signal on the PR is worth the small regex tweaks. §11 grows by three assertions (one per finding) so the fixes can't silently regress. Full suite: 103 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CodeRabbit feedback on PR santifer#693: scan.mjs runs up to 10 concurrent workers, so the existsSync()-then-writeFileSync() pattern in saveJd is a TOCTOU race — two workers can both pass the existence check and then clobber each other's writes. Keep existsSync as a cheap fast-path (avoids building the YAML/markdown content string when the cache is already populated), but switch the actual write to `{ flag: 'wx' }` so the OS atomically fails the open with EEXIST if a sibling worker created the file in the meantime. Handle EEXIST in the existing catch by returning the path (preserving the documented "first save wins" behaviour); every other FS error still logs and returns null for the per-item fallback added in e5cf1f6. `relPath` is hoisted above the try so it's accessible to the catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

My previous htmlToText fix (9b1baad) used `</script\s*>` to tolerate the whitespace variant `</script >`. CodeQL #9 points out the regex still misses the parser-tolerated junk-attribute form like `</script\t\n foo bar>` that some scrapers emit — invalid per the HTML spec but the browser parser silently accepts it as a script close, so it can still hide script content from a stripper that's stricter than the parser. Switch both script and style close patterns from `\s*>` to `\b[^>]*>` so the close tag consumes any whitespace, tabs, newlines, or invalid attributes before the actual `>`. `\b` preserves the boundary so `</scripty>` still doesn't match. Adds one §11 assertion exercising the junk-attr variant. Full suite: 104 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ageem23 and others added 3 commits May 18, 2026 17:35

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread providers/apify.mjs

Comment thread templates/portals.example.yml Outdated

ageem23 and others added 4 commits May 18, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scan): generic Apify actor provider (closes #325)#26

feat(scan): generic Apify actor provider (closes #325)#26
ageem23 wants to merge 9 commits into
upstream-syncfrom
feat/apify-provider-pr

ageem23 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ageem23 commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Uh oh!

ageem23 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ageem23 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Conformance to RFC santifer#521

Round-2 review fixes already applied

Deliberate scope limits

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ageem23 commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026

Uh oh!

ageem23 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ageem23 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading