Skip to content

feat(scan): generic Apify actor provider (closes #325)#26

Open
ageem23 wants to merge 9 commits into
upstream-syncfrom
feat/apify-provider-pr
Open

feat(scan): generic Apify actor provider (closes #325)#26
ageem23 wants to merge 9 commits into
upstream-syncfrom
feat/apify-provider-pr

Conversation

@ageem23
Copy link
Copy Markdown
Owner

@ageem23 ageem23 commented May 18, 2026

Note: Opened against upstream-sync (mirror of santifer/career-ops:main) on my fork to get CodeRabbit feedback before sending upstream. The body below is drafted for the eventual upstream PR.


Follow-up to the Plugin Architecture RFC (#521) — implements the third item in the RFC's split plan ("Follow-up RFC + PR (Apify, closes santifer#325)"). Phase A (plugin contract + Greenhouse/Ashby/Lever ports) is already on main; this PR builds on that.

Closes santifer#325.

Summary

  • providers/apify.mjs — generic provider that runs any pre-built Apify actor and maps its dataset items to the scanner's {title, url, company, location} shape. All variation (actor, input, field mappings) lives in portals.yml, not in code, so a single implementation serves any site with an Apify scraper available.
  • providers/_apify.mjs — shared Apify transport (auth, run/poll/fetch, retries with end-to-end deadline, SSRF guard on actorId). Follows the _-prefix helper convention established by _http.mjs / _types.js.
  • Optional field_map.description mapping persists the JD body to jds/{slug}.md and rewrites the entry's url to local:jds/... so downstream tools read it from disk instead of re-fetching the remote URL. Avoids paying for the same Apify run twice and dodges HTTP failures on stale job-board links. Falls through to the remote URL when the description field is missing or shorter than 50 chars.
  • .env.example — documents APIFY_TOKEN.
  • templates/portals.example.yml — worked Indeed example showing both the basic field_map shape and the optional description mapping.

Conformance to RFC santifer#521

  • Provider contract: default export with id, detect(), fetch(entry, ctx). detect() returns null (explicit provider: apify required — no URL scheme to auto-detect from).
  • Error contract: missing token / missing required entry config / invalid field_map shape all throw so scan.mjs records the error per-entry and continues with the next.
  • Helper convention: _apify.mjs uses the _ prefix and is skipped by the loader.
  • scan-history.tsv: source label uses the slugified actor id, so misceres/indeed-scraper produces a stable per-actor dedup key.

Round-2 review fixes already applied

From the prior PR santifer#454 review thread, these are folded in (one CodeRabbit pass already absorbed on my fork as PR #8):

  • APIFY_TOKEN sent via Authorization: Bearer header, never as a query string (prevents token leakage into access logs).
  • SSRF guard on actorId — strict ^[A-Za-z0-9][A-Za-z0-9_.-]*[~/][A-Za-z0-9][A-Za-z0-9_.-]*$ validation so a malformed config can't divert the bearer-token request off the intended /acts/<actor>/runs path.
  • End-to-end timeout via a single shared deadline threaded through startRunwaitForRunfetchDatasetItems. runActor's timeoutMs is now an actual wall-clock ceiling, not just a wait-loop ceiling.
  • waitForRun() re-throws non-retriable 4xx (401/403/404) instead of looping to the deadline and masking the real cause as a generic timeout.
  • runActor() validates timeoutMs is a positive finite number up front (a NaN previously produced Date.now() + NaN → confusing "did not finish within 0s" errors).
  • apify.mjs saveJd() filename includes sha1(url)[:10] so two distinct postings sharing the same company+title slug don't collide on a single cache file (mirrors the same fix applied to the linkedin provider).
  • field_map spec validation accepts only a string or non-empty array of strings; rejects malformed values like { title: 42 } at load time with a clear error instead of crashing mid-scan.

Deliberate scope limits

  • Only one example entry in portals.example.yml — the Indeed actor. Workday-via-third-party-rental actors are also tested on my fork but recommending a specific paid third-party rental in upstream docs warrants its own discussion; happy to add it in a follow-up.
  • No browser transport, no scraper provider — those are the separate follow-up PR per RFC RFC: Plugin-based Provider Architecture for `scan.mjs` santifer/career-ops#521. This PR is Apify-only.
  • No proxyConfiguration handling — only relevant for the deferred Workday entry; can land with it.

Test plan

  • node scan.mjs --dry-run with the Indeed example enabled and APIFY_TOKEN set — verify items show up under the misceres-indeed-scraper source label
  • Same with APIFY_TOKEN unset — verify the Apify entry errors cleanly and the rest of the scan continues
  • Verify a malformed field_map (e.g. title: 42) is rejected at fetch with the documented error
  • Verify the optional description: mapping writes jds/{slug}.md and the pipeline entry uses the local:jds/... URL
  • Verify a posting with a short/missing description falls through to the remote URL (no broken local:jds/... link)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Integrated Apify for automated job discovery from external sources (e.g., Indeed scraping)
    • Job listings can be cached locally as formatted markdown files with job descriptions
    • Added environment variable documentation for Apify token authentication setup
    • Included example configuration showing how to set up Apify-backed job sources with automatic field mapping

Review Change Stack

ageem23 and others added 3 commits May 18, 2026 17:35
Adds a generic provider that runs any pre-built Apify actor and maps its
dataset items to the scanner's {title, url, company, location} shape.
All variation (which actor, what input, how to read fields) lives in
portals.yml, not in code — so a single provider implementation serves
any site that has a working Apify scraper.

Optional `field_map.description` persists the JD body to jds/{slug}.md
and rewrites the entry's url to local:jds/... so downstream tools read
it from disk instead of re-fetching the remote URL. Avoids paying for
the same Apify run twice and dodges HTTP failures on stale job-board
links.

Transport (auth, run/poll/fetch, retries with a shared end-to-end
deadline, SSRF guard on actorId) lives in providers/_apify.mjs, mirroring
the _http.mjs split used by the existing providers. APIFY_TOKEN is sent
via Authorization header, never as a query string. Entries error cleanly
when the token is missing so the rest of scan.mjs continues.

Includes a worked example in templates/portals.example.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worked example showing the minimum field_map shape (positionName→title,
company/companyName fallback, location/formattedLocation fallback) for
the misceres/indeed-scraper actor.

Also demonstrates the optional `description` mapping that enables local
jds/{slug}.md caching and the local:jds/... URL rewrite. Without an
example here, that feature is only discoverable by reading the provider
source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the APIFY_TOKEN entry to .env.example so users discover the
configuration requirement when they copy the file, instead of finding
out only after a scan run hits the missing-token error from
providers/_apify.mjs.

Per the split plan in RFC santifer#521, the env entry ships with the Apify
provider PR rather than the plugin-contract PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 5c7c6e96-3c28-4e24-ba9c-71557c03d9b9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/apify-provider-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/apify.mjs`:
- Around line 137-172: The fallback to normalized.applicationUrl in saveJd is
dead/confusing because normalizeItem never sets applicationUrl; either remove
the unused reference in saveJd and simplify the hash input to use normalized.url
|| `${normalized.company}-${normalized.title}`, or alternatively populate
applicationUrl in normalizeItem; update the code in saveJd (function saveJd) to
use only normalized.url (with the company-title fallback) or add an explicit
applicationUrl field in normalizeItem so the reference is valid—pick one
consistent approach and apply it across saveJd and normalizeItem.

In `@templates/portals.example.yml`:
- Around line 1039-1043: Update the example comment that describes where the job
description files are written for the "description" mapping: change the
documented filename from "jds/{slug}.md" to the actual hash-suffixed pattern
used at runtime (e.g., "jds/{slug}-{hash}.md" or "jds/{slug}-{sha256}.md") so
users can locate the cached files correctly when using the description:
[description, descriptionText, descriptionHTML, jobDescription] mapping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 0aca62dc-07c7-4b97-8cc5-21c006aef44b

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1f3a3 and f0d49c2.

📒 Files selected for processing (4)
  • .env.example
  • providers/_apify.mjs
  • providers/apify.mjs
  • templates/portals.example.yml

Comment thread providers/apify.mjs
Comment thread templates/portals.example.yml Outdated
Three findings from the round-1 review:

1. Dead `applicationUrl` fallback in saveJd() — normalizeItem never
   populates that field, so the middle clause of the hash input was
   never reached. Simplified to `normalized.url || company-title`.
   (apify.mjs:148)

2. Docs referenced `jds/{slug}.md` but saveJd actually writes
   `jds/{slug}-{hash}.md` (the sha1[:10] suffix prevents two distinct
   postings sharing the same company+title slug from colliding on one
   file). Fixed in both the apify.mjs module docstring and the
   portals.example.yml inline comment.

3. Apply PR santifer#653's URL-validation pattern: actor-supplied URLs were
   passed through unchecked to pipeline.md, the JD-cache filename
   hash, and `_remote_url`. Added isHttpsUrl() and reject items
   whose normalized.url isn't https — blocks `javascript:`, `data:`,
   `file:`, and protocol-downgraded `http:` URLs from a buggy or
   malicious actor before they become clickable links.

Also scrubbed a stale `mirrors providers/linkedin.mjs format` section
comment — that file doesn't exist on upstream/main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ageem23
Copy link
Copy Markdown
Owner Author

ageem23 commented May 18, 2026

Round-1 review feedback addressed in 2b5561b:

  1. Dead applicationUrl fallback (saveJd) — removed; hash input simplified to normalized.url || ${company}-${title}.
  2. jds/{slug}.mdjds/{slug}-{hash}.md — fixed in both apify.mjs module docstring and the portals.example.yml inline comment.
  3. URL validation (pattern from feat(scan): add Workable, SmartRecruiters, Recruitee ATS parsers santifer/career-ops#653's review of parseRecruiteeResponse / parseWorkableMarkdown) — actor-supplied URLs are now rejected unless protocol is https:. Blocks javascript:, data:, file:, and protocol-downgraded http: URLs from a buggy or malicious actor before they become clickable in pipeline.md or feed the JD-cache filename hash.

Also scrubbed a stale mirrors providers/linkedin.mjs format section comment — that file isn't upstream.

Test gap noted but not addressed in this commit: PR santifer#653 added §11/§12/§13 in test-all.mjs (~27 assertions per provider). Happy to add an equivalent §11 for the apify provider here or in a follow-up — let me know which fits better for the upstream PR scope.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

@ageem23 Thanks for the detailed round-up — all three fixes look well-scoped:

  • The simplified hash input (normalized.url || \${company}-${title}``) is cleaner and the dead fallback removal reduces confusion.
  • The {slug}-{hash}.md docstring/YAML alignment is good housekeeping.
  • The https:-only guard is the right call — especially since actor output URLs flow into pipeline.md links and the JD cache filename hash; blocking downgrade/exotic protocols at the boundary is much safer than trusting the actor author.

On the test gap: My recommendation is to add §11 for the Apify provider in this PR. The new provider introduces meaningful branching surface (field_map validation, description fallback, URL rejection, JD file write/reuse, hasToken short-circuit) that benefits from co-located tests. Shipping the provider without any assertions makes future refactors riskier. If the upstream PR scope is strict, a narrow §11 covering just the security-relevant paths (URL rejection, malformed field_map, missing token) would be a reasonable minimum — the happy-path and description tests could follow in the same style as #653's §12/§13.

Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Per CodeRabbit's recommendation on PR #26, adds a narrow provider test
section focused on the new branching surface introduced by this PR:
URL rejection, field_map validation, missing token, actor-id SSRF
guard, and normalization defaults whitelist. 34 assertions, follows
the §11/§12/§13 style established by PR santifer#653.

Specifically covers:

- isHttpsUrl() rejects javascript:/data:/file:/ftp:/http:/malformed/
  empty/null/undefined (the URL-validation hardening from the PR santifer#653
  pattern)
- isFieldSpec() accepts string + non-empty string array, rejects
  number/empty array/mixed-type array/null/object
- normalizeItem() picks first non-empty fallback, applies defaults
  only to title/url/company/location (rejects defaults outside the
  allowlist so portals.yml can't inject arbitrary record fields)
- normalizeActorId() rejects path traversal (../), query (?), fragment
  (#), extra path segments, missing separator, non-string — the SSRF
  guard on /acts/<actor>/runs
- apify.fetch() throws clear errors and never reaches runActor when
  APIFY_TOKEN is unset, actor is missing, or field_map is malformed

To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem
(apify.mjs) and normalizeActorId (_apify.mjs) are now named exports
alongside the existing default export. Behavior unchanged.

The full happy-path / runActor mocking / JD-cache write path is left
for a follow-up — would require mocking _apify.mjs's runActor and is
out of scope for the security-focused §11 CodeRabbit asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ageem23
Copy link
Copy Markdown
Owner Author

ageem23 commented May 18, 2026

Added §11 to test-all.mjs in c3e3b2f — 34 assertions covering the security-relevant paths CodeRabbit asked for. Full suite: 100 passed, 0 failed.

Coverage (kept narrow per the round-2 guidance):

  • URL rejectionisHttpsUrl() rejects http:, javascript:, data:, file:, ftp:, malformed/empty/null/undefined
  • Malformed field_mapisFieldSpec() accepts string + non-empty string array; rejects number, empty array, mixed-type array, null, object
  • Missing token / actor / field_mapapify.fetch() throws clear errors before any network call (env var save/restore so the test works whether or not APIFY_TOKEN is set locally)
  • Actor-id SSRF guardnormalizeActorId() rejects path traversal, query, fragment, extra path segments, missing separator, non-string
  • Defaults allowlistnormalizeItem() blocks portals.yml defaults outside title/url/company/location

To make the helpers testable, isFieldSpec / isHttpsUrl / normalizeItem (apify.mjs) and normalizeActorId (_apify.mjs) are now named exports alongside the existing default export. Behavior unchanged.

Skipped per CodeRabbit's "narrow §11" guidance: happy-path fetch() through a mocked runActor, JD-cache write path, and htmlToText sanitization. Worth a follow-up if the maintainer wants the broader surface.

ageem23 and others added 4 commits May 18, 2026 19:14
CodeRabbit feedback on PR santifer#693: a synchronous mkdirSync/writeFileSync
failure (disk full, EACCES, EROFS, Windows EBUSY) inside saveJd would
propagate up through the .map() callback and abort fetch() for the
entire entry — one bad cache write taking down a whole multi-source
scan run.

Wrap the FS sequence in try/catch. On failure, warn to stderr and
return null; the caller now checks for null and leaves the remote
URL on the record instead of producing a broken `local:` link.

The "coerce inputs to String() before slugify" half of the same CR
finding is already covered: slugify() does `String(text || '')` at
its first line, and the template-literal hash input and yamlEscape
already coerce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Advanced Security flagged htmlToText on PR santifer#693:

- Alert #7 (bad HTML filtering regexp): `<script[\s\S]*?<\/script>`
  didn't match `</script >` (whitespace before close). Added `\s*` to
  both script and style close patterns, plus `\b` after the open tag
  so `<scripty>` isn't matched as a script open.

- Alert #8 (incomplete multi-character sanitization): a polyglot like
  `<<a>b>` left a dangling `<` after one pass. Loop the strip until
  the string stabilizes.

- Alert #6 (double escape/unescape): decoding `&amp;` before `&#NN;`
  meant `&amp;santifer#60;` round-tripped through `&santifer#60;` and out as `<`.
  Moved `&amp;` to the end of the decode chain so the literal
  `&amp;santifer#60;` stays as `&santifer#60;`.

None of the three is exploitable in our context — htmlToText output
is plain text written to a markdown JD file, never rendered — but
clean CodeQL signal on the PR is worth the small regex tweaks.

§11 grows by three assertions (one per finding) so the fixes can't
silently regress. Full suite: 103 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit feedback on PR santifer#693: scan.mjs runs up to 10 concurrent
workers, so the existsSync()-then-writeFileSync() pattern in saveJd
is a TOCTOU race — two workers can both pass the existence check
and then clobber each other's writes.

Keep existsSync as a cheap fast-path (avoids building the YAML/markdown
content string when the cache is already populated), but switch the
actual write to `{ flag: 'wx' }` so the OS atomically fails the open
with EEXIST if a sibling worker created the file in the meantime.
Handle EEXIST in the existing catch by returning the path (preserving
the documented "first save wins" behaviour); every other FS error
still logs and returns null for the per-item fallback added in
e5cf1f6.

`relPath` is hoisted above the try so it's accessible to the catch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My previous htmlToText fix (9b1baad) used `</script\s*>` to tolerate
the whitespace variant `</script >`. CodeQL #9 points out the regex
still misses the parser-tolerated junk-attribute form like
`</script\t\n foo bar>` that some scrapers emit — invalid per the
HTML spec but the browser parser silently accepts it as a script
close, so it can still hide script content from a stripper that's
stricter than the parser.

Switch both script and style close patterns from `\s*>` to `\b[^>]*>`
so the close tag consumes any whitespace, tabs, newlines, or invalid
attributes before the actual `>`. `\b` preserves the boundary so
`</scripty>` still doesn't match.

Adds one §11 assertion exercising the junk-attr variant. Full suite:
104 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant