Skip to content

feat(archive): add archive-posting.mjs — save job postings as PDF bef…#697

Open
Krxshna wants to merge 8 commits into
santifer:mainfrom
Krxshna:feat/archive-posting
Open

feat(archive): add archive-posting.mjs — save job postings as PDF bef…#697
Krxshna wants to merge 8 commits into
santifer:mainfrom
Krxshna:feat/archive-posting

Conversation

@Krxshna
Copy link
Copy Markdown

@Krxshna Krxshna commented May 19, 2026

…ore they disappear

What does this PR do?

Adds archive-posting.mjs — a Playwright script that saves a live job posting as a rendered PDF to jds/ before it disappears. Auto-detects company and role from the page title, supports --pipeline mode to batch-archive all pending URLs at once.

Out of scope for this PR (can be follow-up issues):

  • portals.yml config flag for auto-archiving during scans
  • jd_pdf field in tracker TSV

Related issue

Closes #553

Type of change

  • Bug fix
  • [✓] New feature
  • Documentation / translation
  • Refactor (no behavior change)

Checklist

  • [✓] I have read CONTRIBUTING.md
  • [✓] I linked a related issue above (required for features and architecture changes)
  • [✓] My PR does not include personal data (CV, email, real names)
  • [✓] I ran node test-all.mjs and all tests pass
  • [✓] My changes respect the Data Contract (no modifications to user-layer files)
  • My changes align with the project roadmap

Aligns with the current-phase goal of zero-token utilities (same pattern as scan.mjs —
pure Playwright, no LLM cost). Addresses the data preservation gap identified in #553.


Questions? Join the Discord for faster feedback.

Summary by CodeRabbit

  • New Features
    • Added a CLI to archive live job postings to PDF with single-URL, batch (pipeline), and dry-run preview modes; supports company/role overrides and produces dated, descriptive filenames in a local archive.
  • Chores
    • Added a package script to run the archival CLI.
  • Tests
    • Added tests for CLI help, dry-run output, detection/override behavior, and a best-effort live integration that verifies PDF creation and cleanup.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Node.js CLI archive-posting.mjs that snapshots job posting URLs to PDFs in jds/ (single URL or pipeline), resolves company/role via overrides, page title/h1, or ATS URL heuristics, registers an archive npm script, and extends tests including an optional live PDF generation check.

Changes

Archive Posting Feature

Layer / File(s) Summary
CLI Interface and Argument Parsing
archive-posting.mjs
Defines help/usage banner with early exit, Node imports, and argument parsing for --help, --pipeline, --dry-run, --company, --role in both --key=value and --key value forms, with validation that pipeline mode or a target URL is provided.
Company and Role Detection Utilities
archive-posting.mjs
Implements slugify, today, parsePageTitle for ATS-specific regex extraction, extractCompanyFromUrl for hostname/path fallback, extractPipelineEntries() to read data/pipeline.md pending entries, and normalization logic removing common prefixes.
PDF Archival Implementation
archive-posting.mjs
Implements archiveUrl() using Playwright to navigate, wait for hydration, extract title and h1, resolve/normalize company+role with precedence, generate deterministic filename YYYY-MM-DD_company-slug_role-slug.pdf, render A4 PDF with background graphics and margins, write to jds/, and return metadata including size.
Main Orchestration and Error Handling
archive-posting.mjs
Implements main() control flow for pipeline vs single-URL selection, dry-run (no browser) vs live (shared Chromium) processing, sequential target archival, per-URL error handling, results aggregation, reference and summary printing, and top-level fatal error handler with exit-code logic.
NPM Script Registration
package.json
Registers archive npm script entry that runs node archive-posting.mjs.
Test Suite Integration
test-all.mjs
Adds filesystem utilities to imports, includes archive-posting.mjs --help in script validation phase, and introduces "ARCHIVE-POSTING" test section validating dry-run parsing, company detection heuristics, CLI overrides, output format (local:jds/ and date), and optional live integration against Greenhouse with PDF verification and cleanup; live tests are skipped with a warning if API unavailable.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements core requirements from #553: archiving job postings to PDF with proper filename format, company/role detection, and dry-run mode. However, portals.yml config support and jd_pdf TSV field—listed as requirements—are not implemented. Address the missing implementations from #553: add portals.yml configuration support for PDF archiving and implement the jd_pdf field in tracker TSV, or clarify these as follow-up work.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the main feature: adding archive-posting.mjs to save job postings as PDFs, which is the primary change in the changeset.
Out of Scope Changes check ✅ Passed All changes align with the PR's stated scope: archive-posting.mjs implementation, package.json script registration, and test-all.mjs validation. No unrelated modifications were introduced.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Welcome to career-ops, @Krxshna! Thanks for your first PR.

A few things to know:

  • Tests will run automatically — check the status below
  • Make sure you've linked a related issue (required for features)
  • Read CONTRIBUTING.md if you haven't

We'll review your PR soon. Join our Discord if you have questions.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@archive-posting.mjs`:
- Line 206: The console.log call using the template literal console.log(`\ 
${url}`) prints a stray backslash instead of the intended newline/emoji; update
that call to match the other usage (e.g., replace the `\  ${url}` fragment with
`\n🔗  ${url}`) so the output shows a newline and link emoji before the
url—locate the console.log that references the variable url and make this
replacement.

In `@test-all.mjs`:
- Line 364: The test currently interpolates untrusted liveJobUrl into a shell
string passed to run (see the run(...) invocation that builds `node
archive-posting.mjs "${liveJobUrl}"`), which allows command injection; change
the call to pass the program and arguments separately (e.g., use
execFile/child_process spawn or run with an args array) so the URL is an
argument rather than shell-parsed, and additionally validate or strictly
whitelist/sanitize liveJobUrl (or at minimum reject quotes/semicolons/newlines)
before passing it; update any related consumer in archive-posting.mjs to read
the URL from process.argv instead of relying on a shell-expanded string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ec7db61c-1d0d-48de-a586-3f314893b195

📥 Commits

Reviewing files that changed from the base of the PR and between 82f0c2e and 0685716.

📒 Files selected for processing (3)
  • archive-posting.mjs
  • package.json
  • test-all.mjs

Comment thread archive-posting.mjs Outdated
Comment thread test-all.mjs Outdated
Krxshna and others added 2 commits May 22, 2026 18:52
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test-all.mjs (1)

355-365: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate liveJobUrl before passing it to the archiver (Line 364).

liveJobUrl comes from external API data and is used directly as a crawl target. Please enforce https + hostname allowlisting to prevent SSRF-style fetches if upstream data is malformed or compromised.

Suggested hardening patch
 let liveJobUrl = null;
 try {
   const res = await fetch('https://boards-api.greenhouse.io/v1/boards/anthropic/jobs?content=false');
   const { jobs } = await res.json();
   liveJobUrl = jobs?.[0]?.absolute_url ?? null;
+  if (liveJobUrl) {
+    try {
+      const u = new URL(liveJobUrl);
+      const allowedHosts = new Set(['boards.greenhouse.io', 'job-boards.greenhouse.io']);
+      if (u.protocol !== 'https:' || !allowedHosts.has(u.hostname)) {
+        liveJobUrl = null;
+      }
+    } catch {
+      liveJobUrl = null;
+    }
+  }
 } catch { /* offline — degrade gracefully */ }

As per coding guidelines "Check for command injection, path traversal, and SSRF. Ensure scripts handle missing data/ directories gracefully."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test-all.mjs` around lines 355 - 365, The code uses the external liveJobUrl
directly in run('node', ['archive-posting.mjs', liveJobUrl], ...) which can
enable SSRF/command-injection; validate liveJobUrl before invoking the archiver
by parsing it (new URL(...)) and enforcing protocol === 'https:' and that
url.hostname is in an allowlist of trusted hostnames (reject or warn and skip if
missing/invalid), and only pass the validated string to run; ensure the
validation logic is applied where liveJobUrl is set/checked and that a failing
validation results in the same skip path (warn('live archive skipped ...'))
rather than calling archive-posting.mjs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@test-all.mjs`:
- Around line 355-365: The code uses the external liveJobUrl directly in
run('node', ['archive-posting.mjs', liveJobUrl], ...) which can enable
SSRF/command-injection; validate liveJobUrl before invoking the archiver by
parsing it (new URL(...)) and enforcing protocol === 'https:' and that
url.hostname is in an allowlist of trusted hostnames (reject or warn and skip if
missing/invalid), and only pass the validated string to run; ensure the
validation logic is applied where liveJobUrl is set/checked and that a failing
validation results in the same skip path (warn('live archive skipped ...'))
rather than calling archive-posting.mjs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 54a72fab-248e-4017-a38c-6ea663ee11fa

📥 Commits

Reviewing files that changed from the base of the PR and between a05e3ac and 6c3873b.

📒 Files selected for processing (1)
  • test-all.mjs

Krxshna and others added 4 commits May 22, 2026 19:14
…r upstream merge

The upstream rebase added sections 11 (VERSION FILE) and a LOCATION
FILTER block, leaving our ARCHIVE-POSTING section with a duplicate
number and a missing closing brace on the liveJobUrl else-block.
Renumbered ARCHIVE-POSTING → 12, LOCATION FILTER → 13, and added
the missing } to restore valid syntax (96 passed, 0 failed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Archive Job Postings as PDF

1 participant