-
Notifications
You must be signed in to change notification settings - Fork 0
[v4] Implement sneaked-reference detection #5
Description
Summary
Detect references that appear in the reference list but are never cited in the manuscript body. This is the highest-priority v4 heuristic — high value, low complexity, immediately testable.
Background
Sneaked references are a paper mill signature. Adding uncited references inflates citation counts for specific papers (often other paper mill products) or establishes fake citation networks. A managing editor currently has to manually cross-check the reference list against in-text citations — tedious and error-prone for a 40+ reference manuscript.
See roadmap/v4-features.md (Priority 1) for full specification.
Policy Constraint: Manuscript Confidentiality
Elsevier (which publishes AWHONN's journals — JOGNN, MCN, Nursing for Women's Health) has instructed editorial teams not to use publicly available AI chatbots for reference checking. This policy is aimed at consumer LLM interfaces (claude.ai, ChatGPT, etc.) and reflects legitimate confidentiality concerns around pre-publication manuscripts.
The reference list itself is low-sensitivity — it's metadata about already-published work. The auditor's current reference-list-only design is probably defensible under this policy, though editorial teams should confirm with their publisher.
The manuscript body is high-sensitivity. Sneaked-reference detection requires cross-referencing in-text citations against the reference list, which means touching manuscript content. This creates a policy tension that the implementation must address.
Implementation Options
Option A: Local Script (Recommended for Now)
Sneaked-reference detection is a mechanical cross-check, not a forensic judgment call. It doesn't need AI at all.
Ship a lightweight Python script that runs locally on the editor's machine:
- Accepts manuscript body text (PDF, DOCX, or pasted text)
- Parses in-text citation markers
- Compares against the reference list
- Reports orphaned references (in reference list but never cited)
- Reports phantom citations (cited in text but not in reference list)
Pros: Zero policy concerns — nothing leaves the editor's machine. Simple to build, simple to deploy. Separates the mechanical task from the forensic task cleanly.
Cons: Two tools instead of one. The auditor doesn't know about sneaked references, so it can't combine that signal with other heuristics (e.g., sneaked + shadow paper = very high confidence fabrication).
Option B: Extract-and-Discard
A local preprocessing step extracts only the in-text citation markers from the manuscript body — (Smith, 2024), [14], etc. — and sends that extracted list (not the manuscript text) to the auditor alongside the reference list.
Pros: The auditor can integrate sneaked-reference signals with other heuristics. Minimal exposure surface — citation markers contain no manuscript content.
Cons: Requires a local preprocessing tool. The extracted citation list is still derived from the manuscript, so policy teams may have opinions about it. Adds friction to the workflow.
Option C: Full Manuscript via API with DPA
In a productized deployment, the publisher signs a data processing agreement (DPA) with the API provider. Full manuscript access, full sneaked-reference detection, fully compliant.
Pros: Cleanest integration. Full heuristic interaction. No workflow friction.
Cons: Requires business relationships, legal agreements, and a product that doesn't exist yet. This is the production-grade path, not the current-state path.
Recommended Sequencing
- Now: Build Option A (local Python script) as a standalone companion tool. Ships fast, zero policy risk, immediately useful.
- If productizing: Build Option B into the workflow — local extraction feeds the auditor. Test whether editorial teams find the two-step process acceptable.
- At scale: Option C with API access and a DPA. The tool handles everything.
Acceptance Criteria (Option A — Local Script)
- Python script accepts manuscript as PDF, DOCX, or plain text
- Parses APA-style in-text citations: (Author, Year), (Author & Author, Year), (Author et al., Year)
- Parses numbered citation styles: [1], [1,2], [1-3]
- Accepts a reference list (plain text or extracted from the same manuscript)
- Reports orphaned references (in list, never cited in body)
- Reports phantom citations (cited in body, not in reference list)
- Runs entirely locally — no network calls, no API access, no data leaves the machine
- Clear console output or simple HTML report
Acceptance Criteria (Option B — Extract-and-Discard, Future)
- Local tool extracts citation markers only (no surrounding text)
- Auditor prompt accepts citation marker list as optional input
- Sneaked references classified as Elevated risk (not auto-High — could be innocent omission)
- Sneaked + shadow paper combination escalates to High risk
- Does not false-flag references cited only in tables, figure legends, or appendices
Test Cases Needed
- Test manuscript with 2-3 sneaked references
- Test manuscript with one legitimate uncited reference (methods supplement) for false-positive testing
- Test both APA and numbered citation formats
Dependencies
- Requires v3 prompt as baseline (for Option B/C integration)
- Policy confirmation from editorial team on reference-list-only handling