Skip to content

[v4] Implement sneaked-reference detection #5

@cpitzi

Description

@cpitzi

Summary

Detect references that appear in the reference list but are never cited in the manuscript body. This is the highest-priority v4 heuristic — high value, low complexity, immediately testable.

Background

Sneaked references are a paper mill signature. Adding uncited references inflates citation counts for specific papers (often other paper mill products) or establishes fake citation networks. A managing editor currently has to manually cross-check the reference list against in-text citations — tedious and error-prone for a 40+ reference manuscript.

See roadmap/v4-features.md (Priority 1) for full specification.

Policy Constraint: Manuscript Confidentiality

Elsevier (which publishes AWHONN's journals — JOGNN, MCN, Nursing for Women's Health) has instructed editorial teams not to use publicly available AI chatbots for reference checking. This policy is aimed at consumer LLM interfaces (claude.ai, ChatGPT, etc.) and reflects legitimate confidentiality concerns around pre-publication manuscripts.

The reference list itself is low-sensitivity — it's metadata about already-published work. The auditor's current reference-list-only design is probably defensible under this policy, though editorial teams should confirm with their publisher.

The manuscript body is high-sensitivity. Sneaked-reference detection requires cross-referencing in-text citations against the reference list, which means touching manuscript content. This creates a policy tension that the implementation must address.

Implementation Options

Option A: Local Script (Recommended for Now)

Sneaked-reference detection is a mechanical cross-check, not a forensic judgment call. It doesn't need AI at all.

Ship a lightweight Python script that runs locally on the editor's machine:

  1. Accepts manuscript body text (PDF, DOCX, or pasted text)
  2. Parses in-text citation markers
  3. Compares against the reference list
  4. Reports orphaned references (in reference list but never cited)
  5. Reports phantom citations (cited in text but not in reference list)

Pros: Zero policy concerns — nothing leaves the editor's machine. Simple to build, simple to deploy. Separates the mechanical task from the forensic task cleanly.

Cons: Two tools instead of one. The auditor doesn't know about sneaked references, so it can't combine that signal with other heuristics (e.g., sneaked + shadow paper = very high confidence fabrication).

Option B: Extract-and-Discard

A local preprocessing step extracts only the in-text citation markers from the manuscript body — (Smith, 2024), [14], etc. — and sends that extracted list (not the manuscript text) to the auditor alongside the reference list.

Pros: The auditor can integrate sneaked-reference signals with other heuristics. Minimal exposure surface — citation markers contain no manuscript content.

Cons: Requires a local preprocessing tool. The extracted citation list is still derived from the manuscript, so policy teams may have opinions about it. Adds friction to the workflow.

Option C: Full Manuscript via API with DPA

In a productized deployment, the publisher signs a data processing agreement (DPA) with the API provider. Full manuscript access, full sneaked-reference detection, fully compliant.

Pros: Cleanest integration. Full heuristic interaction. No workflow friction.

Cons: Requires business relationships, legal agreements, and a product that doesn't exist yet. This is the production-grade path, not the current-state path.

Recommended Sequencing

  1. Now: Build Option A (local Python script) as a standalone companion tool. Ships fast, zero policy risk, immediately useful.
  2. If productizing: Build Option B into the workflow — local extraction feeds the auditor. Test whether editorial teams find the two-step process acceptable.
  3. At scale: Option C with API access and a DPA. The tool handles everything.

Acceptance Criteria (Option A — Local Script)

  • Python script accepts manuscript as PDF, DOCX, or plain text
  • Parses APA-style in-text citations: (Author, Year), (Author & Author, Year), (Author et al., Year)
  • Parses numbered citation styles: [1], [1,2], [1-3]
  • Accepts a reference list (plain text or extracted from the same manuscript)
  • Reports orphaned references (in list, never cited in body)
  • Reports phantom citations (cited in body, not in reference list)
  • Runs entirely locally — no network calls, no API access, no data leaves the machine
  • Clear console output or simple HTML report

Acceptance Criteria (Option B — Extract-and-Discard, Future)

  • Local tool extracts citation markers only (no surrounding text)
  • Auditor prompt accepts citation marker list as optional input
  • Sneaked references classified as Elevated risk (not auto-High — could be innocent omission)
  • Sneaked + shadow paper combination escalates to High risk
  • Does not false-flag references cited only in tables, figure legends, or appendices

Test Cases Needed

  • Test manuscript with 2-3 sneaked references
  • Test manuscript with one legitimate uncited reference (methods supplement) for false-positive testing
  • Test both APA and numbered citation formats

Dependencies

  • Requires v3 prompt as baseline (for Option B/C integration)
  • Policy confirmation from editorial team on reference-list-only handling

Metadata

Metadata

Assignees

No one assigned

    Labels

    heuristicForensic heuristic developmentprompt-engineeringPrompt design and optimizationv4Planned for v4

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions