Skip to content

GiesDSRS/pdf-highlighter

Repository files navigation

PDF Highlighter

This project provides a Node.js script to programmatically find and highlight text within a PDF document. It uses pdf-lib to create the highlight annotations and pdfjs-dist to parse the PDF and find the text coordinates.

Features

  • Finds and highlights specific text in a PDF.
  • Creates proper, machine-readable "Highlight" annotations.
  • Can extract existing highlight annotations from a PDF.

Prerequisites

  • Node.js
  • npm

Installation

  1. Clone the repository:
    git clone <repository-url>
  2. Navigate to the project directory:
    cd pdf-highlighter
  3. Install the dependencies:
    npm install

How it Works

This project revolves around three core functions, each serving a distinct purpose in the PDF processing pipeline. They are designed to be used in sequence: first extracting text, then using that text to add highlights, and finally retrieving the content of those highlights.

1. extract_pdf_text

Purpose: To read a PDF and extract all of its text into a structured format that includes precise location data for every word or text chunk.

Step-by-Step:

  1. The tool loads the raw PDF file using the pdf.js library.
  2. It iterates through the document page by page.
  3. On each page, it calls page.getTextContent() to get a list of every individual "text chunk". A text chunk often contains a "str" that can be a line of text, a single word or even a fragment (and sometimes empty strings like "" ...), and it comes with its coordinates (transform), width, and height.
  4. The tool collects all these text items from all pages into a single, large array. It adds a pageNum property to each item to track its location.
  5. This complete, detailed array is then cached in memory on the server. This is a critical step, as the highlight_pdf tool relies on this cache.
  6. Finally, a lightweight version of this data (containing only id, pageNum, and the text string str) is sent back to the user or AI agent.

2. highlight_pdf

Purpose: To add machine-readable highlight annotations to a PDF based on a list of exact text strings.

Step-by-Step:

  1. The tool receives an array of strings (textsToHighlight) that the user wants to find and highlight.
  2. It retrieves the full, detailed text data from the in-memory cache created by extract_pdf_text. If the cache is empty for that PDF, the tool will fail and direct the user: PDF data for ${inputPath} not found in cache. Please run 'extract_pdf_text' first..
  3. To find the text, it iterates through the cached items on a per-page basis. It joins the text chunks on a page into a single string to search for the textsToHighlight.
  4. When a match is found, the tool identifies which of the original, smaller text chunks correspond to the matched string.
  5. Using the precise coordinates of these identified chunks, it calculates the QuadPoints for the highlight. QuadPoints allow for highly accurate highlights that can wrap around line breaks.
  6. It then uses the pdf-lib library to create a new highlight annotation object in memory with these QuadPoints.
  7. After creating all requested highlights, the tool saves the modified PDF document to the specified output path.

3. get_pdf_highlights

Purpose: To read a PDF, find all existing highlight annotations, and extract the actual text content that is visually underneath them.

Step-by-Step:

  1. The tool loads the specified PDF and, for each page, fetches two separate sets of information: (a) the list of all text chunks with their coordinates, and (b) the list of all highlight annotations with their bounding box coordinates (rect).
  2. It then iterates through each highlight annotation found.
  3. For a single highlight, the tool performs a spatial search: it goes through every text chunk on the page and checks if the center of that chunk falls within the highlight's bounding box.
  4. All text chunks that are inside the box are collected into a temporary list.
  5. To ensure the text is in a natural reading order, these collected chunks are sorted based on their vertical and then horizontal coordinates.
  6. The sorted text chunks are joined together to reconstruct the full, human-readable highlighted phrase or sentence.
  7. Finally, the tool returns a clean JSON object for each highlight, containing the page number, the reconstructed text, and the highlight's original coordinates.

Usage

This project is designed to be used programmatically via the provided MCP (Model-Context-Protocol) server, which exposes the PDF tools to an AI agent or other client.

1. Running the MCP Server

There are two main ways to run the server:

A) Manually (for testing):

You can run the server directly from your terminal. This is useful for quick tests.

node mcp/mcp-server.js

B) Via a Client Configuration (Recommended):

For integration with an AI agent (Claude Code, Gemini CLI, Cline, Cursors, etc...), you configure the client to manage the server process for you. Add the following to your client's configuration file (e.g., .gemini/settings.json):

If you cloned this repo:

{
  "mcpServers": {
    "pdf-highlighter": {
      "command": "node",
      "args": [
        "/absolute/path/to/pdf-highlighter/mcp/mcp-server.js"
      ]
    }
  }
}

Note: You must provide the full, absolute path to mcp-server.js.

npx ...

Once configured, the client will automatically start the server, and you should see the tools become available:

ℹ Configured MCP servers:
 
  🟢 pdf-highlighter - Ready (3 tools)
    - extract_pdf_text
    - highlight_pdf
    - get_pdf_highlights

2. Emulating Tools Use

To emulate how these tool calls work, you can run the following "task scripts":

  1. Place Your PDF: Add a PDF file to the project's root directory.
  2. Run a Task Script: Open a new terminal and run one of the scripts from the tasks/ directory. Each script emulates a specific tool call:
    • node tasks/task1.js --> Calls extract_pdf_text
    • node tasks/task2.js --> Calls highlight_pdf
    • node tasks/task3.js --> Calls get_pdf_highlights

To run a task, execute it in a separate terminal: bash node tasks/task3.js

  1. Check the Output: Conveniently, to see what the AI agent/user would get back, the scripts are designed to save their results so you can see what the tool returns. For example, task3.js will create tasks/task3-output.json.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published