This project provides a Node.js script to programmatically find and highlight text within a PDF document. It uses pdf-lib to create the highlight annotations and pdfjs-dist to parse the PDF and find the text coordinates.
- Finds and highlights specific text in a PDF.
- Creates proper, machine-readable "Highlight" annotations.
- Can extract existing highlight annotations from a PDF.
- Node.js
- npm
- Clone the repository:
git clone <repository-url>
- Navigate to the project directory:
cd pdf-highlighter - Install the dependencies:
npm install
This project revolves around three core functions, each serving a distinct purpose in the PDF processing pipeline. They are designed to be used in sequence: first extracting text, then using that text to add highlights, and finally retrieving the content of those highlights.
Purpose: To read a PDF and extract all of its text into a structured format that includes precise location data for every word or text chunk.
Step-by-Step:
- The tool loads the raw PDF file using the
pdf.jslibrary. - It iterates through the document page by page.
- On each page, it calls
page.getTextContent()to get a list of every individual "text chunk". A text chunk often contains a "str" that can be a line of text, a single word or even a fragment (and sometimes empty strings like "" ...), and it comes with its coordinates (transform), width, and height. - The tool collects all these text items from all pages into a single, large array. It adds a
pageNumproperty to each item to track its location. - This complete, detailed array is then cached in memory on the server. This is a critical step, as the
highlight_pdftool relies on this cache. - Finally, a lightweight version of this data (containing only
id,pageNum, and the text stringstr) is sent back to the user or AI agent.
Purpose: To add machine-readable highlight annotations to a PDF based on a list of exact text strings.
Step-by-Step:
- The tool receives an array of strings (
textsToHighlight) that the user wants to find and highlight. - It retrieves the full, detailed text data from the in-memory cache created by
extract_pdf_text. If the cache is empty for that PDF, the tool will fail and direct the user:PDF data for ${inputPath} not found in cache. Please run 'extract_pdf_text' first.. - To find the text, it iterates through the cached items on a per-page basis. It joins the text chunks on a page into a single string to search for the
textsToHighlight. - When a match is found, the tool identifies which of the original, smaller text chunks correspond to the matched string.
- Using the precise coordinates of these identified chunks, it calculates the
QuadPointsfor the highlight.QuadPointsallow for highly accurate highlights that can wrap around line breaks. - It then uses the
pdf-liblibrary to create a new highlight annotation object in memory with theseQuadPoints. - After creating all requested highlights, the tool saves the modified PDF document to the specified output path.
Purpose: To read a PDF, find all existing highlight annotations, and extract the actual text content that is visually underneath them.
Step-by-Step:
- The tool loads the specified PDF and, for each page, fetches two separate sets of information: (a) the list of all text chunks with their coordinates, and (b) the list of all highlight annotations with their bounding box coordinates (
rect). - It then iterates through each highlight annotation found.
- For a single highlight, the tool performs a spatial search: it goes through every text chunk on the page and checks if the center of that chunk falls within the highlight's bounding box.
- All text chunks that are inside the box are collected into a temporary list.
- To ensure the text is in a natural reading order, these collected chunks are sorted based on their vertical and then horizontal coordinates.
- The sorted text chunks are joined together to reconstruct the full, human-readable highlighted phrase or sentence.
- Finally, the tool returns a clean JSON object for each highlight, containing the page number, the reconstructed text, and the highlight's original coordinates.
This project is designed to be used programmatically via the provided MCP (Model-Context-Protocol) server, which exposes the PDF tools to an AI agent or other client.
There are two main ways to run the server:
A) Manually (for testing):
You can run the server directly from your terminal. This is useful for quick tests.
node mcp/mcp-server.jsB) Via a Client Configuration (Recommended):
For integration with an AI agent (Claude Code, Gemini CLI, Cline, Cursors, etc...), you configure the client to manage the server process for you. Add the following to your client's configuration file (e.g., .gemini/settings.json):
If you cloned this repo:
{
"mcpServers": {
"pdf-highlighter": {
"command": "node",
"args": [
"/absolute/path/to/pdf-highlighter/mcp/mcp-server.js"
]
}
}
}Note: You must provide the full, absolute path to mcp-server.js.
- From mcpservers.org/ or mcp.so/:
npx ...
Once configured, the client will automatically start the server, and you should see the tools become available:
ℹ Configured MCP servers:
🟢 pdf-highlighter - Ready (3 tools)
- extract_pdf_text
- highlight_pdf
- get_pdf_highlights
To emulate how these tool calls work, you can run the following "task scripts":
- Place Your PDF: Add a PDF file to the project's root directory.
- Run a Task Script: Open a new terminal and run one of the scripts from the
tasks/directory. Each script emulates a specific tool call:node tasks/task1.js--> Callsextract_pdf_textnode tasks/task2.js--> Callshighlight_pdfnode tasks/task3.js--> Callsget_pdf_highlights
To run a task, execute it in a separate terminal:
bash node tasks/task3.js
- Check the Output: Conveniently, to see what the AI agent/user would get back, the scripts are designed to save their results so you can see what the tool returns. For example,
task3.jswill createtasks/task3-output.json.