TalkToDoc

By: Eden Oved

Introduction

TalkToDoc is a local AI-powered system that lets you "talk with your documents" - just like ChatGPT, but fully local and private.

It ingests PDF and Excel files, stores the text in a PostgreSQL database, builds a searchable TF-IDF index, and extracts structured project data into clean JSON files - including project timelines, milestones, contacts, keywords, and summaries.

The system minimizes LLM usage (under $3 total), caches all calls locally, and outputs both extracted answers and references to the source documents.

Pipeline Overview

Ingest all documents under data/ → stored in PostgreSQL + artifacts/pages.jsonl
Build Index using TF-IDF → artifacts/tfidf.pkl + DB page_vectors
Group documents by project ID
Pre-filter relevant pages per query
Extract structured fields using OpenAI (LLMClient)
Cache & track token cost → outputs/cost_log.jsonl
Output per-project structured summaries

Tools & Libraries

Python 3.11
PostgreSQL (via Docker)
Typer (CLI)
pandas
PyMuPDF
scikit-learn
psycopg2
openai
Docker + docker-compose

Setup

Copy .env.example → .env and set:

OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
TOKEN_BUDGET_DOLLARS=3.0
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=talk_to_doc
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres

Place your project folders and files under:

./data/

Run the System

docker compose up -d --build
docker compose exec app python app.py ingest
docker compose exec app python app.py build-index
docker compose exec app python app.py extract

Query Example:

You can ask questions in natural language, and the system will return relevant pages + evidence from your documents:

docker compose exec app python app.py query --q "project start date"

Reset (Clean Re-Run)

Removes all files under artifacts/ and outputs/ for a clean rerun.

docker compose exec app python app.py reset

Tests (optional)

Run automated pipeline tests (reset → ingest → index → extract → query).

docker compose exec app pytest -q

Output Files

File / Folder	Description
`artifacts/pages.jsonl`	Extracted text chunks
`artifacts/tfidf.pkl`	TF-IDF index + vectorizer
`artifacts/cache/`	Local cache of LLM responses
`outputs/index.jsonl`	Search index (debug/inspection)
`outputs/manifest.jsonl`	Summary of ingested documents
`outputs/PRJ-*_key_params.json`	LLM extraction project metadata
`outputs/cost_log.jsonl`	Token usage + LLM cost log

Built by Eden Oved
AI-driven document indexing and extraction project.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
db		db
main-LLM-indexing		main-LLM-indexing
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TalkToDoc

Introduction

Pipeline Overview

Tools & Libraries

Setup

Run the System

Query Example:

Reset (Clean Re-Run)

Tests (optional)

Output Files

About

Uh oh!

Releases

Packages

Languages

License

EdenOved/TalkToDoc

Folders and files

Latest commit

History

Repository files navigation

TalkToDoc

Introduction

Pipeline Overview

Tools & Libraries

Setup

Run the System

Query Example:

Reset (Clean Re-Run)

Tests (optional)

Output Files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages