By: Eden Oved
TalkToDoc is a local AI-powered system that lets you "talk with your documents" - just like ChatGPT, but fully local and private.
It ingests PDF and Excel files, stores the text in a PostgreSQL database, builds a searchable TF-IDF index, and extracts structured project data into clean JSON files - including project timelines, milestones, contacts, keywords, and summaries.
The system minimizes LLM usage (under $3 total), caches all calls locally, and outputs both extracted answers and references to the source documents.
- Ingest all documents under
data/→ stored in PostgreSQL +artifacts/pages.jsonl - Build Index using TF-IDF →
artifacts/tfidf.pkl+ DBpage_vectors - Group documents by project ID
- Pre-filter relevant pages per query
- Extract structured fields using OpenAI (
LLMClient) - Cache & track token cost →
outputs/cost_log.jsonl - Output per-project structured summaries
- Python 3.11
- PostgreSQL (via Docker)
- Typer (CLI)
- pandas
- PyMuPDF
- scikit-learn
- psycopg2
- openai
- Docker + docker-compose
- Copy
.env.example→.envand set:
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
TOKEN_BUDGET_DOLLARS=3.0
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=talk_to_doc
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres- Place your project folders and files under:
./data/
docker compose up -d --build
docker compose exec app python app.py ingest
docker compose exec app python app.py build-index
docker compose exec app python app.py extractYou can ask questions in natural language, and the system will return relevant pages + evidence from your documents:
docker compose exec app python app.py query --q "project start date"Removes all files under artifacts/ and outputs/ for a clean rerun.
docker compose exec app python app.py resetRun automated pipeline tests (reset → ingest → index → extract → query).
docker compose exec app pytest -q| File / Folder | Description |
|---|---|
artifacts/pages.jsonl |
Extracted text chunks |
artifacts/tfidf.pkl |
TF-IDF index + vectorizer |
artifacts/cache/ |
Local cache of LLM responses |
outputs/index.jsonl |
Search index (debug/inspection) |
outputs/manifest.jsonl |
Summary of ingested documents |
outputs/PRJ-*_key_params.json |
LLM extraction project metadata |
outputs/cost_log.jsonl |
Token usage + LLM cost log |
Built by Eden Oved
AI-driven document indexing and extraction project.