Turn messy documents into clean, structured output. The dad test: if he can use it, anyone can.
Phase 1 Goal: Prove the engine works
- Input: Messy document (meeting notes, contracts, research, etc.)
- Output: Cleaned, structured document (markdown or JSON)
# Clean up a document from stdin
cat messy-notes.txt | python cleanup.py
# From a file
python cleanup.py --file meeting-notes.txt
# Output as JSON
python cleanup.py --file notes.txt --format json- Extract text from various formats (txt, md, pdf, docx, html)
- Identify document structure (headings, lists, paragraphs)
- Fix formatting issues (spacing, bullets, numbering)
- Generate clean, consistent output
- Optionally extract metadata (dates, names, action items)
Input: Rambling meeting notes with inconsistent formatting Output: Structured summary with attendees, decisions, action items
Input: Scanned contract with OCR errors
Output: Clean text with sections properly identified
- Python 3.11+
- google-genai (Gemini API)
- python-docx (Word docs)
- PyMuPDF (PDFs)
- beautifulsoup4 (HTML)
This is one tool in the Reify Studio collection — AI tools that feed into your personal knowledge vault.