parsing

only for advanced users

PDF to TXT converter ready to chunk for your RAG

ONLY WINDOWS
EXE and PY available (en)

⇨ give me a ❤️, if you like ;)

newest: PDF Parser - Sevenof9_v7d.py (exe on huggingface, see below)

Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. Often textblocks are mixed and tables not readable. Therefore its better to convert it with some help of a parser.
I work with "pdfplumber/pdfminer" none OCR, so its very fast!

Works with single and multi pdf list, works with folder
Intelligent multiprocessing
Error tolerant, that means if your PDF is not convertible, it will be skipped, no special handling
Instant view of the result, hit one pdf on top of the list
Converts some common tables as json inside the txt file
It adds the absolute PAGE number to each page
All txt files will be created in original folder of PDF
All txt files will be overwritten

exe files aviable here:
https://huggingface.co/kalle07/pdf2txt_parser_converter
This I have created with my brain and the help of chatGPT, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.
It is really hard for me with GUI and the Function and in addition to compile it.
For the python-file oc you need to import missing libraries.

I also have a "docling" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.
You have to download all libs, and if you start (first time) internal also OCR models are downloaded. At the moment i have prepared a kind of multi docling, the number of parallel processes depend on VRAM and if you use OCR only for tables or for all. I have set VRAM = 16GB (my GPU RAM, you should set yours) and the multiple calls for docling are VRAM/1.3, so it uses ~12GB (in my version) and processes 12 PDFs at once, only txt and tables are converted, so no images no diagrams. For now all PDFs must be same folder like the python file. If you change OCR for all the VRAM consum is rasing you have to set 1.3 to 2 or more.

now have fun and leave a comment if you like ;)
on discord "sevenof9"
my embedder collection:
https://huggingface.co/kalle07/embedder_collection

I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
PDF Parser - Sevenof9_v7d.py		PDF Parser - Sevenof9_v7d.py
README.md		README.md
docling_by_sevenof9_v1.py		docling_by_sevenof9_v1.py
parser_sevenof9_v1_1_en.py		parser_sevenof9_v1_1_en.py
parser_sevenof9_v1_de.py		parser_sevenof9_v1_de.py
parser_sevenof9_v1_en.py		parser_sevenof9_v1_en.py
parser_sevenof9_v2_en.py		parser_sevenof9_v2_en.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

parsing

PDF to TXT converter ready to chunk for your RAG

About

Uh oh!

Releases

Packages

Languages

kalle07/parsing

Folders and files

Latest commit

History

Repository files navigation

parsing

PDF to TXT converter ready to chunk for your RAG

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages