Add Tesseract training setup scripts and example data#339
Conversation
TwoAbove
left a comment
There was a problem hiding this comment.
Looks good!
Left a couple of minor comments.
I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?
| sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev | ||
| sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config | ||
| sudo apt-get install libpango1.0-dev libleptonica-dev |
There was a problem hiding this comment.
I think it would make sense to extract this into the README as a ## training tesseract section.
I would split this script into two parts - a setup.sh script (also mention it in the README in the setup instructions) and a train.sh script that takes in a ground truth path.
There was a problem hiding this comment.
Totally, I had the same thought. one will likely run once while the other may need many runs.
| greentext "Installing Deps and Creating File Structure" | ||
|
|
||
| # Dont polute the directory | ||
| mkdir -p ./tess |
There was a problem hiding this comment.
Since this script creates artifacts, we'll need to add them to a .gitignore file. Ideally, we would keep the sole .gitignore so it's consolidated in one place.
There was a problem hiding this comment.
Good Callout, I'll consider what the new entries might need to be.
| sudo apt-get install libicu-dev | ||
| sudo apt-get install libpango1.0-dev | ||
| sudo apt-get install libcairo2-dev |
|
|
||
| greentext "Pulling the required ENG traineddata from github" | ||
| wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata | ||
| sudo mv eng.traineddata /usr/local/share/tessdata |
There was a problem hiding this comment.
Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use TESSDATA variable to make it local. https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#train
There was a problem hiding this comment.
Yes, I think that's a great idea, will incorporate.
Co-authored-by: Seva Maltsev <TwoAbove@users.noreply.github.com>
Short Answer: Long answer: |
Work In progress, opening for visibility.
Current status:
tessTrain/tessTrain.sh - works and will set up a baseline ubuntu 22.04 wsl / container / etc with the tools and binaries required for training tesseract. It will also run an example training session with the included example training data. Documentation and sources are commented inside the script for further details look there for now.
tessTrain/example_truth/ - Example of what a training data directory needs to look like. Used by tessTrain.sh to confirm that setup was successful.
Ping me on discord for any questions or comments! Thx.