Skip to content

Commit 40cd874

Browse files
GerHobbeltstweil
authored andcommitted
markdown formatting fix: whitespace only; headings are always followed by an empty line.
1 parent eda0552 commit 40cd874

9 files changed

+45
-4
lines changed

Compiling-–-GitInstallation.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
These are the instructions for installing Tesseract from the git repository. You should be ready to face unexpected problems.
1515

1616
## Installing With Autoconf Tools
17+
1718
In order to do this; you must have automake, libtool, leptonica, make and pkg-config installed. In addition, you need Git and a C++ compiler.
1819

1920
On Debian or Ubuntu, you can probably install all required packages like this:
@@ -143,6 +144,7 @@ If you want to put the traineddata files in a different directory than the direc
143144
1. Place any language training data you need into this `tessdata` folder as well. For example, the English one is called `eng.traineddata`. Download it [from the tessdata repository here](https://github.com/tesseract-ocr/tessdata), and move it to your `tessdata` directory you just specified in your `TESSDATA_PREFIX` variable above.
144145

145146
### Build with TensorFlow
147+
146148
Building with TensorFlow requires additional packages for Protocol Buffers and TensorFlow.
147149
On Debian or Ubuntu, you can probably install them like this:
148150

@@ -160,6 +162,7 @@ Build support with TensorFlow is a new feature in Git master. The resulting code
160162

161163

162164
### Unit test builds
165+
163166
Such builds can be used to run the automated regression tests, which have additional requirements. This includes the additional dependencies for the training tools (as mentioned above), and downloading all git submodules, as well as the model repositories (`*.traineddata`):
164167

165168
# Clone the Tesseract source tree:
@@ -190,6 +193,7 @@ Failed tests will show prominently as segfaults or SIGILL handlers (depending on
190193

191194

192195
### Debug Builds
196+
193197
Such builds produce Tesseract binaries which run very slowly. They are not useful for production, but good to find or analyze software problems. This is a proven build sequence:
194198

195199
cd tesseract
@@ -227,6 +231,7 @@ GNU gprof is used to show the profiling information from that file.
227231

228232

229233
### Release Builds for Mass Production
234+
230235
The default build creates a Tesseract executable which is fine for processing of single images. Tesseract then uses 4 CPU cores to get an OCR result as fast as possible.
231236

232237
For mass production with hundreds or thousands of images that default is bad because the multi threaded execution has a very large overhead. It is better to run single threaded instances of Tesseract, so that every available CPU core will process a different image.
@@ -246,6 +251,7 @@ This disabled OpenMP (multi threading), does not use a shared Tesseract library
246251
disables setting of `errno` for mathematical functions (faster execution!) and enables lots of compiler warnings.
247252

248253
### Builds for fuzzing
254+
249255
Fuzzing is used to test the Tesseract API for bugs. Tesseract uses [OSS-Fuzz](https://oss-fuzz.com/),
250256
but fuzzing can also run locally. A newer Clang++ compiler is required.
251257

@@ -273,4 +279,5 @@ Example (Run the fuzzer to find new bugs):
273279
nice bin/fuzzer/fuzzer-api -jobs=16 -workers=16
274280

275281
## Building using Windows Visual Studio
282+
276283
See [Compiling for Windows](Compiling.md#windows).

Data-Files-in-tessdata_best.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ network.
1515
There are two sections below: 125 languages, followed by 37 scripts.
1616

1717
### Languages (123 + osd + eq)
18+
1819
All language and script models have the same values for the following parameters which have been removed from the
1920
individual descriptions: `int_mode=0, recoding=1, learning_rate=0.001, momentum=0.5, adam_beta=0.999 `
2021

Docker-Containers.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,23 @@
11
## Tesseract 4 OCR Compilation - Docker Container
2+
23
[This Github repository](https://github.com/tesseract-shadow/tesseract-ocr-compilation) contains scripts and definition of Docker container that helps to compile Tesseract 4.
34

45
Automated build Docker image: [`docker pull tesseractshadow/tesseract4cmp`](https://hub.docker.com/r/tesseractshadow/tesseract4cmp/)
56

67
## Tesseract 4 OCR Runtime Environment - Docker Container
8+
79
If you are looking for ready to use Teserract 4 Runtime Environment container (and don't want to compile it) please take look at [this Github repository](https://github.com/tesseract-shadow/tesseract-ocr-re). The repository also contains some examples of usage.
810

911
Automated build Docker image: [`docker pull tesseractshadow/tesseract4re`](https://hub.docker.com/r/tesseractshadow/tesseract4re/).
1012

1113
## Tesseract 4 OCR with OpenCV Environment - Docker Container
14+
1215
Automate build Docker Image: [`docker pull mylamour/tesseract-ocr:opencv`]
1316

1417
## Building for Android with Docker
18+
1519
[This Github repository](https://github.com/rhardih/bad/tree/master/tesseract) contains Docker images for Tesseract 4.0 and earlier.
1620

1721
## Docker - Get Started
22+
1823
If you are not familiar with Docker please read [Docker - Get Started](https://docs.docker.com/get-started/).

Examples_C++.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,34 +4,49 @@ title: C++ API Examples
44
## C++ Examples
55

66
### Basic_example
7+
78
```
89
{% include_relative examples/Basic_example.cc %}
910
```
11+
1012
### SetRectangle_example
13+
1114
```
1215
{% include_relative examples/SetRectangle_example.cc %}
1316
```
17+
1418
### GetComponentImages_example
19+
1520
```
1621
{% include_relative examples/GetComponentImages_example.cc %}
1722
```
23+
1824
### ResultIterator_example
25+
1926
```
2027
{% include_relative examples/ResultIterator_example.cc %}
2128
```
29+
2230
### OSD_example
31+
2332
```
2433
{% include_relative examples/OSD_example.cc %}
2534
```
35+
2636
### LSTM_Choices_example
37+
2738
```
2839
{% include_relative examples/LSTM_Choices_example.cc %}
2940
```
41+
3042
### OpenCV_example
43+
3144
```
3245
{% include_relative examples/OpenCV_example.cc %}
3346
```
47+
3448
### UserPatterns_example
49+
3550
```
3651
{% include_relative examples/UserPatterns_example.cc %}
3752
```

Fonts.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ The installed fonts are shown by the command `fc-list`. See also the [Debian wik
108108
* http://www.steffmann.de/wordpress/test-2/
109109

110110
#### Arabic Fonts
111+
111112
* https://fonts.google.com/?subset=arabic
112113

113114
#### Devanagari Fonts
@@ -160,6 +161,7 @@ The installed fonts are shown by the command `fc-list`. See also the [Debian wik
160161
* http://www.morscher.com/3r/fonts/fraktur.htm
161162

162163
#### Hebrew Fonts
164+
163165
* [A list of Hebrew fonts from the Open Siddur Project](http://opensiddur.org/tools/fonts/)
164166

165167
#### Collections of fonts

Planning.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,16 +89,19 @@ Depending on available resources and opinions, these suggestions will either be
8989

9090
* #### Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...)
9191

92-
* #### Relative includes for traineddata
92+
* #### Relative includes for traineddata
93+
9394
tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
9495

9596
* #### More fixes for compiler warnings and issues reported by Coverity Scan
9697

9798
* #### Add a simple bash script for building tesseract
9899

99100
* #### New traineddata format
101+
100102
In addition to the current proprietary format Tesseract could also support ZIP archives (see [discussion](https://github.com/tesseract-ocr/tesseract/pull/911)).
101-
A possible implementation using libarchive is [available](https://github.com/stweil/tesseract/tree/libarchive), but needs more testing.
103+
104+
A possible implementation using libarchive is [available](https://github.com/stweil/tesseract/tree/libarchive), but needs more testing.
102105

103106
* #### "Training light" - Learning by doing (see [issue](https://github.com/tesseract-ocr/tesseract/issues/1442))
104107

@@ -143,8 +146,11 @@ Here we collect important issues and features for the release(s) following 4.0.0
143146
This does not include OpenCL or the old Tesseract engine.
144147

145148
* #### Tesseract creates output for missing input (see [issue 1023](https://github.com/tesseract-ocr/tesseract/issues/1023)).
149+
146150
Mostly solved, but could be improved.
147151

148152

149153
* #### Issue 1353: Patch for /training/tessopt.cpp (see [pull request 13](https://github.com/tesseract-ocr/tesseract/pull/13))
150-
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).
154+
155+
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).
156+

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,11 +153,13 @@ Please use scripts from [tesseract-ocr/tesstrain](https://github.com/tesseract-o
153153
- [Training LSTM Tesseract 5](tess5/TrainingTesseract-5.md) - based on [detailed Tesseract 4 tutorial and guide by Ray Smith](tess4/TrainingTesseract-4.00.md)
154154

155155
### Testing
156+
156157
- [Benchmarks](Benchmarks.md)
157158
- [TestingTesseract](TestingTesseract.md)
158159
- [UNLV Testing of Tesseract](UNLV-Testing-of-Tesseract.md)
159160

160161
### External Projects
162+
161163
- [AddOns](AddOns.md)
162164
- [User Projects - 3rdParty](User-Projects-–-3rdParty.md)
163165

User-Projects-–-3rdParty.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
* [Tesseract-OCR-iOS](https://github.com/gali8/Tesseract-OCR-iOS) - Tesseract OCR iOS is a Framework for iOS7+, compiled also for armv7s and arm64.
5858
* [OCR-iOS-Example](https://github.com/robmathews/OCR-iOS-Example) - a simple example of how to do optical character recognition (OCR) on iOS.
5959
* [Tesseract-iPhone-Demo ](https://github.com/nolanbrown/Tesseract-iPhone-Demo) - example based on tesseract 2.04.
60+
6061
* _More OS_:
6162
* [ScanBizCards](http://www.scanbizcards.com): Mobile solution for business card scanning. _Requirements:_ iPhone 4/iPhone 3/Android 2.0
6263

@@ -66,13 +67,15 @@
6667
## 4. Others (Utilities, Tools, Command-Line Interfaces [CLI], etc)
6768

6869
### A. PDF to Searchable PDF tools
70+
6971
(ie: any tool which can also handle a non-searchable PDF as an input):
7072

7173
1. [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) - Adds OCR text layer to scanned PDF files and images, allowing them to be searched. Processes pages in parallel on multi-core CPUs. Keeps exact resolution of original embedded images without recompressing JPEGs, when possible. Includes image several preprocessing options, detailed documentation, and support for many exotic PDFs.
7274
1. [pdf2pdfocr](https://github.com/LeoFCardoso/pdf2pdfocr) is a tool to OCR a PDF (or supported images) and add a text layer in the original file making it a searchable PDF. It is a python script that uses tesseract and other open source tools. Linux, macOS and Windows supported.
7375
1. [pdf2searchablepdf](https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF) - a tool which allows converting any non-searchable PDF, OR any entire directory of images, to a searchable PDF
7476

7577
### B. Others:
78+
7679
1. [ocr-fileformat](https://github.com/UB-Mannheim/ocr-fileformat) - Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)
7780
1. [Tess4J](https://github.com/nguyenq/tess4j) - A Java JNA wrapper for Tesseract OCR API.
7881
1. [Traineddata inspector](https://mazoea.com/te/traineddata/) - to inspect some of the internals of traineddata files

tess3/Training-Tesseract-3.00–3.02.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -465,4 +465,4 @@ tesseract image.tif output -l [lang]
465465

466466
More options of `combine_tessdata` can be found on its [Manual Page](https://github.com/tesseract-ocr/tesseract/blob/3.02.02/doc/combine_tessdata.1.asc) or in comment of its [source code](https://github.com/tesseract-ocr/tesseract/blob/3.02.02/training/combine_tessdata.cpp#L23).
467467

468-
You can inspect some of the internals of traineddata files in 3rd party online [Traineddata inspector](https://te-traineddata-ui.herokuapp.com).
468+
You can inspect some of the internals of traineddata files in 3rd party online [Traineddata inspector](https://te-traineddata-ui.herokuapp.com).

0 commit comments

Comments
 (0)