Input formats

The SentenceCleaner is a simple tool for the "cleaning" of sentences and replaces the old PHP cleaning scripts. Filter rules are expressed by the definition and combination of regular expressions and simple length restrictions. These filters are applied on every sentence in the input file. Sentences with no matching filter are written to the output file.

Input formats

Supported are two input formats:

Wortschatz raw text format ('source' format) (default)
tab separated file by using parameter-c COLUMN (position of sentence column, starting with 0)

Starting the tool

java -jar SentenceCleaner.jar -i INPUT -o OUTPUT [-l LANG_CODE] [-t TEXTTYPE] [-c COLUMN] [-r] [-s] [-f] [-v] [-e]

INPUT path to inputfile
OUTPUT path to outputfile
LANG_CODE language code in ISO 639-3
TEXTTYPE text type: web|news|wikipedia
COLUMN column number: treats input as tabulator separated file, checks only specified column, index starts with 0
r replace: replace HTML entities with UTF8 characters
s summary: write summary to stdout
f summary: write summary to stat file
v verbose: verbose output
e exchange: write the ill-formed sentences to output (+triggered rule)

Filters

All filters are stored in directory src/main/resources/rules/ of the project. They are separated in general ('general.rules'), language specific (like 'lang_ces.rules') and text type specific (like 'texttype_web.rules') filters. The syntax of the filters are described in 'general.rules'. Every filter rule has a unique identifier. A filter in the text type file overrides a filter in the general file with the same ID, a filter in the language specific file overrides filters with the same ID in both other files.

The files containing filters are compiled into the jar. Reading filters from external files is currently not supported.

Starting the tool

TSV: java -jar SentenceCleaner.jar -i testdata/inputfile -o testdata/outputfile -c 1 -r
Raw text: java -jar SentenceCleaner.jar -i testdata/inputfile_raw -o testdata/outputfile_raw -r

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
testdata		testdata
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Input formats

Starting the tool

Filters

Starting the tool

About

Uh oh!

Releases

Packages

Languages

Leipzig-Corpora-Collection/sentencecleaner

Folders and files

Latest commit

History

Repository files navigation

Input formats

Starting the tool

Filters

Starting the tool

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages