You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New feature: Prune XML tags and their contents from the input before tokenization (via the command line option --prune TAGNAME1 --prune TAGNAME2 … or by passing prune_tags=["TAGNAME1", "TAGNAME2", …] to tokenize_xml or tokenize_xml_file). This can be useful when processing HTML files, e.g. for removing any <script> and <style> tags from the input.
New feature: Delimit sentences with XML tags (via the command line option --sentence-tag TAGNAME or by passing xml_sentences="TAGNAME" to the constructor). When using this option with XML input, SoMaJo tries hard to produce well-formed XML as output. To achieve this, some tags will need to be closed and re-opened at sentence boundaries. In this paragraph, for example, the italic region contains a sentence boundary:
<p>Hi <i>there! How</i> are you?</p>
SoMaJo will close the i tag before the end of the sentence and re-open it afterwards:
<p> <s> Hi <i> there ! </i> </s> <s> <i> How </i> are you ? </s> </p>