Skip to content

Conversation

@krambox
Copy link

@krambox krambox commented May 7, 2024

Currently, adjustments to the abbreviations can only be made directly in the data directory. In order to use SoMaJo also for domain-specific texts with own abbreviations, the constructor has been extended so that own abbreviations can be used without fork in SoMaJo.

@tsproisl
Copy link
Owner

Thanks, this is something that has been requested a couple of times!

Before I merge it into develop, could you please address the following minor issues?

  • Add a space before the commas
  • Change the default value of custom_abbreviations to None (to avoid mutable default arguments)
  • Check the indentation level in TesttCustomAbbreviation
  • Fix the typo in TesttCustomAbbreviation

TODOs (intended as reminders to myself) until it can be merged into master and released:

  • Update the docstrings
  • When merging the custom abbreviations with the default list, check for duplicates and sort all abbreviations by length (it’s probably best to pass the custom abbreviations to utils.read_abbreviation_file() as additional argument and initialize the abbreviations set with them, respecting to_lower)
  • Add an argument custom_single_token_abbreviations for abbreviations that should not be split (corresponding to the single_token_abbreviations_*.txt files)
  • Add the functionality to the command-line interface, e.g. via options that let the user provide custom abbreviation files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants