The folder sources contains the source files of the used lexica; for more information see the README.md in this folder.
The folder scripts contains the Python scripts used to generate all files.
The repository has the following files at the moment:
Bailly2020.4a.json,LSJ.json,Pape.json,TBESG.jsonGreek dictionaries compiled from respective source files insources.pta_lexicon_grc.json:- compiled from LSJ, TBESG, and Pape
- has: lemma – grc_eng – grc_eng2 – grc_deu
- grc_eng = LSJ, grc_eng2 = TBESG, grc_deu = Pape
- grc_eng and grc_deu are lists, as there are homonymous lemmata.
- If there is no entry in one of the dictionaries, the entry is empty.
- The folder
pta_lexicon_grccontains xml-version of the above wordlemma_grc_cltk.json:- result of lemmatizing all Greek texts in in pta_data; it currently has 133.438 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
wordlemma_grc.json(outdated):- result of lemmatizing part of the texts in in pta_data; it has 42.346 entries. Lemmatization was done using the Morpheus morphological analysis engine used at morph.perseids.org.
- has word – lemma – morphology
- words which have not been lemmatized (for whatever reason), are not in the file.
wordlemma_grc.xml(outdated):- xml-version of the file above
wordlemma_grc_diogenes.json:- morphology data from Diogenes; Greek is converted to utf-8 (from Betacode).
- has word - lemma (list of possible morphology)
- JSON-versions of the lexica in the
source-folder, adapted for use in PTA, the folderpta_dictionariescontains xml-versions of these.
-
georges_lat.json: compiled from respective source file insources. -
LewisShort.json: tbd -
TLL.json:- built from https://publikationen.badw.de/de/api/thesaurus/html-xml/thesaurus/index.json"
- has lemma - url of entry in THESAVRVS LINGVAE LATINAE Open Access
-
wordlemma_lat_cltk.json:- result of lemmatizing all Latin texts in in pta_data; it currently has 3022 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
-
wordlemma_lat_diogenes.json:- morphology data from Diogenes
- has word - lemma (list of possible morphology)