spaCy pipeline for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.
| Feature | Description |
|---|---|
| Language | french |
| Name | fr_solipcysme |
| Default Pipeline | jusqucy_tokenizer,commecy_normalizer, jusqucy_normalizer, pretagger_hunspell,morphologizer, viceverser_lemmatizer, parser |
| Components | jusqucy_tokenizer, jusqucy_normalizer, commecy_normalizer, morphologizer, viceverser_lemmatizer, parser |
| Sources | Corpus narraFEATS (morphologizer), Universal Dependencies (parser), french-word-vectors (vectors) |
| License | GPL |
| Author | thjbdvlt |
# Main pipeline
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_lg-0.2.6-py3-none-any.whl
# Faster, less accurate, smaller model
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_sm-0.2.6-py3-none-any.whlimport spacy
nlp = spacy.load("fr_solipcysme_sm")
doc = nlp(
"la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
)
for i in doc:
print(
i.norm_, # commecy_normalizer / jusqucy_normalizer
i.pos_, # morphologizer
i.morph, # morphologizer
i.lemma_, # viceverser_lemmatizer
i.dep_, # parser
i.head, # parser
i.sent_start, # jusqucy_tokenizer
i._.ttype, # jusqucy_tokenizer
i._.isword, # jusqucy_tokenizer
)
print(
# these attributes are not especially usefull.
# mostly used to make morphologizer more accurate.
doc._.jusqucy_ttypes, # jusqucy_tokenizer
doc._.hunspell_po, # pretagger_hunspell
doc._.hunspell_is, # pretagger_hunspell
)solipCysme not only is a trained pipeline, but also a set of minimal pipeline components and model architectures that can be used independently.
a modified MultiHashEmbed that makes it possible to use Doc underscore attributes as features. The value of an attribute must be a list of int, and must have the same length as the Doc itself.
a modified CharacterEmbed that makes it possible to use underscore attributes as features and that replace nC (number of character) by nCstart and nCend, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with nCstart = 0 and nCend = 6).
a component that makes Hunspell morphological analysis available as features for the SolipcysmeMultiHashe or SolipcysmeCharEmbed architectures.
- only knows about straigt apostroph (
') and quotes ("). - morphologizer depends on the
jusqucy_tokenizer, because this tokenizer sets a value to a doc extension (Doc._.jusqucy_ttypes), used by the morpholgizer. - morphologizer depends on the
pretagger_hunspellcomponent, too; because the morphologizer uses the output of Hunspell as token features (po:andis:features). - no
Genderfeature
this work is released under GPL license (v3).