solipCysme

spaCy pipeline for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.

Feature	Description
Language	french
Name	`fr_solipcysme`
Default Pipeline	`jusqucy_tokenizer`,`commecy_normalizer`, `jusqucy_normalizer`, `pretagger_hunspell`,`morphologizer`, `viceverser_lemmatizer`, `parser`
Components	jusqucy_tokenizer, jusqucy_normalizer, commecy_normalizer, `morphologizer`, viceverser_lemmatizer, `parser`
Sources	Corpus narraFEATS (morphologizer), Universal Dependencies (parser), french-word-vectors (vectors)
License	GPL
Author	thjbdvlt

installation

# Main pipeline
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_lg-0.2.6-py3-none-any.whl

# Faster, less accurate, smaller model
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_sm-0.2.6-py3-none-any.whl

usage

import spacy

nlp = spacy.load("fr_solipcysme_sm")

doc = nlp(
    "la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
)

for i in doc:
    print(
        i.norm_,      # commecy_normalizer / jusqucy_normalizer
        i.pos_,       # morphologizer
        i.morph,      # morphologizer
        i.lemma_,     # viceverser_lemmatizer
        i.dep_,       # parser
        i.head,       # parser
        i.sent_start, # jusqucy_tokenizer
        i._.ttype,    # jusqucy_tokenizer
        i._.isword,   # jusqucy_tokenizer
    )

print(
    # these attributes are not especially usefull.
    # mostly used to make morphologizer more accurate.
    doc._.jusqucy_ttypes,  # jusqucy_tokenizer
    doc._.hunspell_po,     # pretagger_hunspell
    doc._.hunspell_is,     # pretagger_hunspell
)

components and architectures

solipCysme not only is a trained pipeline, but also a set of minimal pipeline components and model architectures that can be used independently.

SolipcysmeMultiHashed

a modified MultiHashEmbed that makes it possible to use Doc underscore attributes as features. The value of an attribute must be a list of int, and must have the same length as the Doc itself.

SolipcysmeCharEmbed

a modified CharacterEmbed that makes it possible to use underscore attributes as features and that replace nC (number of character) by nCstart and nCend, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with nCstart = 0 and nCend = 6).

pretagger_hunspell

a component that makes Hunspell morphological analysis available as features for the SolipcysmeMultiHashe or SolipcysmeCharEmbed architectures.

limits and specificities

only knows about straigt apostroph (') and quotes (").
morphologizer depends on the jusqucy_tokenizer, because this tokenizer sets a value to a doc extension (Doc._.jusqucy_ttypes), used by the morpholgizer.
morphologizer depends on the pretagger_hunspell component, too; because the morphologizer uses the output of Hunspell as token features (po: and is: features).
no Gender feature

license

this work is released under GPL license (v3).

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
make_pipeline		make_pipeline
solipcysme		solipcysme
.gitignore		.gitignore
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README.md		README.md
config.cfg		config.cfg
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

solipCysme

installation

usage

components and architectures

SolipcysmeMultiHashed

SolipcysmeCharEmbed

pretagger_hunspell

limits and specificities

license

About

Uh oh!

Releases 6

Packages

Uh oh!

Languages

License

thjbdvlt/solipCysme

Folders and files

Latest commit

History

Repository files navigation

solipCysme

installation

usage

components and architectures

SolipcysmeMultiHashed

SolipcysmeCharEmbed

pretagger_hunspell

limits and specificities

license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Languages

Packages