v1.0.0: Support for deep learning workflows and entity-aware rule matcher
✨ Major features and improvements
- NEW: custom processing pipelines, to support deep learning workflows
- NEW: Rule matcher now supports entity IDs and attributes
- NEW: Official/documented training APIs and
GoldParseclass - Download and use GloVe vectors by default
- Make it easier to load and unload word vectors
- Improved rule matching functionality
- Move basic data into the code, rather than the json files. This makes it simpler to use the tokenizer without the models installed, and makes adding new languages much easier.
- Replace file-system strings with
Pathobjects. You can now load resources over your network, or do similar trickery, by passing any object that supports thePathprotocol.
⚠️ Backwards incompatibilities
- The data_dir keyword argument of
Language.__init__(and its subclassesEnglish.__init__andGerman.__init__) has been renamed topath. - Details of how the Language base-class and its sub-classes are loaded, and how defaults are accessed, have been heavily changed. If you have your own subclasses, you should review the changes.
- The deprecated
token.repvecname has been removed. - The
.train()method of Tagger and Parser has been renamed to.update() - The previously undocumented
GoldParseclass has a new__init__()method. The old method has been preserved inGoldParse.from_annot_tuples(). - Previously undocumented details of the
Parserclass have changed. - The previously undocumented
get_packageandget_package_by_namehelper functions have been moved into a new module,spacy.deprecated, in case you still need them while you update.
🔴 Bug fixes
- Fix
get_lang_classbug when GloVe vectors are used. - Fix Issue #411:
doc.sentsraised IndexError on empty string. - Fix Issue #455: Correct lemmatization logic
- Fix Issue #371: Make
Lexemeobjects hashable - Fix Issue #469: Make
noun_chunksdetect root NPs
👥 Contributors
Thanks to @daylen, @RahulKulhari, @stared, @adamhadani, @izeye and @crawfordcomeaux for the pull requests!