v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more
✨ New features and improvements
- NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
- NEW: Experimental
SpanCategorizercomponent for labeling arbitrary and potentially overlapping spans of text. - NEW: Use predicted annotations during training via the
[training.annotating_components]config setting. - Alpha tokenization support for Azerbaijani.
- Part-of-speech tag-based lemmatizers for Catalan and Italian.
- The TextCatCNN and TextCatBOW architectures are now resizable.
- Support updating the
EntityRecognizerwith known incorrect span annotations. - Auto-generate a pretty
README.mdbased on the meta inspacy package.
For more details, see the New in v3.1 usage guide.
📦 New trained pipelines
| Package | Language | UPOS | Parser LAS | NER F |
|---|---|---|---|---|
ca_core_news_sm |
Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md |
Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg |
Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf |
Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf |
Danish | 98.0 | 85.0 | 82.9 |
⚠️ Upgrading from v3.0
- Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the
spacy_versionin your model package meta to">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1. - Use
spacy init fill-configto update a v3.0 config for v3.1. - When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in
[initialize.vectors]. - Logger warnings have been converted to Python warnings. Use
warnings.filterwarningsor the new helper methodspacy.errors.filter_warning(action, error_msg='')to manage warnings.
For more information, see Notes on upgrading from v3.0.
🔴 Bug fixes
- Fix issue #7036: Use a context manager when reading model.
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7799: Ensure
spacy raycommand works. - Fix issue #7807: Show warning if entity ruler runs without patterns.
- Fix issue #7886: Fix unknown tokens percentage in
debug data. - Fix issue #7930: Make
EntityLinkerrobust for nO=None. - Fix issue #7925: Skip vector ngram backoff if
minnis not set. - Fix issue #7973: Fix
debug modelfor transformers. - Fix issue #7988: Preserve existing
ENT_KB_IDinnerannotation. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()for all empty docs. - Fix issue #8012: Fix ensemble
textcatwith listener. - Fix issue #8054: Add
ENT_IDandNORMtoDocBinstrings. - Fix issue #8055: Handle partial entities in
Span.as_doc. - Fix issue #8062: Make all
Spanattrs writable. - Fix issue #8066: Update
debug datafortextcat. - Fix issue #8069: Custom warning if
DocBinis too large. - Fix issue #8099: Update Vietnamese tokenizer.
- Fix issue #8113: Support
to/from_bytesforKnowledgeBaseandEntityLinker. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1. - Fix issue #8169: Fix bug from
EntityRuler:ent_idsreturns None for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler. - Fix issue #8265: Address mypy errors.
- Fix issue #8335: Raise error if deps not provided with heads in
Doc. - Fix issue #8368: Preserve whitespace in
Span.lemma_. - Fix issue #8388: Don't clobber vectors when loading components from source models.
- Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict. - Fix issue #8441: Add correct types for
Language.pipereturn values. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs. - Fix issue #8559: Fix vectors check for sourced components.
- Fix issue #8584: Raise an error for
textcatwith <2 labels.
👥 Contributors
@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD