Release v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more · explosion/spaCy

✨ New features and improvements

NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
NEW: Experimental SpanCategorizer component for labeling arbitrary and potentially overlapping spans of text.
NEW: Use predicted annotations during training via the [training.annotating_components] config setting.
Alpha tokenization support for Azerbaijani.
Part-of-speech tag-based lemmatizers for Catalan and Italian.
The TextCatCNN and TextCatBOW architectures are now resizable.
Support updating the EntityRecognizer with known incorrect span annotations.
Auto-generate a pretty README.md based on the meta in spacy package.

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

Package	Language	UPOS	Parser LAS	NER F
`ca_core_news_sm`	Catalan	98.2	87.4	79.8
`ca_core_news_md`	Catalan	98.3	88.2	84.0
`ca_core_news_lg`	Catalan	98.5	88.4	84.2
`ca_core_news_trf`	Catalan	98.9	93.0	91.2
`da_core_news_trf`	Danish	98.0	85.0	82.9

⚠️ Upgrading from v3.0

Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
Use spacy init fill-config to update a v3.0 config for v3.1.
When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

Fix issue #7036: Use a context manager when reading model.
Fix issue #7629: Fix scoring normalization.
Fix issue #7799: Ensure spacy ray command works.
Fix issue #7807: Show warning if entity ruler runs without patterns.
Fix issue #7886: Fix unknown tokens percentage in debug data.
Fix issue #7930: Make EntityLinker robust for nO=None.
Fix issue #7925: Skip vector ngram backoff if minn is not set.
Fix issue #7973: Fix debug model for transformers.
Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
Fix issue #8004: Handle errors while multiprocessing.
Fix issue #8009: Fix Doc.from_docs() for all empty docs.
Fix issue #8012: Fix ensemble textcat with listener.
Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
Fix issue #8055: Handle partial entities in Span.as_doc.
Fix issue #8062: Make all Span attrs writable.
Fix issue #8066: Update debug data for textcat.
Fix issue #8069: Custom warning if DocBin is too large.
Fix issue #8099: Update Vietnamese tokenizer.
Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
Fix issue #8116: Fix offsets in Span.get_lca_matrix.
Fix issue #8132: Remove unsupported attrs from attrs.IDS.
Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
Fix issue #8208: Address missing config overrides post load of models.
Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
Fix issue #8216: Don't add duplicate patterns in EntityRuler.
Fix issue #8265: Address mypy errors.
Fix issue #8335: Raise error if deps not provided with heads in Doc.
Fix issue #8368: Preserve whitespace in Span.lemma_.
Fix issue #8388: Don't clobber vectors when loading components from source models.
Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
Fix issue #8426: Fix setting empty entities in Example.from_dict.
Fix issue #8441: Add correct types for Language.pipe return values.
Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
Fix issue #8559: Fix vectors check for sourced components.
Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

✨ New features and improvements

📦 New trained pipelines

⚠️ Upgrading from v3.0

🔴 Bug fixes

👥 Contributors

Uh oh!