Releases: explosion/spaCy
v3.2.2: Improved NER and parser speeds, bug fixes and more
✨ New features and improvements
- Improved
parserandnerspeeds on long documents (see technical details in #10019). - Support for
spancatcomponents indebug data. - Support for
ENT_IOBas aMatchertoken pattern key. - Extended and improved types for many classes.
🔴 Bug fixes
- Fix issue #9735: Make floret murmurhash endian-neutral.
- Fix issue #9738: Support string IOB values for
ENT_IOB. - Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
- Fix issue #9960: Warn about entities that cross sentence boundaries in
debug data. - Fix issue #9979: Fix type for
Lexeme.rank. - Fix issue #10026: Check for 0-size assets in
spacy project. - Fix issue #10051: Consistently return scalars from similarity methods.
- Fix issue #10052: Fix spaces in
Doc.from_docs()for empty docs. - Fix issue #10079: Fix label detection in
debug datafor components with custom names. - Fix issue #10109: Add types to
UnderscoreandDependencyMatcherand improve types inLanguage,MatcherandPhraseMatcher. - Fix issue #10130: Fix
Tokenizer.explainwhen infixes appear as prefixes. - Fix issue #10143: Use simple suggester in
spancatinitialization. - Fix issue #10164: Support
IS_SENT_ENDinDoc.has_annotation. - Fix issue #10192: Detect invalid package names in
spacy package. - Fix issue #10223: Support mixed case in package names.
- Fix issue #10234: Fix type in
PhraseMatcher.
📖 Documentation and examples
- Various documentation updates.
- New spaCy version tags in spaCy universe.
- New
Dockerfilefor repeatable website builds and easier local development. - New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks
👥 Contributors
@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleanercomponent for removingdoc.tensor,doc._._trf_dataor otherDocattributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_IDandENT_KB_IDtoMatcherpattern attributes. - Support
kb_idfor entities in displaCy fromDocinput. - Add
Span.sentsproperty for spans spanning over more than one sentence. - Add
EntityRuler.removeto remove patterns byid. - Make the
Taggerneg_prefixconfigurable. - Use
Language.pipeinLanguage.evaluatefor more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpuspath optional again. - Fix issue #9654: Fix
spancatfor empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonlpaths inEntityRuler. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spansto handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parserfrom reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
taggerandmorphologizer.
📖 Documentation and examples
- Various documentation updates:
init_tok2vecafter pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
v3.2.0: Registered scoring functions, Doc input, floret vectors and more
✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()andnlp.pipe()acceptDocinput, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwriteconfig settings forentity_linker,morphologizer,tagger,sentencizerandsenter.extendconfig setting formorphologizerfor whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()including IETF language tags, for examplefraforFrenchandzh-HansforChinese. - New package
spacy-loggersfor additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipyare annotated asToken.morphfeatures. - Additional
morph_micro_p/r/fscores for morphological features fromScorer.score_morph_per_feat(). LIKE_URLattribute includes the tokenizer URL pattern.--n-save-epochoption forspacy pretrain.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vecfeature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.posandToken.morph.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes
- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer. - Fix issue #9584: Use metaclass to subclass errors to allow better pickling.
⚠️ Backwards incompatibilities
- In the
Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].is now° c .instead of° c.for most languages. - The tokenizer classes
ChineseTokenizer,JapaneseTokenizer,KoreanTokenizer,ThaiTokenizerandVietnameseTokenizerrequireVocabrather thanLanguagein__init__. - In
DocBin, user data is now always serialized according to thestore_user_dataoption, see #9190.
📖 Documentation and examples
- Demo projects for floret vectors:
pipelines/floret_vectors_demo: basic floret vector training and importing.pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
👥 Contributors
@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
v3.1.4: Python 3.10 wheels and support for AppleOps
✨ New features and improvements
- NEW: Binary wheels for Python 3.10.
- NEW: Improve performance on Apple M1 with
AppleOps:pip install spacy[apple]. - GPU profiling with
spacy.models_with_nvtx_range.v1. - Full
mypyintegration in the CI and many type fixes across the code base. - Added custom
Protocolclasses inty.pyto define behavior of pipeline components. - Support for entity linking visualization in
displacy. - Allow overriding vars in
spacy project assets. - Standalone
trainfunction to run the training from Python scripts just like thespacy trainCLI. - Support for
spacy-transformers>=1.1.0with improved IO. - Support for
thinc>=8.0.11with improved gradient clipping.
🔴 Bug fixes
- Fix issue #5507: Improve UX for multiprocessing on GPU.
- Fix issue #9137: Fix serialization for
KnowledgeBase.set_entities. - Fix issue #9244: Fix vectors for 0-length spans.
- Fix issue #9247: Improve UX for the
DocBinconstructor. - Fix Issue #9254: Allow unicode in a
spacy projecttitle. - Fix issue #9263: Make added patterns consistent in the
DependencyMatcher. - Fix issue #9305: Restore tokenization timing during evaluation.
- Fix issue #9335: Sync vocab in vectors and sourced components.
- Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
- Fix issue #9404: Create consistent default
textcatandtextcat_multilabelconfigurations. - Fix issue #9437: Improve UX around
Docobject creation. - Fix issue #9465: Fix minor issues with
convertCLI. - Fix issue #9500: Include
.pyifiles in the distributed package.
📖 Documentation and examples
- Various updates to the documentation.
- New additions to the spaCy universe:
deplacy: CUI-based dependency visualizeripymarkup: Visualizations for NER and syntax treesPhruzzMatcher: Find fuzzy matchesspacy-huggingface-hub: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca: Entity Linking on Wikidataspacy-clausie: Clause-based information extraction system- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly
👥 Contributors
@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
v3.1.3: Bug fixes and UX updates
✨ New features and improvements
- The
v3ofWandbLoggernow supports optionalrun_nameandentityparameters. - Improved UX when providing invalid
posvalues for aDocorToken.
🔴 Bug fixes
- Fix issue #9001: Pass alignments to
Matchercallbacks. - Fix issue #9009: Include component factories in third-party dependencies resolver.
- Fix issue #9012: Correct type of
configincreate_pipe. - Fix issue #9014: Allow
typer0.4 to provide support for both Click 7 and Click 8. - Fix issue #9033: Fix verbs list for French tokenizer exceptions.
- Fix issue #9059: Pass overrides to subcommands in
spacy projectworkflows. - Fix issue #9074: Improve UX around
repoandpatharguments inspacy project. - Fix issue #9084: Fix inference of
epoch_resumeinspacy pretrain. - Fix issue #9163: Handle
spacy-legacyinspacy packagedependency detection. - Fix issue #9211: Include only runtime-relevant dependencies in
spacy package.
📖 Documentation and examples
- Various updates to the documentation.
- Few additions and updates to the spaCy universe.
- Extended the developer documentation with information about the listener pattern, the
StringStoreand theVocab.
👥 Contributors
@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
v3.1.2: Improved spancat component and various bugfixes
✨ New features and improvements
- NEW: Provide scores for the
SpanCategorizerpredictions. - NEW: Broader compatibility with type checkers thanks to
.pyistub files. - NEW: Auto-detect package dependencies in
spacy package. - New
INTERSECTSoperator for the Matcher. - More debugging info for
spacy projectpushandpullcommands. - Allow passing in a precomputed array for speeding up multiple
Span.as_doccalls. - The default
datransformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).
🔴 Bug fixes
- Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
- Fix issue #8774: Ensure
debug dataruns correctly with a custom tokenizer. - Fix issue #8784: Fix incorrect
ISSUBSETandISSUPERSETin schema and docs. - Fix issue #8796: Respect the
no_skipvalue forspacy project run. - Fix issue #8810: Make
ConsoleLoggerflush after each logging line. - Fix issue #8819: Pass
excludewhen serializing the vocab. - Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
- Fix issue #8970: Fix
allow_overlapdefault for span categorizer scoring. - Fix issue #8982: Add glossary entry for
_SP. - Fix issue #9007: Fix span categorizer training on nested entities.
📖 Documentation and examples
- New developer documentation covering spaCy's internals and code conventions.
- Added a documentation section on preparing training data in spaCy's binary format.
- Updated some error/log messages to be more informative.
- Various updates to the documentation.
- A few new additions to the spaCy universe.
👥 Contributors
@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
v3.0.7: Bug fixes and base support for Azerbaijani
✨ New features and improvements
- Alpha tokenization support for Azerbaijani.
- Updates for French stop words.
🔴 Bug fixes
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7886: Fix unknown tokens percentage in
debug data. - Fix issue #7907: Update
load_lookupsreturn type and docstring. - Fix issue #7930: Make
EntityLinkerrobust fornO=None. - Fix issue #7925: Skip vector ngram backoff if
minnis not set. - Fix issue #7973: Fix
debug modelfor transformers. - Fix issue #7988: Preserve existing
ENT_KB_IDinnerannotation. - Fix issue #7992: Fix span offsets for
Matcher(as_spans)on spans. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()for all empty docs. - Fix issue #8012: Fix ensemble
textcatwith listener. - Fix issue #8054: Add
ENT_IDandNORMtoDocBinstrings. - Fix issue #8055: Handle partial entities in
Span.as_doc. - Fix issue #8062: Make all
Spanattrs writable. - Fix issue #8066: Update
debug datafortextcat. - Fix issue #8069: Custom warning if
DocBinis too large. - Fix issue #8113: Support
to/from_bytesforKnowledgeBaseandEntityLinker. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1. - Fix issue #8169: Fix bug from
EntityRuler:ent_idsreturnsNonefor phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler. - Fix issue #8244: Use context manager when reading model file.
- Fix issue #8245: Fix other open calls without context managers.
- Fix issue #8265: Address mypy errors.
- Fix issue #8299: Restrict
pymorphy2requirement topymorphy2mode in Russian and Ukrainian lemmatizers. - Fix issue #8335: Raise error if deps not provided with heads in
Doc. - Fix issue #8368: Preserve whitespace in
Span.lemma_. - Fix issue #8396: Make
JsonlReaderpath optional. - Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs. - Fix issue #8584: Raise an error for
textcatwith <2 labels. - Fix issue #8551: Fix duplicate spacy package CLI opts.
👥 Contributors
@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
v3.1.1: Support for Ancient Greek and various bug fixes
✨ New features and improvements
- Alpha tokenization support for Ancient Greek.
- Implementation of a
noun_chunkiterator for Dutch. - Support for
black&flake8as pre-commit hooks. - New
spacy.ngram_range_suggester.v1for suggesting a range of n-gram sizes for thespancatcomponent.
🔴 Bug fixes
- Fix issue #8638: Fix Azerbaijani initialization.
- Fix issue #8639: Use 0-vector for OOV lexemes.
- Fix issue #8640: Update lexeme ranks for loaded vectors.
- Fix issue #8651: Fix
ruandukmultiprocessing (withspawn). - Fix issue #8663: Preserve existing
metainformation withspacy package. - Fix issue #8718: Ensure that
replace_pipetakes disabled components into account.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe
v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more
✨ New features and improvements
- NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
- NEW: Experimental
SpanCategorizercomponent for labeling arbitrary and potentially overlapping spans of text. - NEW: Use predicted annotations during training via the
[training.annotating_components]config setting. - Alpha tokenization support for Azerbaijani.
- Part-of-speech tag-based lemmatizers for Catalan and Italian.
- The TextCatCNN and TextCatBOW architectures are now resizable.
- Support updating the
EntityRecognizerwith known incorrect span annotations. - Auto-generate a pretty
README.mdbased on the meta inspacy package.
For more details, see the New in v3.1 usage guide.
📦 New trained pipelines
| Package | Language | UPOS | Parser LAS | NER F |
|---|---|---|---|---|
ca_core_news_sm |
Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md |
Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg |
Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf |
Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf |
Danish | 98.0 | 85.0 | 82.9 |
⚠️ Upgrading from v3.0
- Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the
spacy_versionin your model package meta to">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1. - Use
spacy init fill-configto update a v3.0 config for v3.1. - When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in
[initialize.vectors]. - Logger warnings have been converted to Python warnings. Use
warnings.filterwarningsor the new helper methodspacy.errors.filter_warning(action, error_msg='')to manage warnings.
For more information, see Notes on upgrading from v3.0.
🔴 Bug fixes
- Fix issue #7036: Use a context manager when reading model.
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7799: Ensure
spacy raycommand works. - Fix issue #7807: Show warning if entity ruler runs without patterns.
- Fix issue #7886: Fix unknown tokens percentage in
debug data. - Fix issue #7930: Make
EntityLinkerrobust for nO=None. - Fix issue #7925: Skip vector ngram backoff if
minnis not set. - Fix issue #7973: Fix
debug modelfor transformers. - Fix issue #7988: Preserve existing
ENT_KB_IDinnerannotation. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()for all empty docs. - Fix issue #8012: Fix ensemble
textcatwith listener. - Fix issue #8054: Add
ENT_IDandNORMtoDocBinstrings. - Fix issue #8055: Handle partial entities in
Span.as_doc. - Fix issue #8062: Make all
Spanattrs writable. - Fix issue #8066: Update
debug datafortextcat. - Fix issue #8069: Custom warning if
DocBinis too large. - Fix issue #8099: Update Vietnamese tokenizer.
- Fix issue #8113: Support
to/from_bytesforKnowledgeBaseandEntityLinker. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1. - Fix issue #8169: Fix bug from
EntityRuler:ent_idsreturns None for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler. - Fix issue #8265: Address mypy errors.
- Fix issue #8335: Raise error if deps not provided with heads in
Doc. - Fix issue #8368: Preserve whitespace in
Span.lemma_. - Fix issue #8388: Don't clobber vectors when loading components from source models.
- Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict. - Fix issue #8441: Add correct types for
Language.pipereturn values. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs. - Fix issue #8559: Fix vectors check for sourced components.
- Fix issue #8584: Raise an error for
textcatwith <2 labels.
👥 Contributors
@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD
v2.3.7: Bug fix for download CLI
🔴 Bug fixes
- Fix issue #8286: Fix
spacy download.