v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleanercomponent for removingdoc.tensor,doc._._trf_dataor otherDocattributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_IDandENT_KB_IDtoMatcherpattern attributes. - Support
kb_idfor entities in displaCy fromDocinput. - Add
Span.sentsproperty for spans spanning over more than one sentence. - Add
EntityRuler.removeto remove patterns byid. - Make the
Taggerneg_prefixconfigurable. - Use
Language.pipeinLanguage.evaluatefor more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpuspath optional again. - Fix issue #9654: Fix
spancatfor empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonlpaths inEntityRuler. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spansto handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parserfrom reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
taggerandmorphologizer.
📖 Documentation and examples
- Various documentation updates:
init_tok2vecafter pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar