Releases: openvinotoolkit/openvino_tokenizers
Releases · openvinotoolkit/openvino_tokenizers
2025.3.0.0
What's Changed
Functional Changes
- Support More Tokenizers With 2 Inputs by @apaniukov in #526
- add
TruncateandCombneSegmentsto node factory by @pavel-esir in #530 - add optional pad_right input to RaggedToDence op by @pavel-esir in #533
JS Bindings
- [JS] Add types for openvino-tokenizers by @Retribution98 in #497
Build, CI and GHA
- [GHA] Enabled product manifest.yml by @mryzhov in #496
- [WIN] Set binaries details by @mryzhov in #502
- [GHA] Use OV provider for testing dependabot changes by @mryzhov in #519
- [ICU] Replace py -3 on windows by @mryzhov in #531
- [CI] [GHA] Add manylinux build by @akashchi in #538
Other Changes
- Change Typing According to PEP 585 by @apaniukov in #528
New Contributors
Full Changelog: 2025.2.1.0...2025.3.0.0
2025.2.0.0
What's Changed
- [ICU] Set openvino compile defenitions by @mryzhov in #438
- Support complex tensors for equal by @NingLi670 in #439
- Write into rt_info string instead of python enum object. by @pavel-esir in #442
- ICU compile flags by @mryzhov in #440
- Switch workflows to aks-linux-medium runner by @ababushk in #447
- Use const pointers for const tensors by @praasz in #450
- Add JiT Compilation to PCRE2 by @apaniukov in #449
- [GHA] Use separate artifacts folder structure by @akladiev in #451
- Bumped cmake req version by @MichalCiecioraIntel in #452
- Merge RegexSplit Steps by @apaniukov in #456
- Check Binary Compatibility Before Patching by @apaniukov in #459
- [ICU] Added crossplatform compilation flags by @mryzhov in #461
- [GHA] use dev package for build by @mryzhov in #453
- Skip OpenVINO_DIR by @apaniukov in #462
- Add Support For Falcon3 Tokenizer by @apaniukov in #463
- [GHA] Improving caching by @mryzhov in #464
- [JS] Upgrade the js package versions to the upcoming releases by @Retribution98 in #468
- Improve pair input by @pavel-esir in #455
- Ensure PCRE2 JiT is Available by @apaniukov in #469
- Alternative fix for JIT in PCRE2 by @ilya-lavrenov in #470* Support Sentencepiece Chars Tokenizer by @apaniukov in #476
- Poetry dependency management by @mryzhov in #484
- [GGUF] Create tokenizers factory for GGUF support in OpenVINO GenAI by @rkazants in #494
New Contributors
- @NingLi670 made their first contribution in #439
- @MichalCiecioraIntel made their first contribution in #452
- @almilosz made their first contribution in #482
Full Changelog: 2025.1.0.0...2025.2.0.0
2025.1.0.0
What's Changed
- Replace openvino.runtime imports with openvino by @helena-intel in #378
- [JS] Fix path to libraries on linux arm by @vishniakov-nikolai in #384
- Fix RegexSplit by @pavel-esir in #390
- Support More Models by @apaniukov in #391
- [GHA] Use OV provider on macOS by @mryzhov in #392
- [GHA] Explicitly set VS 2022 version by @mryzhov in #393
- [GHA] added OV TF tests by @mryzhov in #396
- TF: dropped translate_squeeze_op by @ilya-lavrenov in #397
- Turn on StringPack/Unpack on master from opset15: 3rd attempt by @pavel-esir in #386
- Update Normalization by @apaniukov in #401
- [ICU] Support cmake < 3.16 by @mryzhov in #403
- [CI] tokenizers ccache by @mryzhov in #111
- [CI] Introduced GHA Overall_Status job by @mryzhov in #406
- [Coverity] Enabling coverity scan by @akazakov-github in #400
- [JS] Update openvino-tokenizers-node package version to 2025.0.0 by @Retribution98 in #409
- Pass max_length to
convert_model, add layer tests to RaggedToDense, CombineSegments by @pavel-esir in #362 - [ICU] Do not use debug postfix on mac by @mryzhov in #412
- Fix Skips Detection For CharsMap, Special Tokens Detection by @apaniukov in #411
- Fixed compilation with external protobuf / abseil by @ilya-lavrenov in #414
- [Build] Fix ICU build for macOS by @ilya-lavrenov in #413
- [ICU] Copy icu artifacts by @mryzhov in #416
- Change ConversionExtension to tensorflow::ConversionExtension by @olpipi in #405
- Support GGUF, Update README.md by @apaniukov in #417
- CMAKE: reuse ccache for ICU by @ilya-lavrenov in #404
- [JS] Fix the rpath in the openvino-tokenizer build for npm by @Retribution98 in #425
- Allow build w/o python by @mryzhov in #428
- [GHA] Save tokenizers artifacts to cloud by @akladiev in #419
- Add Unigram Tokenizer Implementation by @apaniukov in #431
- Propagate linker flags for ICU build by @aobolensk in #432
- Add TemplateProcessor to rt_info and Update Extension Finder by @apaniukov in #433
- Fix CLI Arg Default Value by @apaniukov in #434
- Switch workflows to aks-linux-medium runner by @ababushk in #448
- [MERGE] Ported compile flags fixes by @mryzhov in #446
New Contributors
- @akazakov-github made their first contribution in #400
- @Retribution98 made their first contribution in #409
- @olpipi made their first contribution in #405
- @aobolensk made their first contribution in #432
Full Changelog: 2025.0.0.0...2025.1.0.0
2025.0.0.0
What's Changed
- Add max_length Option to CLI Convert Tool by @apaniukov in #309
- Update Regex For Clean Tokenization Spaces by @apaniukov in #314
- Suppress warnings from 3rd party headers by @ilya-lavrenov in #316
- Update Prepend Regex by @apaniukov in #317
- Add C++ example to README by @helena-intel in #320
- Turn on UTF8Validate.REPLACE by default by @pavel-esir in #322
- make skip_tokens an input for VocabDecode (parametrize detokenization/decoding) by @pavel-esir in #325
- [JS] Add sources for nodejs package of tokenizers by @vishniakov-nikolai in #312
- Port print debug errors only if ENV VAR is set to master by @pavel-esir in #348
- [bug] Fix set tensor name for
attention_maskby @praasz in #352 - Support GLM Edge and ModernBERT by @apaniukov in #356
- Support BART-G2P Tokenizer by @apaniukov in #359
- Add Tests For WordLevel Tokenizer by @apaniukov in #360
- Add information about full Tokenizers version by @ilya-lavrenov in #365
- Wordpiece Detokenizer Support by @apaniukov in #369
- Write Detailed Version To XML by @apaniukov in #372
New Contributors
- @sfblackl-intel made their first contribution in #330
- @praasz made their first contribution in #352
- @jacekpawlak made their first contribution in #370
Full Changelog: 2024.6.0.0...2025.0.0.0
2024.6.0.0
What's Changed
- Port "Update Prepend Regex" by @apaniukov in #319
- Utf8 turn on by @pavel-esir in #326
- Port Fix For Llava Model To Release by @apaniukov in #333
- Print debug errors only if ENV VAR is set by @pavel-esir in #334
- Bump product version to 2024.6 by @akladiev in #336
- [MERGE][2024.6]Reverted w/a with hardcoded paths #329 by @mryzhov in #341
- GitHub workspace w/a on Windows (#342) by @akladiev in #343
- Use RC1 by @ilya-lavrenov in #337
Full Changelog: 2024.5.0.0...2024.6.0.0
2024.5.0.0
What's Changed
New and Reimplemented Operations
- Add Skip Tokens Node by @apaniukov in #264
- Optimize CombineSegments by @pavel-esir in #265
- Add Charsmap Operation by @apaniukov in #267
- Improve BPE by @pavel-esir in #281
- Reimplement WordPiece tokenization by @pavel-esir in #298
Improvements and Compatibility
- Store tokenizer conversion params in rt_info / refactor passing params by @pavel-esir in #268
- add packages versions to rt_info by @pavel-esir in #292
- Fix GLM4 Tokenization by @apaniukov in #280
Build Changes
- Dynamic linking with msvc runtime by @mryzhov in #260
- Linking with sentencepiece_train by @mryzhov in #272
Full Changelog: 2024.4.1.0...2024.5.0.0
2024.4.1.0
2024.4.0.0
What's Changed
- Reduce icud.dll by @mryzhov in #196
- Split implementation without FastTokenizer by @pavel-esir in #208
- Align Sentencepiece Model Vocab by @apaniukov in #205
- Ops Optimization by @apaniukov in #219
- [TF FE][Tokenizers] Avoid dependency from TF FE in tokenizers by @rkazants in #227
- Add Truncation To Sentencepiece by @apaniukov in #225
- reimplement BPE tokenizer by @pavel-esir in #220
- [TF FE][Tokenizers] Optimize TF FE extensions by @rkazants in #232
- Enabled build w/o FastTokenizers by @ilya-lavrenov in #237
- Win debug build by @mryzhov in #218
- Switch To BPE Backend by @apaniukov in #235
- Add UTF-8 validation by @pavel-esir in #242
Full Changelog: 2024.3.0.0...2024.4.0.0
2024.3.0.0
What's Changed
Improvements
- Fix Tokenization of Special Tokens in Sentencepiece by @apaniukov in #173
- Add Left Padding and Padding to Max Length by @apaniukov in #152
- Sentencepiece Tokenization Improvements by @apaniukov in #176
- BPE Fallback for Sentencepiece by @apaniukov in #181
- Update Sentencepiece Parsing by @apaniukov in #185
- Fix Decoding For Long Tokens by @apaniukov in #187
- Sentencepiece Left Padding by @apaniukov in #186
- Update Remaining Inputs Detection During Model Connection by @apaniukov in #190
- Update rt_info by @apaniukov in #191
- Truncate Left Side When Left Padding Is Used by @apaniukov in #192
- Add Separate Special Token Handling To Sentencepiece by @apaniukov in #198
- Support GLM-4 Tokenizer by @apaniukov in #202
- use PCRE2 fallback for RegexNormalization @pavel-esir in #203
Changes
Build, Packaging and CI
- Package into correct dirs by @Wovchena in #148
- Set cmake policies by @ilya-lavrenov in #157
- Fix usage of protobuf_MODULE_COMPATIBLE by @ilya-lavrenov in #158
- Build release by default (#162) by @ilya-lavrenov in #163
- Fixed conda-forge on Windows (#164) by @ilya-lavrenov in #165
- Package into correct dirs (#155) by @Wovchena in #167
- New python build scheme by @ilya-lavrenov in #166
- Support build from OpenVINO wheel only by @mryzhov in #178
- Configure cmake similar to GenAI (#175) by @Wovchena in #180
- Patch icu external project by @mryzhov in #184
- [CI] Build from OV wheel by @mryzhov in #183
- [GHA] Set permissions read-all by @mryzhov in #189
- [CI] Fixed Jenkins artifacts by @mryzhov in #195
- [MERGE] reduced icudt.dll (#196) by @mryzhov in #201
Full Changelog: 2024.2.0.0...2024.3.0.0
2024.2.0.0
What's Changed
- Add support for left padding in Wordpiece, BPE and tiktoken-based tokenizers
- Enhanced handling of special tokens
- Add support for padding to a particular length
- New option to add or not add special tokens during the tokenization
- Support Punctuation Pretokenizer
- Enchanse tokenizer postprocessing parser for better model coverage
- Add StringToHashBucketFast Tensorflow Translator
- Optimize EqualStr and VocabEncoder Operations
- Add Benchmarking Script
Full Changelog: 2024.1.0.2...2024.2.0.0