Releases: openvinotoolkit/openvino.genai
Releases · openvinotoolkit/openvino.genai
2025.3.0.0
What's Changed
- Bump product version 2025.3 by @akladiev in #2255
- Implement SnapKV by @vshampor in #2067
- [WWB] Additional processing of native phi4mm by @nikita-savelyevv in #2276
- Update ov genai version in samples by @as-suvorov in #2275
- use chat templates in vlm by @eaidova in #2279
- Fix 'Unsupported property' fails if set prompt_lookup to False by @sbalandi in #2240
- Force the PA implementation in the llm-bench by default by @sbalandi in #2271
- Update Whisper README.md as "--disable-stateful" is no longer required to export models for NPU by @luke-lin-vmc in #2249
- Removed 'slices' from EncodedImage by @popovaan in #2258
- support text embeddings in llm_bench by @eaidova in #2269
- [wwb]: load transformers model first, then only trust_remote_code by @eaidova in #2270
- [GHA] Coverity pipeline fixes by @mryzhov in #2283
- [GHA][DEV] Fixed coverity path creation by @mryzhov in #2285
- [GHA][DEV]Save coveity tool to cache by @mryzhov in #2286
- [GHA][DEV] Set cache key for coverity tool by @mryzhov in #2288
- Image generation multiconcurrency (#2190) by @dkalinowski in #2284
- [GGUF] Support GGUF format for tokenizers and detokenizers by @rkazants in #2263
- Unskip whisper tests & update optimum-intel by @as-suvorov in #2247
- Update README.md with text-to-speech by @rkazants in #2294
- add new chat template for qwen3 by @eaidova in #2297
- [DOCS] Correct cmd-line for TTS conversion by @rkazants in #2303
- [GHA] Enabled product manifest.yml by @mryzhov in #2281
- Bump the npm_and_yarn group across 3 directories with 2 updates by @dependabot[bot] in #2309
- [GHA] Save artifacts to cloud share by @akladiev in #1943
- [GHA][[COVERITY] added manual trigger by @mryzhov in https://github.com//pull/2289
- [GHA] Fix missing condition for Extract Artifacts step by @akladiev in #2313
- [llm bench] Turn off PA backend for VLM by @sbalandi in #2312
- [llm_bench] Add setting of max_num_batched_tokens for SchedulerConfig by @sbalandi in #2316
- [GHA] Fix missing condition for LLM & VLM test by @sammysun0711 in #2326
- [Test] Skip gguf test on MacOS due to sporadic failure by @sammysun0711 in #2328
- [GGUF] support Qwen3 architecture by @TianmengChen in #2273
- [llm_bench] Increase max_num_batched_tokens to the largest positive integer by @sbalandi in #2327
- Bump aquasecurity/trivy-action from 0.30.0 to 0.31.0 by @dependabot[bot] in #2310
- Bump actions/download-artifact from 4.1.9 to 4.3.0 by @dependabot[bot] in #2315
- Fix system_message forwarding by @Wovchena in #2325
- Disabled crop of the prompt for minicpmv. by @andreyanufr in #2320
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum by @sbalandi in #2332
- Bump brace-expansion from 2.0.1 to 2.0.2 in /.github/actions/install_wheel in the npm_and_yarn group across 1 directory by @dependabot[bot] in #2338
- Fix multinomial sampling for PromptLookupDecoding by @sbalandi in #2331
- [llm bench] Avoid using not supported by beam_search parameters for beam_search case by @sbalandi in #2336
- Update Export Requirements by @apaniukov in #2342
- [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init by @sammysun0711 in #2218
- Fixed system message in chat mode. by @popovaan in #2343
- Bump librosa from 0.10.2.post1 to 0.11.0 in /samples by @dependabot[bot] in #2346
- [Test][GGUF] Add DeepSeek-R1-Distill-Qwen GGUF in CI by @sammysun0711 in #2329
- [llm_bench] Remove default scheduler config by @sbalandi in #2341
- master: add Phi-4-multimodal-instruct by @Wovchena in #2264
- Fix paths with unicode for tokenizers by @yatarkan in #2337
- [WWB] Add try-except block for processor loading by @nikita-savelyevv in #2352
- [WWB] Bring back eager attention implementation by default by @nikita-savelyevv in #2353
- fix supported models link in TTS samples by @eaidova in #2300
- StaticLLMPipeline: Add tests on caching by @smirnov-alexey in #1905
- [WWB] Fix loading the tokenizer for VLMs by @l-bat in #2351
- Pass Scheduler Config for VLM Pipeline in WhoWhatBenchmark. by @popovaan in #2318
- Remove misspelled CMAKE_CURRENT_SOUCE_DIR by @Wovchena in #2362
- Increase timeout for LLM & VLM by @Wovchena in #2359
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum for VLM by @sbalandi in #2361
- Support multi images for vlm benchmarking in samples and llm_bench by @wgzintel in #2197
- CB: Hetero pipeline parallel support by @WeldonWangwang in #2227
- Update conversion instructions by @adrianboguszewski in #2287
- Merge stderr from failed samples by @Wovchena in #2156
- Revert cache folder by @Wovchena in #2372
- Update README in Node.js API by @almilosz in #2374
- [Docs] Rework home page by @yatarkan in #2368
- Align PromptLookupDecoding with greedy when dynamic_split_fuse works by @sbalandi in #2360
- Support to collect latency for transformers V4.52.0 by @wgzintel in #2373
- Bump diffusers from 0.33.1 to 0.34.0 in /samples by @dependabot[bot] in #2381
- Bump diffusers from 0.33.1 to 0.34.0 in /tests/python_tests by @dependabot[bot] in #2380
- Structured Output generation with
XGrammar
by @pavel-esir in #2295 - Disable XGrammar on Android by @apaniukov in #2389
- [wwb] Take prompts from different categories for def dataset for VLM by @sbalandi in #2349
- Fix for cloning NPU Image Generation pipelines (#2376) by @dkalinowski in #2393
- Set add_special_tokens=false for image tags in MiniCPM. by @popovaan in #2404
- Fix missing use cases for inpainting models and defining use case with relative path by @sbalandi in #2387
- temporary skip failing whisper tests by @pavel-esir in #2396
- Fix test_vlm_npu_no_exception by @AlexanderKalistratov in #2388
- Bump timm from 1.0.15 to 1.0.16 by @dependabot[bot] in #2390
- Optimize VisionEncoderQwen2VL::encode by @usst...
2025.2.0.0
What's Changed
- [GHA] Replaced isual_language_chat_sample-ubuntu-minicpm_v2_6 job by @mryzhov in #1909
- [GHA] Replaced cpp-chat_sample-ubuntu pipeline by @mryzhov in #1913
- Add support of Prompt Lookup decoding to llm bench by @sbalandi in #1917
- [GHA] Introduce SDL pipeline by @mryzhov in #1924
- Switch Download OpenVINO step to aks-medium-runner by @ababushk in #1889
- Bump product version 2025.2 by @akladiev in #1920
- [GHA] Replaced cpp-continuous-batching by @mryzhov in #1910
- Update dependencies in samples by @ilya-lavrenov in #1925
- phi3_v: add universal tag by @Wovchena in #1921
- Fix image_id unary error by @rkazants in #1927
- [Docs] Image generation use case by @yatarkan in #1877
- Add perf metrics for CB VLM by @pavel-esir in #1897
- Enhance the flexibility of the c streamer by @apinge in #1941
- add Gemma3 LLM to supported models by @eaidova in #1942
- Added GPTQ/AWQ support with HF Transformers by @AlexKoff88 in #1933
- Add --static_reshape option to llm_bench, to force static reshape + compilation at pipeline creation by @RyanMetcalfeInt8 in #1851
- benchmark_image_gen: Add --reshape option, and ability to specify multiple devices by @RyanMetcalfeInt8 in #1878
- Revert perf regression changes by @dkalinowski in #1949
- Add running greedy_causal_lm for JS to the sample tests by @Retribution98 in #1930
- [Docs] Add VLM use case by @yatarkan in #1907
- Added possibility to generate base text on GPU for text evaluation. by @andreyanufr in #1945
- VLM: change infer to start_async/wait by @dkalinowski in #1948
- [WWB]: Addressed issues with validation on Windows by @AlexKoff88 in #1953
- [GHA] Remove bandit pipeline by @mryzhov in #1956
- Disable MSVC debug assertions, addressing false positives in iterator checking by @apinge in #1952
- [GHA] Replaced genai-tools pipeline by @mryzhov in #1954
- configurable delay by @eaidova in #1963
- Update cast of tensor data pointer for const tensors by @praasz in #1966
- Remove tokens after EOS for draft model for speculative decoding by @sbalandi in #1951
- Add testcase for chat_sample_c by @apinge in #1934
- Skip warm-up iteration during llm_bench results averaging by @nikita-savelyevv in #1972
- Reset pipeline cache usage statistics on each generate call by @vshampor in #1961
- [Docs] Update models, rebuild on push by @yatarkan in #1922
- Updated logic whether PA backend is explicitly required by @ilya-lavrenov in #1976
- [GHA] [MAC] Use latest_available_commit OV artifacts by @mryzhov in #1977
- [GHA] Set HF_TOKEN by @mryzhov in #1986
- [GHA] Setup ov_cache by @mryzhov in #1962
- [GHA] Changed cleanup runner by @mryzhov in #1995
- Added mutex to methods which use blocks map. by @popovaan in #1975
- Add documentation and sample on KV cache eviction by @vshampor in #1960
- StaticLLMPipeline: Simplify compile_model call logic by @smirnov-alexey in #1915
- Fix reshape in heterogeneous SD samples by @helena-intel in #1994
- Update tokenizers by @mryzhov in #2002
- docs: fix max_new_tokens option description by @tpragasa in #1987
- [Docs] Add speech recognition with whisper use case by @yatarkan in #1971
- Revert "VLM: change infer to start_async/wait " by @ilya-lavrenov in #2004
- Revert "Revert perf regression changes" by @ilya-lavrenov in #2003
- Set xfail to failing tests. by @popovaan in #2006
- [GHA] Use cpack bindings in the samples tests by @mryzhov in #1979
- [Docs]: add Phi3.5MoE to supported models by @eaidova in #2012
- add TensorArt SD3.5 models to supported list by @eaidova in #2013
- Move MiniCPM resampler to vision encoder by @popovaan in #1997
- [GHA] Fix ccache on Win/Mac by @mryzhov in #2008
- samples/python/text_generation/lora.py -> samples/python/text_generation/lora_greedy_causal_lm.py by @Wovchena in #2007
- Whisper timestamp fix by @RyanMetcalfeInt8 in #1918
- Unskip Qwen2-VL-2B-Instruct sample test by @as-suvorov in #1970
- [GHA] Use developer openvino packages by @mryzhov in #2000
- Added NNCF to export-requirements.txt by @ilya-lavrenov in #1974
- Bump py-build-cmake from 0.4.2 to 0.4.3 by @dependabot in #2016
- Use OV_CACHE for python tests by @as-suvorov in #2020
- [GHA] Disable HTTP calls to the Hugging Face Hub by @mryzhov in #2021
- Add python bindings to VLMPipeline for encrypted models by @olpipi in #1916
- Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot in #2017
- CB: auto plugin support by @ilya-lavrenov in #2034
- timeout-minutes: 90 by @Wovchena in #2039
- Bump diffusers from 0.32.2 to 0.33.1 by @dependabot in #2031
- Bump diffusers from 0.32.2 to 0.33.1 in /samples by @dependabot in #2032
- Enable cache and add cache encryption to samples by @olpipi in #1990
- Fix VLM concurrency by @mzegla in #2022
- Move Phi3 vision projection model to vision encoder by @popovaan in #2009
- Fix spelling by @Wovchena in #2025
- [Docs] Enable autogenerated samples docs by @yatarkan in #2029
- Synchronize entire embeddings calculation phase (#1967) by @mzegla in #1993
- Add missing finish reason set when finishing the sequence by @mzegla in #2036
- Bump image-size from 1.2.0 to 1.2.1 in /site in the npm_and_yarn group across 1 directory by @dependabot in #1998
- Add README for C Samples by @apinge in #2040
- Use ov_cache for test_vlm_pipeline by @as-suvorov in #2042
- increase timeouts by @Wovchena in #2041
- [GHA] Use azure runners for python tests by @mryzhov in #1991
- [WWB]: move diffusers imports closer to usage by @eaidova in #2046
- [llm bench] Move calculation of memory consumption to memory_monitor tool by @sbalandi in #1...
2025.1.0.0
What's Changed
- skip failing Chinese prompt on Win by @pavel-esir in #1573
- Bump product version 2025.1 by @akladiev in #1571
- Bump tokenizers submodule by @akladiev in #1575
- [LLM_BENCH] relax md5 checks and allow pass cb config without use_cb by @eaidova in #1570
- [VLM] Add Qwen2VL by @yatarkan in #1553
- Fix links, remind about ABI by @Wovchena in #1585
- Add nightly to instructions similar to requirements by @Wovchena in #1582
- GHA: use nightly from 2025.1.0 by @ilya-lavrenov in #1577
- NPU LLM Pipeline: Switch to STATEFUL by default by @dmatveev in #1561
- Verify not empty rendered chat template by @yatarkan in #1574
- [RTTI] Fix passes rtti definitions by @t-jankowski in #1588
- Test
add_special_tokens
properly by @pavel-esir in #1586 - Add indentation for llm_bench json report dumping by @nikita-savelyevv in #1584
- prioretize config model type under path-based task determination by @eaidova in #1587
- Replace openvino.runtime imports with openvino by @helena-intel in #1579
- Add tests for Whisper static pipeline by @eshiryae in #1250
- CB: removed handle_dropped() misuse by @ilya-lavrenov in #1594
- Bump timm from 1.0.13 to 1.0.14 by @dependabot in #1595
- Update samples readme by @olpipi in #1545
- [ Speculative decoding ][ Prompt lookup ] Enable Perf Metrics for assisting pipelines by @iefode in #1599
- [LLM] [NPU] StaticLLMPipeline: Export blob by @smirnov-alexey in #1601
- [llm_bench] enable prompt permutations for prevent prefix caching and fix vlm image load by @eaidova in #1607
- LLM: use set_output_seq_len instead of WA by @ilya-lavrenov in #1611
- CB: support different number of K and V heads per layer by @ilya-lavrenov in #1610
- LLM: fixed Slice / Gather of last MatMul by @ilya-lavrenov in #1616
- Switch to VS 2022 by @mryzhov in #1598
- Add Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1609
- Whisper pipeline: apply slice matmul by @as-suvorov in #1623
- GHA: use OV master in mac.yml by @ilya-lavrenov in #1622
- [Image Generation] Image2Image for FLUX by @likholat in #1621
- add missed ignore_eos in generation config by @eaidova in #1625
- Master increase priority for rt info to fix Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1626
- Correct model name by @wgzintel in #1624
- Token rotation by @vshampor in #987
- Whisper pipeline: use Sampler by @as-suvorov in #1615
- Fix setting eos_token_id with kwarg by @Wovchena in #1629
- Extract cacheopt E2E tests into separate test matrix field by @vshampor in #1630
- [CB] Split token streaming and generation to different threads for all CB based pipelines by @iefode in #1544
- Don't silence a error if a file can't be opened by @Wovchena in #1620
- [CMAKE]: use different version for macOS arm64 by @ilya-lavrenov in #1632
- Test invalid fields assignment raises in GenerationConfig by @Wovchena in #1633
- do_sample=False for NPU in chat_sample, add NPU to README by @helena-intel in #1637
- [JS] Add GenAI Node.js bindings by @vishniakov-nikolai in #1193
- CB: preparation for relying on KV cache precisions from plugins by @ilya-lavrenov in #1634
- [LLM bench]support providing adapter config mode by @eaidova in #1644
- Automatically apply chat template in non-chat scenarios by @sbalandi in #1533
- beam_search_causal_lm.cpp: delete wrong comment by @Wovchena in #1639
- [WWB]: Fixed chat template usage in VLM GenAI pipeline by @AlexKoff88 in #1643
- [WWB]: Fixed nano-Llava preprocessor selection by @AlexKoff88 in #1646
- [WWB]: Added config to preprocessor call in VLMs by @AlexKoff88 in #1638
- CB: remove DeviceConfig class by @ilya-lavrenov in #1640
- [WWB]: Added initialization of nano-llava in case of Transformers model by @AlexKoff88 in #1649
- WWB: simplify code around start_chat / use_template by @ilya-lavrenov in #1650
- Tokenizers update by @ilya-lavrenov in #1653
- DOCS: reorganized support models for image generation by @ilya-lavrenov in #1655
- Fix using lm_bemch/wwb with version w/o apply_chat_template by @sbalandi in #1651
- Fix Qwen2VL generation without images by @yatarkan in #1645
- Parallel sampling with threadpool by @mzegla in #1252
- [Coverity] Enabling coverity scan by @akazakov-github in #1657
- [ CB ] Fix streaming in case of empty outputs by @iefode in #1647
- Allow overriding eos_token_id by @Wovchena in #1654
- CB: remove GenerationHandle:back by @ilya-lavrenov in #1662
- Fix tiny-random-llava-next in VLM Pipeline by @yatarkan in #1660
- [CB] Add KVHeadConfig parameters to PagedAttention's rt_info by @sshlyapn in #1666
- Bump py-build-cmake from 0.3.4 to 0.4.0 by @dependabot in #1668
- pin optimum version by @pavel-esir in #1675
- [LLM] Enabled CB by default by @ilya-lavrenov in #1455
- SAMPLER: fixed hang during destruction of ThreadPool by @ilya-lavrenov in #1681
- CB: use optimized scheduler config for cases when user explicitly asked CB backend by @ilya-lavrenov in #1679
- [CB] Return Block manager asserts to destructors by @iefode in #1569
- phi3_v: allow images, remove unused var by @Wovchena in #1670
- [Image Generation] Inpainting for FLUX by @likholat in #1685
- [WWB]: Added support for SchedulerConfig in LLMPipeline by @AlexKoff88 in #1671
- Add LongBench validation by @l-bat in #1220
- Fix Tokenizer for several added special tokens by @pavel-esir in #1659
- Unpin optimum-intel version by @ilya-lavrenov in #1680
- Image generation: proper error message when encode() is used w/o encoder passed to ctor by @ilya-lavrenov in #1683
- Fix excluding stop str from output for some tokenizer by @sbalandi in #1676
- [VLM] Fix chat template fallback in chat mode with defined system message by @yatarkan in https://github.com/openvinotoolkit/openvino.genai/pull/...
2025.0.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!
2024.6.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!
2024.5.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!
2024.4.1.0
Please check out the latest documentation pages related to the new openvino_genai
package!
What's Changed
- Bump OV version to 2024.4.1 by @akladiev in #894
- Update requirements.txt and add requirements_2024.4.txt by @wgzintel in #893
Full Changelog: 2024.4.0.0...2024.4.1.0
2024.4.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!
What's Changed
- Support chat conversation for StaticLLMPipeline by @TolyaTalamanov in #580
- Prefix caching. by @popovaan in #639
- Allow to build GenAI with OpenVINO via extra modules by @ilya-lavrenov in #726
- Simplified partial preemption algorithm. by @popovaan in #730
- Add set_chat_template by @Wovchena in #734
- Detect KV cache sequence length axis by @as-suvorov in #744
- Enable u8 KV cache precision for CB by @ilya-lavrenov in #759
- Add test case for native pytorch model by @wgzintel in #722
- Prefix caching improvements by @popovaan in #758
- Add USS metric by @wgzintel in #762
- Prefix caching optimization by @popovaan in #785
- Transition to default int4 compression configs from optimum-intel by @nikita-savelyevv in #689
- Control KV-cache size for StaticLLMPipeline by @TolyaTalamanov in #795
- [2024.4] update optimum intel commit to include mxfp4 conversion by @eaidova in #828
- [2024.4] use perf metrics for genai in llm bench by @eaidova in #830
- Update Pybind to version 13 by @mryzhov in #836
- Introduce stop_strings and stop_token_ids sampling params [2024.4 base] by @mzegla in #817
- StaticLLMPipeline: Handle single element list of prompts by @TolyaTalamanov in #848
- Fix Meta-Llama-3.1-8B-Instruct chat template by @pavel-esir in #846
- Add GPU support for continuous batching [2024.4] by @sshlyapn in #858
Full Changelog: 2024.3.0.0...2024.4.0.0
2024.3.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!
2024.2.0.0
Please check out the latest documentation pages related to the new openvino_genai
package!