bump version to v0.4.2 (#1644)

lvhan028 · web-flow · commit 54b7230b4ca0 · 2024-05-27T16:55:10.000+08:00
* bump version to v0.4.2

* update latest news
diff --git a/README.md b/README.md
@@ -26,7 +26,8 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/05\] Support VLMs quantization, such as InternVL v1.5, LLaVa, InternLMXComposer2.
+- \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
+- \[2024/05\] Support 4-bits weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
 - \[2024/04\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
 - \[2024/04\] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer [here](docs/en/quantization/kv_quant.md) for detailed guide
 - \[2024/04\] TurboMind latest upgrade boosts GQA, rocketing the [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) model inference to 16+ RPS, about 1.8x faster than vLLM.
@@ -122,6 +123,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>Gemma (2B - 7B)</li>
   <li>Dbrx (132B)</li>
   <li>Phi-3-mini (3.8B)</li>
+  <li>StarCoder2 (3B - 15B)</li>
 </ul>
 </td>
 <td>
@@ -133,7 +135,6 @@ For detailed inference benchmarks in more devices and more settings, please refe
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>MiniGeminiLlama (7B)</li>
-  <li>StarCoder2 (3B - 15B)</li>
 </ul>
 </td>
 </tr>
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -26,7 +26,8 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/05\] 支持 InternVL v1.5, LLaVa, InternLMXComposer2 等 VLM 模型的量化与推理。
+- \[2024/05\] 在多 GPU 上部署 VLM 模型时，支持把视觉部分的模型均分到多卡上
+- \[2024/05\] 支持InternVL v1.5, LLaVa, InternLMXComposer2 等 VLMs 模型的 4bit 权重量化和推理
 - \[2024/04\] 支持 Llama3 和 InternVL v1.1, v1.2，MiniGemini，InternLM-XComposer2 等 VLM 模型
 - \[2024/04\] TurboMind 支持 kv cache int4/int8 在线量化和推理，适用已支持的所有型号显卡。详情请参考[这里](docs/zh_cn/quantization/kv_quant.md)
 - \[2024/04\] TurboMind 引擎升级，优化 GQA 推理。[internlm2-20b](https://huggingface.co/internlm/internlm2-20b) 推理速度达 16+ RPS，约是 vLLM 的 1.8 倍
@@ -123,6 +124,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>Gemma (2B - 7B)</li>
   <li>Dbrx (132B)</li>
   <li>Phi-3-mini (3.8B)</li>
+  <li>StarCoder2 (3B - 15B)</li>
 </ul>
 </td>
 <td>
@@ -134,7 +136,6 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
   <li>DeepSeek-VL (7B)</li>
   <li>InternVL-Chat (v1.1-v1.5)</li>
   <li>MiniGeminiLlama (7B)</li>
-  <li>StarCoder2 (3B - 15B)</li>
 </ul>
 </td>
 </tr>
diff --git a/lmdeploy/version.py b/lmdeploy/version.py
@@ -1,7 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Tuple
 
-__version__ = '0.4.1'
+__version__ = '0.4.2'
 short_version = __version__