You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-\[2024/05\] Support VLMs quantization, such as InternVL v1.5, LLaVa, InternLMXComposer2.
29
+
-\[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
30
+
-\[2024/05\] Support 4-bits weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
30
31
-\[2024/04\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
31
32
-\[2024/04\] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer [here](docs/en/quantization/kv_quant.md) for detailed guide
32
33
-\[2024/04\] TurboMind latest upgrade boosts GQA, rocketing the [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) model inference to 16+ RPS, about 1.8x faster than vLLM.
@@ -122,6 +123,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
122
123
<li>Gemma (2B - 7B)</li>
123
124
<li>Dbrx (132B)</li>
124
125
<li>Phi-3-mini (3.8B)</li>
126
+
<li>StarCoder2 (3B - 15B)</li>
125
127
</ul>
126
128
</td>
127
129
<td>
@@ -133,7 +135,6 @@ For detailed inference benchmarks in more devices and more settings, please refe
0 commit comments