Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release V0.0.8

11 Sep 15:34
450757b

Choose a tag to compare

Highlight

  • Support Baichuan2-7B-Base and Baichuan2-7B-Chat
  • Support all features of Code Llama: code completion, infilling, chat / instruct, and python specialist

What's Changed

🚀 Features

🐞 Bug fixes

  • [Fix] when using stream is False, continuous batching doesn't work by @sleepwalker2017 in #346
  • [Fix] Set max dynamic smem size for decoder MHA to support context length > 8k by @lvhan028 in #377
  • Fix exceed session len core dump for chat and generate by @AllentDan in #366
  • [Fix] update puyu model by @Harold-lkk in #399

📚 Documentations

New Contributors

Full Changelog: v0.0.7...v0.0.8

LMDeploy Release V0.0.7

04 Sep 06:39
d065f3e

Choose a tag to compare

Highlights

  • Flash attention 2 is supported, boosting context decoding speed by approximately 45%
  • Token_id decoding has been optimized for better efficiency
  • The gemm-tunned script has been packed in the PyPI package

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

Full Changelog: v0.0.6...v0.0.7

LMDeploy Release V0.0.6

25 Aug 13:30
cfabbbd

Choose a tag to compare

Highlights

  • Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
  • Support tensor parallelism for W4A16
  • Add OpenAI-like RESTful API
  • Support Llama-2 70B 4-bit quantization

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • Adjust dependency of gradio server by @AllentDan in #236
  • Implement movmatrix using warp shuffling for CUDA < 11.8 by @lzhangzz in #267
  • Add 'accelerate' to requirement list by @lvhan028 in #261
  • Fix building with CUDA 11.3 by @lzhangzz in #280
  • Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in #285
  • Fix llama2 70b & qwen quantization error by @pppppM in #273
  • Import turbomind in gradio server only when it is needed by @AllentDan in #303

📚 Documentations

🌐 Other

Known issues

  • 4-bit Qwen-7b model inference failed. #307 is addressing this issue.

Full Changelog: v0.0.5...v0.0.6

LMDeploy Release V0.0.5

15 Aug 07:40
271a19f

Choose a tag to compare

What's Changed

🐞 Bug fixes

  • Fix wrong RPATH using the absolute path instead of relative one by @irexyc in #239

Full Changelog: v0.0.4...v0.0.5

LMDeploy Release V0.0.4

14 Aug 11:35
8cdcb2a

Choose a tag to compare

Highlight

  • Support 4-bit LLM quantization and inference. Check this guide for detailed information.
    image

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • Fix TIS client got-no-space-result side effect brought by PR #197 by @lvhan028 in #222

📚 Documentations

Full Changelog: v0.0.3...v0.0.4

LMDeploy Release V0.0.3

09 Aug 09:55
4bd0b48

Choose a tag to compare

What's Changed

🚀 Features

  • Support tensor parallelism without offline splitting model weights by @grimoire in #158
  • Add script to split HuggingFace model to the smallest sharded checkpoints by @LZHgrla in #199
  • Add non-stream inference api for chatbot by @lvhan028 in #200

💥 Improvements

🐞 Bug fixes

  • Fix build test error and move turbmind csrc test cases to tests/csrc by @lvhan028 in #188
  • Fix launching client error by moving lmdeploy/turbomind/utils.py to lmdeploy/utils.py by @lvhan028 in #191

📚 Documentations

New Contributors

Full Changelog: v0.0.2...v0.0.3

LMDeploy Release V0.0.2

28 Jul 07:11
7e0b75b

Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

New Contributors

@streamsunshine @del-zhenwu @APX103 @xin-li-67 @KevinNuNu @rollroll90