11 Nov 03:27

Jiang-Jia-Jun

cba7b29

v2.3.0 Latest

Latest

新增功能

新增GLM 4.5文本类模型部署支持 #3928
新增GPT-OSS-BF16文本类模型部署支持 #4240
新增ERNIE-4.5-VL-28B-A3B-Thinking多模态思考模型部署支持，详见文档
新增PaddleOCR-VL多模态模型部署支持 #4936
多模态模型和思考模型增加受限解码StructredOutput支持 #2749
多模态模型增加Prefix Caching与Encoder Caching支持 #4134
新增Wfp8Afp8在线量化推理支持 #4051 #4238
新增静态Cfp8量化推理支持 #4568
LogProb功能
- 支持EP并行下开启logprob #4151
- 支持MTP场景下开启logprob #4464 #4467
- 新增logprobs_mode参数指定返回结果的类型 #4567
HuggingFace Safetensors模型升级为默认能力
- Qwen2.5-VL系列支持 #3921
- ERNIE-4.5-VL系列模型支持 #4042
- 新增EP并行与Cache量化场景下支持 #3801
- 新增动态量化缓存机制，二次加载可使用缓存进行加载 #3857
Nvidia GPU下CUDA Graphs功能的完善
- CUDA Graphs默认在Decode阶段开启 #3594
- 使用统一内存池，降低显存开销 #4230
- 支持投机解码 #3769 #4545 #4617 #4669
- 支持TP、DP、EP混合并行 #4456 #4589
- 支持 PD 分离式部署 #4530
- 支持权重清理与动态加载下的重捕获 #3781 #3594
- 支持CustomAllReduce下开启CUDA Graphs重捕获 #4305
- 增加ERNIE-4.5-VL-MOE模型的支持 #3226
新增终端命令行CLI工具集
- chat：执行对话生成任务 #4037
- complete：执行文本补全任务 #4037
- serve：启动与OpenAI协议兼容的推理服务 #4226
- bench：对推理服务进行性能（延迟、吞吐）或精度评测
  - bench serve \ bench latency 精度评测工具 #4160 #4239
  - bench throughtput \ bench eval 性能评测工具 #4239
- collect-env：收集并打印系统、GPU、依赖等运行环境信息 #4044 #4558 #4159
- run-batch：批量执行推理任务，支持文件/URL输入输出 # 4237
- tokenizer：执行文本与 token 的编码、解码及词表导出 #4278
新增engine-worker-queue-port与cache-queue-port的匿名端口支持 #4597
新增```LogitsProcessors````后处理参数支持 #4515
新增ERNIE-45-VL-Thinking模型的ReasoningParser与ToolParser #4571
usage字段返回新增多模态输入与输出Token、思考Token的统计 #4648 #4520
新增n参数支持单请求返回多个生成结果 #4273
离线推理chat接口新增tool参数支持工具调用 #4415
多模态数据预处理增加对url数据的下载增加重试 #3838

性能优化

优化per_token_quant_fp8算子性能，提升50% #4238
MTP支持Chunked Prefill与V1 KVCache调度 #3659 #4366
V1 KVCache调度增加对上下文缓存的支持，并作为默认配置 #3807 #3814
优化MLA kernel性能，支持auto chunk + graph下的高性能MLA kernel #3886
优化Qwen-VL中ViT模块的CPU同步耗时 #4442
Machete GEMM支持WINT4/WINT8以及group scale，并作为默认dense GEMM后端，优化模型性能与精度 #4451 #4295 #4121 #3999 #3905
优化append attention前处理算子性能 #4443 #4369 #4367
思考长度裁剪功能自定义算子化，实现更鲁棒更规范 #4279 #4736
INTEL HPU优化多卡场景下sampling #4445
新增MergedReplicatedLinear方法，支持DeepSeek，qkv_a_proj融合 #3673
优化DeepEP buffer显存；支持EP场景下DeepEP buffer的creat/delete功能 #4039
优化集中式EP场景下DeepEP clear buffer带来的降速 #4039
spec decode适配qk norm #3637
优化MLA Kernel性能，支持auto chunk + CUDA Graphs #3886
解决KV Cache容量分配偏小问题 #4355
Engine与Worker跨进程通信支持零拷贝方式传输多模态张量数据 #4531
APIServer支持gunicore+uvicorn优化前处理耗时 #4496 #4364

多硬件

昆仑芯P800
- 新增ERNIE-4.5-VL系列模型的支持 #4030
- 新增PaddleOCR-VL 0.9B模型的支持 #4529
- BlockAttention算子支持neos版本rope #4723
- 新增W4A8精度支持 #4068
- 适配V1 KVCache调度 #4573
沐曦C550
- 优化Attention、MoE、RotaryEmbedding算子实现 #3688
- 新增DeepSeek-R1、DeepSeek-V3.1-BF16部署支持 #4498
天数CoreX
- 新增ERNIE-4.5-VL-28B-A3B部署支持 #4313
- ERNIE-4.5-300B-A47B推理性能优化 #3651
- 修复rebuild_padding错误问题 #4504

文档

新增终端命令行工具CLI命令使用说明 #4569
新增优雅退出方案 #3785
更新模型支持文档 #4754
新增2Bit量化方式和最佳实践 #3819 #3968
新增DP并行部署文档 #3883
新增昆仑芯ERNIE-4.5-VL模型部署文档 #4586
新增XPU PaddleOCR-VL模型部署文档 #4792
更新模型最佳实践文档 #3969
新增ERNIE-4.5-21B-A3B-Thinking最佳实践文档 #3994
更新metrics指标说明文档 #4061
更新接口参数文档，增加completion_tokens、rompt_tokens、tool_calls说明 #4421

Bug修复

修复DP并行场景下Prefix Caching无法部署问题 #4359 #4370
修复集中式EP并行部署下长输入KVCache调度Hang住问题 #4275
修复开启CUDA Graphs时noaux_tc算子报错CUDA 700问题 #4174
修复V1 Loader下TritonMoEBlockWiseFP8权重shape错误 #4384
修复EP场景下MoE前处理问题，增加num_experts_per_rank合法值 #4102
修复CustomAllReduce输出不稳定问题 #4437
修复昆仑芯下思考长度限制，只有思考无回复内容问题 #4539 #4760
修复推理异常退出场景下KVCache管理进程残留问题 #4410
修复部分场景默认开启ChunkedPrefill报错问题 #3759
修复调度方法导致DeepSeek模型CudaError问题 #4757
修复XPU多模下默认开启上下文缓存bug #4694
修复MTP与C8场景下模型加载问题 #4077
修复MLA默认开启TensorCore的bug #4354
修复APIServer连接重复初始化的问题 #3901
修复MultiAPIServer日志地址混乱问题 #3967
修复多机张量并行无法部署问题 #4377
修复Qwen-VL系列模型无法关闭思考问题 #3808 #4762
修复APIServer的对话接口非流式返回场景下finish_reason不正确问题 #4582
修复ERNIE-4.5-VL模型ReasoningPaserser中思考结束符错误问题 #4686
修复离线接口enable_thinking强制False的不符合预期错误 #4248
修复ERNIE-4.5-VL对PNG格式透明背景图像的处理问题 #4847
修复rope3d开启FA3下的报错问题 #3791
修复部分硬件平台上算子导入出错问题 #4559
修复PD分离EP并行场景下启动推理服务的多个问题 # 4311 #4420 #4542 #4693 #4781
修复Metrics中num_requests_running, num_requests_waiting, available_gpu_block_num统计不准确的问题 #4404
修复Trace日志在流式输出场景下trace span过多问题 #4375
修复动态C8计算错误问题 #4119
修复AppendAttention作为自定义算子注册下的Bug导致动静不统一问题 #4340
修复Qwen-VL系列模型预处理中视频与图片数据的占位符处理错误 #4065
修复模型组网存在的无用显存浪费问题 #3854
修复思考长度限制在并发场景下的Bug #4296
修复PD分离下IPC信号读取错误问题 #4309
修复metrics指标的共享目录命名冲突问题 #4007
修复昆仑芯barrier随机精度问题 #4181
修复思考长度限制超过上限时的异常问题 #4086

其它

修复沐曦硬件上的单测报错问题 #4027
修复沐曦硬件上的单测报错问题test_get_save_output_v1单测偶发挂的问题 #4732
昆仑芯增加W4A8单测用例 #4501
Config配置代码优化，去除冗余字段 #4147 #4362 #4400
第三方库采用submodule管理 #4033
新增DeepSeek-V3-0324端到端监控 #4360
ERNIE-4.5-VL模型续推字段generated_token_ids改为completion_token_ids #4086
后面进程异常退出时，APIServer进程自动退出提在终端输出提示 #3271
Metrics增加若干可观测性指标 #3868
新增Attention层的性能单测 #4494
DP+EP并行场景下支持模型权重的热更新 #3765 #3803 #3898
支持在训练场景下强制停止推理请求 #3601 #4402
修复在训练场景下Qwen3模型命名映射异常问题 #4338 #4322
修复流式请求max_streaming_response_token参数不起作用问题 #3789
增加基于ZMQ回传worker推理结果至Engine的通信方式 #3521

What's Changed

Add more runtime information to resource manager by @ming1753 in #3706
Add CI cases by @ZhangYulongg in #3714
Add loader test for mtp by @YuanRisheng in #3724
fix typos by @co63oc in #3684
add ci images build job by @XieYunshen in #3749
[DOC] fix Document by @lizexu123 in #3782
Update test_ernie_21b_mtp.py by @ZhangYulongg in #3783
fix test_load_mtp by @co63oc in #3780
[BugFix] Fix chunked prefill by @kevincheng2 in #3759
[BugFix] fix max streaming tokens invalid by @ltd0924 in #3789
[Feature] Setting number of apiserver workers automatically by @Jiang-Jia-Jun in #3790
[Feature] mm and thinking model support structred output by @kevincheng2 in #2749
[Feature] support model weight update in ep by @ltd0924 in #3765
[BugFix] fix error of import paddle.base.core.Config by @yuanlehome in #3761
[Executor] Fix bug of import paddle with RLHF by @gongshaotian in #3781
rename speculate_stop_generation_multi_stop_seqs by @co63oc in #3743
Modify mask_offset‘s format by @carryyu in #3525
rename speculate_token_penalty_multi_scores.cu by @co63oc in #3735
fix ce compile job by @XieYunshen in #3768
[v1loader]Reduce EB300B model loading time by @bukejiyu in #3700
【Fix bug] w4afp8 的nblock固定为256，并且fa3的append attn 增加mask参数 by @yangjianfengo1 in #3771
【Hackathon 9th No.64】add test_draft_model_set_value_by_flags by @Echo-Nie in #3741
[Feat] Support streaming transfer data using ZMQ by @Wanglongzhi2001 in #3521
[BugFix] fix scheduler invalid by @ltd0924 in #3803
rename fused_get_rope.cu by @co63oc in #3752
【Hackathon 9th No.84】Supplementary Unit Test for fastdeploy/reasoning by @Echo-Nie in #3570
fix w8a8.py by @co63oc in #3733
fix dcu_worker.py by @co63oc in #3734
【Hackathon 9th No.73】add unit tests for graph_opt_backend by @ooooo-create in #3609
[XPU] FIX XPU CI BUG by @plusNew001 in #3829
[Doc] update wint2 doc by @chang-wenbin in #3819
fix test_append_attention_with_output.py by @carryyu in #3831
[XPU] Update XPU CI case by @plusNew001 in #3837
qk norm for speculate decode C16 by @rsmallblue in #3637
[V1 Loader]V1 loader support EP by @YuanRisheng in #3801
[Code Simplification] delete cum_offsets_out by @lizexu123 in #3815
[Feature] ernie4_5_vl_moe support huggingface safetensor loading by @aquagull in #3750
add reasoning parser plugin by @luukunn in #3811
reopen ut by @XieYunshen in #3795
Automatically configure workers based on max-num-seqs by @yyssys in #3846
【Hackathon 9th No.43、45】add speculate_get_padding_offset by @co63oc in #3730
【Hackathon 9th No.42】ad...

Contributors

co63oc, DDDivano, and 79 other contributors

Assets 2

11 Oct 07:01

Jiang-Jia-Jun

v2.2.1

e42dc8c

v2.2.1

新增功能

新增在线权重更新支持开启Prefix Caching
新增GLM 4.5 Air模型部署支持

What's Changed

[docs] update best practice docs for release/2.2 by @zoooo0820 in #3970
[Docs] release 2.2.0 by @ming1753 in #3991
[docs] update readme by @yangjianfengo1 in #3996
[Optimize]Error messages about Model api. by @AuferGachet in #3972
[Cherry-Pick] get org_vocab_size from args by @zeroRains in #3984
【FIX】Change the name of sparse attn from moba to plas by @yangjianfengo1 in #4006
Fix down projection weight shape in fused MOE layer by @yuanlehome in #4041
[Fix] fix multi api server log dir by @ltd0924 in #3966
Fixed the issue of metrics file conflicts between multiple instances … by @zhuangzhuang12 in #4010
[Feature] Support mixed deployment with yiyan adapter in release22 by @rainyfly in #3974
[CI] update paddlepaddle==3.2.0 in release/2.2 by @EmmonsCurse in #3997
[setup optimize]Support git submodule (#4033) by @YuanRisheng in #4080
[CP]Glm45 air 2.2 by @ckl117 in #4073
[feat] support prefix cache clearing when /clear_load_weight is called by @liyonghua0910 in #4091
[BugFix]fix tp/ep group gid by @gzy19990617 in #4038
Support limit thinking lengths. by @K11OntheBoat in #4070
Add assertion for ENABLE_V1_KVCACHE_SCHEDULER by @Jiang-Jia-Jun in #4146
[fix] fix ep group all-reduce by @liyonghua0910 in #4140
[Cherry-pick] fix MTP load with v1 loader by @zoooo0820 in #4153
[CP2.2] Machete support group scale & wint8 & v1 loader by @Sunny-bot1 in #4166
[Feature] support rdma IB transfer by @ltd0924 in #4123
[BugFix]2.2 glm all reduce tp group by @ckl117 in #4188
[Executor] Adjust signal sending order in RL training (#3773) (#4066) by @gongshaotian in #4178
[fix] initialize available_gpu_block_num with max_gpu_block_num by @liyonghua0910 in #4193
[fix]Modify follow-up push parameters and Modify the verification method for thinking length by @luukunn in #4177
Fix noaux_tc cuda Error 700 in CUDAGraph and Add wfp8apf8 moe quant method by @ckl117 in #4115
[Feature]CP support data clear by @ltd0924 in #4214
[fix] fix clearing caches synchronization and add more logs by @liyonghua0910 in #4212
fix ernie vl distributed attr. by @ZHUI in #4217
[2.2]include_stop_str_in_output=False not return eos text by @ckl117 in #4231
[fix]update apply_chat_template by @luukunn in #4249
[fix]remove reasoning_max_tokens=max_toksns*0.8 in sampling_params by @luukunn in #4294
【fix】Remove the logic that assigns the default value of 80% to reasoning_max_tokens in the offline component of FastDeploy by @kxz2002 in #4304
[feature]2.2 custom_allreduce support cudagraph recapture by @ckl117 in #4307
[BUGFIX] clear request by @ltd0924 in #4320

Full Changelog: v2.2.0...v2.2.1

Contributors

zoooo0820, ZHUI, and 19 other contributors

Assets 2

08 Sep 16:17

Jiang-Jia-Jun

v2.2.0

d40a104

v2.2.0

新增功能

采样策略中的bad_words支持传入token ids
新增Qwen2.5-VL系列模型支持(视频请求不支持enable-chunked-prefill)
API-Server completions接口prompt 字段支持传入token id列表，同时支持批量推理
新增function call解析功能，支持通过tool-call-parse解析function call结果
支持服务启动或请求中自定义chat_template
支持模型chat_template.jinja文件的加载
请求报错结果增加异常堆栈信息，完善异常log记录
新增混合MTP、Ngram的投机解码方法
支持用于投机解码的Tree Attention功能
模型加载功能增强，实现了使用迭代器加载模型权重，加载速度和内存占用进一步优化
API-Server完善日志格式，增加时间信息
新增插件机制，允许用户在不修改FastDeploy核心代码的前提下扩展自定义功能
支持Marlin kernel文件在编译阶段按照模版配置自动生成
支持加载 HuggingFace原生Safetensors格式的文心、Qwen系列模型
完善DP+TP+EP混合并行推理

性能优化

新增W4Afp8 MoE Group GEMM算子
CUDA Graph增加对超32K长文的支持
优化moe_topk_select算子性能，提升MoE模型性能
新增Machete WINT4 GEMM算子，优化WINT4 GEMM性能，通过FD_USE_MACHETE=1开启
Chunked prefill 默认开启
V1 KVCache调度策略与上下文缓存默认开启
MTP支持更多草稿token推理，提升多步接受率
新增可插拔轻量化稀疏注意力加速长文推理
针对Decode支持自适应双阶段的All-to-All通信，提升通信速度
支持DeepSeek系列模型MLA Bankend encoder阶段启用Flash-Attrntion-V3
支持DeepSeek系列模型q_a_proj & kv_a_proj_with_mqa linear横向融合
API-Server新增zmq dealer 模式通信管理模块，支持连接复用进一步扩展服务可支持的最大并发数

Bug修复

completion接口echo回显支持
修复 V1调度下上下文缓存的管理 bug
修复 Qwen 模型固定 top_p=0 两次输出不一致的问题
修复 uvicorn 多worker启动、运行中随机挂掉问题
修复 API-Server completions接口中多个 prompt 的 logprobs 聚合方式
修复 MTP 的采样问题
修复PD 分离cache 传输信号错误
修复异常抛出流量控制信号释放问题
修复max_tokens为0 异常抛出失败问题
修复EP + DP 混合模式下离线推理退出hang问题

文档

更新了最佳实践文档中一些技术的用法和冲突关系
新增多机张量并行部署文档
新增数据并行部署文档

其它

CI新增对自定义算子的Approve拦截
Config整理及规范化

What's Changed

Describe PR diff coverage using JSON file by @XieYunshen in #3114
[CI] add xpu ci case by @plusNew001 in #3111
disable test_cuda_graph.py by @XieYunshen in #3124
[CE] Add base test class for web server testing by @DDDivano in #3120
[OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in #3121
[Docs] Optimal Deployment by @ming1753 in #2768
fix stop seq unittest by @zoooo0820 in #3126
[XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3133
[Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in #2937
add case by @DDDivano in #3150
fix ci by @XieYunshen in #3141
Fa3 支持集中式 by @yangjianfengo1 in #3112
Add CI cases by @ZhangYulongg in #3155
[XPU]Updata XPU dockerfiles by @plusNew001 in #3144
[Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in #3014
【Inference Optimize】Support automatic generation of marlin kernel by @chang-wenbin in #3149
Update init.py by @DDDivano in #3163
fix load_pre_sharded_checkpoint by @bukejiyu in #3152
【Feature】add fd plugins && rm model_classes by @gzy19990617 in #3123
[Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3172
Update test_base_chat.py by @DDDivano in #3183
Fix approve shell scripts by @YuanRisheng in #3108
[Bug Fix] fix the bug in test_sampler by @zeroRains in #3157
【Feature】support qwen3 name_mapping by @gzy19990617 in #3179
remove useless code by @zhoutianzi666 in #3166
[Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in #3130
[Bugfix] Fix uninitialized decoded_token and add corresponding unit t… by @sunlei1024 in #3195
[CI] add test_compare_top_logprobs by @EmmonsCurse in #3191
fix expertwise_scale by @rsmallblue in #3181
[FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3197
[plugin] Custom model_runner/model support by @lizhenyun01 in #3186
Add more base chat cases by @DDDivano in #3203
Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in #3192
[Bug Fix]Fix bug of append attention test case by @gongshaotian in #3202
add more cases by @DDDivano in #3207
fix coverage report by @XieYunshen in #3198
[New Feature] fa3 支持flash mask by @yangjianfengo1 in #3184
[Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in #3210
[EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in #3182
[Bug fix] Fix lm head bias by @RichardWooSJTU in #3185
Ce add repitation early stop cases by @DDDivano in #3213
[BugFix]fix test_air_top_p_sampling name by @ckl117 in #3211
[BugFix] support real batch_size by @lizexu123 in #3109
Ce add bad cases by @DDDivano in #3215
revise noaux_tc by @rsmallblue in #3164
[Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3176
support qk norm for append attn by @rsmallblue in #3145
Fix approve ci by @XieYunshen in #3212
[Trace]add trace when fd start by @sg263 in #3174
[New Feature] Support W4Afp8 MoE GroupGemm by @yangjianfengo1 in #3171
Perfect approve error message by @YuanRisheng in #3224
Fix the confused enable_early_stop when only set early_stop_config by @zeroRains in #3214
[CI] Add ci case for min token and max token by @xjkmfa in #3229
add some evil cases by @DDDivano in #3240
support qwen3moe by @bukejiyu in #3084
[Feature] support seed parameter by @lizexu123 in #3161
【Fix Bug】修复 fa3 支持集中式bug by @yangjianfengo1 in #3235
[bugfix]fix blockwisefp8 and all_reduce by @bukejiyu in #3243
[Feature] multi source download by @Yzc216 in #3125
[fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3247
[Doc][XPU] Update deps and fix dead links by @hong19860320 in #3252
Fix approve ci bug by @YuanRisheng in #3239
[Executor]Update graph test case and delete test_attention by @gongshaotian in #3257
[CI] remove useless case by @EmmonsCurse in #3261
Ce add benchmark test by @DDDivano in #3262
[stop_seq] fix out-bound value for stop sequence by @zoooo0820 in #3216
[fix] multi source download by @Yzc216 in #3259
[Bug fix] support logprob in scheduler v1 by @rainyfly in #3249
[feat]add fast_weights_iterator by @bukejiyu in #3258
[Iluvatar GPU] Optimze attention and moe performance by @wuyujiji in #3234
delete parallel_state.py by @yuanlehome in #3250
[bugfix]qwen3_fix and qwq fix by @bukejiyu in #3255
【Fix】【MTP】Fix MTP sample bug by @freeliuzc in #3139
[CI] add CI logprobs case by @plusNew001 in #3189
Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend by @zeroRains in #3148
[Bugfix] Fix model accuracy in some ops by @gzy19990617 in #3231
add base test ci by @XieYunshen in #3225
[BugFix] fix too many ...

Contributors

co63oc, lengxia, and 60 other contributors

Assets 2

02 Sep 09:40

Jiang-Jia-Jun

v2.1.1

c49c43d

v2.1.1

文档

新增多机张量并行部署文档
文心系列模型最佳实践文档更新到最新用法
更新CUDA Graph使用说明

新增功能

返回结果新增completion_tokens与prompt_tokens，支持返回原始输入与模型原始输出文本
completion接口支持echo参数

Bug修复

修复V1 KVCache调度下LogProb无法返回问题
修复chat_template_kwargs参数无法生效问题
修复混合架构部署下的EP并行问题
修复completion接口返回结果中输出Token计数错误问题
修复logprobs返回结果聚合问题

What's Changed

[Docs] Add Multinode deployment document by @ltd0924 in #3416
[docs] cherry-pick update docs by @zoooo0820 in #3422
[Docs]update installation readme by @yongqiangma in #3435
[Docs] release 2.1 by @ming1753 in #3441
[Docs]Updata docs of graph opt backend by @gongshaotian in #3443
[Feature] Support logprob in scheduler v1 for release/2.1 by @rainyfly in #3446
[Bugfix]fix config bug in dynamic_weight_manager by @gzy19990617 in #3432
[Feature] Pass through the chat_template_kwargs to the data processing module by @luukunn in #3469
[CI] fix run_ci error in release/2.1 by @EmmonsCurse in #3499
[BugFix] fix ep real_bsz by @lizexu123 in #3396
[Feature] add prompt_tokens and completion_tokens by @memoryCoderC in #3505
[fix] setting disable_chat_template while passing prompt_token_ids led to response error by @liyonghua0910 in #3511
[Excutor] Fixed the issue of CUDA graph execution failure caused by d… by @gongshaotian in #3512
[Feature] add tool parser by @luukunn in #3518
[BUGFIX] fix ep mixed bug by @ltd0924 in #3513
[BugFix] Api server bugs by @ltd0924 in #3530
[Feature] Support limit thinking len for text models by @K11OntheBoat in #3527
[Bug Fix] Close get think_end_id for XPU for now. by @K11OntheBoat in #3563
[Feature] Support mixed deployment with yiyan adapter by @rainyfly in #3533
[Cherry-Pick] Launch expert_service before kv_cache initialization in worker_process by @zeroRains in #3558
【BugFix】completion接口echo回显支持 by @AuferGachet in #3477
[fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3588
[fix] fix ZmqIpcClient.close() error by @liyonghua0910 in #3600
[Bugfix] Correct logprobs aggregation for multiple prompts in /completions endpoint by @sunlei1024 in #3620
[BugFix] ep mixed mode offline exit failed by @ltd0924 in #3623
【Bugfix】修复2.1分支上0.3B模型性能大幅下降 by @AuferGachet in #3624
[CI] add cleanup logic in release/2.1 workflows by @EmmonsCurse in #3655
[BugFix] fix parameter is 0 by @ltd0924 in #3663
[fix] qwen output inconsistency when top_p=0 (#3634) by @liyonghua0910 in #3662
Revert "[BugFix] fix parameter is 0" by @Jiang-Jia-Jun in #3681
[feat] add metrics for yiyan adapter by @liyonghua0910 in #3615
[bugfix]PR3663 parameter is 0 by @ltd0924 in #3679
[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. by @lizexu123 in #3670
Revert "[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER." by @Jiang-Jia-Jun in #3719
[Cherry-Pick] fix the bug when num_key_value_heads < tensor_parallel_size by @zeroRains in #3722
[Optimize] Increase zmq buffer size to prevent apiserver too slowly t… by @gongshaotian in #3728
[Fix] Do not drop result when request result slowly by @rainyfly in #3704
[Bug fix] Fix prefix cache in v1 by @rainyfly in #3710
[Bug fix] Fix mix deployment perf with yiyan adapter in release21 by @rainyfly in #3703

Full Changelog: v2.1.0...v2.1.1

Contributors

yongqiangma, zoooo0820, and 15 other contributors

Assets 2

15 Aug 10:26

Jiang-Jia-Jun

v2.1.0

d998efb

v2.1.0

FastDeploy v2.1.0通过升级KVCache调度机制、增强高并发场景能力以及丰富采样策略，进一步提升用户体验和服务稳定性；通过CUDA Graph以及MTP等多项优化提升推理性能；此外，还新增支持多款国产硬件上文心开源模型的推理能力。

使用体验优化

KVCache调度机制升级：采用输入与输出的KVCache统一管理方式，解决此前由于kv_cache_ratio参数配置不当导致的OOM问题；解决多模态模型由于输出KVCache不足，生成提前结束的问题。部署时通过配置环境变量export ENABLE_V1_KVCACHE_SCHEDULER=1启用（下个版本会默认开启），即可不再依赖kv_cache_ratio的设置，推荐使用。
高并发场景功能增强：增加max_concurrency/max_waiting_time控制并发，对于超时请求进行拒绝优化用户体验，保障服务稳定性。
多样的采样方式支持：新增min_p、top_k_top_p采样方式支持，使用方式参考采样说明；同时增加基于Repetition策略和基于stop词列表早停能力，详见早停说明。
服务化部署能力提升：增加return_token_ids/include_stop_str_in_output/logprobs等参数支持返回更完整的推理信息。
默认参数下性能提升：增强因max_num_seqs默认值与实际并发不一致时性能下降问题，避免手动修改max_num_seqs。

推理性能优化

CUDA Graph覆盖更多场景：覆盖多卡推理，支持与上下文缓存、Chunked Prefill同时使用，在ERNIE 4.5系列、Qwen3系列模型上性能提升17%~91%，详细使用可以参考最佳实践文档。
MTP投机解码性能提升 ：优化算子性能，减少CPU调度开销，提升整体性能；同时，相比v2.0.0版本新增ERNIE-4.5-21B-A3B模型支持MTP投机解码。
算子性能优化：优化W4A8、 KVCache INT4、WINT2 Group GEMM等计算Kernel，提升性能；如ERNIE-4.5-300B-A47B WINT2模型性能提升25.5%。
PD分离完成更多模型验证：P节点完善FlashAttention后端，提升长文推理性能，并基于ERNIE-4.5-21B-A3B等轻量模型完成验证。

国产硬件部署能力升级

新增支持昆仑芯P800上ERNIE-4.5-21B-A3B模型部署，更多说明参考昆仑芯P800部署文档。
新增支持海光K100-AI上ERNIE4.5文本系列模型部署，更多说明参考海光K100-AI部署文档。
新增支持燧原S60上ERNIE4.5文本系列模型的部署，更多说明参考燧原S60部署文档。
新增支持天数天垓150上ERNIE-4.5-300B-A47B和ERNIE-4.5-21B-A3B模型部署，并优化推理性能，更多说明参考天数部署文档。

ERNIE4.5 模型国产硬件推理适配情况（✅ 已支持 🚧 适配中 ⛔暂无计划）
模型	昆仑芯P800	昇腾910B	海光K100-AI	天数天垓150	沐曦曦云C550	燧原S60/L600
ERNIE4.5-VL-424B-A47B	🚧	🚧	⛔	⛔	⛔	⛔
ERNIE4.5-300B-A47B	✅	🚧	✅	✅	🚧	✅
ERNIE4.5-VL-28B-A3B	🚧	🚧	⛔	🚧	⛔	⛔
ERNIE4.5-21B-A3B	✅	🚧	✅	✅	✅	✅
ERNIE4.5-0.3B	✅	🚧	✅	✅	✅	✅

FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

🔥 Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.

Supported Models

Model	Data Type	PD Disaggregation	Chunked Prefill	Prefix Caching	MTP	CUDA Graph	Maximum Context Length
ERNIE-4.5-300B-A47B	BF16/WINT4/WINT8/W4A8C8/WINT2/FP8	✅	✅	✅	✅(WINT4)	WIP	128K
ERNIE-4.5-300B-A47B-Base	BF16/WINT4/WINT8	✅	✅	✅	✅(WINT4)	WIP	128K
ERNIE-4.5-VL-424B-A47B	BF16/WINT4/WINT8	WIP	✅	WIP	❌	WIP	128K
ERNIE-4.5-VL-28B-A3B	BF16/WINT4/WINT8	❌	✅	WIP	❌	WIP	128K
ERNIE-4.5-21B-A3B	BF16/WINT4/WINT8/FP8	❌	✅	✅	WIP	✅	128K
ERNIE-4.5-21B-A3B-Base	BF16/WINT4/WINT8/FP8	❌	✅	✅	WIP	✅	128K
ERNIE-4.5-0.3B	BF16/WINT8/FP8	❌	✅	✅	❌	✅	128K

Assets 2

Releases: PaddlePaddle/FastDeploy

v2.3.0

新增功能

性能优化

多硬件

文档

Bug修复

其它

What's Changed

Contributors

Uh oh!

v2.2.1

新增功能

What's Changed

Contributors

Uh oh!

v2.2.0

新增功能

性能优化

Bug修复

文档

其它

What's Changed

Contributors

Uh oh!

v2.1.1

文档

新增功能

Bug修复

What's Changed

Contributors

Uh oh!

v2.1.0

使用体验优化

推理性能优化

国产硬件部署能力升级

相关文档和说明

What's Changed

Contributors

Uh oh!

v2.0.0

FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

About

Supported Models

Uh oh!