Multiple sequences support #16

wine99 · 2025-11-18T09:21:57Z

CPU and GPU

llama-server -np > 1 support
llama-perplexity support
llama-bench support (-fa 1 is needed to use flash attention)
GPU accuracy issue on quantized models (WA: OV_GPU_DISABLE_HORIZONTAL_FC_FUSION)
Need to test perf: small regression on CPU, large regression on GPU

All apps broken on NPU, WIP

github-actions bot added the ggml label Nov 18, 2025

wine99 force-pushed the multi_seq branch 2 times, most recently from df6dac2 to d2524d0 Compare November 20, 2025 07:32

change graph to 4d, support multi sequences

a8ac068

wine99 force-pushed the multi_seq branch from d2524d0 to a8ac068 Compare November 20, 2025 07:48

Fix llama-bench

a45bc4c

Provide feedback