-
Setup
llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 -ts 1,1,0,0 --host 0.0.0.0 --port 3721 -t 8 -tb 8 IssueUsing identical configuration for both models:
Questions
Test MethodSimple browser-based concurrent API requests through two windows. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Try: LLAMA_SET_ROWS=1 llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 --host 0.0.0.0 --port 3721 -t 8 -tb 8 -fa |
Beta Was this translation helpful? Give feedback.
-
The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the response. I did some further testing to confirm the behavior:
set LLAMA_SET_ROWS=1
llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 --host 0.0.0.0 --port 3721 -t 8 -tb 8 -fa But didn't observe significant improvement in concurrent request performance. I also tried vllm as an alternative, but being unfamiliar with Python environments, its deployment process on my older GPUs (Tesla V100) proved too complicated. Even after several days of attempts with AI assistance, I couldn't successfully deploy it. llama.cpp is much simpler to work with, and I'm looking forward to improvements in concurrent processing capabilities for MoE models. |
Beta Was this translation helpful? Give feedback.
The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.