-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
i use ipex-llm==2.1.0b20240805+vllm 0.4.2 to run Qwen2-7B-Instruct on CPU, the use curl to launch http request to call the api which is openai api compatible.
The server start command:
python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server
--model /datamnt/Qwen2-7B-Instruct --port 8080
--served-model-name 'Qwen/Qwen2-7B-Instruct'
--load-format 'auto' --device cpu --dtype bfloat16
--load-in-low-bit sym_int4
--max-num-batched-tokens 32768
The curl command:
time curl http://172.16.30.28:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2-7B-Instruct",
"messages": [
{"role": "system", "content": "你是一个写作助手"},
{"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
],
"top_k": 1,
"max_tokens": 256,
"stream": false}'
Then the server raised error after the inference finished:
INFO 01-17 09:51:07 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-17 09:51:09 async_llm_engine.py:120] Finished request cmpl-a6703cc7cb0140adaebbfdd9dbf1f1e5.
INFO: 172.16.30.28:47694 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 117, in create_chat_completion
invalidInputError(isinstance(generator, ChatCompletionResponse))
TypeError: invalidInputError() missing 1 required positional argument: 'errMsg'
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
zengqingfu1442 commentedon Jan 20, 2025
While the stream style is supported:
xiangyuT commentedon Jan 21, 2025
The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version.
zengqingfu1442 commentedon Jan 21, 2025
i just tried to update ipex-llm to 2.1.0 with
pip install ipex-llm -U
, and then run but there is new errors:xiangyuT commentedon Jan 21, 2025
What version of
ipex-llm
are you using right now? Maybe you could trypip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu
zengqingfu1442 commentedon Jan 21, 2025
i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,
Here is the versions:
xiangyuT commentedon Jan 21, 2025
You may need to reinstall vllm after updating ipex-llm. It seems that the versions of transformers and torch are lower than recommended.
Below is some recommended versions for these libs:
And it works in my environment:
zengqingfu1442 commentedon Jan 21, 2025
Ok. Reinstalling vllm after updating ipex-llm to
2.1.0b20240810
really works.zengqingfu1442 commentedon Jan 21, 2025
But the latest stable version
ipex-llm==2.1.0
does not work.zengqingfu1442 commentedon Jan 21, 2025
And the latest pre-release version
ipex-llm==2.2.0b20250120
does not work also.xiangyuT commentedon Jan 21, 2025
You could use
2.1.0b20240810
version for now. We will look into the issue and plan to update vllm-cpu in the future.zengqingfu1442 commentedon Jan 21, 2025
@xiangyuT it seems that
low-bit
does not work when client send many async requests. My server start command isAnd the packages versions:
Here are the error logs:
xiangyuT commentedon Jan 21, 2025
Understood. We are planning to update vllm-cpu to the latest version and address these issues.
7 remaining items
zengqingfu1442 commentedon Feb 8, 2025
i can successfully run this with short user prompt, but the server would crash when using long user prompt.
zengqingfu1442 commentedon Feb 10, 2025
@xiangyuT Does ipex-llm support DeepSeek-R1-Distill-Qwen-7B?
xiangyuT commentedon Feb 10, 2025
Hi @zengqingfu1442,
The issue is not reproduced in my environment. Could you provide some more information about it?
Yes, it has been already supported.
zengqingfu1442 commentedon Feb 10, 2025
xiangyuT commentedon Feb 12, 2025
Hi @zengqingfu1442 ,
This issue should be resolved by PR#12805. Please update
ipex-llm
to the latest version (2.2.0b20250211
or newer version) and try again. You can do this by running the following command:zengqingfu1442 commentedon Feb 12, 2025
@xiangyuT Does ipex-llm+cpu support Deepseek-V3 and Deepseek-R1?
rnwang04 commentedon Feb 13, 2025
Hi @zengqingfu1442 , ipex-llm llama.cpp now can run Deepseek-V3 and Deepseek-R1 on CPU / GPU / CPU + GPU (This require
ipex-llm[cpp]>=2.2.0b20250212
, which will be released tonight).zengqingfu1442 commentedon Feb 20, 2025
what speed if i use pure CPU to run Deepseek-R1? How many tokens per second?
rnwang04 commentedon Feb 21, 2025
Sorry, there was a problem with my previous statement, ipex-llm llama.cpp can only run models on GPU or CPU + GPU, at least one GPU is needed.
As for the performance, it depends on what hardware you used.
zengqingfu1442 commentedon Feb 23, 2025
To be more clear, it is Intel CPU or Intel CPU + Intel GPU, right? not support Nvidia GPU or AMD GPU/CPU?
rnwang04 commentedon Feb 24, 2025
Yes, it's Intel GPU or Intel CPU + Intel GPU. 😊
zengqingfu1442 commentedon Feb 25, 2025
@rnwang04 Does ipex-llm support DeepSeek-R1-Distill-Llama-70B on pure Intel CPU?
xiangyuT commentedon Feb 26, 2025
Yes, you can run DeepSeek-R1-Distill-Llama-70B with ipex-llm + vLLM CPU on pure Intel CPU.
zengqingfu1442 commentedon Mar 30, 2025
Does ipex-llm + CPU support QwQ 32B and Gemma3 27B?
xiangyuT commentedon Mar 31, 2025
QwQ 32B is supported; however, Gemma3 27B has not been validated yet.