-
Notifications
You must be signed in to change notification settings - Fork 314
Fix : mlx-server for chunked request (to support one-api, curl) #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix : mlx-server for chunked request (to support one-api, curl) #589
Conversation
|
@jyork03 could you have a look at it ? Note for gpt-oss, mlx should update harmonous template parsing library and add relevant support. (tempalte leak won't happen in latest SGLang and Ollama). |
|
First off, thanks for the contribution! A few things I've noticed so far:
|
jyork03
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've suggested some changes that address points 1, 2 and 3 from my comment above.
Tests are still needed.
Thanks @jyork03 for time of reviewing , I will update tests soon D. Though I am not usually working on client codes and some inconveniences to update in a test machine, within a LLM agentic environment, the suggestion for potential large payload attack is very helpful. |
|
@ jyork03 Unit test has been added. It works in my local laptop (M4) and Mac Studio M3 Ultra from apple team. |
|
Could you say more about this feature? I'm not opposed to adding it but would be good to know more about why / if it's worth while? Which front-ends are sending chunked requests / why? Is this something the standard OpenAI API supports (we tend to try to match feature parity with that). |
@awni thanks for attention. To simply put, the MLX-LM server is not production based. So the easiest way to use it in product is that we will not expose MLX-LM services to our LLM api service, instead we use a model router with support of OpenAI restful api scheme to delegate authorization, throttling and other model related services. And our model router send chunked requests. During our extensive tests in MacStudio Ultra3 from apple team, we found that ollama works well but conservative performance , however MLX-LM does not support this form of requests but give more aggresive performance, so we added this support. This is necessary in an agentic environment, where multiple models, model workflow are served : user intent analysis, output alignment, and audio services, et al. Later, I will add an online tests suite for our benchmark scripts where we use to measure the performance, instead of using offline benchmark. |
960a863 to
64f2acd
Compare
|
@yiakwy-xpu-ml-framework-team I've noticed that all 11 of my code suggestions from my review have been resolved and the code has been added. Thank you for incorporating the feedback. However, it appears the code was added manually in commit 7596cc2 rather than by using GitHub's "Commit suggestions" feature. Because of this, my contribution was not credited. This was almost certainly an oversight. To correct it, could you please amend that commit to include my co-author attribution? Here is the exact line to add to the end of the commit message (you can find instructions here): After amending the commit, you'll just need to force-push the branch ( Thanks! If you're interested in how the process usually works, here are docs on incorporating feedback in your pull request. |
@jyork03 I am happy to do this. I found M3 ultra is extremely useful and handy in debugging an agentic LLM in one stop and I will put more effort in this technical stack (MLX-LM + MLX metal backend) beyound my duty. |
update codes - add body bytes limit to prevent DOS attacks - clean codes add unit test Co-authored-by: Josh York <[email protected]>
aab4779 to
befa4cd
Compare
|
@jyork03 Let's move on D |
|
Hi, @awni good weekend, could you help me to move forward ? Thank you very much! |
Description
Since in an agentic environment, multiple requests, or chunked requests send from model router (one-api) for example, we found that the server codes have been broken without chunked request support.
With this feature people can handle chunked in an agentic LLM flow
Verification in an agentic envrionment, where multiple concurrent calls made
Handy Test
curl for model router
For whom may refer to this PR and require a quick test with curl:
Agentic Entry
Our agentic model router (model can be any models handled in model router to the real model behind):
to replace usual test entry (model name must be the real model name) :
Explanation : "http://localhost:3000/v1/chat/completions" is our model router to test various of models hosted in MacOS Studio:
It will automatically route models to the right services hosted by MLX (default to 5001)
The real request
simpler test
server:
client:
Unit Test
python