Skip to content

Conversation

@yiakwy-xpu-ml-framework-team
Copy link

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Nov 3, 2025

Description

Since in an agentic environment, multiple requests, or chunked requests send from model router (one-api) for example, we found that the server codes have been broken without chunked request support.

With this feature people can handle chunked in an agentic LLM flow

Verification in an agentic envrionment, where multiple concurrent calls made

mlx_test_img

Handy Test

curl for model router

For whom may refer to this PR and require a quick test with curl:

Agentic Entry

Our agentic model router (model can be any models handled in model router to the real model behind):

curl -v http://localhost:3000/v1/chat/completions \
 -H "Authorization: Bearer sk-${YOUR_KEY}" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-oss-120b-MXFP4-Q4", "message": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": false}'

to replace usual test entry (model name must be the real model name) :

curl -v http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-oss-120b-MXFP4-Q4", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": true}'

Explanation : "http://localhost:3000/v1/chat/completions" is our model router to test various of models hosted in MacOS Studio:

It will automatically route models to the right services hosted by MLX (default to 5001)

The real request
echo -n '{"model": "gpt-oss-120b-MXFP4-Q4", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": true}' > payload.json

curl -v --request POST http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-binary @playload.json
simpler test

server:

python -m mlx_lm.server --model "mlx-community/Qwen1.5-0.5B-Chat-4bit" --port 5001

client:

echo -n '{"model": "mlx-community/Qwen1.5-0.5B-Chat-4bit", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": false}' > payload.json

curl -v --request POST http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \                                                         
  --data-binary @payload.json

Unit Test

python

  • test_server.py
    • test_handle_chunked_request

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 3, 2025

@jyork03 could you have a look at it ?

Note for gpt-oss, mlx should update harmonous template parsing library and add relevant support. (tempalte leak won't happen in latest SGLang and Ollama).

@jyork03
Copy link
Contributor

jyork03 commented Nov 3, 2025

First off, thanks for the contribution!

A few things I've noticed so far:

  1. Either remove the debug logging or use logging.debug instead of print(f"[Debug] ... if it's generally useful information to log while running the server. Also, fix the "reaading" typo on line 376 print(f"[Debug] unexpected error reaading chunked body.").
  2. fix MLX_MODEL_PATH joining: os.path.join expects a variable number of path-like arguments instead of a list.
  3. Don't set a "Content-Length" default. Defaulting the content length risks truncation and confusing errors. It should be handled explicitly while enforcing limits and providing clear errors.
  4. Write some tests in /tests/test_server.py:
    1. ensure chunking works appropriately for /v1/completions and /v1/chat/completions
    2. ensure errors are handled as expected

Copy link
Contributor

@jyork03 jyork03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested some changes that address points 1, 2 and 3 from my comment above.

Tests are still needed.

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 4, 2025

I've suggested some changes that address points 1, 2 and 3 from my comment above.

Tests are still needed.

Thanks @jyork03 for time of reviewing , I will update tests soon D. Though I am not usually working on client codes and some inconveniences to update in a test machine, within a LLM agentic environment, the suggestion for potential large payload attack is very helpful.

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team marked this pull request as draft November 4, 2025 11:43
@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team marked this pull request as ready for review November 4, 2025 12:29
@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 4, 2025

@ jyork03 Unit test has been added. It works in my local laptop (M4) and Mac Studio M3 Ultra from apple team.

@awni
Copy link
Member

awni commented Nov 5, 2025

Could you say more about this feature? I'm not opposed to adding it but would be good to know more about why / if it's worth while?

Which front-ends are sending chunked requests / why?

Is this something the standard OpenAI API supports (we tend to try to match feature parity with that).

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 6, 2025

Could you say more about this feature? I'm not opposed to adding it but would be good to know more about why / if it's worth while?

Which front-ends are sending chunked requests / why?

Is this something the standard OpenAI API supports (we tend to try to match feature parity with that).

@awni thanks for attention.

To simply put, the MLX-LM server is not production based. So the easiest way to use it in product is that we will not expose MLX-LM services to our LLM api service, instead we use a model router with support of OpenAI restful api scheme to delegate authorization, throttling and other model related services.

And our model router send chunked requests.

During our extensive tests in MacStudio Ultra3 from apple team, we found that ollama works well but conservative performance , however MLX-LM does not support this form of requests but give more aggresive performance, so we added this support.

This is necessary in an agentic environment, where multiple models, model workflow are served : user intent analysis, output alignment, and audio services, et al.

Later, I will add an online tests suite for our benchmark scripts where we use to measure the performance, instead of using offline benchmark.

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team changed the title fix mlx-server for chunked request (to support one-api, curl) Fix : mlx-server for chunked request (to support one-api, curl) Nov 6, 2025
@yiakwy-xpu-ml-framework-team
Copy link
Author

With more and more people going to use mlx-lm, I hope we can merge this soon D @jyork03 @awni

@jyork03
Copy link
Contributor

jyork03 commented Nov 7, 2025

@yiakwy-xpu-ml-framework-team I've noticed that all 11 of my code suggestions from my review have been resolved and the code has been added. Thank you for incorporating the feedback.

However, it appears the code was added manually in commit 7596cc2 rather than by using GitHub's "Commit suggestions" feature. Because of this, my contribution was not credited.

This was almost certainly an oversight. To correct it, could you please amend that commit to include my co-author attribution?

Here is the exact line to add to the end of the commit message (you can find instructions here):

Co-authored-by: Josh York <[email protected]>

After amending the commit, you'll just need to force-push the branch (git push --force) to update the pull request.

Thanks!

If you're interested in how the process usually works, here are docs on incorporating feedback in your pull request.

@yiakwy-xpu-ml-framework-team
Copy link
Author

yiakwy-xpu-ml-framework-team commented Nov 8, 2025

@yiakwy-xpu-ml-framework-team I've noticed that all 11 of my code suggestions from my review have been resolved and the code has been added. Thank you for incorporating the feedback.

However, it appears the code was added manually in commit 7596cc2 rather than by using GitHub's "Commit suggestions" feature. Because of this, my contribution was not credited.

This was almost certainly an oversight. To correct it, could you please amend that commit to include my co-author attribution?

Here is the exact line to add to the end of the commit message (you can find instructions here):

Co-authored-by: Josh York <[email protected]>

After amending the commit, you'll just need to force-push the branch (git push --force) to update the pull request.

Thanks!

If you're interested in how the process usually works, here are docs on incorporating feedback in your pull request.

@jyork03 I am happy to do this. I found M3 ultra is extremely useful and handy in debugging an agentic LLM in one stop and I will put more effort in this technical stack (MLX-LM + MLX metal backend) beyound my duty.

update codes

- add body bytes limit to prevent DOS attacks
- clean codes

add unit test

Co-authored-by: Josh York <[email protected]>
@yiakwy-xpu-ml-framework-team
Copy link
Author

@jyork03 Let's move on D

@yiakwy-xpu-ml-framework-team
Copy link
Author

Hi, @awni good weekend, could you help me to move forward ? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants