Fix : mlx-server for chunked request (to support one-api, curl) #589

yiakwy-xpu-ml-framework-team · 2025-11-03T11:09:08Z

Description

Since in an agentic environment, multiple requests, or chunked requests send from model router (one-api) for example, we found that the server codes have been broken without chunked request support.

With this feature people can handle chunked in an agentic LLM flow

Verification in an agentic envrionment, where multiple concurrent calls made

Handy Test

curl for model router

For whom may refer to this PR and require a quick test with curl:

Agentic Entry

Our agentic model router (model can be any models handled in model router to the real model behind):

curl -v http://localhost:3000/v1/chat/completions \
 -H "Authorization: Bearer sk-${YOUR_KEY}" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-oss-120b-MXFP4-Q4", "message": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": false}'

to replace usual test entry (model name must be the real model name) :

curl -v http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-oss-120b-MXFP4-Q4", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": true}'

Explanation : "http://localhost:3000/v1/chat/completions" is our model router to test various of models hosted in MacOS Studio:

It will automatically route models to the right services hosted by MLX (default to 5001)

The real request

echo -n '{"model": "gpt-oss-120b-MXFP4-Q4", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": true}' > payload.json

curl -v --request POST http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-binary @playload.json

simpler test

server:

python -m mlx_lm.server --model "mlx-community/Qwen1.5-0.5B-Chat-4bit" --port 5001

client:

echo -n '{"model": "mlx-community/Qwen1.5-0.5B-Chat-4bit", "messages": [{"role": "user", "content": "Once upon a time"}], "temperature": 0.8, "max_tokens": 1024, "stream": false}' > payload.json

curl -v --request POST http://localhost:5001/v1/chat/completions \
  -H "Content-Type: application/json" \                                                         
  --data-binary @payload.json

Unit Test

python

test_server.py
- test_handle_chunked_request

yiakwy-xpu-ml-framework-team · 2025-11-03T11:11:23Z

@jyork03 could you have a look at it ?

Note for gpt-oss, mlx should update harmonous template parsing library and add relevant support. (tempalte leak won't happen in latest SGLang and Ollama).

jyork03 · 2025-11-03T21:02:49Z

First off, thanks for the contribution!

A few things I've noticed so far:

Either remove the debug logging or use logging.debug instead of print(f"[Debug] ... if it's generally useful information to log while running the server. Also, fix the "reaading" typo on line 376 print(f"[Debug] unexpected error reaading chunked body.").
fix MLX_MODEL_PATH joining: os.path.join expects a variable number of path-like arguments instead of a list.
Don't set a "Content-Length" default. Defaulting the content length risks truncation and confusing errors. It should be handled explicitly while enforcing limits and providing clear errors.
Write some tests in /tests/test_server.py:
1. ensure chunking works appropriately for /v1/completions and /v1/chat/completions
2. ensure errors are handled as expected

jyork03

I've suggested some changes that address points 1, 2 and 3 from my comment above.

Tests are still needed.

mlx_lm/server.py

yiakwy-xpu-ml-framework-team · 2025-11-04T00:32:57Z

I've suggested some changes that address points 1, 2 and 3 from my comment above.

Tests are still needed.

Thanks @jyork03 for time of reviewing , I will update tests soon D. Though I am not usually working on client codes and some inconveniences to update in a test machine, within a LLM agentic environment, the suggestion for potential large payload attack is very helpful.

yiakwy-xpu-ml-framework-team · 2025-11-04T12:30:51Z

@ jyork03 Unit test has been added. It works in my local laptop (M4) and Mac Studio M3 Ultra from apple team.

awni · 2025-11-05T20:21:32Z

Could you say more about this feature? I'm not opposed to adding it but would be good to know more about why / if it's worth while?

Which front-ends are sending chunked requests / why?

Is this something the standard OpenAI API supports (we tend to try to match feature parity with that).

yiakwy-xpu-ml-framework-team · 2025-11-06T03:08:28Z

Could you say more about this feature? I'm not opposed to adding it but would be good to know more about why / if it's worth while?

Which front-ends are sending chunked requests / why?

Is this something the standard OpenAI API supports (we tend to try to match feature parity with that).

@awni thanks for attention.

To simply put, the MLX-LM server is not production based. So the easiest way to use it in product is that we will not expose MLX-LM services to our LLM api service, instead we use a model router with support of OpenAI restful api scheme to delegate authorization, throttling and other model related services.

And our model router send chunked requests.

During our extensive tests in MacStudio Ultra3 from apple team, we found that ollama works well but conservative performance , however MLX-LM does not support this form of requests but give more aggresive performance, so we added this support.

This is necessary in an agentic environment, where multiple models, model workflow are served : user intent analysis, output alignment, and audio services, et al.

Later, I will add an online tests suite for our benchmark scripts where we use to measure the performance, instead of using offline benchmark.

yiakwy-xpu-ml-framework-team · 2025-11-07T15:59:54Z

With more and more people going to use mlx-lm, I hope we can merge this soon D @jyork03 @awni

jyork03 · 2025-11-07T17:49:20Z

@yiakwy-xpu-ml-framework-team I've noticed that all 11 of my code suggestions from my review have been resolved and the code has been added. Thank you for incorporating the feedback.

However, it appears the code was added manually in commit 7596cc2 rather than by using GitHub's "Commit suggestions" feature. Because of this, my contribution was not credited.

This was almost certainly an oversight. To correct it, could you please amend that commit to include my co-author attribution?

Here is the exact line to add to the end of the commit message (you can find instructions here):

Co-authored-by: Josh York <[email protected]>

After amending the commit, you'll just need to force-push the branch (git push --force) to update the pull request.

Thanks!

If you're interested in how the process usually works, here are docs on incorporating feedback in your pull request.

yiakwy-xpu-ml-framework-team · 2025-11-08T10:13:03Z

@yiakwy-xpu-ml-framework-team I've noticed that all 11 of my code suggestions from my review have been resolved and the code has been added. Thank you for incorporating the feedback.

However, it appears the code was added manually in commit 7596cc2 rather than by using GitHub's "Commit suggestions" feature. Because of this, my contribution was not credited.

This was almost certainly an oversight. To correct it, could you please amend that commit to include my co-author attribution?

Here is the exact line to add to the end of the commit message (you can find instructions here):
Co-authored-by: Josh York <[email protected]>
After amending the commit, you'll just need to force-push the branch (git push --force) to update the pull request.

Thanks!

If you're interested in how the process usually works, here are docs on incorporating feedback in your pull request.

@jyork03 I am happy to do this. I found M3 ultra is extremely useful and handy in debugging an agentic LLM in one stop and I will put more effort in this technical stack (MLX-LM + MLX metal backend) beyound my duty.

update codes - add body bytes limit to prevent DOS attacks - clean codes add unit test Co-authored-by: Josh York <[email protected]>

yiakwy-xpu-ml-framework-team · 2025-11-10T07:46:46Z

@jyork03 Let's move on D

yiakwy-xpu-ml-framework-team · 2025-11-15T09:24:00Z

Hi, @awni good weekend, could you help me to move forward ? Thank you very much!

jyork03 suggested changes Nov 3, 2025

View reviewed changes

yiakwy-xpu-ml-framework-team marked this pull request as draft November 4, 2025 11:43

yiakwy-xpu-ml-framework-team marked this pull request as ready for review November 4, 2025 12:29

yiakwy-xpu-ml-framework-team requested a review from jyork03 November 4, 2025 12:30

yiakwy-xpu-ml-framework-team force-pushed the fix_chunked_request branch from 960a863 to 64f2acd Compare November 6, 2025 03:17

yiakwy-xpu-ml-framework-team changed the title ~~fix mlx-server for chunked request (to support one-api, curl)~~ Fix : mlx-server for chunked request (to support one-api, curl) Nov 6, 2025

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 6, 2025

Fix: JSON parse error handling: avoid referencing stream before init #592

Merged

fix mlx-server for chunked request (to support one-api, curl)

befa4cd

update codes - add body bytes limit to prevent DOS attacks - clean codes add unit test Co-authored-by: Josh York <[email protected]>

yiakwy-xpu-ml-framework-team force-pushed the fix_chunked_request branch from aab4779 to befa4cd Compare November 10, 2025 07:46

Fix : mlx-server for chunked request (to support one-api, curl) #589

Are you sure you want to change the base?

Fix : mlx-server for chunked request (to support one-api, curl) #589

Uh oh!

Conversation

yiakwy-xpu-ml-framework-team commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Verification in an agentic envrionment, where multiple concurrent calls made

Handy Test

curl for model router

Agentic Entry

The real request

simpler test

Unit Test

python

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyork03 commented Nov 3, 2025

Uh oh!

jyork03 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Nov 5, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 7, 2025

Uh oh!

jyork03 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 10, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiakwy-xpu-ml-framework-team commented Nov 3, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 3, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 4, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 4, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 6, 2025 •

edited

Loading

jyork03 commented Nov 7, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Nov 8, 2025 •

edited

Loading