Skip to content

[Bug]: trtllm-serve: Harmony-format tokens and reasoning fields emitted in responses for gpt-oss-120B #9256

@jsinghchauhan1

Description

@jsinghchauhan1

System Info

Description
During tool calls, OpenAI Chat Completions responses sometimes include Harmony-format control tokens in assistant content and Harmony-only fields in JSON (e.g., reasoning) when hosting the GPT-OSS 120B model via trt-llm

Example:
{ "model": "gpt-oss-20b", "messages": [ {"role": "user", "content": "Search for latest policy doc title"} ], "tools": [ { "type": "function", "function": { "name": "search", "description": "How is the weather today", "parameters": { "type": "object", "properties": { "q": { "type": "string" } }, "required": ["q"] } } } ], "tool_choice": "auto" }
Would result in a response:
<|channel|>commentary<|message|>{ "q": "search the web" }

The harmony tokens end up leaking arbitrarily during tool calls and mostly in commentary channels and sometimes in the analysis channels as well.

Triton Information
Version: using TensorRT-LLM OpenAI server (trtllm-serve).
TensorRT-LLM OpenAI server image versions: 1.2.0rc0 and 1.2.0rc0.post1.
Container vs build: Using the official container images (no custom build).

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run TensorRT-LLM OpenAI HTTP server (trtllm-serve serve) with gpt-oss-120b. No custom stop tokens or output filters configured.
POST to /v1/chat/completions with:
Messages that trigger tool planning/execution.
A tools schema (tools / tool_choice) to enable tool calling.
Observe responses (both streaming and non-streaming):
Assistant content contains Harmony markers (e.g., <|channel|>commentary<|message|>{...}).
Non-streaming JSON sometimes includes reasoning

Here is an example request:
{ "model": "gpt-oss-20b", "messages": [ {"role": "user", "content": "Search for latest policy doc title"} ], "tools": [ { "type": "function", "function": { "name": "search", "description": "How is the weather today", "parameters": { "type": "object", "properties": { "q": { "type": "string" } }, "required": ["q"] } } } ], "tool_choice": "auto" }

Expected behavior

Would result in a response:
{
"q": "search the web for weather"
}`

actual behavior

Would result in a response:
<|channel|>commentary<|message|>{ "q": "search the web for weather" }

additional notes

Model description:
Models: openai/gpt-oss-120b, openai/gpt-oss-20b.
Served via TensorRT-LLM OpenAI server; downloaded at container start; no request/response mutation layer.
Inputs: OpenAI Chat Completions payloads with messages and tools.
Outputs: OpenAI Chat Completions (expected clean JSON and content).

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.Triton backend<NV>Related to NVIDIA Triton Inference Server backendbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions