[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199
[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199JunyiXu-nv wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
Add support for interleaved thinking (reasoning between tool calls) for MiniMax-M2 and GLM-4.7 model families in trtllm-serve. - Add MiniMaxM2ToolParser for <minimax:tool_call> XML format with single/parallel tool calls and streaming support - Add Glm47ToolParser extending Glm4ToolParser for GLM-4.7 models with optional arguments support - Register new reasoning parsers (glm45, minimax_m2, minimax_m2_append_think) using existing DeepSeekR1Parser with reasoning_at_start=True for <think>...</think> format - Register new tool parsers (glm47, minimax_m2) in ToolParserFactory - Add comprehensive unit tests for new parsers including streaming, parallel tool calls, and interleaved thinking integration tests Signed-off-by: Junyi Xi <junyix@nvidia.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Add docstrings to all methods in Glm47ToolParser and MiniMaxM2ToolParser to meet the 80% docstring coverage threshold required by CI pre-merge checks Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Remove glm47_parser.py (GLM-4.7 is not in ticket scope) - Remove glm45 reasoning parser registration - Remove GLM-4.7 related tests - Keep only Kimi-K2 and MiniMax-M2 as specified in ticket Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
…nking support - Add KimiK2ReasoningParser that extends DeepSeekR1Parser to handle reasoning content implicitly ended by tool call sections (<|tool_calls_section_begin|>) without explicit </think> tags - Support standard <think>...</think>, tool-call-interrupted reasoning, and no-reasoning patterns in both streaming and non-streaming modes - Add comprehensive unit tests for kimi_k2 parser (non-streaming, streaming, and interleaved thinking scenarios) - Adapted from vLLM kimi_k2_reasoning_parser.py and sglang reasoning parser implementations Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR introduces support for MiniMax-M2 and Kimi K2 models through a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant ReasoningParser as KimiK2ReasoningParser
participant ToolParser as MiniMaxM2ToolParser
participant Output
Client->>ReasoningParser: parse_delta(token_stream)
activate ReasoningParser
ReasoningParser->>ReasoningParser: Buffer delta text
alt Detect <think> tag
ReasoningParser->>ReasoningParser: Mark reasoning started
end
alt Detect </think> or tool_calls marker
ReasoningParser->>ReasoningParser: Mark reasoning ended
ReasoningParser->>Output: ReasoningParserResult(reasoning_content)
end
ReasoningParser->>Output: ReasoningParserResult(content)
deactivate ReasoningParser
Output->>ToolParser: parse_streaming_increment(content)
activate ToolParser
alt Detect <minimax:tool_call> start
ToolParser->>ToolParser: Initialize tool invocation buffer
end
ToolParser->>ToolParser: Extract function name, parameters
ToolParser->>ToolParser: Convert parameters with type inference
alt Detect </minimax:tool_call> end
ToolParser->>Output: StreamingParseResult(ToolCallItem)
end
deactivate ToolParser
Output-->>Client: Combined reasoning + tool results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 75-76: The MiniMax parser registrations ("minimax_m2" and
"minimax_m2_append_think") reuse DeepSeekR1Parser.parse_delta(), which fails
when a single delta contains both the post-reasoning tail and the next "<think>"
opener (e.g. "reason1</think>text1<think>reason2") because it emits the entire
tail as plain content instead of splitting and reopening a reasoning block;
update the parse_delta implementation used by these registrations (or add an
overriding wrapper) to detect a "</think>...<think>" pattern in the incoming
delta, split the tail into post-reasoning content and the reopened reasoning
segment, emit the post-reasoning text as content, then emit a token/event to
reopen a reasoning block before passing the remaining text back into the
existing parsing flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
- Around line 267-322: The parser currently discards any text before a found
<think> (and leaks partial start-tags) when self.in_reasoning is False; fix by
preserving the prefix as content and buffering partial start-tag suffixes.
Specifically, in the branch that computes begin_idx from self.reasoning_start,
when begin_idx != -1 set content = delta_text[:begin_idx] and set
reasoning_content = delta_text[begin_idx + len(self.reasoning_start):] (and set
self.in_reasoning True); when begin_idx == -1 do not always clear self._buffer —
detect a trailing partial prefix of self.reasoning_start or
self.tool_section_start (e.g. last '<' suffix) and set self._buffer to that
suffix while returning content=delta_text up to that suffix, otherwise clear
self._buffer and return content=delta_text; update uses of self.in_reasoning,
self._buffer, reasoning_content, begin_idx and self.reasoning_start accordingly.
In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`:
- Around line 104-110: The detect_and_parse method currently drops text after or
around <minimax:tool_call> by returning only the prefix or empty normal_text
when an opener exists; update detect_and_parse to preserve prefix (text before
the opener) as normal_text and also detect and include any suffix after the
closing tag when present, parsing tool call content between opener and closer
into calls; when an opener exists but no closer yet, keep the prefix in
normal_text (do not return ""), buffer the remainder for streaming updates, and
only remove the tool block once its closing tag is seen. Apply the same
preservation logic to the corresponding streaming/partial-parse handlers
referenced at the other ranges (the functions handling streaming deltas) so they
similarly retain text before the opener and surface suffix text after the closer
instead of dropping it.
- Around line 33-61: The function _parse_param_value currently JSON-parses and
coerces values before checking the declared param_type, causing values declared
as "string" in the schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix
by short-circuiting when param_type == "string" (return the stripped value_str
unchanged) before any json.loads or numeric/boolean conversions; otherwise keep
the existing logic (JSON parse first, then numeric/boolean fallbacks) and
preserve the original behavior for non-string types.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6efa071d-da90-46ac-916d-81f3f4e3635a
📒 Files selected for processing (5)
tensorrt_llm/llmapi/reasoning_parser.pytensorrt_llm/serve/tool_parser/minimax_m2_parser.pytensorrt_llm/serve/tool_parser/tool_parser_factory.pytests/unittest/llmapi/apps/test_tool_parsers.pytests/unittest/llmapi/test_reasoning_parser.py
| @register_reasoning_parser("minimax_m2", reasoning_at_start=True) | ||
| @register_reasoning_parser("minimax_m2_append_think", reasoning_at_start=True) |
There was a problem hiding this comment.
minimax_m2 interleaving still depends on lucky chunk boundaries.
This reuses DeepSeekR1Parser.parse_delta(), but that implementation does not handle a delta that contains post-reasoning content and the next <think> together. For reason1</think>text1<think>reason2, the tail is emitted as plain content instead of text1 plus a reopened reasoning block, so the new MiniMax registrations still misparse valid interleaved streams.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/llmapi/reasoning_parser.py` around lines 75 - 76, The MiniMax
parser registrations ("minimax_m2" and "minimax_m2_append_think") reuse
DeepSeekR1Parser.parse_delta(), which fails when a single delta contains both
the post-reasoning tail and the next "<think>" opener (e.g.
"reason1</think>text1<think>reason2") because it emits the entire tail as plain
content instead of splitting and reopening a reasoning block; update the
parse_delta implementation used by these registrations (or add an overriding
wrapper) to detect a "</think>...<think>" pattern in the incoming delta, split
the tail into post-reasoning content and the reopened reasoning segment, emit
the post-reasoning text as content, then emit a token/event to reopen a
reasoning block before passing the remaining text back into the existing parsing
flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
| # Wait if the buffer is a prefix of any special token. | ||
| if (self.reasoning_start.startswith(delta_text) | ||
| or self.reasoning_end.startswith(delta_text) | ||
| or self.tool_section_start.startswith(delta_text)): | ||
| return ReasoningParserResult() | ||
|
|
||
| if not self.in_reasoning: | ||
| begin_idx = delta_text.find(self.reasoning_start) | ||
| if begin_idx == -1: | ||
| self._buffer = "" | ||
| return ReasoningParserResult(content=delta_text) | ||
| self.in_reasoning = True | ||
| reasoning_content = delta_text[begin_idx + | ||
| len(self.reasoning_start):] | ||
|
|
||
| if self.in_reasoning: | ||
| delta_text = (reasoning_content | ||
| if reasoning_content is not None else delta_text) | ||
|
|
||
| # Find the earliest end marker. | ||
| end_idx = delta_text.find(self.reasoning_end) | ||
| tool_idx = delta_text.find(self.tool_section_start) | ||
|
|
||
| if end_idx != -1 and (tool_idx == -1 or end_idx <= tool_idx): | ||
| # Standard </think> end. | ||
| reasoning_content = delta_text[:end_idx] | ||
| content = delta_text[end_idx + len(self.reasoning_end):] | ||
| self.in_reasoning = False | ||
| self._buffer = "" | ||
| return ReasoningParserResult( | ||
| content=content, reasoning_content=reasoning_content) | ||
| elif tool_idx != -1: | ||
| # Implicit end via tool-call section start. | ||
| reasoning_content = delta_text[:tool_idx] | ||
| content = delta_text[tool_idx:] | ||
| self.in_reasoning = False | ||
| self._buffer = "" | ||
| return ReasoningParserResult( | ||
| content=content, reasoning_content=reasoning_content) | ||
|
|
||
| # No complete end marker – check for partial tag at the end of | ||
| # the buffer (could be a prefix of </think> or | ||
| # <|tool_calls_section_begin|>). | ||
| last_lt = delta_text.rfind("<") | ||
| if last_lt != -1: | ||
| suffix = delta_text[last_lt:] | ||
| if (self.reasoning_end.startswith(suffix) | ||
| or self.tool_section_start.startswith(suffix)): | ||
| self._buffer = suffix | ||
| reasoning_content = delta_text[:last_lt] | ||
| return ReasoningParserResult( | ||
| reasoning_content=reasoning_content) | ||
|
|
||
| self._buffer = "" | ||
| reasoning_content = delta_text | ||
| return ReasoningParserResult(reasoning_content=reasoning_content) |
There was a problem hiding this comment.
Keep the content prefix when a later <think> arrives.
When self.in_reasoning is False, this branch jumps straight to the next <think> and discards everything before it. content1<think>reason2 loses content1, and content1<th leaks the partial start tag as content. Streaming chunk boundaries are arbitrary, so multi-block Kimi responses will be corrupted.
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 307-307: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/llmapi/reasoning_parser.py` around lines 267 - 322, The parser
currently discards any text before a found <think> (and leaks partial
start-tags) when self.in_reasoning is False; fix by preserving the prefix as
content and buffering partial start-tag suffixes. Specifically, in the branch
that computes begin_idx from self.reasoning_start, when begin_idx != -1 set
content = delta_text[:begin_idx] and set reasoning_content =
delta_text[begin_idx + len(self.reasoning_start):] (and set self.in_reasoning
True); when begin_idx == -1 do not always clear self._buffer — detect a trailing
partial prefix of self.reasoning_start or self.tool_section_start (e.g. last '<'
suffix) and set self._buffer to that suffix while returning content=delta_text
up to that suffix, otherwise clear self._buffer and return content=delta_text;
update uses of self.in_reasoning, self._buffer, reasoning_content, begin_idx and
self.reasoning_start accordingly.
| def _parse_param_value(value_str: str, param_type: Optional[str]) -> Any: | ||
| """Parse a parameter value string into the appropriate Python type.""" | ||
| value_str = value_str.strip() | ||
|
|
||
| # Try JSON parsing first | ||
| try: | ||
| parsed = json.loads(value_str) | ||
| return parsed | ||
| except (json.JSONDecodeError, ValueError): | ||
| pass | ||
|
|
||
| # For numeric types, try numeric conversion | ||
| if param_type in ("number", "integer"): | ||
| try: | ||
| if "." in value_str or "e" in value_str.lower(): | ||
| return float(value_str) | ||
| return int(value_str) | ||
| except (ValueError, TypeError): | ||
| pass | ||
|
|
||
| # For boolean type | ||
| if param_type == "boolean": | ||
| if value_str.lower() == "true": | ||
| return True | ||
| if value_str.lower() == "false": | ||
| return False | ||
|
|
||
| # Default: return as string | ||
| return value_str |
There was a problem hiding this comment.
Don't coerce "string" parameters before consulting the schema.
json.loads() runs before any param_type check, so values like 42, true, null, or {"k":1} are converted into non-strings even when the tool definition says "type": "string". That mutates the arguments the tool receives on the happy path.
Possible direction
def _parse_param_value(value_str: str, param_type: Optional[str]) -> Any:
value_str = value_str.strip()
- # Try JSON parsing first
- try:
- parsed = json.loads(value_str)
- return parsed
- except (json.JSONDecodeError, ValueError):
- pass
+ if param_type in (None, "string"):
+ return value_str
+
+ if param_type in ("object", "array", "null"):
+ try:
+ return json.loads(value_str)
+ except json.JSONDecodeError:
+ return value_str
# For numeric types, try numeric conversion🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py` around lines 33 - 61,
The function _parse_param_value currently JSON-parses and coerces values before
checking the declared param_type, causing values declared as "string" in the
schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix by short-circuiting
when param_type == "string" (return the stripped value_str unchanged) before any
json.loads or numeric/boolean conversions; otherwise keep the existing logic
(JSON parse first, then numeric/boolean fallbacks) and preserve the original
behavior for non-string types.
| def detect_and_parse(self, text: str, tools: List[Tool]) -> StreamingParseResult: | ||
| """One-time parsing: detect and parse all tool calls in the text.""" | ||
| idx = text.find(self.bot_token) | ||
| normal_text = text[:idx].strip() if idx != -1 else text | ||
|
|
||
| if self.bot_token not in text: | ||
| return StreamingParseResult(normal_text=normal_text, calls=[]) |
There was a problem hiding this comment.
Preserve assistant text around the MiniMax tool block.
The one-shot path keeps only the prefix before the first <minimax:tool_call>, so any suffix after the closing tag is dropped. The streaming path returns normal_text="" as soon as the opener is present, which means prefix<minimax:tool_call>... loses prefix, and ...</minimax:tool_call>suffix is not surfaced until a later delta. Mixed content/tool responses become lossy.
Also applies to: 167-170, 259-268
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py` around lines 104 - 110,
The detect_and_parse method currently drops text after or around
<minimax:tool_call> by returning only the prefix or empty normal_text when an
opener exists; update detect_and_parse to preserve prefix (text before the
opener) as normal_text and also detect and include any suffix after the closing
tag when present, parsing tool call content between opener and closer into
calls; when an opener exists but no closer yet, keep the prefix in normal_text
(do not return ""), buffer the remainder for streaming updates, and only remove
the tool block once its closing tag is seen. Apply the same preservation logic
to the corresponding streaming/partial-parse handlers referenced at the other
ranges (the functions handling streaming deltas) so they similarly retain text
before the opener and surface suffix text after the closer instead of dropping
it.
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Add support for interleaved thinking in trtllm-serve, specifically for the Kimi-K2-Thinking model. This addresses the case where reasoning content may be implicitly ended by a tool call section (
<|tool_calls_section_begin|>) without an explicit</think>tag.Changes:
KimiK2ReasoningParser: ExtendsDeepSeekR1Parserto detect both</think>and<|tool_calls_section_begin|>as reasoning end markers. When a tool call section starts during reasoning, the reasoning is implicitly ended and the tool call section is passed through as content.<think>...</think>patterns, tool-call-interrupted reasoning, and no-reasoning content in bothparse()andparse_delta()modes.kimi_k2: Usesreasoning_at_start=False(matching sglang's Qwen3Detector mapping), so the model must explicitly start reasoning with<think>.Supported patterns:
<think>reasoning</think>content– standard thinking<think>reasoning<|tool_calls_section_begin|>...– interleaved thinking (reasoning interrupted by tool call)content(no<think>) – no reasoningAdapted from:
vllm/reasoning/kimi_k2_reasoning_parser.pysglang/srt/parser/reasoning_parser.pyTest Coverage
test_kimi_k2_reasoning_parser– 8 parametrized non-streaming cases including tool call interruptiontest_kimi_k2_reasoning_parser_stream– 7 parametrized streaming cases including buffered tool token handlingtest_interleaved_thinking_stream– Cross-parser interleaved thinking tests for minimax_m2, deepseek-r1, qwen3, and kimi_k2 (including tool-call-interrupted reasoning)PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Please check this after reviewing the above items as appropriate for this PR.