Skip to content

[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199

Open
JunyiXu-nv wants to merge 4 commits intoNVIDIA:mainfrom
JunyiXu-nv:user/junyix/fix-trtllm-11357
Open

[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199
JunyiXu-nv wants to merge 4 commits intoNVIDIA:mainfrom
JunyiXu-nv:user/junyix/fix-trtllm-11357

Conversation

@JunyiXu-nv
Copy link
Collaborator

@JunyiXu-nv JunyiXu-nv commented Mar 13, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for MiniMax-M2 and Kimi K2 models with interleaved reasoning capabilities.
    • Introduced streaming tool parsing for improved real-time tool invocation handling.
  • Tests

    • Added comprehensive test coverage for new model parsers and streaming scenarios.

Description

Add support for interleaved thinking in trtllm-serve, specifically for the Kimi-K2-Thinking model. This addresses the case where reasoning content may be implicitly ended by a tool call section (<|tool_calls_section_begin|>) without an explicit </think> tag.

Changes:

  • New KimiK2ReasoningParser: Extends DeepSeekR1Parser to detect both </think> and <|tool_calls_section_begin|> as reasoning end markers. When a tool call section starts during reasoning, the reasoning is implicitly ended and the tool call section is passed through as content.
  • Both streaming and non-streaming support: The parser handles standard <think>...</think> patterns, tool-call-interrupted reasoning, and no-reasoning content in both parse() and parse_delta() modes.
  • Registered as kimi_k2: Uses reasoning_at_start=False (matching sglang's Qwen3Detector mapping), so the model must explicitly start reasoning with <think>.

Supported patterns:

  • <think>reasoning</think>content – standard thinking
  • <think>reasoning<|tool_calls_section_begin|>... – interleaved thinking (reasoning interrupted by tool call)
  • content (no <think>) – no reasoning

Adapted from:

  • vLLM vllm/reasoning/kimi_k2_reasoning_parser.py
  • sglang sglang/srt/parser/reasoning_parser.py

Test Coverage

  • test_kimi_k2_reasoning_parser – 8 parametrized non-streaming cases including tool call interruption
  • test_kimi_k2_reasoning_parser_stream – 7 parametrized streaming cases including buffered tool token handling
  • test_interleaved_thinking_stream – Cross-parser interleaved thinking tests for minimax_m2, deepseek-r1, qwen3, and kimi_k2 (including tool-call-interrupted reasoning)
  • All 69 reasoning parser tests pass (existing + new)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why.

  • PR Follows TRT-LLM CODING GUIDELINES.

  • Test cases are provided for new code paths.

  • Please check this after reviewing the above items as appropriate for this PR.

Add support for interleaved thinking (reasoning between tool calls)
for MiniMax-M2 and GLM-4.7 model families in trtllm-serve.

- Add MiniMaxM2ToolParser for <minimax:tool_call> XML format with
  single/parallel tool calls and streaming support
- Add Glm47ToolParser extending Glm4ToolParser for GLM-4.7 models
  with optional arguments support
- Register new reasoning parsers (glm45, minimax_m2,
  minimax_m2_append_think) using existing DeepSeekR1Parser with
  reasoning_at_start=True for <think>...</think> format
- Register new tool parsers (glm47, minimax_m2) in ToolParserFactory
- Add comprehensive unit tests for new parsers including streaming,
  parallel tool calls, and interleaved thinking integration tests

Signed-off-by: Junyi Xi <junyix@nvidia.com>
Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Add docstrings to all methods in Glm47ToolParser and MiniMaxM2ToolParser
  to meet the 80% docstring coverage threshold required by CI pre-merge checks

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Remove glm47_parser.py (GLM-4.7 is not in ticket scope)
- Remove glm45 reasoning parser registration
- Remove GLM-4.7 related tests
- Keep only Kimi-K2 and MiniMax-M2 as specified in ticket

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
…nking support

- Add KimiK2ReasoningParser that extends DeepSeekR1Parser to handle
  reasoning content implicitly ended by tool call sections
  (<|tool_calls_section_begin|>) without explicit </think> tags
- Support standard <think>...</think>, tool-call-interrupted reasoning,
  and no-reasoning patterns in both streaming and non-streaming modes
- Add comprehensive unit tests for kimi_k2 parser (non-streaming,
  streaming, and interleaved thinking scenarios)
- Adapted from vLLM kimi_k2_reasoning_parser.py and sglang reasoning
  parser implementations

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
@JunyiXu-nv JunyiXu-nv requested a review from a team as a code owner March 13, 2026 11:18
@JunyiXu-nv JunyiXu-nv requested a review from hchings March 13, 2026 11:18
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The PR introduces support for MiniMax-M2 and Kimi K2 models through a new KimiK2ReasoningParser that handles interleaved thinking (where tool-call sections implicitly end reasoning), a corresponding MiniMaxM2ToolParser for parsing tool calls in an XML-like structure, factory registration, and comprehensive test coverage spanning reasoning and tool parsing scenarios.

Changes

Cohort / File(s) Summary
Reasoning Parser Enhancements
tensorrt_llm/llmapi/reasoning_parser.py
Adds decorator registrations for minimax_m2 and minimax_m2_append_think on DeepSeekR1Parser with reasoning_at_start=True. Introduces new KimiK2ReasoningParser class extending DeepSeekR1Parser to support interleaved thinking with explicit (</think>) and implicit (<|tool_calls_section_begin|>) reasoning end markers. Implements parse for full-text parsing and parse_delta for incremental streaming with buffer management and partial token handling. Implementation appears duplicated later in file for NemotronV3 context.
Tool Parser Implementation
tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
Introduces MiniMaxM2ToolParser extending BaseToolParser to parse MiniMax-M2 XML-like tool calls. Includes helper functions _get_param_types and _parse_param_value for type inference and value conversion. Implements detect_and_parse for full parsing and parse_streaming_increment for incremental streaming with JSON parameter buffering, partial invocation handling, and error recovery.
Tool Parser Registration
tensorrt_llm/serve/tool_parser/tool_parser_factory.py
Adds import and factory registration mapping "minimax_m2" to MiniMaxM2ToolParser in ToolParserFactory.parsers dictionary.
Test Coverage
tests/unittest/llmapi/apps/test_tool_parsers.py, tests/unittest/llmapi/test_reasoning_parser.py
Comprehensive test suites for MiniMaxM2ToolParser (initialization, detection, single/multiple tool parsing, streaming, parameter types, structural tag support) and reasoning parsers (minimax_m2, minimax_m2_append_think, kimi_k2 across streaming/non-streaming modes with interleaved thinking and tool-call scenarios). Verifies factory registrations and parser coordination.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ReasoningParser as KimiK2ReasoningParser
    participant ToolParser as MiniMaxM2ToolParser
    participant Output

    Client->>ReasoningParser: parse_delta(token_stream)
    activate ReasoningParser
    ReasoningParser->>ReasoningParser: Buffer delta text
    alt Detect <think> tag
        ReasoningParser->>ReasoningParser: Mark reasoning started
    end
    alt Detect </think> or tool_calls marker
        ReasoningParser->>ReasoningParser: Mark reasoning ended
        ReasoningParser->>Output: ReasoningParserResult(reasoning_content)
    end
    ReasoningParser->>Output: ReasoningParserResult(content)
    deactivate ReasoningParser

    Output->>ToolParser: parse_streaming_increment(content)
    activate ToolParser
    alt Detect <minimax:tool_call> start
        ToolParser->>ToolParser: Initialize tool invocation buffer
    end
    ToolParser->>ToolParser: Extract function name, parameters
    ToolParser->>ToolParser: Convert parameters with type inference
    alt Detect </minimax:tool_call> end
        ToolParser->>Output: StreamingParseResult(ToolCallItem)
    end
    deactivate ToolParser

    Output-->>Client: Combined reasoning + tool results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for interleaved thinking in trtllm-serve, which directly corresponds to the PR's primary objective of implementing the KimiK2ReasoningParser.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description clearly explains the purpose, changes, supported patterns, and test coverage for interleaved thinking support in trtllm-serve.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 75-76: The MiniMax parser registrations ("minimax_m2" and
"minimax_m2_append_think") reuse DeepSeekR1Parser.parse_delta(), which fails
when a single delta contains both the post-reasoning tail and the next "<think>"
opener (e.g. "reason1</think>text1<think>reason2") because it emits the entire
tail as plain content instead of splitting and reopening a reasoning block;
update the parse_delta implementation used by these registrations (or add an
overriding wrapper) to detect a "</think>...<think>" pattern in the incoming
delta, split the tail into post-reasoning content and the reopened reasoning
segment, emit the post-reasoning text as content, then emit a token/event to
reopen a reasoning block before passing the remaining text back into the
existing parsing flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
- Around line 267-322: The parser currently discards any text before a found
<think> (and leaks partial start-tags) when self.in_reasoning is False; fix by
preserving the prefix as content and buffering partial start-tag suffixes.
Specifically, in the branch that computes begin_idx from self.reasoning_start,
when begin_idx != -1 set content = delta_text[:begin_idx] and set
reasoning_content = delta_text[begin_idx + len(self.reasoning_start):] (and set
self.in_reasoning True); when begin_idx == -1 do not always clear self._buffer —
detect a trailing partial prefix of self.reasoning_start or
self.tool_section_start (e.g. last '<' suffix) and set self._buffer to that
suffix while returning content=delta_text up to that suffix, otherwise clear
self._buffer and return content=delta_text; update uses of self.in_reasoning,
self._buffer, reasoning_content, begin_idx and self.reasoning_start accordingly.

In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`:
- Around line 104-110: The detect_and_parse method currently drops text after or
around <minimax:tool_call> by returning only the prefix or empty normal_text
when an opener exists; update detect_and_parse to preserve prefix (text before
the opener) as normal_text and also detect and include any suffix after the
closing tag when present, parsing tool call content between opener and closer
into calls; when an opener exists but no closer yet, keep the prefix in
normal_text (do not return ""), buffer the remainder for streaming updates, and
only remove the tool block once its closing tag is seen. Apply the same
preservation logic to the corresponding streaming/partial-parse handlers
referenced at the other ranges (the functions handling streaming deltas) so they
similarly retain text before the opener and surface suffix text after the closer
instead of dropping it.
- Around line 33-61: The function _parse_param_value currently JSON-parses and
coerces values before checking the declared param_type, causing values declared
as "string" in the schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix
by short-circuiting when param_type == "string" (return the stripped value_str
unchanged) before any json.loads or numeric/boolean conversions; otherwise keep
the existing logic (JSON parse first, then numeric/boolean fallbacks) and
preserve the original behavior for non-string types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6efa071d-da90-46ac-916d-81f3f4e3635a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fb931a and 1c8a086.

📒 Files selected for processing (5)
  • tensorrt_llm/llmapi/reasoning_parser.py
  • tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
  • tensorrt_llm/serve/tool_parser/tool_parser_factory.py
  • tests/unittest/llmapi/apps/test_tool_parsers.py
  • tests/unittest/llmapi/test_reasoning_parser.py

Comment on lines +75 to +76
@register_reasoning_parser("minimax_m2", reasoning_at_start=True)
@register_reasoning_parser("minimax_m2_append_think", reasoning_at_start=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

minimax_m2 interleaving still depends on lucky chunk boundaries.

This reuses DeepSeekR1Parser.parse_delta(), but that implementation does not handle a delta that contains post-reasoning content and the next <think> together. For reason1</think>text1<think>reason2, the tail is emitted as plain content instead of text1 plus a reopened reasoning block, so the new MiniMax registrations still misparse valid interleaved streams.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/reasoning_parser.py` around lines 75 - 76, The MiniMax
parser registrations ("minimax_m2" and "minimax_m2_append_think") reuse
DeepSeekR1Parser.parse_delta(), which fails when a single delta contains both
the post-reasoning tail and the next "<think>" opener (e.g.
"reason1</think>text1<think>reason2") because it emits the entire tail as plain
content instead of splitting and reopening a reasoning block; update the
parse_delta implementation used by these registrations (or add an overriding
wrapper) to detect a "</think>...<think>" pattern in the incoming delta, split
the tail into post-reasoning content and the reopened reasoning segment, emit
the post-reasoning text as content, then emit a token/event to reopen a
reasoning block before passing the remaining text back into the existing parsing
flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).

Comment on lines +267 to +322
# Wait if the buffer is a prefix of any special token.
if (self.reasoning_start.startswith(delta_text)
or self.reasoning_end.startswith(delta_text)
or self.tool_section_start.startswith(delta_text)):
return ReasoningParserResult()

if not self.in_reasoning:
begin_idx = delta_text.find(self.reasoning_start)
if begin_idx == -1:
self._buffer = ""
return ReasoningParserResult(content=delta_text)
self.in_reasoning = True
reasoning_content = delta_text[begin_idx +
len(self.reasoning_start):]

if self.in_reasoning:
delta_text = (reasoning_content
if reasoning_content is not None else delta_text)

# Find the earliest end marker.
end_idx = delta_text.find(self.reasoning_end)
tool_idx = delta_text.find(self.tool_section_start)

if end_idx != -1 and (tool_idx == -1 or end_idx <= tool_idx):
# Standard </think> end.
reasoning_content = delta_text[:end_idx]
content = delta_text[end_idx + len(self.reasoning_end):]
self.in_reasoning = False
self._buffer = ""
return ReasoningParserResult(
content=content, reasoning_content=reasoning_content)
elif tool_idx != -1:
# Implicit end via tool-call section start.
reasoning_content = delta_text[:tool_idx]
content = delta_text[tool_idx:]
self.in_reasoning = False
self._buffer = ""
return ReasoningParserResult(
content=content, reasoning_content=reasoning_content)

# No complete end marker – check for partial tag at the end of
# the buffer (could be a prefix of </think> or
# <|tool_calls_section_begin|>).
last_lt = delta_text.rfind("<")
if last_lt != -1:
suffix = delta_text[last_lt:]
if (self.reasoning_end.startswith(suffix)
or self.tool_section_start.startswith(suffix)):
self._buffer = suffix
reasoning_content = delta_text[:last_lt]
return ReasoningParserResult(
reasoning_content=reasoning_content)

self._buffer = ""
reasoning_content = delta_text
return ReasoningParserResult(reasoning_content=reasoning_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the content prefix when a later <think> arrives.

When self.in_reasoning is False, this branch jumps straight to the next <think> and discards everything before it. content1<think>reason2 loses content1, and content1<th leaks the partial start tag as content. Streaming chunk boundaries are arbitrary, so multi-block Kimi responses will be corrupted.

🧰 Tools
🪛 Ruff (0.15.5)

[warning] 307-307: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/reasoning_parser.py` around lines 267 - 322, The parser
currently discards any text before a found <think> (and leaks partial
start-tags) when self.in_reasoning is False; fix by preserving the prefix as
content and buffering partial start-tag suffixes. Specifically, in the branch
that computes begin_idx from self.reasoning_start, when begin_idx != -1 set
content = delta_text[:begin_idx] and set reasoning_content =
delta_text[begin_idx + len(self.reasoning_start):] (and set self.in_reasoning
True); when begin_idx == -1 do not always clear self._buffer — detect a trailing
partial prefix of self.reasoning_start or self.tool_section_start (e.g. last '<'
suffix) and set self._buffer to that suffix while returning content=delta_text
up to that suffix, otherwise clear self._buffer and return content=delta_text;
update uses of self.in_reasoning, self._buffer, reasoning_content, begin_idx and
self.reasoning_start accordingly.

Comment on lines +33 to +61
def _parse_param_value(value_str: str, param_type: Optional[str]) -> Any:
"""Parse a parameter value string into the appropriate Python type."""
value_str = value_str.strip()

# Try JSON parsing first
try:
parsed = json.loads(value_str)
return parsed
except (json.JSONDecodeError, ValueError):
pass

# For numeric types, try numeric conversion
if param_type in ("number", "integer"):
try:
if "." in value_str or "e" in value_str.lower():
return float(value_str)
return int(value_str)
except (ValueError, TypeError):
pass

# For boolean type
if param_type == "boolean":
if value_str.lower() == "true":
return True
if value_str.lower() == "false":
return False

# Default: return as string
return value_str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't coerce "string" parameters before consulting the schema.

json.loads() runs before any param_type check, so values like 42, true, null, or {"k":1} are converted into non-strings even when the tool definition says "type": "string". That mutates the arguments the tool receives on the happy path.

Possible direction
 def _parse_param_value(value_str: str, param_type: Optional[str]) -> Any:
     value_str = value_str.strip()
 
-    # Try JSON parsing first
-    try:
-        parsed = json.loads(value_str)
-        return parsed
-    except (json.JSONDecodeError, ValueError):
-        pass
+    if param_type in (None, "string"):
+        return value_str
+
+    if param_type in ("object", "array", "null"):
+        try:
+            return json.loads(value_str)
+        except json.JSONDecodeError:
+            return value_str
 
     # For numeric types, try numeric conversion
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py` around lines 33 - 61,
The function _parse_param_value currently JSON-parses and coerces values before
checking the declared param_type, causing values declared as "string" in the
schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix by short-circuiting
when param_type == "string" (return the stripped value_str unchanged) before any
json.loads or numeric/boolean conversions; otherwise keep the existing logic
(JSON parse first, then numeric/boolean fallbacks) and preserve the original
behavior for non-string types.

Comment on lines +104 to +110
def detect_and_parse(self, text: str, tools: List[Tool]) -> StreamingParseResult:
"""One-time parsing: detect and parse all tool calls in the text."""
idx = text.find(self.bot_token)
normal_text = text[:idx].strip() if idx != -1 else text

if self.bot_token not in text:
return StreamingParseResult(normal_text=normal_text, calls=[])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve assistant text around the MiniMax tool block.

The one-shot path keeps only the prefix before the first <minimax:tool_call>, so any suffix after the closing tag is dropped. The streaming path returns normal_text="" as soon as the opener is present, which means prefix<minimax:tool_call>... loses prefix, and ...</minimax:tool_call>suffix is not surfaced until a later delta. Mixed content/tool responses become lossy.

Also applies to: 167-170, 259-268

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py` around lines 104 - 110,
The detect_and_parse method currently drops text after or around
<minimax:tool_call> by returning only the prefix or empty normal_text when an
opener exists; update detect_and_parse to preserve prefix (text before the
opener) as normal_text and also detect and include any suffix after the closing
tag when present, parsing tool call content between opener and closer into
calls; when an opener exists but no closer yet, keep the prefix in normal_text
(do not return ""), buffer the remainder for streaming updates, and only remove
the tool block once its closing tag is seen. Apply the same preservation logic
to the corresponding streaming/partial-parse handlers referenced at the other
ranges (the functions handling streaming deltas) so they similarly retain text
before the opener and surface suffix text after the closer instead of dropping
it.

@JunyiXu-nv JunyiXu-nv removed the request for review from hchings March 13, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant