Skip to content

feat: add interactive (multi-turn) evaluation support#24

Open
joyyc wants to merge 2 commits into
mgechev:mainfrom
joyyc:upstream
Open

feat: add interactive (multi-turn) evaluation support#24
joyyc wants to merge 2 commits into
mgechev:mainfrom
joyyc:upstream

Conversation

@joyyc

@joyyc joyyc commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Add multi-turn conversation evaluation where agents can engage in iterative dialogue with simulated user inputs. Key components:

  • InteractiveSession: orchestrates multi-turn conversation execution
  • InputInjectorManager: manages input injection, pattern matching, and stop conditions
  • ClaudeStreamAgent: persistent streaming agent for true multi-turn conversations via stream-json protocol
  • Output marker parser for agent signals ([NEEDS_INPUT:type], etc.)
  • LLM grader support for multi-turn transcripts with context-aware hints
  • Auto-switch from claude to claude-stream when interactive is enabled
  • Error-path grading: attempts evaluation even when agent setup fails

Includes eval.yaml configuration reference, interactive-demo example, and template updates.

Add multi-turn conversation evaluation where agents can engage in
iterative dialogue with simulated user inputs. Key components:

- InteractiveSession: orchestrates multi-turn conversation execution
- InputInjectorManager: manages input injection, pattern matching, and
  stop conditions
- ClaudeStreamAgent: persistent streaming agent for true multi-turn
  conversations via stream-json protocol
- Output marker parser for agent signals ([NEEDS_INPUT:type], etc.)
- LLM grader support for multi-turn transcripts with context-aware hints
- Auto-switch from claude to claude-stream when interactive is enabled
- Error-path grading: attempts evaluation even when agent setup fails

Includes eval.yaml configuration reference, interactive-demo example,
and template updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mgechev

mgechev commented Jun 22, 2026

Copy link
Copy Markdown
Owner

@joyyc would you share the prompt you used to generate this change? There are 20 modifier changes and 1.8k lines added which makes the PR hard to review. I'd love to understand your feature request better and run the prompt in my trusted workflow.

@joyyc

joyyc commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@joyyc would you share the prompt you used to generate this change? There are 20 modifier changes and 1.8k lines added which makes the PR hard to review. I'd love to understand your feature request better and run the prompt in my trusted workflow.

To be candid, in actual development this was arrived at through multiple rounds of conversation and test-driven iteration with Claude Code. The distilled prompt is as follows (the original was authored in Chinese; the English translation below is provided for readability):

Add interactive multi-turn evaluation support to skillgrade. Implement a persistent multi-turn conversational agent (ClaudeStreamAgent) on top of Claude Code's stream-json protocol (--input-format stream-json --output-format stream-json).
Core requirements:

  1. An InteractiveSession that drives the multi-turn conversation loop and honors max_turns and timeout_per_turn;
  2. An InputInjectorManager that handles input injection (via triggers such as on_turn / on_output_contains) and termination conditions;
  3. Output marker parsing (e.g. [NEEDS_INPUT:type], including Chinese markers);
  4. A new interactive configuration section in eval.yaml;
  5. LLM grader support for scoring multi-turn conversations;
  6. A complete interactive-demo example;
  7. Type definitions and README documentation.

为skillgrade添加交互式多轮评估支持。需要使用Claude Code的stream-json协议(--input-format stream-json --output-format stream-json)实现持久化多轮对话agent(ClaudeStreamAgent)。

核心:

  1. InteractiveSession管理多轮对话循环,支持max_turns和timeout_per_turn;
  2. InputInjectorManager处理输入注入(on_turn/on_output_contains等触发器)和停止条件;
  3. Output marker解析([NEEDS_INPUT:type]等,含中文标记);
  4. eval.yaml新增interactive配置节;
  5. LLM grader支持多轮对话评分;
  6. 提供interactive-demo完整示例;
  7. 类型定义和README文档。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants