feat: add interactive (multi-turn) evaluation support#24
Conversation
Add multi-turn conversation evaluation where agents can engage in iterative dialogue with simulated user inputs. Key components: - InteractiveSession: orchestrates multi-turn conversation execution - InputInjectorManager: manages input injection, pattern matching, and stop conditions - ClaudeStreamAgent: persistent streaming agent for true multi-turn conversations via stream-json protocol - Output marker parser for agent signals ([NEEDS_INPUT:type], etc.) - LLM grader support for multi-turn transcripts with context-aware hints - Auto-switch from claude to claude-stream when interactive is enabled - Error-path grading: attempts evaluation even when agent setup fails Includes eval.yaml configuration reference, interactive-demo example, and template updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@joyyc would you share the prompt you used to generate this change? There are 20 modifier changes and 1.8k lines added which makes the PR hard to review. I'd love to understand your feature request better and run the prompt in my trusted workflow. |
To be candid, in actual development this was arrived at through multiple rounds of conversation and test-driven iteration with Claude Code. The distilled prompt is as follows (the original was authored in Chinese; the English translation below is provided for readability): Add interactive multi-turn evaluation support to skillgrade. Implement a persistent multi-turn conversational agent (
为skillgrade添加交互式多轮评估支持。需要使用Claude Code的stream-json协议(--input-format stream-json --output-format stream-json)实现持久化多轮对话agent(ClaudeStreamAgent)。 核心:
|
Add multi-turn conversation evaluation where agents can engage in iterative dialogue with simulated user inputs. Key components:
Includes eval.yaml configuration reference, interactive-demo example, and template updates.