Add eval API harness and restructure test agent skills#135
Open
Add eval API harness and restructure test agent skills#135
Conversation
New eval-api script (scripts/src/eval-api.js) provides a thin HTTP client for the Power Platform Evaluation API (PPAPI) with agent-driven polling. Subcommands: list-testsets, get-testset, start-run, get-run, get-results, list-runs. Key capabilities: - Draft testing via runOnPublishedBot=false — no publish step needed - Requires App Registration with CopilotStudio.MakerOperations.Read/ReadWrite delegated permissions on the Power Platform API resource - Reuses shared-auth for MSAL token management with device code fallback Restructured test agent skills: - run-eval: New skill for PPAPI evaluations with full push→eval→fix loop - run-tests-kit: Extracted Kit batch testing (Mode A from run-tests) - analyze-evals: Extracted CSV analysis (Mode B from run-tests) - run-tests: Deprecated with redirect to new skills Updated test agent (copilot-studio-test.md): - New routing table with draft vs published testing modes - Documents the edit→push→eval loop for iterative testing - Clarifies agent lifecycle (pushed drafts reachable by PPAPI evals) Tested end-to-end: list-testsets, start-run (draft), polling, get-results (9/10 pass, 1 groundedness fail) against a real Copilot Studio agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the import CSV format (question + expectedResponse columns), limits (100 questions, 500 char/question, 1000 char/response), all available test methods, and the workflow for creating test sets via CSV import in the Copilot Studio UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents all 7 test methods with exact CSV column values, scoring types, expected response requirements, and UI-only configuration details. Based on official Microsoft Learn documentation for Copilot Studio evaluation. Key additions: - Testing method CSV column with exact string values per grader - General quality four-criteria breakdown (relevance, groundedness, completeness, abstention) - Compare meaning and Text similarity pass threshold details - Capability use and Custom graders are UI-only (not settable via CSV) - Mixed methods per CSV (each row can use a different grader) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frames the test agent's capabilities into three distinct categories upfront: in-product evaluations (PPAPI), Copilot Studio Kit (Dataverse batch), and point testing (DirectLine/SDK). Makes the taxonomy explicit before the routing table and skill list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add "create/prepare a test set" to test agent routing table pointing to run-eval skill (which has the CSV format documentation) - Add eval scenario that verifies: test agent is invoked, run-eval skill is routed to, and a CSV file is created Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add CRITICAL section to system prompt requiring evaluation/test set tasks to go through the Test Agent (not handled directly) - Expand Test Agent description in system prompt to include evaluation and test set CSV creation - Add disambiguation note for run-eval vs create-eval in test agent - Update eval prompt to be more testing-workflow oriented Eval results: Test Agent dispatch now works (agent_invoked passes). Inner skill routing (run-eval vs create-eval vs run-tests) is non-deterministic — needs further iteration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New create-eval-set skill for creating test set CSVs for Copilot Studio in-product evaluation (Evaluate tab import) - Remove deprecated run-tests skill entirely (was causing routing confusion) - Deprecate create-eval skill description (plugin dev only, not for users) - Update test agent disambiguation table: create-eval-set vs run-eval vs create-eval - Update eval scenario to expect create-eval-set skill Eval results: Test agent dispatch works consistently (3/3 runs). Inner skill routing still non-deterministic — the test agent picks run-tests (ghost reference) instead of create-eval-set. Needs further investigation into stale skill references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows passing a local plugin directory to the claude CLI during evals, so evals can test against the working copy instead of the installed plugin cache. This resolved stale-plugin routing issues where the eval harness was loading outdated skill definitions. Usage: python3 evals/evaluate.py --scenario X --plugin-dir /path/to/plugin Eval results: create-eval-set routing passes 3/3 with --plugin-dir. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The heavy-handed "MUST go through Test Agent" block was added during
debugging when we thought the model wasn't routing evaluation tasks.
The actual issue was stale plugin files in the installed cache, now
fixed by --plugin-dir. The existing system prompt guidance ("testing...
for which you have a sub-agent available") is sufficient.
Eval still passes 3/3 without the block.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The DEPRECATED label was added during debugging when stale plugin files caused routing confusion. The skill is still valid for plugin development evals. Keep the clarification distinguishing it from create-eval-set (Copilot Studio in-product evaluation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The official template confirms test methods cannot be set via CSV import — the Testing method column is ignored. All imported test cases get General quality as the default. Other methods must be configured in the UI after import. Updated both create-eval-set and run-eval skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The eval-api script triggers device code auth on first use, which requires foreground execution to complete. Running in background causes the auth to be killed before the user can authenticate, leading to repeated auth prompts. The skill now explicitly says: - Never use run_in_background for eval-api commands - Use list-testsets as the auth gate (first foreground call) - Re-run after auth completes to get the cached token Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tell the agent to watch for the device_code JSON in stdout, present the code prominently, and wait for the command to finish (not interrupt it). Matches the pattern used by chat-sdk skill for MSAL device code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rowser Standardizes authentication for all test agent API calls (eval API and SDK chat) around a single "test-agent" MSAL cache slot with interactive browser login instead of device code. Changes: - eval-api.js: Add `auth` subcommand with interactive browser login, switch from "manage-agent" to "test-agent" cache slot, remove device code fallback for custom client IDs - chat-with-agent.js: Switch getSdkAccessToken from device code + "chat" slot to interactive browser + "test-agent" slot. One auth covers both eval and chat. - Test agent: Add shared "Authentication for eval and SDK chat" section documenting the pre-auth step and required permissions - run-eval skill: Replace device code handling with auth command reference, add foreground-only execution rules - chat-sdk skill: Replace device code instructions with shared auth reference Required App Registration permissions (all on Power Platform API): - CopilotStudio.MakerOperations.Read/ReadWrite (eval API) - CopilotStudio.Copilots.Invoke (SDK chat) Tested: eval-api auth → interactive browser → cached token → silent reuse by list-testsets (all via the "test-agent" cache slot). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New test-auth skill centralizes authentication for all test agent workflows that need a custom App Registration: - Asks user for client ID (or guides through app reg creation) - Discovers tenant from conn.json automatically - Runs eval-api auth (interactive browser login) - Caches token in "test-agent" slot shared by run-eval and chat-sdk Stripped auth logic from run-eval and chat-sdk skills — they now reference test-auth as a prerequisite. The test agent runs test-auth before any eval or SDK chat workflow and remembers the client ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chat-sdk had old prerequisites listing only Copilots.Invoke, causing the test agent to present incomplete permission requirements instead of going through test-auth. Now chat-sdk and the point-test workflow both defer to test-auth for all app registration setup. Only test-auth has the complete permissions list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The agent was skipping the test set listing and auto-selecting without asking the user. Made the instruction explicit: MUST list test sets, MUST let user choose when multiple exist, MUST WAIT for response before starting a run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
run-eval: Removed CSV format docs (belongs in create-eval-set),
removed contradictory auth instructions, simplified from 7 phases
to 5 clear steps. Auth is the caller's responsibility via test-auth.
test agent: Reduced from 6 tables to 1 routing table. Replaced
verbose sections with direct 3-step workflows for each task type.
Removed escape hatches ("if you already have the client ID, skip").
Auth flow is always: test-auth first, then run-eval/chat-sdk.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
eval-api.bundle.jsscript — thin HTTP client for the Power Platform Evaluation API (PPAPI) with 6 subcommands:list-testsets,get-testset,start-run,get-run,get-results,list-runsrunOnPublishedBot=false— no publish step needed for the edit→push→eval loopCopilotStudio.MakerOperations.Read+ReadWritedelegated permissions (Power Platform API resource8578e004-a5c6-46e7-913e-12f58912df43)run-eval(new PPAPI skill),run-tests-kit(extracted Kit mode),analyze-evals(extracted CSV mode),run-tests(deprecated)API findings
CopilotStudio.MakerOperationspermissions) — custom app registration requiredapi-version=2024-10-01is required on all endpointsTest plan
list-testsets— discovers test sets for a botstart-runwithrunOnPublishedBot: false— starts draft evaluationget-runpolling — tracks Queued → InProgress → Completedget-results— returns full metrics (GeneralQuality with relevance/completeness/groundedness/abstention)🤖 Generated with Claude Code