Add eval API harness and restructure test agent skills by adilei · Pull Request #135 · microsoft/skills-for-copilot-studio

adilei · 2026-04-06T14:43:18Z

Summary

New eval-api.bundle.js script — thin HTTP client for the Power Platform Evaluation API (PPAPI) with 6 subcommands: list-testsets, get-testset, start-run, get-run, get-results, list-runs
Supports draft testing via runOnPublishedBot=false — no publish step needed for the edit→push→eval loop
Requires App Registration with CopilotStudio.MakerOperations.Read + ReadWrite delegated permissions (Power Platform API resource 8578e004-a5c6-46e7-913e-12f58912df43)
Restructured test skills: run-eval (new PPAPI skill), run-tests-kit (extracted Kit mode), analyze-evals (extracted CSV mode), run-tests (deprecated)
Rewritten test agent with routing table for draft vs published testing, and full edit→push→eval loop documentation

API findings

VS Code 1P token does not work with the eval API (missing CopilotStudio.MakerOperations permissions) — custom app registration required
Test set CRUD is not available via the API — test sets must be managed in the Copilot Studio UI
api-version=2024-10-01 is required on all endpoints

Test plan

list-testsets — discovers test sets for a bot
start-run with runOnPublishedBot: false — starts draft evaluation
get-run polling — tracks Queued → InProgress → Completed
get-results — returns full metrics (GeneralQuality with relevance/completeness/groundedness/abstention)
Concurrent run limit (422) handled with clear error message
End-to-end: 10 test cases, 9 Pass / 1 Fail (groundedness), ~3 min runtime

🤖 Generated with Claude Code

New eval-api script (scripts/src/eval-api.js) provides a thin HTTP client for the Power Platform Evaluation API (PPAPI) with agent-driven polling. Subcommands: list-testsets, get-testset, start-run, get-run, get-results, list-runs. Key capabilities: - Draft testing via runOnPublishedBot=false — no publish step needed - Requires App Registration with CopilotStudio.MakerOperations.Read/ReadWrite delegated permissions on the Power Platform API resource - Reuses shared-auth for MSAL token management with device code fallback Restructured test agent skills: - run-eval: New skill for PPAPI evaluations with full push→eval→fix loop - run-tests-kit: Extracted Kit batch testing (Mode A from run-tests) - analyze-evals: Extracted CSV analysis (Mode B from run-tests) - run-tests: Deprecated with redirect to new skills Updated test agent (copilot-studio-test.md): - New routing table with draft vs published testing modes - Documents the edit→push→eval loop for iterative testing - Clarifies agent lifecycle (pushed drafts reachable by PPAPI evals) Tested end-to-end: list-testsets, start-run (draft), polling, get-results (9/10 pass, 1 groundedness fail) against a real Copilot Studio agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents the import CSV format (question + expectedResponse columns), limits (100 questions, 500 char/question, 1000 char/response), all available test methods, and the workflow for creating test sets via CSV import in the Copilot Studio UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents all 7 test methods with exact CSV column values, scoring types, expected response requirements, and UI-only configuration details. Based on official Microsoft Learn documentation for Copilot Studio evaluation. Key additions: - Testing method CSV column with exact string values per grader - General quality four-criteria breakdown (relevance, groundedness, completeness, abstention) - Compare meaning and Text similarity pass threshold details - Capability use and Custom graders are UI-only (not settable via CSV) - Mixed methods per CSV (each row can use a different grader) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Frames the test agent's capabilities into three distinct categories upfront: in-product evaluations (PPAPI), Copilot Studio Kit (Dataverse batch), and point testing (DirectLine/SDK). Makes the taxonomy explicit before the routing table and skill list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add "create/prepare a test set" to test agent routing table pointing to run-eval skill (which has the CSV format documentation) - Add eval scenario that verifies: test agent is invoked, run-eval skill is routed to, and a CSV file is created Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add CRITICAL section to system prompt requiring evaluation/test set tasks to go through the Test Agent (not handled directly) - Expand Test Agent description in system prompt to include evaluation and test set CSV creation - Add disambiguation note for run-eval vs create-eval in test agent - Update eval prompt to be more testing-workflow oriented Eval results: Test Agent dispatch now works (agent_invoked passes). Inner skill routing (run-eval vs create-eval vs run-tests) is non-deterministic — needs further iteration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New create-eval-set skill for creating test set CSVs for Copilot Studio in-product evaluation (Evaluate tab import) - Remove deprecated run-tests skill entirely (was causing routing confusion) - Deprecate create-eval skill description (plugin dev only, not for users) - Update test agent disambiguation table: create-eval-set vs run-eval vs create-eval - Update eval scenario to expect create-eval-set skill Eval results: Test agent dispatch works consistently (3/3 runs). Inner skill routing still non-deterministic — the test agent picks run-tests (ghost reference) instead of create-eval-set. Needs further investigation into stale skill references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows passing a local plugin directory to the claude CLI during evals, so evals can test against the working copy instead of the installed plugin cache. This resolved stale-plugin routing issues where the eval harness was loading outdated skill definitions. Usage: python3 evals/evaluate.py --scenario X --plugin-dir /path/to/plugin Eval results: create-eval-set routing passes 3/3 with --plugin-dir. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The heavy-handed "MUST go through Test Agent" block was added during debugging when we thought the model wasn't routing evaluation tasks. The actual issue was stale plugin files in the installed cache, now fixed by --plugin-dir. The existing system prompt guidance ("testing... for which you have a sub-agent available") is sufficient. Eval still passes 3/3 without the block. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The DEPRECATED label was added during debugging when stale plugin files caused routing confusion. The skill is still valid for plugin development evals. Keep the clarification distinguishing it from create-eval-set (Copilot Studio in-product evaluation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The official template confirms test methods cannot be set via CSV import — the Testing method column is ignored. All imported test cases get General quality as the default. Other methods must be configured in the UI after import. Updated both create-eval-set and run-eval skills. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The eval-api script triggers device code auth on first use, which requires foreground execution to complete. Running in background causes the auth to be killed before the user can authenticate, leading to repeated auth prompts. The skill now explicitly says: - Never use run_in_background for eval-api commands - Use list-testsets as the auth gate (first foreground call) - Re-run after auth completes to get the cached token Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tell the agent to watch for the device_code JSON in stdout, present the code prominently, and wait for the command to finish (not interrupt it). Matches the pattern used by chat-sdk skill for MSAL device code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rowser Standardizes authentication for all test agent API calls (eval API and SDK chat) around a single "test-agent" MSAL cache slot with interactive browser login instead of device code. Changes: - eval-api.js: Add `auth` subcommand with interactive browser login, switch from "manage-agent" to "test-agent" cache slot, remove device code fallback for custom client IDs - chat-with-agent.js: Switch getSdkAccessToken from device code + "chat" slot to interactive browser + "test-agent" slot. One auth covers both eval and chat. - Test agent: Add shared "Authentication for eval and SDK chat" section documenting the pre-auth step and required permissions - run-eval skill: Replace device code handling with auth command reference, add foreground-only execution rules - chat-sdk skill: Replace device code instructions with shared auth reference Required App Registration permissions (all on Power Platform API): - CopilotStudio.MakerOperations.Read/ReadWrite (eval API) - CopilotStudio.Copilots.Invoke (SDK chat) Tested: eval-api auth → interactive browser → cached token → silent reuse by list-testsets (all via the "test-agent" cache slot). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New test-auth skill centralizes authentication for all test agent workflows that need a custom App Registration: - Asks user for client ID (or guides through app reg creation) - Discovers tenant from conn.json automatically - Runs eval-api auth (interactive browser login) - Caches token in "test-agent" slot shared by run-eval and chat-sdk Stripped auth logic from run-eval and chat-sdk skills — they now reference test-auth as a prerequisite. The test agent runs test-auth before any eval or SDK chat workflow and remembers the client ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chat-sdk had old prerequisites listing only Copilots.Invoke, causing the test agent to present incomplete permission requirements instead of going through test-auth. Now chat-sdk and the point-test workflow both defer to test-auth for all app registration setup. Only test-auth has the complete permissions list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The agent was skipping the test set listing and auto-selecting without asking the user. Made the instruction explicit: MUST list test sets, MUST let user choose when multiple exist, MUST WAIT for response before starting a run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

run-eval: Removed CSV format docs (belongs in create-eval-set), removed contradictory auth instructions, simplified from 7 phases to 5 clear steps. Auth is the caller's responsibility via test-auth. test agent: Reduced from 6 tables to 1 routing table. Replaced verbose sections with direct 3-step workflows for each task type. Removed escape hatches ("if you already have the client ID, skip"). Auth flow is always: test-auth first, then run-eval/chat-sdk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adilei and others added 17 commits April 6, 2026 17:42

ChrisGarty added the type/infra Evals, hooks, CI, build, scripts label Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval API harness and restructure test agent skills#135

Add eval API harness and restructure test agent skills#135
adilei wants to merge 18 commits intomainfrom
feature/eval-api-harness

adilei commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adilei commented Apr 6, 2026

Summary

API findings

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants