Skip to content

Add eval API harness and restructure test agent skills#135

Open
adilei wants to merge 18 commits intomainfrom
feature/eval-api-harness
Open

Add eval API harness and restructure test agent skills#135
adilei wants to merge 18 commits intomainfrom
feature/eval-api-harness

Conversation

@adilei
Copy link
Copy Markdown
Collaborator

@adilei adilei commented Apr 6, 2026

Summary

  • New eval-api.bundle.js script — thin HTTP client for the Power Platform Evaluation API (PPAPI) with 6 subcommands: list-testsets, get-testset, start-run, get-run, get-results, list-runs
  • Supports draft testing via runOnPublishedBot=false — no publish step needed for the edit→push→eval loop
  • Requires App Registration with CopilotStudio.MakerOperations.Read + ReadWrite delegated permissions (Power Platform API resource 8578e004-a5c6-46e7-913e-12f58912df43)
  • Restructured test skills: run-eval (new PPAPI skill), run-tests-kit (extracted Kit mode), analyze-evals (extracted CSV mode), run-tests (deprecated)
  • Rewritten test agent with routing table for draft vs published testing, and full edit→push→eval loop documentation

API findings

  • VS Code 1P token does not work with the eval API (missing CopilotStudio.MakerOperations permissions) — custom app registration required
  • Test set CRUD is not available via the API — test sets must be managed in the Copilot Studio UI
  • api-version=2024-10-01 is required on all endpoints

Test plan

  • list-testsets — discovers test sets for a bot
  • start-run with runOnPublishedBot: false — starts draft evaluation
  • get-run polling — tracks Queued → InProgress → Completed
  • get-results — returns full metrics (GeneralQuality with relevance/completeness/groundedness/abstention)
  • Concurrent run limit (422) handled with clear error message
  • End-to-end: 10 test cases, 9 Pass / 1 Fail (groundedness), ~3 min runtime

🤖 Generated with Claude Code

adilei and others added 17 commits April 6, 2026 17:42
New eval-api script (scripts/src/eval-api.js) provides a thin HTTP client for the
Power Platform Evaluation API (PPAPI) with agent-driven polling. Subcommands:
list-testsets, get-testset, start-run, get-run, get-results, list-runs.

Key capabilities:
- Draft testing via runOnPublishedBot=false — no publish step needed
- Requires App Registration with CopilotStudio.MakerOperations.Read/ReadWrite
  delegated permissions on the Power Platform API resource
- Reuses shared-auth for MSAL token management with device code fallback

Restructured test agent skills:
- run-eval: New skill for PPAPI evaluations with full push→eval→fix loop
- run-tests-kit: Extracted Kit batch testing (Mode A from run-tests)
- analyze-evals: Extracted CSV analysis (Mode B from run-tests)
- run-tests: Deprecated with redirect to new skills

Updated test agent (copilot-studio-test.md):
- New routing table with draft vs published testing modes
- Documents the edit→push→eval loop for iterative testing
- Clarifies agent lifecycle (pushed drafts reachable by PPAPI evals)

Tested end-to-end: list-testsets, start-run (draft), polling, get-results
(9/10 pass, 1 groundedness fail) against a real Copilot Studio agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the import CSV format (question + expectedResponse columns),
limits (100 questions, 500 char/question, 1000 char/response), all
available test methods, and the workflow for creating test sets via
CSV import in the Copilot Studio UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents all 7 test methods with exact CSV column values, scoring types,
expected response requirements, and UI-only configuration details. Based on
official Microsoft Learn documentation for Copilot Studio evaluation.

Key additions:
- Testing method CSV column with exact string values per grader
- General quality four-criteria breakdown (relevance, groundedness,
  completeness, abstention)
- Compare meaning and Text similarity pass threshold details
- Capability use and Custom graders are UI-only (not settable via CSV)
- Mixed methods per CSV (each row can use a different grader)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frames the test agent's capabilities into three distinct categories upfront:
in-product evaluations (PPAPI), Copilot Studio Kit (Dataverse batch), and
point testing (DirectLine/SDK). Makes the taxonomy explicit before the
routing table and skill list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add "create/prepare a test set" to test agent routing table pointing
  to run-eval skill (which has the CSV format documentation)
- Add eval scenario that verifies: test agent is invoked, run-eval skill
  is routed to, and a CSV file is created

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add CRITICAL section to system prompt requiring evaluation/test set
  tasks to go through the Test Agent (not handled directly)
- Expand Test Agent description in system prompt to include evaluation
  and test set CSV creation
- Add disambiguation note for run-eval vs create-eval in test agent
- Update eval prompt to be more testing-workflow oriented

Eval results: Test Agent dispatch now works (agent_invoked passes).
Inner skill routing (run-eval vs create-eval vs run-tests) is
non-deterministic — needs further iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New create-eval-set skill for creating test set CSVs for Copilot Studio
  in-product evaluation (Evaluate tab import)
- Remove deprecated run-tests skill entirely (was causing routing confusion)
- Deprecate create-eval skill description (plugin dev only, not for users)
- Update test agent disambiguation table: create-eval-set vs run-eval vs create-eval
- Update eval scenario to expect create-eval-set skill

Eval results: Test agent dispatch works consistently (3/3 runs).
Inner skill routing still non-deterministic — the test agent picks
run-tests (ghost reference) instead of create-eval-set. Needs further
investigation into stale skill references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows passing a local plugin directory to the claude CLI during evals,
so evals can test against the working copy instead of the installed
plugin cache. This resolved stale-plugin routing issues where the
eval harness was loading outdated skill definitions.

Usage: python3 evals/evaluate.py --scenario X --plugin-dir /path/to/plugin

Eval results: create-eval-set routing passes 3/3 with --plugin-dir.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The heavy-handed "MUST go through Test Agent" block was added during
debugging when we thought the model wasn't routing evaluation tasks.
The actual issue was stale plugin files in the installed cache, now
fixed by --plugin-dir. The existing system prompt guidance ("testing...
for which you have a sub-agent available") is sufficient.

Eval still passes 3/3 without the block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The DEPRECATED label was added during debugging when stale plugin
files caused routing confusion. The skill is still valid for plugin
development evals. Keep the clarification distinguishing it from
create-eval-set (Copilot Studio in-product evaluation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The official template confirms test methods cannot be set via CSV import —
the Testing method column is ignored. All imported test cases get General
quality as the default. Other methods must be configured in the UI after
import. Updated both create-eval-set and run-eval skills.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The eval-api script triggers device code auth on first use, which
requires foreground execution to complete. Running in background
causes the auth to be killed before the user can authenticate,
leading to repeated auth prompts. The skill now explicitly says:
- Never use run_in_background for eval-api commands
- Use list-testsets as the auth gate (first foreground call)
- Re-run after auth completes to get the cached token

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tell the agent to watch for the device_code JSON in stdout, present
the code prominently, and wait for the command to finish (not interrupt
it). Matches the pattern used by chat-sdk skill for MSAL device code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rowser

Standardizes authentication for all test agent API calls (eval API and
SDK chat) around a single "test-agent" MSAL cache slot with interactive
browser login instead of device code.

Changes:
- eval-api.js: Add `auth` subcommand with interactive browser login,
  switch from "manage-agent" to "test-agent" cache slot, remove device
  code fallback for custom client IDs
- chat-with-agent.js: Switch getSdkAccessToken from device code +
  "chat" slot to interactive browser + "test-agent" slot. One auth
  covers both eval and chat.
- Test agent: Add shared "Authentication for eval and SDK chat" section
  documenting the pre-auth step and required permissions
- run-eval skill: Replace device code handling with auth command
  reference, add foreground-only execution rules
- chat-sdk skill: Replace device code instructions with shared auth
  reference

Required App Registration permissions (all on Power Platform API):
- CopilotStudio.MakerOperations.Read/ReadWrite (eval API)
- CopilotStudio.Copilots.Invoke (SDK chat)

Tested: eval-api auth → interactive browser → cached token → silent
reuse by list-testsets (all via the "test-agent" cache slot).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New test-auth skill centralizes authentication for all test agent
workflows that need a custom App Registration:
- Asks user for client ID (or guides through app reg creation)
- Discovers tenant from conn.json automatically
- Runs eval-api auth (interactive browser login)
- Caches token in "test-agent" slot shared by run-eval and chat-sdk

Stripped auth logic from run-eval and chat-sdk skills — they now
reference test-auth as a prerequisite. The test agent runs test-auth
before any eval or SDK chat workflow and remembers the client ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chat-sdk had old prerequisites listing only Copilots.Invoke, causing
the test agent to present incomplete permission requirements instead
of going through test-auth. Now chat-sdk and the point-test workflow
both defer to test-auth for all app registration setup.

Only test-auth has the complete permissions list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The agent was skipping the test set listing and auto-selecting without
asking the user. Made the instruction explicit: MUST list test sets,
MUST let user choose when multiple exist, MUST WAIT for response
before starting a run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ChrisGarty ChrisGarty added the type/infra Evals, hooks, CI, build, scripts label Apr 7, 2026
run-eval: Removed CSV format docs (belongs in create-eval-set),
removed contradictory auth instructions, simplified from 7 phases
to 5 clear steps. Auth is the caller's responsibility via test-auth.

test agent: Reduced from 6 tables to 1 routing table. Replaced
verbose sections with direct 3-step workflows for each task type.
Removed escape hatches ("if you already have the client ID, skip").
Auth flow is always: test-auth first, then run-eval/chat-sdk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/infra Evals, hooks, CI, build, scripts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants