Skip to content

Add lookup-schema eval test cases#121

Open
GiorgioUghini wants to merge 2 commits intomainfrom
giorgioughini/lookup-schema-evals
Open

Add lookup-schema eval test cases#121
GiorgioUghini wants to merge 2 commits intomainfrom
giorgioughini/lookup-schema-evals

Conversation

@GiorgioUghini
Copy link
Copy Markdown
Contributor

Summary

Adds the first eval test suite for the lookup-schema skill — a reference/query skill that looks up Copilot Studio YAML schema definitions using three commands: lookup, search, and resolve.

Since this is a stdout-based skill (no YAML files created), all checks use stdout_contains, stdout_not_contains, and exit_code.

Test Cases

# Name Command Tested Description
1 Lookup known definition — SendActivity lookup Looks up a common, well-known kind. Verifies the response includes SendActivity, text, and kind.
2 Lookup trigger kind — OnRecognizedIntent lookup Looks up a trigger definition. Verifies the response explains intent and trigger concepts.
3 Lookup top-level kind — AdaptiveDialog lookup Looks up the root topic kind. Verifies beginDialog and trigger are described.
4 Search for model-related definitions search Searches for model-related schema elements. Checks for specific terms like modelDescription and properties (not just the generic word "model").
5 Resolve $ref references in a definition resolve Resolves QuestionWithOptions with its $ref sub-definitions. Verifies fully resolved output includes properties.
6 Lookup non-existent definition — graceful fallback lookup (negative) Looks up FakeNonExistentKind. Verifies the skill gracefully reports "not found" without crashes (ENOENT, stack trace, FATAL).

Coverage

  • ✅ All 3 skill commands covered (lookup, search, resolve)
  • ✅ Positive and negative test cases
  • ✅ Uses basic-agent fixture (consistent with other eval suites)

GiorgioUghini and others added 2 commits April 1, 2026 15:03
Add 6 eval test cases for the lookup-schema skill covering:
- Lookup of known definitions (SendActivity, OnRecognizedIntent, AdaptiveDialog)
- Search command for model-related definitions
- Resolve command for  reference resolution
- Negative test for graceful handling of non-existent definitions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add explicit UTF-8 encoding and error replacement to subprocess.run
to prevent cp1252 codec failures on Windows. Also guard against
None stdout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@adilei
Copy link
Copy Markdown
Collaborator

adilei commented Apr 5, 2026

Hey @GiorgioUghini — the evals refactor in #130 changed the eval structure, so this will need a small update before merging:

  1. Move evals/skills/lookup-schema.jsonevals/scenarios/lookup-schema.json
  2. Rename top-level key from skill/evals to scenario_name/evals (see evals/scenarios/agent-settings.json for reference)
  3. Rewrite prompts as natural language — e.g. instead of "Use the lookup-schema skill to look up SendActivity", use something like "What properties does the SendActivity kind have in the Copilot Studio YAML schema?"
  4. Add routing checksskill_invoked: "copilot-studio:lookup-schema" and optionally agent_invoked if it should route through a sub-agent
  5. Drop any changes to evaluate.py — the harness was rewritten in Refactor evals: scenario-based testing instead of skill isolation #130

The actual check types you're using (stdout_contains, stdout_not_contains, exit_code) are still supported, so the test cases themselves are fine — it's mostly a structural move.

@ChrisGarty ChrisGarty added the type/infra Evals, hooks, CI, build, scripts label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/infra Evals, hooks, CI, build, scripts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants