Skip to content

Feat/user evals#18418

Open
PClmnt wants to merge 10 commits intomasterfrom
feat/user-evals
Open

Feat/user evals#18418
PClmnt wants to merge 10 commits intomasterfrom
feat/user-evals

Conversation

@PClmnt
Copy link
Copy Markdown
Collaborator

@PClmnt PClmnt commented Mar 30, 2026

Description

This PR adds the ability to run evaluates of multiple types on your agent instructions.

You can use the following methods.

  • Exact match
  • Contains
  • LLM as a Judge
  • Tool used

For each agent you have you can create a suite of evaluations, each containing multiple test cases. Each of this test cases can test for multiple conditions as mentioned above.

All suites are ran against the instructions present in the configuration tab at that moment in time.

It's currently gated against the AI_EVALS feature flag.

Screenshots

image image image

Launchcontrol

  • Adds the ability to run evaluations on your agent prompt and tools

Summary by cubic

Adds agent evaluations (Evals) so you can define test cases, run suites, and review results for agents. The feature is behind the AI_EVALS flag and includes new UI, API, SDK, and logs support.

  • New Features

    • Builder UI to create/edit cases, run suites, and inspect results with per-case reviewers and snapshots; new Evals tab appears only when AI_EVALS is enabled.
    • Reviewer types: exact match, contains text, tool used, and LLM judge.
    • API and frontend-core client: GET/PUT /api/agent/:agentId/evals, POST /api/agent/:agentId/evals/run, with validation and AI_EVALS gating (403 when disabled).
    • Server SDK for eval suites (CRUD), running suites, and result aggregation; new doc IDs and types for suites and runs.
    • Logs classify “Eval” sessions separately; suite runs and latest case results are surfaced in both UI and API.
  • Migration

    • Enable AI_EVALS to use the feature.
    • Duration utils moved to @budibase/shared-core (Duration, DurationType); update imports from backend-core to @budibase/shared-core.

Written for commit 32ce9b2. Summary will update on new commits.

@github-actions github-actions bot added firestorm Data/Infra/Revenue Team size/xl labels Mar 30, 2026
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just moved from another location to be in shared-core.

@PClmnt PClmnt marked this pull request as ready for review March 30, 2026 13:36
@PClmnt PClmnt requested a review from a team as a code owner March 30, 2026 13:36
@PClmnt PClmnt removed the request for review from a team March 30, 2026 13:36
@PClmnt PClmnt requested a review from adrinr March 30, 2026 13:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e86592388e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions github-actions bot added the stale label Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

firestorm Data/Infra/Revenue Team size/xl stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant