Skip to content

Complete vMCP circuit breaker configuration and testing#3615

Open
yrobla wants to merge 2 commits intomainfrom
issue-3036-v1-new
Open

Complete vMCP circuit breaker configuration and testing#3615
yrobla wants to merge 2 commits intomainfrom
issue-3036-v1-new

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Feb 5, 2026

Wire circuit breaker configuration from VirtualMCPServer CRD to health monitoring system and add comprehensive E2E test coverage.

Circuit breaker implementation:

  • Wire CircuitBreakerConfig from VirtualMCPServer spec to health monitor
  • Add HealthCheckTimeout field to FailureHandlingConfig (default: 10s)
  • Configure failure threshold, timeout, and enabled flag from CRD
  • Add info logging when circuit breaker is enabled

Configuration changes:

  • pkg/vmcp/config/config.go: Add HealthCheckTimeout field
  • cmd/vmcp/app/commands.go: Wire both circuit breaker config and timeout to health monitor, converting from config types to health types

Related-to: #3036

Wire circuit breaker configuration from VirtualMCPServer CRD to health
monitoring system and add comprehensive E2E test coverage.

Circuit breaker implementation:
  - Wire CircuitBreakerConfig from VirtualMCPServer spec to health monitor
  - Add HealthCheckTimeout field to FailureHandlingConfig (default: 10s)
  - Configure failure threshold, timeout, and enabled flag from CRD
  - Add info logging when circuit breaker is enabled

Configuration changes:
  - pkg/vmcp/config/config.go: Add HealthCheckTimeout field
  - cmd/vmcp/app/commands.go: Wire both circuit breaker config and timeout
    to health monitor, converting from config types to health types

Related-to: #3036
@yrobla yrobla requested review from Copilot and removed request for ChrisJBurns, JAORMX, amirejaz, jerm-dro and jhrozek February 5, 2026 09:42
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the circuit breaker implementation for the Virtual MCP Server by wiring configuration from the VirtualMCPServer CRD to the health monitoring system and adding comprehensive end-to-end tests.

Changes:

  • Added HealthCheckTimeout field to FailureHandlingConfig for configuring health check timeouts
  • Wired circuit breaker configuration (enabled flag, failure threshold, timeout) from CRD to health monitor
  • Added comprehensive E2E test suite for circuit breaker lifecycle (failure detection, circuit opening, recovery)
  • Fixed condition type references to use constants instead of string literals in helper functions

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/vmcp/config/config.go Added HealthCheckTimeout field to FailureHandlingConfig with proper documentation and kubebuilder annotations
cmd/vmcp/app/commands.go Wired circuit breaker configuration from config types to health monitor types with validation and logging
deploy/charts/operator-crds/templates/toolhive.stacklok.dev_virtualmcpservers.yaml Updated CRD schema to include healthCheckTimeout field with default value and validation pattern
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_virtualmcpservers.yaml Updated CRD file with healthCheckTimeout field (mirrors template changes)
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Added comprehensive E2E test for circuit breaker lifecycle including failure detection, circuit opening, and recovery scenarios
test/e2e/thv-operator/virtualmcp/helpers.go Fixed to use ConditionTypeVirtualMCPServerReady constant instead of string literal
test/e2e/thv-operator/virtualmcp/virtualmcp_status_reporting_test.go Fixed to use ConditionTypeVirtualMCPServerReady constant instead of string literal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla requested a review from Copilot February 5, 2026 10:05
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 21.73913% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.61%. Comparing base (8e25a44) to head (a0f1282).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
cmd/vmcp/app/commands.go 0.00% 26 Missing ⚠️
pkg/vmcp/health/monitor.go 50.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3615      +/-   ##
==========================================
- Coverage   65.68%   65.61%   -0.08%     
==========================================
  Files         408      410       +2     
  Lines       40416    40655     +239     
==========================================
+ Hits        26549    26677     +128     
- Misses      11805    11899      +94     
- Partials     2062     2079      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-new branch from 437603a to 6955d9c Compare February 5, 2026 11:42
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 5, 2026
@yrobla yrobla requested a review from Copilot February 5, 2026 11:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla requested a review from amirejaz February 5, 2026 12:08
@yrobla yrobla force-pushed the issue-3036-v1-new branch from 6955d9c to 925b6b6 Compare February 5, 2026 13:26
@yrobla yrobla requested a review from Copilot February 5, 2026 13:26
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-new branch from 925b6b6 to 9c01491 Compare February 5, 2026 14:01
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 5, 2026
@yrobla yrobla requested a review from Copilot February 5, 2026 14:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-new branch from 9c01491 to 0a0b2a8 Compare February 5, 2026 14:19
@github-actions github-actions bot removed the size/L Large PR: 600-999 lines changed label Feb 5, 2026
@yrobla yrobla requested a review from Copilot February 5, 2026 14:20
@github-actions github-actions bot added the size/L Large PR: 600-999 lines changed label Feb 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-new branch from 0a0b2a8 to 160dd4d Compare February 5, 2026 15:19
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 5, 2026
Fix circuit breaker E2E test recovery and improve test reliability

Improve the circuit breaker recovery test to handle stuck pods in
ImagePullBackOff state. When an MCPServer image is changed from
a non-existent image back to a valid image, Kubernetes doesn't
automatically recreate the StatefulSet pod - it stays in
ImagePullBackOff. The test now deletes stuck pending pods to force
recreation with the correct image.

Also simplify the failure simulation approach:
- Before: Scaled deployments/statefulsets to 0 replicas
- After: Change image to non-existent (matches status reporting test pattern)

Changes:
- Add pod deletion step in recovery test to handle ImagePullBackOff
- Switch from scaling approach to image change approach (cleaner)
- Remove unused imports (appsv1, labels, ptr)
- Regenerate CRD docs with healthCheckTimeout field

All 5 circuit breaker E2E tests now pass reliably.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add validation for health check timeout vs interval constraint

Validate that HealthCheckTimeout is less than HealthCheckInterval to prevent
health checks from queuing up. This enforces the constraint documented in
config.go but was not previously validated at runtime.

The validation returns a clear error message when the constraint is violated:
  health check timeout (Xs) must be less than check interval (Ys)
  to prevent checks from queuing up

Default values are safe (timeout: 10s < interval: 30s), but this prevents
invalid user configurations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Use cbFailureThreshold constant in circuit breaker error message

Replace hardcoded threshold value "3" with the cbFailureThreshold constant
for consistency and maintainability. This ensures the error message stays
accurate if the test configuration changes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fix circuit breaker state enum mismatch between Go and CRD

Change CircuitHalfOpen constant from "half_open" (underscore) to "half-open"
(hyphen) to match the CRD enum validation in VirtualMCPServer status.

Without this fix, when the circuit breaker transitions to half-open state,
the status update would be rejected by the Kubernetes API server because
"half_open" doesn't match the CRD enum which requires "half-open".

The CRD enum is:
  - closed
  - open
  - half-open

All code uses the CircuitHalfOpen constant, so this change is safe and only
affects the string value used in status reporting.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add circuit breaker configuration validation

Add runtime and CRD validation for circuit breaker configuration to prevent
invalid values that could cause unexpected behavior.

Runtime validation (cmd/vmcp/app/commands.go):
- FailureThreshold must be >= 1 (can't open circuit with 0 failures)
- Timeout must be >= 1s when set (prevents thrashing with rapid recovery attempts)
- Validation only runs when circuit breaker is enabled

CRD validation (pkg/vmcp/config/config.go):
- Add +kubebuilder:validation:Minimum=1 for FailureThreshold
- Update documentation to clarify constraints
- Kubernetes API will reject invalid configurations at creation time

Benefits:
- Prevents silent failures from misconfiguration
- Clear error messages guide users to correct values
- Defense-in-depth: validation at both CRD and runtime levels

Example error messages:
- "circuit breaker failure threshold must be >= 1, got 0"
- "circuit breaker timeout must be >= 1s to prevent thrashing, got 500ms"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@yrobla yrobla force-pushed the issue-3036-v1-new branch from 160dd4d to a0f1282 Compare February 5, 2026 15:33
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 5, 2026
@amirejaz amirejaz requested a review from Copilot February 5, 2026 16:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants