Skip to content

Backend Timeout Functionality#891

Open
O-sura wants to merge 8 commits intowso2:mainfrom
O-sura:backend-timeout
Open

Backend Timeout Functionality#891
O-sura wants to merge 8 commits intowso2:mainfrom
O-sura:backend-timeout

Conversation

@O-sura
Copy link
Contributor

@O-sura O-sura commented Feb 2, 2026

Purpose

This PR adds backend timeout support by introducing upstreamDefinitions under the API spec. APIs can reference a named definition with spec.upstream.main.ref (or sandbox ref); each definition may define a timeout block with optional connect, request, and idle duration strings. When a ref is used, the translator resolves it to the definition’s URL and timeout; if a definition has no timeout (or the upstream uses a direct url), the gateway falls back to the global router config timeouts so existing behaviour is unchanged.

Fix for:

Summary by CodeRabbit

  • New Features

    • Added upstream definitions feature enabling reusable upstream configurations within APIs, including per-upstream timeout settings for connection, request, and idle durations.
    • Enhanced JWT authentication to support custom user ID claims for analytics extraction.
  • Configuration Changes

    • Timeout configuration values changed from seconds to milliseconds for consistency and precision.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

Walkthrough

Adds support for reusable upstream definitions with per-upstream timeout configuration; updates all timeout configuration keys from seconds to milliseconds; introduces validation for upstream definitions and references; extends XDS translator to resolve upstream definitions and apply timeouts; adds integration test coverage for backend timeout behavior.

Changes

Cohort / File(s) Summary
Schema & Validation
gateway/gateway-controller/api/openapi.yaml, gateway/gateway-controller/pkg/config/api_validator.go, gateway/gateway-controller/pkg/config/validator_test.go
Introduces UpstreamDefinition and UpstreamTimeout public schemas; adds validation for upstream definitions (names, duplicates, URLs, weights, timeout formats); extends upstream validation to support references to definitions.
Runtime Translation
gateway/gateway-controller/pkg/xds/translator.go, gateway/gateway-controller/pkg/xds/translator_test.go
Implements upstream definition resolution with per-upstream timeout extraction; extends resolveUpstreamCluster to handle definition references and return resolved timeouts; threads timeout configuration through route and cluster creation; adds helper functions for definition lookup and timeout parsing.
Configuration & Constants
gateway/gateway-controller/pkg/config/config.go, gateway/gateway-controller/pkg/config/config_test.go, gateway/gateway-controller/pkg/constants/constants.go, gateway/configs/config-template.toml, gateway/it/test-config.toml, gateway/it/test-config.yaml
Renames timeout configuration from seconds to milliseconds (RouteTimeoutInSeconds → RouteTimeoutInMs, etc.); introduces EnvoyUpstreamClusterConfig with connect timeout; updates default values and validation; renames MaxReasonableTimeoutSeconds constant to MaxReasonableTimeoutMs.
Integration Testing
gateway/it/features/backend_timeout.feature, gateway/it/suite_test.go, gateway/it/steps_backend_timeout.go
Adds backend timeout feature with two test scenarios validating timeout enforcement via upstream definitions; registers step definitions for time recording and elapsed duration assertions with tolerance.
Documentation
docs/gateway/policies/jwt-authentication.md
Adds userIdClaim parameter for custom user ID extraction in analytics; updates examples to emphasize analytics-focused usage; renumbers use cases to highlight analytics capabilities.

Sequence Diagram

sequenceDiagram
    participant Config as Configuration
    participant Validator as APIValidator
    participant Translator as XDS Translator
    participant RouteCluster as Route/Cluster

    Config->>Validator: Validate API with UpstreamDefinitions
    Validator->>Validator: validateUpstreamDefinitions()
    Validator->>Validator: Check names, duplicates, URLs, weights, timeout formats
    alt Valid definitions
        Validator-->>Config: ✓ Validation passed
    else Invalid
        Validator-->>Config: ✗ Validation errors
    end

    Config->>Translator: Translate API config with UpstreamDefinitions
    Translator->>Translator: resolveUpstreamCluster(upstream ref, definitions)
    alt Upstream has reference
        Translator->>Translator: resolveUpstreamDefinition(ref name)
        Translator->>Translator: Extract URL and Timeout from definition
        Translator->>Translator: resolveTimeoutFromDefinition()
        Translator->>Translator: parseTimeout() - parse duration strings
        Translator-->>Translator: Return (clusterName, URL, resolvedTimeout, error)
    else Direct URL
        Translator-->>Translator: Return (clusterName, URL, nil timeout, nil error)
    end

    Translator->>RouteCluster: createCluster(URL, connectTimeout)
    Translator->>RouteCluster: createRoute(context, method, path, timeoutCfg)
    RouteCluster->>RouteCluster: Apply per-route timeouts from config or defaults
    RouteCluster-->>Translator: ✓ Cluster and Route configured
    Translator-->>Config: ✓ XDS translation complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Behold the hops we've made today,
With timeouts measured the modern way—
Milliseconds now, not seconds of yore,
Upstream definitions we can reuse and explore! ⏱️
Per-route timing flows like carrot juice,
Analytics shine with every new deuce! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. While it provides Purpose, Goals, and fixes referenced issues, it lacks required sections: Approach (no implementation details), User stories, Documentation, Automation tests (unit/integration), Security checks, Samples, Related PRs, and Test environment from the template. Complete the PR description by adding all missing required sections from the template: detailed Approach, User stories, Documentation links, Automation tests with coverage details, Security checks (secure coding, FindSecurityBugs, secrets verification), Samples, Related PRs, and Test environment specifics.
Docstring Coverage ⚠️ Warning Docstring coverage is 22.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Backend Timeout Functionality' clearly summarizes the main change: adding backend timeout support to the API gateway, which is the primary focus of all modifications across documentation, API definitions, validators, and translator logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@gateway/gateway-controller/pkg/config/api_validator.go`:
- Around line 257-272: The validator currently can mark empty names as
duplicates because namesSeen[def.Name] is being set even when def.Name is empty;
update the loop in the upstream definition validation so you only record
namesSeen for non-empty names (e.g., move or guard the namesSeen[def.Name] =
true assignment behind a check that def.Name != ""), keeping the existing
ValidationError append for empty names (uses ValidationError, def.Name,
namesSeen) and preserving the duplicate-name check logic.

In `@gateway/gateway-controller/pkg/xds/translator.go`:
- Around line 1351-1369: The resolvedTimeout.Connect value is parsed but never
applied; update createCluster to accept a connect timeout parameter and pass
resolvedTimeout.Connect from where createCluster is invoked (or, if you prefer
the other option, remove the Connect field from resolvedTimeout and all related
parsing). Specifically, change the createCluster signature to include a
connectTimeout (or an options struct) and use it instead of the hardcoded 5 *
time.Second inside createCluster, and update all call sites that construct
clusters to forward the resolvedTimeout.Connect (the code that currently
computes requestTimeout/idleTimeout and the resolvedTimeout struct). Ensure
resolvedTimeout.Connect is respected consistently or removed along with its
parsing logic.
🧹 Nitpick comments (4)
gateway/gateway-controller/pkg/config/validator_test.go (1)

700-721: Consider adding a test for missing URL scheme.

The current TestValidateUpstreamDefinitions_InvalidURL test uses "not-a-valid-url" which doesn't have a scheme. This validates the scheme check, but consider adding a separate test case for a URL with missing host (e.g., "http://") to ensure full coverage of the URL validation logic.

gateway/it/features/backend_timeout.feature (1)

64-96: Consider verifying global timeout value for more meaningful test.

The "global defaults" scenario waits the same 2 seconds as the explicit timeout scenario. If the global default timeout is also around 2 seconds or less, this test passes but doesn't truly validate the "defaults" behavior.

Consider either:

  1. Documenting what the expected global default timeout value is
  2. Adjusting wait time based on known global defaults
  3. Adding a comment explaining that this test assumes global defaults will also cause a timeout within 2 seconds for an unreachable backend
gateway/gateway-controller/pkg/xds/translator.go (2)

556-559: Only the first URL from the first upstream target is used.

The current implementation extracts only definition.Upstreams[0].Urls[0]. If the upstream definition has multiple targets or URLs for weighted load balancing, they are ignored. This appears intentional for initial implementation, but consider documenting this limitation or adding a TODO for future enhancement.


2477-2490: Consider validating that parsed timeouts are positive.

time.ParseDuration accepts zero and negative values (e.g., "0s", "-5s"). A zero timeout effectively means "expire immediately" and negative values are likely configuration errors. Consider adding validation:

🛡️ Proposed validation
 func parseTimeout(timeoutStr *string) (*time.Duration, error) {
 	if timeoutStr == nil || strings.TrimSpace(*timeoutStr) == "" {
 		return nil, nil
 	}

 	duration, err := time.ParseDuration(strings.TrimSpace(*timeoutStr))
 	if err != nil {
 		return nil, fmt.Errorf("invalid timeout format: %w", err)
 	}

+	if duration <= 0 {
+		return nil, fmt.Errorf("timeout must be positive, got: %v", duration)
+	}
+
 	return &duration, nil
 }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@gateway/gateway-controller/api/openapi.yaml`:
- Around line 1857-1895: The OpenAPI schema declares
UpstreamDefinition.upstreams[].urls as an array but the translator currently
reads only definition.Upstreams[0].Urls[0] (see TODO "Support multiple upstream
targets"); update the implementation in
gateway/gateway-controller/pkg/xds/translator.go to iterate all
UpstreamDefinition.Upstreams and each Upstream.Urls, creating backend endpoints
(respecting per-upstream weight and default weight) and aggregating them into
the cluster/load-balancer config instead of using only the first URL, then
remove the TODO; alternatively, if you prefer to keep single-target behavior,
update the OpenAPI UpstreamDefinition.upstreams.items.properties.urls
description to state only the first URL is used and restrict the schema to a
single string field to match current behavior.
- Around line 1897-1915: The OpenAPI schema for UpstreamTimeout currently
restricts duration units in the pattern on properties connect, request, and
idle; update the regex used for those properties' pattern to allow Go's
supported units by including ns and us (e.g., change the pattern from
^\d+(\.\d+)?(ms|s|m|h)$ to a pattern that includes ns and us such as
^\d+(\.\d+)?(ns|us|ms|s|m|h)$) so the OpenAPI validation matches
time.ParseDuration() behavior for the UpstreamTimeout object and its fields
(connect, request, idle).

In `@gateway/it/steps_backend_timeout.go`:
- Around line 34-37: The step that asserts elapsed time hardcodes the key
"request_start" while the recorder step (ctx.Step `I record the current time as
"([^"]*)"$`) accepts any key; update the assertion step(s) to accept a key
parameter instead of the hardcoded "request_start" and use that key with
state.GetContextValue/GetContextTimestamp (or the existing state helper) and
time.Since to compute elapsed time; locate the assertion ctx.Step(s) that
currently reference "request_start" (and the similar one noted for lines 44-46)
and change their regex to capture a key string, replace literal "request_start"
references with the captured key variable, and keep the same assertion logic.

In `@gateway/it/test-config.toml`:
- Around line 120-124: The comment says "6s route timeout" but
route_timeout_in_milliseconds = 5000 (5s) in the
gateway_controller.router.envoy_upstream.timeouts block; update either the
comment to "5s route timeout" to match route_timeout_in_milliseconds (and
max_route_timeout_in_milliseconds) or change route_timeout_in_milliseconds and
max_route_timeout_in_milliseconds to 6000 to match the "6s" comment (ensure
consistency with gateway/it/test-config.yaml which uses 6000ms).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@gateway/gateway-controller/pkg/config/api_validator.go`:
- Around line 207-242: The validator currently treats a present-but-blank ref as
valid; update validateUpstreamRef so that only a nil ref is ignored but a ref
that trims to an empty string produces a ValidationError. Specifically, in
APIValidator.validateUpstreamRef check ref == nil and return no errors, but if
strings.TrimSpace(*ref) == "" append a ValidationError for field
"spec.upstream."+label+".ref" indicating the reference must not be empty; keep
the subsequent logic that checks upstreamDefinitions and looks up def.Name ==
refName when refName is non-empty.

In `@gateway/gateway-controller/pkg/xds/translator.go`:
- Around line 2503-2520: The parseTimeout function currently rejects zero
durations (it checks duration <= 0); change that check in parseTimeout to only
reject negative durations (duration < 0) so explicit "0s" is accepted by route
request/idle/connect timeouts, and add a unit test for parseTimeout covering
"0s" (and a negative value) to document behavior; alternatively, if you intend
to keep rejecting zero, update parseTimeout's doc comment to explain why zero is
disallowed and add a test asserting that "0s" errors. Mention parseTimeout by
name in both the code change and test update.
🧹 Nitpick comments (2)
gateway/gateway-controller/pkg/config/config.go (1)

205-226: Good addition of cluster-level upstream defaults.

As per coding guidelines: When modifying code in gateway components (gateway-controller, policy-engine, router, etc., excluding gateway/it/), rebuild Docker images using cd gateway && make build-local.

gateway/gateway-controller/pkg/xds/translator_test.go (1)

134-141: Consider adding a test for zero duration.

The implementation rejects duration <= 0, but there's no test explicitly verifying this behavior for "0s". Adding such a test would document the expected behavior and prevent regressions.

📝 Suggested test addition
func TestParseTimeout_ZeroDuration(t *testing.T) {
	zero := "0s"
	duration, err := parseTimeout(&zero)

	assert.Error(t, err)
	assert.Nil(t, duration)
	assert.Contains(t, err.Error(), "timeout must be positive")
}

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gateway/gateway-builder/go.mod (1)

11-16: ⚠️ Potential issue | 🟡 Minor

Move the testify dependency to the first require block.

The testify dependency (line 12) is imported directly in multiple test files throughout this module (assert and require packages), making it a direct dependency. According to Go conventions, it should be moved to the first require block (lines 5-9) alongside other direct dependencies, rather than placed in the second block which is for indirect dependencies.

The second block should contain only indirect dependencies marked with // indirect.

🤖 Fix all issues with AI agents
In `@gateway/gateway-controller/pkg/config/api_validator.go`:
- Around line 334-373: The timeout validator currently calls time.ParseDuration
on def.Timeout.Request (and similarly def.Timeout.Connect and def.Timeout.Idle)
which accepts disallowed units like ns/us; instead pre-validate the trimmed
timeout string against the OpenAPI regex that only allows units ms|s|m|h (e.g. a
compiled regexp matching ^\d+(ms|s|m|h)$ or your schema pattern) and if it
doesn't match append a ValidationError for the corresponding field
(spec.upstreamDefinitions[%d].timeout.request / .connect / .idle); only call
time.ParseDuration for strings that pass the regex and keep the existing
ParseDuration error handling as a fallback.

In `@gateway/it/features/backend_timeout.feature`:
- Around line 65-96: Update the inline comment that currently states the
global-default is 5s to reflect the actual default of 6s used in the test (the
scenario "RestApi without upstream timeout uses global defaults" asserts 6s and
the config uses 6000ms); find the comment above that scenario and change "5s" to
"6s" (or reword to "6s default") so the comment matches the assertion and
config.
🧹 Nitpick comments (5)
gateway/gateway-controller/pkg/xds/translator_test.go (1)

158-341: Prefer constructing Translator via helpers instead of &Translator{}.

Direct instantiation can become brittle if resolveUpstreamCluster starts reading config/logger. Using createTestTranslator() (or NewTranslator) keeps tests future‑proof.

♻️ Suggested tweak (apply consistently in this block)
-	translator := &Translator{}
+	translator := createTestTranslator()
gateway/gateway-controller/pkg/config/api_validator.go (1)

418-425: Remember to rebuild gateway images after gateway-controller changes.

This change touches gateway-controller validation paths, so ensure images are rebuilt.

As per coding guidelines, when modifying code in gateway components (gateway-controller, policy-engine, router, etc., excluding gateway/it/), rebuild Docker images using cd gateway && make build-local.

gateway/gateway-controller/pkg/config/validator_test.go (1)

771-796: Consider adding a disallowed-unit timeout case (ns/us).

Since the schema restricts units to ms|s|m|h, it would help to assert that 10us/10ns are rejected.

Based on learnings, the UpstreamTimeout schema intentionally restricts duration units to ms, s, m, h (excluding ns and us).

gateway/gateway-controller/pkg/config/config.go (1)

205-215: Reminder: rebuild gateway Docker images after gateway component changes.
Run cd gateway && make build-local before validation/release.

As per coding guidelines: gateway/**/*.{go,yaml,yml,Dockerfile} → rebuild Docker images using cd gateway && make build-local.

gateway/gateway-controller/pkg/xds/translator.go (1)

85-91: Minor typo: escaped quotes in comment.

The comment contains escaped quotes \"no override\" which should be regular quotes.

📝 Suggested fix
-// resolvedTimeout represents parsed timeout values for an upstream.
-// All fields are optional; nil means \"no override\" (use global defaults).
+// resolvedTimeout represents parsed timeout values for an upstream.
+// All fields are optional; nil means "no override" (use global defaults).
 type resolvedTimeout struct {

route_idle_timeout_in_milliseconds = 300000

[gateway_controller.router.envoy_upstream_cluster]
connect_timeout_in_milliseconds = 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this config

@renuka-fernando
Copy link
Contributor

renuka-fernando commented Feb 6, 2026

As timeout and idle timeout are route-level configs (Downstream), it is not correct to read them from Upstream Definition and configure routes. Shall we remove them then?
cc: @Krishanx92 @malinthaprasan

We may get an issue when we are gonig to support dynamic routing

@@ -94,9 +94,12 @@ verify_host_name = true
disable_ssl_verification = false

[gateway_controller.router.envoy_upstream.timeouts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[gateway_controller.router.envoy_upstream.timeouts]
[gateway_controller.router.envoy_downstream.timeouts]

max_route_timeout_in_milliseconds = 60000
route_idle_timeout_in_milliseconds = 300000

[gateway_controller.router.envoy_upstream_cluster]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[gateway_controller.router.envoy_upstream_cluster]
[gateway_controller.router.envoy_upstream.timeout]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants