Skip to content

feat(sinks): introduce RetryStrategy / Config for http based sinks#25057

Open
ndrsg wants to merge 6 commits intovectordotdev:masterfrom
ndrsg:ndrsg/10870
Open

feat(sinks): introduce RetryStrategy / Config for http based sinks#25057
ndrsg wants to merge 6 commits intovectordotdev:masterfrom
ndrsg:ndrsg/10870

Conversation

@ndrsg
Copy link
Copy Markdown

@ndrsg ndrsg commented Mar 27, 2026

Summary

This PR introduces a configurable retry_strategy for Vector's shared HTTP retry logic and exposes it on the http sink plus the sinks built on the same HTTP retry helpers.

It adds support for default, none, all, and custom retry modes, wires that behavior through the affected HTTP-based sinks, and adds documentation plus test coverage for custom status-code retries. The default strategy now consistently retries 408, 429, and 5xx responses across the shared HTTP retry implementations.

Affected sinks

Configurable sinks using HttpStatusRetryLogic:

  • http
  • axiom
  • opentelemetry
  • appsignal
  • azure_logs_ingestion
  • azure_monitor_logs
  • datadog_events
  • gcp_stackdriver_logs
  • gcp_stackdriver_metrics
  • honeycomb
  • keep
  • prometheus_remote_write

Sinks that still use HttpRetryLogic and are not made configurable by this PR:

  • clickhouse
  • greptimedb logs
  • influxdb_metrics
  • sematext_metrics
  • splunk_hec logs
  • splunk_hec metrics

Vector configuration

I used the new HTTP sink example configuration to validate the custom retry behavior (also checked in in examples folder):

data_dir: "/var/lib/vector"

sources:
  demo_logs:
    type: "demo_logs"
    format: "json"
    interval: 1

sinks:
  http_out:
    type: "http"
    inputs: [ "demo_logs" ]
    uri: "https://example.com/ingest"
    method: "post"

    healthcheck:
      enabled: false

    framing:
      method: "newline_delimited"
    encoding:
      codec: "json"

    request:
      timeout_secs: 60
      retry_attempts: 8
      retry_initial_backoff_secs: 2
      retry_max_duration_secs: 30

    retry_strategy:
      type: "custom"
      status_codes: [ 408, 425, 429, 503 ]

How did you test this PR?

  • Added unit coverage for the shared RetryStrategy behavior in src/sinks/util/http.rs.
  • Added HTTP sink tests covering both retryable configured status codes and non-retryable unconfigured status codes in src/sinks/http/tests.rs.
  • Built the Vector binary locally and validated config/examples/http_sink_custom_retry.yaml against it.
  • Ran make check-clippy locally.
  • Validated the changelog fragment with ./scripts/check_changelog_fragments.sh.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@ndrsg ndrsg requested a review from a team as a code owner March 27, 2026 16:12
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ndrsg
Copy link
Copy Markdown
Author

ndrsg commented Mar 27, 2026

I have read the CLA Document and I hereby sign the CLA

@pront
Copy link
Copy Markdown
Member

pront commented Apr 2, 2026

@codex review - draw context from #10870 (comment)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27feabc878

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 667 to +668
fn should_retry_response(&self, response: &Self::Response) -> RetryAction<Self::Request> {
let status = response.status();

match status {
StatusCode::TOO_MANY_REQUESTS => RetryAction::Retry("too many requests".into()),
StatusCode::REQUEST_TIMEOUT => RetryAction::Retry("request timeout".into()),
StatusCode::NOT_IMPLEMENTED => {
RetryAction::DontRetry("endpoint not implemented".into())
}
_ if status.is_server_error() => RetryAction::Retry(
format!("{}: {}", status, String::from_utf8_lossy(response.body())).into(),
),
_ if status.is_success() => RetryAction::Successful,
_ => RetryAction::DontRetry(format!("response status: {status}").into()),
}
self.retry_strategy.retry_action(response.status())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor retry_strategy = none for transport failures

The new RetryStrategy::None mode is introduced as "Don't retry any errors," but this implementation only applies the strategy to HTTP response statuses; request/transport errors are still retried because is_retriable_error remains unconditional elsewhere in the retry logic. In practice, sinks configured with retry_strategy.type = "none" will still back off/retry on connection, DNS, TLS, or timeout errors, so the new mode cannot actually disable retries end-to-end.

Useful? React with 👍 / 👎.

Comment on lines +624 to +625
if status.is_server_error() {
RetryAction::Retry(reason)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep 501 out of the default retriable status set

The default branch now retries any 5xx via status.is_server_error(), which newly makes 501 Not Implemented retriable. Previously this code path treated 501 as non-retriable, and 501 is usually a permanent incompatibility (unsupported method/endpoint) rather than a transient outage. Retrying it consumes the full retry budget and delays rejection on misconfiguration; this should stay excluded unless users explicitly opt into all or custom.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants