Skip to content

feat: CloudEvents ingest compliance and add OpenTelemetry distributed tracing#3961

Open
kaio6fellipe wants to merge 5 commits intoargoproj:masterfrom
kaio6fellipe:feat/cloudevents-compliance-otel-tracing
Open

feat: CloudEvents ingest compliance and add OpenTelemetry distributed tracing#3961
kaio6fellipe wants to merge 5 commits intoargoproj:masterfrom
kaio6fellipe:feat/cloudevents-compliance-otel-tracing

Conversation

@kaio6fellipe
Copy link
Copy Markdown

@kaio6fellipe kaio6fellipe commented Mar 26, 2026

Summary

Add CloudEvents compliance and OpenTelemetry distributed tracing to Argo Events.

CloudEvents ingest compliance (Closes #2011, Related: #1015, #983)

Argo Events currently discards all incoming CloudEvent metadata and regenerates it with internal values. This PR preserves the original CloudEvent attributes (id, source, type, subject, time) and extension attributes (e.g., traceparent, tracestate) when an incoming HTTP request is a valid CloudEvent.

  • Add WithCloudEvent and individual attribute Option constructors
  • Detect incoming CloudEvents via CE SDK HTTP binding in webhook handler
  • Add Extensions field to EventContext for pipeline-wide extension propagation
  • Preserve extensions in sensor convertEvent()

OpenTelemetry Distributed Tracing (Closes #1111)

Add opt-in distributed tracing across the event pipeline (EventSource → EventBus → Sensor → Trigger) using OpenTelemetry with W3C Trace Context propagation.

  • New pkg/shared/tracing package with InitTracer, SpanFromCloudEvent, InjectTraceIntoCloudEvent
  • eventsource.publish span wrapping event bus publish
  • sensor.trigger span wrapping trigger execution with error recording
  • Configuration via standard OTel env vars (OTEL_EXPORTER_OTLP_ENDPOINT) — fully opt-in, no-op when unset

Backward Compatibility

  • Zero breaking changes — existing YAML resources work without modification
  • All 30+ event sources that don't receive CloudEvents behave identically
  • Tracing has zero performance impact when disabled
  • EventContext.Extensions is optional (omitempty) with next available protobuf field number

Tests and behavior

The implementation was tested on a real GKE environment with Jaeger All-in-One as the trace backend. The test setup consists of three interconnected EventSource/Sensor pairs spanning two namespaces (sandbox and platform), exercising the full event pipeline including retries and dead-letter queue (DLQ) routing.

Test environment

  • Argo Events deployed via Helm chart with custom image containing all changes
  • Jaeger All-in-One deployed in observability namespace as OTLP collector + trace UI
  • EventBus: Kafka (managed with Strimzi)
  • OTel tracing enabled via OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME env vars on EventSource/Sensor pod templates

Test scenario

A single HTTP POST triggers a chain across 3 EventSources and 3 Sensors:

  1. another-webhook (EventSource) → another-sensor (Sensor) — forwards via HTTP trigger
  2. helloworld-webhook (EventSource) → helloworld-http (Sensor) — processor endpoint is down, retries 3x, then fires DLQ trigger
  3. dlqueue-webhook (EventSource) → dlqueue-processor (Sensor) — processes the dead letter
graph TD
    A["<b>eventsource.publish</b><br/>another-webhook<br/>event.id: ba49ae95...<br/>47.0ms ✅"]
    B["<b>sensor.trigger</b><br/>another-sensor → another-sensor<br/>6.3ms ✅"]
    C["<b>eventsource.publish</b><br/>helloworld-webhook<br/>event.id: d14e2273...<br/>3.9ms ✅"]
    D["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>31.9ms ❌ retry 1"]
    E["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>0.8ms ❌ retry 2"]
    F["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>0.7ms ❌ retry 3"]
    G["<b>sensor.trigger</b><br/>helloworld-http → dlq-http-trigger<br/>7.2ms ✅"]
    H["<b>eventsource.publish</b><br/>dlqueue-webhook<br/>event.id: 7ed81bf0...<br/>4.6ms ✅"]
    I["<b>sensor.trigger</b><br/>dlqueue-processor → dlqueue-processor<br/>2.6ms ✅"]

    A --> B
    B -->|"HTTP POST"| C
    C --> D
    C --> E
    C --> F
    C -->|"retries exhausted"| G
    G -->|"HTTP POST to DLQ"| H
    H --> I

    style A fill:#2d6a4f,color:#fff
    style B fill:#2d6a4f,color:#fff
    style C fill:#2d6a4f,color:#fff
    style D fill:#ae2012,color:#fff
    style E fill:#ae2012,color:#fff
    style F fill:#ae2012,color:#fff
    style G fill:#2d6a4f,color:#fff
    style H fill:#2d6a4f,color:#fff
    style I fill:#2d6a4f,color:#fff
Loading

Jaeger trace view

The entire chain appears as a single connected trace (a613bf5b2cccbb628b67bacb729a0e46) with 9 spans across 5 services:

image

Jaeger service map

The service dependency graph shows the flow between all instrumented services:

image

Argo Events resources

EventSource and Sensors:

image

Key observations

  1. End-to-end trace propagation works — W3C traceparent is injected by the HTTP trigger into outgoing requests and extracted by downstream webhook EventSources, linking all hops into a single trace.

  2. OTEL_SERVICE_NAME is respected — Each EventSource/Sensor appears with its configured service name in Jaeger (e.g., helloworld-webhook, dlqueue-processor) rather than the generic defaults.

  3. Retries are visible — Failed trigger attempts appear as individual sensor.trigger spans with ERROR status, making it easy to identify retry behavior and eventual DLQ routing.

  4. event.id changes at each EventSource — This is correct per the CloudEvents spec. Each EventSource is an independent event producer and generates a new ID. The trace context (traceparent) is what links the hops, not the event ID.

  5. Zero impact when disabled — When OTEL_EXPORTER_OTLP_ENDPOINT is not set, the OTel tracer is a no-op with no performance overhead.

Checklist:

@kaio6fellipe kaio6fellipe marked this pull request as ready for review March 27, 2026 03:18
@kaio6fellipe kaio6fellipe requested a review from whynowy as a code owner March 27, 2026 03:18
@kaio6fellipe kaio6fellipe changed the title feat: CloudEvents compliance and add OpenTelemetry distributed tracing feat: CloudEvents ingest compliance and add OpenTelemetry distributed tracing Mar 27, 2026
When an incoming HTTP request is a valid CloudEvent (binary or structured
content mode), preserve the original attributes (id, source, type, subject,
time) and extension attributes (e.g., traceparent) instead of discarding
them and generating new values.

- Add WithSource, WithType, WithSubject, WithTime, WithExtension, and
  WithCloudEvent Option constructors
- Thread Options through webhook dispatch channel
- Detect incoming CloudEvents via CE SDK HTTP binding in webhook handler
- Add Extensions field to EventContext (protobuf field 8)
- Preserve extensions in sensor convertEvent()

Closes argoproj#2011
Related: argoproj#1015, argoproj#983

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Add opt-in distributed tracing across the event pipeline
(EventSource -> EventBus -> Sensor -> Trigger) using OpenTelemetry.

- Create pkg/shared/tracing with InitTracer, SpanFromCloudEvent, and
  InjectTraceIntoCloudEvent using W3C Trace Context propagation
- Add eventsource.publish span wrapping event bus publish
- Add sensor.trigger span wrapping trigger execution
- Initialize tracers in EventSource and Sensor pod entrypoints
- Configuration via standard OTel env vars (OTEL_EXPORTER_OTLP_ENDPOINT)
- Fully opt-in: no-op when endpoint is unset, zero performance impact

Closes argoproj#1111

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…EL_SERVICE_NAME

Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
@kaio6fellipe kaio6fellipe force-pushed the feat/cloudevents-compliance-otel-tracing branch from bf28d85 to f2b4be7 Compare April 2, 2026 17:08
@whynowy whynowy requested a review from eduardodbr April 3, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants