feat: CloudEvents ingest compliance and add OpenTelemetry distributed tracing#3961
Open
kaio6fellipe wants to merge 5 commits intoargoproj:masterfrom
Open
feat: CloudEvents ingest compliance and add OpenTelemetry distributed tracing#3961kaio6fellipe wants to merge 5 commits intoargoproj:masterfrom
kaio6fellipe wants to merge 5 commits intoargoproj:masterfrom
Conversation
mario-turno
approved these changes
Mar 30, 2026
When an incoming HTTP request is a valid CloudEvent (binary or structured content mode), preserve the original attributes (id, source, type, subject, time) and extension attributes (e.g., traceparent) instead of discarding them and generating new values. - Add WithSource, WithType, WithSubject, WithTime, WithExtension, and WithCloudEvent Option constructors - Thread Options through webhook dispatch channel - Detect incoming CloudEvents via CE SDK HTTP binding in webhook handler - Add Extensions field to EventContext (protobuf field 8) - Preserve extensions in sensor convertEvent() Closes argoproj#2011 Related: argoproj#1015, argoproj#983 Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Add opt-in distributed tracing across the event pipeline (EventSource -> EventBus -> Sensor -> Trigger) using OpenTelemetry. - Create pkg/shared/tracing with InitTracer, SpanFromCloudEvent, and InjectTraceIntoCloudEvent using W3C Trace Context propagation - Add eventsource.publish span wrapping event bus publish - Add sensor.trigger span wrapping trigger execution - Initialize tracers in EventSource and Sensor pod entrypoints - Configuration via standard OTel env vars (OTEL_EXPORTER_OTLP_ENDPOINT) - Fully opt-in: no-op when endpoint is unset, zero performance impact Closes argoproj#1111 Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
…EL_SERVICE_NAME Signed-off-by: Kaio Fellipe <kaio6fellipe@gmail.com>
bf28d85 to
f2b4be7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add CloudEvents compliance and OpenTelemetry distributed tracing to Argo Events.
CloudEvents ingest compliance (Closes #2011, Related: #1015, #983)
Argo Events currently discards all incoming CloudEvent metadata and regenerates it with internal values. This PR preserves the original CloudEvent attributes (
id,source,type,subject,time) and extension attributes (e.g.,traceparent,tracestate) when an incoming HTTP request is a valid CloudEvent.WithCloudEventand individual attribute Option constructorsExtensionsfield toEventContextfor pipeline-wide extension propagationconvertEvent()OpenTelemetry Distributed Tracing (Closes #1111)
Add opt-in distributed tracing across the event pipeline (EventSource → EventBus → Sensor → Trigger) using OpenTelemetry with W3C Trace Context propagation.
pkg/shared/tracingpackage withInitTracer,SpanFromCloudEvent,InjectTraceIntoCloudEventeventsource.publishspan wrapping event bus publishsensor.triggerspan wrapping trigger execution with error recordingOTEL_EXPORTER_OTLP_ENDPOINT) — fully opt-in, no-op when unsetBackward Compatibility
EventContext.Extensionsis optional (omitempty) with next available protobuf field numberTests and behavior
The implementation was tested on a real GKE environment with Jaeger All-in-One as the trace backend. The test setup consists of three interconnected EventSource/Sensor pairs spanning two namespaces (
sandboxandplatform), exercising the full event pipeline including retries and dead-letter queue (DLQ) routing.Test environment
observabilitynamespace as OTLP collector + trace UIOTEL_EXPORTER_OTLP_ENDPOINTandOTEL_SERVICE_NAMEenv vars on EventSource/Sensor pod templatesTest scenario
A single HTTP POST triggers a chain across 3 EventSources and 3 Sensors:
another-webhook(EventSource) →another-sensor(Sensor) — forwards via HTTP triggerhelloworld-webhook(EventSource) →helloworld-http(Sensor) — processor endpoint is down, retries 3x, then fires DLQ triggerdlqueue-webhook(EventSource) →dlqueue-processor(Sensor) — processes the dead lettergraph TD A["<b>eventsource.publish</b><br/>another-webhook<br/>event.id: ba49ae95...<br/>47.0ms ✅"] B["<b>sensor.trigger</b><br/>another-sensor → another-sensor<br/>6.3ms ✅"] C["<b>eventsource.publish</b><br/>helloworld-webhook<br/>event.id: d14e2273...<br/>3.9ms ✅"] D["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>31.9ms ❌ retry 1"] E["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>0.8ms ❌ retry 2"] F["<b>sensor.trigger</b><br/>helloworld-http → helloworld-http<br/>0.7ms ❌ retry 3"] G["<b>sensor.trigger</b><br/>helloworld-http → dlq-http-trigger<br/>7.2ms ✅"] H["<b>eventsource.publish</b><br/>dlqueue-webhook<br/>event.id: 7ed81bf0...<br/>4.6ms ✅"] I["<b>sensor.trigger</b><br/>dlqueue-processor → dlqueue-processor<br/>2.6ms ✅"] A --> B B -->|"HTTP POST"| C C --> D C --> E C --> F C -->|"retries exhausted"| G G -->|"HTTP POST to DLQ"| H H --> I style A fill:#2d6a4f,color:#fff style B fill:#2d6a4f,color:#fff style C fill:#2d6a4f,color:#fff style D fill:#ae2012,color:#fff style E fill:#ae2012,color:#fff style F fill:#ae2012,color:#fff style G fill:#2d6a4f,color:#fff style H fill:#2d6a4f,color:#fff style I fill:#2d6a4f,color:#fffJaeger trace view
The entire chain appears as a single connected trace (
a613bf5b2cccbb628b67bacb729a0e46) with 9 spans across 5 services:Jaeger service map
The service dependency graph shows the flow between all instrumented services:
Argo Events resources
EventSource and Sensors:
Key observations
End-to-end trace propagation works — W3C
traceparentis injected by the HTTP trigger into outgoing requests and extracted by downstream webhook EventSources, linking all hops into a single trace.OTEL_SERVICE_NAMEis respected — Each EventSource/Sensor appears with its configured service name in Jaeger (e.g.,helloworld-webhook,dlqueue-processor) rather than the generic defaults.Retries are visible — Failed trigger attempts appear as individual
sensor.triggerspans withERRORstatus, making it easy to identify retry behavior and eventual DLQ routing.event.idchanges at each EventSource — This is correct per the CloudEvents spec. Each EventSource is an independent event producer and generates a new ID. The trace context (traceparent) is what links the hops, not the event ID.Zero impact when disabled — When
OTEL_EXPORTER_OTLP_ENDPOINTis not set, the OTel tracer is a no-op with no performance overhead.Checklist: