Internal telemetry routing WIP DRAFT #1672

jmacd · 2025-12-20T00:57:51Z

Related to #1663.

This is a proposal to support an internal route for telemetry that lets us process OTAP-Dataflow telemetry using our own pipeline support. This requires special protections from self-induced telemetry, and it requires options to route telemetry in many ways to remain compatible with OpenTelemetry and keep our options. The proposal to accompany this PR is in the ARCHITECTURE.md draft.

…d/otel_sdk_logs_bridge

codecov · 2025-12-20T01:01:16Z

Codecov Report

❌ Patch coverage is 56.05048% with 592 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.02%. Comparing base (3ce308b) to head (38bda49).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1672      +/-   ##
==========================================
- Coverage   84.27%   84.02%   -0.25%     
==========================================
  Files         464      471       +7     
  Lines      133941   135276    +1335     
==========================================
+ Hits       112884   113672     +788     
- Misses      20523    21070     +547     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`85.42% <56.05%> (-0.43%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.25% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andborja · 2025-12-20T01:01:30Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+internal telemetry pipeline is rigorously safeguarded against these
+pitfalls through:
+
+- OTAP-Dataflow components downstream of an ITR cannot be configured


ITR - first time usage. Define.

andborja · 2025-12-20T01:26:21Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+with:
+
+- No-op logging
+- Raw logging


If "Raw logging" means console output, It should be through the SDK. It is yet another exporter, and should not be considered by separate.

for stdout/console, we cannot (and should not) use OTel. Need to use tracing::fmt or develop customer.

Yes the OTel exporter is not specified as a stable output format by OTel.

I see that it is a JSON-line oriented exporter, and I would welcome such a thing (@gouslu and I discussed deriving such a thing from the AzMon exporter code, which is similar to JSON-line).

In this PR as proof of concept I replaced tracing::fmt and developed an OTLP-bytes-to-console exporter, that's essentially what you're saying I think @cijothomas.

Actually, I've seen logging SDKs emit directly to protobuf bytes before! One reason we can't use tracing::fmt over the OTLP bytes representation in this case (AFAICT) is that the tracing Event struct does not contain a timestamp, there's no way to format a log statement recorded in the past. This is not the case for OTel SDK, why in this proposal we're able to reconstruct an OTLP SDK log event and process/export as an alternative to the ITR.

andborja · 2025-12-20T01:28:50Z

rust/otap-dataflow/crates/otap/src/internal_telemetry_receiver/mod.rs

@@ -0,0 +1,72 @@
+// Copyright The OpenTelemetry Authors


avoid the "mod.rs" format. Use a file internal_telemetry_receiver.rs at the same level as lib.rs instead.

jmacd · 2025-12-20T15:36:12Z

I apologize-- the code in this PR is not acceptable for review, we may view it as a feasibility study to accompany the ARCHITECTURE.md document.

cijothomas · 2025-12-22T19:38:57Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

+    pub max_record_bytes: usize,
+
+    /// Maximum number of records in the bounded channel.
+    /// When full, new records fall back to raw console logger.


why fallback here, instead of keeping count of the drops?

cijothomas · 2025-12-22T19:39:51Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

+    #[serde(default = "default_max_record_bytes")]
+    pub max_record_bytes: usize,
+
+    /// Maximum number of records in the bounded channel.


"bounded" channel feels like internal implementation details; so we should avoid exposing them to public config..

cijothomas · 2025-12-22T19:42:23Z

rust/otap-dataflow/crates/engine/src/effect_handler.rs

+    pub(crate) resource_bytes: OtlpProtoBytes,
+    pub(crate) scope_name: String,
+    pub(crate) flush_threshold_bytes: usize,
+    pub(crate) overflow_sender: mpsc::UnboundedSender<Bytes>,


is it okay to use a unbounded sender for overflow purposes?

This PR is a draft, so it is probably a temporary shortcut. The rule I personally follow is to systematically use bounded channels together with a policy that specifies what should happen in case of overflow, that is, drop the incoming message or block the sender. This policy can be configurable, or chosen directly in the code depending on whether it is worth making configurable.

lquerel · 2025-12-26T18:18:41Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

+    /// Internal collection for component-level logs.
+    /// 
+    /// When enabled, component logs (otel_info!, otel_warn!, etc.) are routed through
+    /// an internal telemetry receiver in the OTAP pipeline, allowing use of built-in
+    /// batch processors, retry, and exporters (console, OTLP, etc.).


I'm a big fan of this approach.

lquerel

I mainly focused on the architecture document for now, which is very detailed and informative. A few general remarks:

In the long term, we should be able to define TelemetrySettings at a global level, at the engine level, with the possibility to override them in a specific pipeline configuration.
In the case of a tight integration with the otap-engine, I think we need to find a way to reuse the AttributeSet concept already used for metrics. This offers several advantages:
1. we get a common language and definition for attributes,
2. they are populated by the engine and registered only once, which avoids redefining them on every call. We only need to provide an attribute set ID, which is essentially just a number. Additional "dynamic" attributes could be added when needed.
We should be able to compile the engine in a way that eliminates all overhead for a given logging level, essentially the same behavior we have today with the current macros. I believe this is already the plan, but I wanted to make this rule explicit.

lquerel · 2025-12-26T18:41:37Z

rust/otap-dataflow/crates/config/src/pipeline/service/telemetry/logs.rs

+    /// When disabled (default), component logs are routed to the OpenTelemetry SDK,
+    /// using the same export path as 3rd party logs from tokio-tracing-rs.
+    #[serde(default)]
+    pub internal_collection: InternalCollectionConfig,


Long term, I think this approach will apply not only to logs, but also to metrics and traces. At some point, we might promote this configuration to a more general level.
For now, I suggest that this field be:

an option, so we can remove the enabled field from InternalCollectionConfig, which I think would simplify things

renamed to something more explicit. I'm not fully satisfied with my proposal otap_pipeline; there is probably a better name to find

lquerel · 2025-12-26T18:47:36Z

rust/otap-dataflow/crates/engine/src/local/processor.rs

+    ///
+    /// This method never fails - errors are silently dropped to prevent recursion.
+    /// If the telemetry buffer is not configured, this is a no-op.
+    pub fn log_event(&self, log_record: &impl otap_df_pdata::views::logs::LogRecordView) {


We need to find a way to reuse the NodeAttributeSet that we already use for metrics. That will let every log emitted by a node share a common context with the metrics.

andborja · 2026-01-05T20:00:24Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+## Internal telemetry receiver
+
+The Internal Telemetry Receiver or "ITR" is an OTAP-Dataflow receiver
+component that produces telemetry from internal sources. An internal


"receives" instead of "produces"

andborja · 2026-01-05T20:02:19Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+the connected processor and exporter components reachable from ITR
+source nodes.
+
+To begin with, every OTAP-Dataflow comonent is configured with an


andborja · 2026-01-05T20:04:53Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+second party as it is responsible for routing internal telemetry. The
+ITR cannot use the internal telemetry SDK itself, an invisible member
+of the pipeline. The ITR can be instrumented using third-party
+instrumentation (e.g., `tracing`, `log` crates) provided it can


why? Can we leave 3p instrumentation mechanisms out? That could go out of control pretty easily

andborja · 2026-01-05T20:08:32Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+  to send to an ITR node. This avoids a direct feedback cycle for
+  internal telemetry because the components cannot reach
+  themselves. For example, ITR and downstream components may be
+  configured for raw logging, no metrics, etc.


Can you elaborate on "raw logging"? They are just another component, so they might be "configured" to produce telemetry. What the framework cannot do is to route its telemetry to an ITR.

andborja · 2026-01-05T20:10:39Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+  configured for raw logging, no metrics, etc.
+- ITR instances share access to one or more threads with associated
+  async runtime. They use these dedicated threads to isolate internal
+  telemetry processes that use third-party instrumentation.


I don't follow this part. The ITR is just another receiver, so it has its own threads as any other receiver.

andborja · 2026-01-05T20:16:51Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+  instrumentation in dedicated internal telemetry threads. Internal
+  telemetry threads automatically configure a safe configuration.
+- Components under observation (non-ITR components) have internal
+  telemetry events routed queues in the OTAP-Dataflow pipeline on the


When you say "queues", are you talking about channels? => flume?

andborja · 2026-01-05T20:18:24Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+  same core, this avoids blocking the engine. First-party
+  instrumentation will be handled on the CPU core that produced the
+  telemetry under normal circumstances. This isolates cores that are
+  able to process their own internal telemetry.


I'm not following this part. Can you elaborate?

andborja · 2026-01-05T20:19:35Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+  telemetry under normal circumstances. This isolates cores that are
+  able to process their own internal telemetry.
+- Option to fall back to no-op, a non-blocking global provider, and/or
+  raw logging.


Same here. Can you elaborate?

andborja · 2026-01-05T20:21:50Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+
+## OTLP-bytes first
+
+As a key design decision, the OTAP-Dataflow internal telemetry data


This also implies that we will create and send one message (OTLP) per event (will not batch them).

andborja · 2026-01-05T20:24:05Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+
+## Raw logging
+
+We support formatting events for direct printing to the console from


Why? This should be configurable if the user does not want verbosity in the output or does not have any reliable mechanism to capture it (Kubernetes scenario)

In general, console should be considered "yet another exporter"

andborja · 2026-01-05T20:28:48Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+
+The two internal logs data paths are:
+
+- Third-party: Tokio `tracing` global subscriber: third-party log


Why not getting rid of the global subscriber? We can have more control over it. We can pass the telemetry objects wherever it is needed (from the thread variable?)

andborja · 2026-01-05T20:33:28Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+`otap_df_ptdata` with an OTLP bytes encoder for its views interfaces.
+
+Then, `TracingLogRecord` implements the log record view, we will encode
+the reocrd as OTLP bytes by encoding the view.


andborja · 2026-01-05T20:36:04Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+Then, `TracingLogRecord` implements the log record view, we will encode
+the reocrd as OTLP bytes by encoding the view.
+
+### Stateful OTLP bytes encoder for repeated LogRecordViews


Which component does this? The ITR? Is it just another processor added after it? Is it during production? (the macro) => this might not be the same thread

It would be in the effect handler.

andborja · 2026-01-05T20:40:53Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+thread-local state to prevent logging in its own export path
+
+The global logs collection thread is configured as one (or more, if
+needed) instances consuming logs from the global Tokio `tracing`


What if another library is used by the 3p component? I think those should be left out so we can have more control over the telemetry.

In #1741 I emphasized the two different diagnostic paths, one for 3rd party instrumentation and one for components to use directly. Definitely agree.

andborja · 2026-01-05T20:44:56Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+    logs:
+      level: info
+      internal_collection:
+        enabled: true


It is preferred at configuration level if present, it is enabled (instead of the flag).

Moving to #1735.

andborja · 2026-01-05T20:45:59Z

rust/otap-dataflow/crates/telemetry/ARCHITECTURE.md

+thread, a raw logger, or do nothing (dropping the internal log
+record).
+
+## Example configuration


This might be more component level, right? (the telemetry producer)

jmacd · 2026-01-08T18:41:06Z

I've read and appreciated all the feedback.
A big portion of this was broken into #1735 and #1741.

jmacd added 14 commits December 18, 2025 09:59

planning stateful encoder

152f55e

tokio-tracing-rs to OTLP bytes

fc70f4a

logs and traces

8175afc

otlp bytes ftw

729d8af

Starting ITR

3a879eb

ITR flushed

17be0b7

remove docs

203c73a

Merge branch 'main' of github.com:open-telemetry/otel-arrow into jmac…

0b01d28

…d/otel_sdk_logs_bridge

fixed

ab2714a

remove code snips

e6051c9

ITR

880ff22

less docs

08cff09

diagram

aa63561

revert a lot

38bda49

github-project-automation bot added this to OTel-Arrow Dec 20, 2025

github-actions bot added the rust Pull requests that update Rust code label Dec 20, 2025

andborja reviewed Dec 20, 2025

View reviewed changes

cijothomas reviewed Dec 22, 2025

View reviewed changes

lquerel reviewed Dec 26, 2025

View reviewed changes

lquerel mentioned this pull request Dec 31, 2025

feat: Add internal telemetry prometheus exporter #1691

Merged

remove half

f6700df

andborja reviewed Jan 5, 2026

View reviewed changes

jmacd mentioned this pull request Jan 5, 2026

feat: Add internal logs receiver #1663

Draft

andborja reviewed Jan 5, 2026

View reviewed changes

jmacd mentioned this pull request Jan 7, 2026

Internal logging bridge, internral logging receiver #1736

Open

typos

f809cb7

jmacd closed this Jan 8, 2026

github-project-automation bot moved this to Done in OTel-Arrow Jan 8, 2026

jmacd mentioned this pull request Jan 8, 2026

Internal logging code path: Raw logger support #1735

Open


		## OTLP-bytes first

		As a key design decision, the OTAP-Dataflow internal telemetry data


		## Raw logging

		We support formatting events for direct printing to the console from


		The two internal logs data paths are:

		- Third-party: Tokio `tracing` global subscriber: third-party log

Internal telemetry routing ** WIP DRAFT ** #1672

Internal telemetry routing ** WIP DRAFT ** #1672

Conversation

jmacd commented Dec 20, 2025

Uh oh!

codecov bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmacd commented Dec 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmacd commented Jan 8, 2026

Uh oh!

Internal telemetry routing WIP DRAFT #1672

Internal telemetry routing WIP DRAFT #1672

codecov bot commented Dec 20, 2025 •

edited

Loading