feat: add OpenTelemetry telemetry module for SDK instrumentation (#164) by 57Ajay · Pull Request #454 · kubeflow/sdk

57Ajay · 2026-04-11T15:47:29Z

description

Adds kubeflow/common/telemetry.py - the foundational module for integrating OpenTelemetry into the Kubeflow SDK, as tracked in #164.

why

The SDK currently has no structured observability. For example what if TrainerClient.train() takes 45 minutes or OptimizerClient.optimize() stalls across 50 trials, we currently have no way to pinpoint whether the bottleneck is K8s API calls, CRD validation, polling, or pod scheduling. This module is the first step toward vendor-neutral, opt-in distributed tracing and metrics across all SDK clients.

design decisions

API-only dependency: Following OTel library instrumentation best practices, we depend only on opentelemetry-api, not opentelemetry-sdk or any exporter. Users choose their own SDK and export backend.
Zero overhead when disabled: When opentelemetry-api is not installed, _setup() catches ImportError once and all subsequent sdk_span() calls yield None with no cost. No import-time side effects.
Module-level functions, not a class: OTel tracers and meters are process-global singletons — a wrapper class would add indirection with no benefit. This mirrors how opentelemetry.trace.get_tracer() itself works.
Follows existing optional dependency pattern: Same approach as ModelRegistryClient with model-registry and SparkClient with pyspark-connect.

usage

from kubeflow.common.telemetry import sdk_span

def train(self, ...):
    with sdk_span("kubeflow.trainer.train", attributes={
        "kubeflow.namespace": self.namespace,
    }) as span:
        # ... existing logic ...
        if span:
            span.set_attribute("kubeflow.trainjob.name", name)

next steps

This is the foundation for client-by-client instrumentation. Later I plan to instrument TrainerClient operations in KubernetesBackend using sdk_span().

Signed-off-by: 57Ajay <57ajay.u@gmail.com>

google-oss-prow · 2026-04-11T15:47:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-04-11T15:47:43Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

Adds an opt-in, API-only OpenTelemetry integration point for Kubeflow SDK so clients can create spans/metrics without requiring opentelemetry-sdk or exporters.

Changes:

Introduces kubeflow/common/telemetry.py with lazy, optional initialization of OTel tracer/meter and an sdk_span() context manager.
Adds unit tests for both “OTel installed” (via injected modules) and “OTel absent” scenarios.
Adds a telemetry extra (opentelemetry-api) and updates the lockfile accordingly.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`uv.lock`	Adds `opentelemetry-api` (and transitive deps) and registers the `telemetry` extra in the lock resolution.
`pyproject.toml`	Defines `telemetry = ["opentelemetry-api>=1.40.0"]` optional extra.
`kubeflow/common/telemetry.py`	Implements lazy tracer/meter initialization plus `sdk_span()` helper for client instrumentation.
`kubeflow/common/telemetry_test.py`	Adds tests for both presence/absence of OTel using sys.modules injection and import blocking.

Copilot · 2026-04-11T15:51:07Z

+# Instrumentation identity. The version tracks the SDK version that introduced
+# telemetry support and is bumped when the instrumentation surface changes
+# (new span names, attribute renames, metric additions).
+_INSTRUMENTATION_NAME = "kubeflow.sdk"
+_INSTRUMENTATION_VERSION = "0.5.0"
+


_INSTRUMENTATION_VERSION is set to "0.5.0", but the package version is currently 0.4.0 (kubeflow/init.py); this will report misleading instrumentation metadata, so derive it from kubeflow.version or set it to the release version that actually introduces telemetry (and update the tests that assert this value).

Copilot · 2026-04-11T15:51:08Z

+    global _tracer, _meter, _initialized  # noqa: PLW0603
+    if _initialized:
+        return
+    _initialized = True
+
+    try:
+        from opentelemetry import metrics, trace
+
+        _tracer = trace.get_tracer(_INSTRUMENTATION_NAME, _INSTRUMENTATION_VERSION)
+        _meter = metrics.get_meter(_INSTRUMENTATION_NAME, _INSTRUMENTATION_VERSION)
+    except ImportError:


_setup() sets _initialized = True before completing the import and singleton assignment, which can race under concurrent calls (another thread may observe _initialized=True and return None tracers/meters even though OTel is available); guard initialization with a lock or only flip _initialized after _tracer/_meter have been set.

Copilot · 2026-04-11T15:51:08Z

+    original_import = (
+        __builtins__["__import__"] if isinstance(__builtins__, dict) else __builtins__.__import__
+    )
+
+    def blocking_import(name, *args, **kwargs):
+        if name.startswith("opentelemetry"):
+            raise ImportError(f"No module named '{name}'")
+        return original_import(name, *args, **kwargs)
+
+    if isinstance(__builtins__, dict):
+        __builtins__["__import__"] = blocking_import
+    else:
+        __builtins__.__import__ = blocking_import
+
+    yield
+
+    if isinstance(__builtins__, dict):
+        __builtins__["__import__"] = original_import
+    else:
+        __builtins__.__import__ = original_import


The mock_no_otel fixture patches builtins.import directly; if anything fails after the patch but before teardown, it can leak into other tests and cause hard-to-debug import behavior, so prefer pytest's monkeypatch fixture (or patch builtins.import) to guarantee restoration.

feat: add OpenTelemetry telemetry module

cade772

Signed-off-by: 57Ajay <57ajay.u@gmail.com>

Copilot AI review requested due to automatic review settings April 11, 2026 15:47

google-oss-prow bot requested review from andreyvelich, kramaranya and szaher April 11, 2026 15:47

google-oss-prow bot added the size/L label Apr 11, 2026

Copilot started reviewing on behalf of 57Ajay April 11, 2026 15:48 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

57Ajay changed the title ~~add OpenTelemetry telemetry module for SDK instrumentation (#164)~~ feat: add OpenTelemetry telemetry module for SDK instrumentation (#164) Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OpenTelemetry telemetry module for SDK instrumentation (#164)#454

feat: add OpenTelemetry telemetry module for SDK instrumentation (#164)#454
57Ajay wants to merge 1 commit intokubeflow:mainfrom
57Ajay:feat/164-telemetry-module

57Ajay commented Apr 11, 2026

Uh oh!

google-oss-prow bot commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

57Ajay commented Apr 11, 2026

description

why

design decisions

usage

next steps

Uh oh!

google-oss-prow bot commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants