feat: add OpenTelemetry telemetry module for SDK instrumentation (#164)#454
feat: add OpenTelemetry telemetry module for SDK instrumentation (#164)#45457Ajay wants to merge 1 commit intokubeflow:mainfrom
Conversation
Signed-off-by: 57Ajay <57ajay.u@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, API-only OpenTelemetry integration point for Kubeflow SDK so clients can create spans/metrics without requiring opentelemetry-sdk or exporters.
Changes:
- Introduces
kubeflow/common/telemetry.pywith lazy, optional initialization of OTel tracer/meter and ansdk_span()context manager. - Adds unit tests for both “OTel installed” (via injected modules) and “OTel absent” scenarios.
- Adds a
telemetryextra (opentelemetry-api) and updates the lockfile accordingly.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
uv.lock |
Adds opentelemetry-api (and transitive deps) and registers the telemetry extra in the lock resolution. |
pyproject.toml |
Defines telemetry = ["opentelemetry-api>=1.40.0"] optional extra. |
kubeflow/common/telemetry.py |
Implements lazy tracer/meter initialization plus sdk_span() helper for client instrumentation. |
kubeflow/common/telemetry_test.py |
Adds tests for both presence/absence of OTel using sys.modules injection and import blocking. |
| # Instrumentation identity. The version tracks the SDK version that introduced | ||
| # telemetry support and is bumped when the instrumentation surface changes | ||
| # (new span names, attribute renames, metric additions). | ||
| _INSTRUMENTATION_NAME = "kubeflow.sdk" | ||
| _INSTRUMENTATION_VERSION = "0.5.0" | ||
|
|
There was a problem hiding this comment.
_INSTRUMENTATION_VERSION is set to "0.5.0", but the package version is currently 0.4.0 (kubeflow/init.py); this will report misleading instrumentation metadata, so derive it from kubeflow.version or set it to the release version that actually introduces telemetry (and update the tests that assert this value).
| global _tracer, _meter, _initialized # noqa: PLW0603 | ||
| if _initialized: | ||
| return | ||
| _initialized = True | ||
|
|
||
| try: | ||
| from opentelemetry import metrics, trace | ||
|
|
||
| _tracer = trace.get_tracer(_INSTRUMENTATION_NAME, _INSTRUMENTATION_VERSION) | ||
| _meter = metrics.get_meter(_INSTRUMENTATION_NAME, _INSTRUMENTATION_VERSION) | ||
| except ImportError: |
There was a problem hiding this comment.
_setup() sets _initialized = True before completing the import and singleton assignment, which can race under concurrent calls (another thread may observe _initialized=True and return None tracers/meters even though OTel is available); guard initialization with a lock or only flip _initialized after _tracer/_meter have been set.
| original_import = ( | ||
| __builtins__["__import__"] if isinstance(__builtins__, dict) else __builtins__.__import__ | ||
| ) | ||
|
|
||
| def blocking_import(name, *args, **kwargs): | ||
| if name.startswith("opentelemetry"): | ||
| raise ImportError(f"No module named '{name}'") | ||
| return original_import(name, *args, **kwargs) | ||
|
|
||
| if isinstance(__builtins__, dict): | ||
| __builtins__["__import__"] = blocking_import | ||
| else: | ||
| __builtins__.__import__ = blocking_import | ||
|
|
||
| yield | ||
|
|
||
| if isinstance(__builtins__, dict): | ||
| __builtins__["__import__"] = original_import | ||
| else: | ||
| __builtins__.__import__ = original_import |
There was a problem hiding this comment.
The mock_no_otel fixture patches builtins.import directly; if anything fails after the patch but before teardown, it can leak into other tests and cause hard-to-debug import behavior, so prefer pytest's monkeypatch fixture (or patch builtins.import) to guarantee restoration.
ref #164
description
Adds
kubeflow/common/telemetry.py- the foundational module for integrating OpenTelemetry into the Kubeflow SDK, as tracked in #164.why
The SDK currently has no structured observability. For example what if
TrainerClient.train()takes 45 minutes orOptimizerClient.optimize()stalls across 50 trials, we currently have no way to pinpoint whether the bottleneck is K8s API calls, CRD validation, polling, or pod scheduling. This module is the first step toward vendor-neutral, opt-in distributed tracing and metrics across all SDK clients.design decisions
opentelemetry-api, notopentelemetry-sdkor any exporter. Users choose their own SDK and export backend.opentelemetry-apiis not installed,_setup()catchesImportErroronce and all subsequentsdk_span()calls yieldNonewith no cost. No import-time side effects.opentelemetry.trace.get_tracer()itself works.ModelRegistryClientwithmodel-registryandSparkClientwithpyspark-connect.usage
next steps
This is the foundation for client-by-client instrumentation. Later I plan to instrument
TrainerClientoperations inKubernetesBackendusingsdk_span().