Environment
- How did you deploy Kubeflow Pipelines (KFP)? Standard KFP v2 with Argo Workflows
- KFP version: 2.14.x / 2.15.x
- KFP SDK version: kfp 2.x
Steps to reproduce
- Create a pipeline component that will trigger a pod-level failure.
For OOMKilled: use a component that allocates more memory than the container limit.
For ImagePullBackOff: set an invalid or non-existent image in the component spec.
- Run the pipeline and wait for the task to fail.
- Once the run fails, check the MLMD execution record for that task - either via
the KFP API or by querying the MLMD gRPC server directly.
- Observe that the execution's error details contain only a generic launcher error
string, not the actual pod failure reason (e.g., OOMKilled, exit code 137).
- Compare with
kubectl describe pod <failed-pod> - the actual reason is visible
there but is absent from MLMD.
Expected result
When a pod fails due to a platform-level reason (OOMKilled, ImagePullBackOff,
CrashLoopBackOff, Evicted, etc.), the MLMD execution record for that task should
include structured error details capturing:
- The pod failure reason (e.g.,
OOMKilled)
- The container exit code where available (e.g., 137 for OOMKilled)
- Optionally: the pod name and namespace for traceability
This would allow the backend to propagate a meaningful failure reason to the UI
and give users actionable information without requiring access to kubectl.
Current behavior
The launcher (backend/src/v2/component/launcher_v2.go) publishes execution state
on failure, but the error context it records comes only from its own runtime error —
not from the Kubernetes pod status. As a result, the MLMD execution record for a
pod-killed task contains a generic error string while the actual failure reason
(e.g., OOMKilled with exit code 137, or ImagePullBackOff with the image name)
is silently dropped.
kubectl describe pod is currently the only way to see the actual failure reason.
Materials and Reference
Affected code paths:
backend/src/v2/component/launcher_v2.go — failure publication path
backend/src/v2/driver/driver.go — where pod status could be read and normalized
backend/src/v2/metadata/client.go — where execution error details are written to MLMD
Related issues:
Impacted by this bug? Give it a 👍.
Environment
Steps to reproduce
For OOMKilled: use a component that allocates more memory than the container limit.
For ImagePullBackOff: set an invalid or non-existent image in the component spec.
the KFP API or by querying the MLMD gRPC server directly.
string, not the actual pod failure reason (e.g., OOMKilled, exit code 137).
kubectl describe pod <failed-pod>- the actual reason is visiblethere but is absent from MLMD.
Expected result
When a pod fails due to a platform-level reason (OOMKilled, ImagePullBackOff,
CrashLoopBackOff, Evicted, etc.), the MLMD execution record for that task should
include structured error details capturing:
OOMKilled)This would allow the backend to propagate a meaningful failure reason to the UI
and give users actionable information without requiring access to kubectl.
Current behavior
The launcher (
backend/src/v2/component/launcher_v2.go) publishes execution stateon failure, but the error context it records comes only from its own runtime error —
not from the Kubernetes pod status. As a result, the MLMD execution record for a
pod-killed task contains a generic error string while the actual failure reason
(e.g.,
OOMKilledwith exit code 137, orImagePullBackOffwith the image name)is silently dropped.
kubectl describe podis currently the only way to see the actual failure reason.Materials and Reference
Affected code paths:
backend/src/v2/component/launcher_v2.go— failure publication pathbackend/src/v2/driver/driver.go— where pod status could be read and normalizedbackend/src/v2/metadata/client.go— where execution error details are written to MLMDRelated issues:
Impacted by this bug? Give it a 👍.