Skip to content

[bug] Pod-level failure reason (OOMKilled, ImagePullBackOff) not recorded in MLMD execution error details in KFP v2 #13192

@sarika-03

Description

@sarika-03

Environment

  • How did you deploy Kubeflow Pipelines (KFP)? Standard KFP v2 with Argo Workflows
  • KFP version: 2.14.x / 2.15.x
  • KFP SDK version: kfp 2.x

Steps to reproduce

  1. Create a pipeline component that will trigger a pod-level failure.
    For OOMKilled: use a component that allocates more memory than the container limit.
    For ImagePullBackOff: set an invalid or non-existent image in the component spec.
  2. Run the pipeline and wait for the task to fail.
  3. Once the run fails, check the MLMD execution record for that task - either via
    the KFP API or by querying the MLMD gRPC server directly.
  4. Observe that the execution's error details contain only a generic launcher error
    string, not the actual pod failure reason (e.g., OOMKilled, exit code 137).
  5. Compare with kubectl describe pod <failed-pod> - the actual reason is visible
    there but is absent from MLMD.

Expected result

When a pod fails due to a platform-level reason (OOMKilled, ImagePullBackOff,
CrashLoopBackOff, Evicted, etc.), the MLMD execution record for that task should
include structured error details capturing:

  • The pod failure reason (e.g., OOMKilled)
  • The container exit code where available (e.g., 137 for OOMKilled)
  • Optionally: the pod name and namespace for traceability

This would allow the backend to propagate a meaningful failure reason to the UI
and give users actionable information without requiring access to kubectl.

Current behavior

The launcher (backend/src/v2/component/launcher_v2.go) publishes execution state
on failure, but the error context it records comes only from its own runtime error —
not from the Kubernetes pod status. As a result, the MLMD execution record for a
pod-killed task contains a generic error string while the actual failure reason
(e.g., OOMKilled with exit code 137, or ImagePullBackOff with the image name)
is silently dropped.

kubectl describe pod is currently the only way to see the actual failure reason.

Materials and Reference

Affected code paths:

  • backend/src/v2/component/launcher_v2.go — failure publication path
  • backend/src/v2/driver/driver.go — where pod status could be read and normalized
  • backend/src/v2/metadata/client.go — where execution error details are written to MLMD

Related issues:


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions