fix: [TKC-5014] check workflow services health at the end of test workflow execution by de-wim · Pull Request #7572 · kubeshop/testkube

de-wim · 2026-04-23T15:47:00Z

Pull request description

This PR adds a health check before killing workflow services, detecting OOMKilled (or otherwise crashed) containers by inspecting pod restart counts. Previously, once a service passed its readiness probe, no further health validation occurred, so post-startup crashes went unnoticed. The stop-services stage is also made non-optional so detected failures properly mark the execution as failed.

Checklist (choose whats happened)

Breaking changes

None

Changes

Executions are now marked as failures when a workflow service crashes during testing

Fixes

TKC-5014

…kflow execution

otherwise, it wouldn't be able to turn the execution red.

de-wim · 2026-04-23T16:05:12Z

@greptileai please take a look

greptile-apps · 2026-04-23T16:07:55Z

Greptile Summary

This PR adds a post-startup health check to the workflow service kill phase by inspecting pod restart counts, and makes the "Stop services" stage non-optional so OOMKilled services cause executions to fail. The refactoring into RunKill/RunKillWithOptions is well-structured, but there is a blocking issue:

ui.Failf exits the process instead of returning an error (kill.go:188). ui.Failf calls os.Exit(1), so RunKill never returns an error for health check failures. This breaks the function's documented contract and makes TestServiceOOMKilledDetectedAtStop_Integration permanently non-functional — the test process exits before assert.Error(t, err) is reached. The fix is return fmt.Errorf(...) instead; the CLI path already has ui.ExitOnError at the call site to handle the exit.

Confidence Score: 3/5

Not safe to merge — the health-check error path exits the process instead of returning an error, breaking both the API contract and the integration test that validates TKC-5014.

One P1 bug: ui.Failf (os.Exit) in RunKill means health errors can never be returned as errors to callers, making RunKillWithOptions and its integration test non-functional. Score sits at 3/5 (below the 4/5 P1 ceiling) because the broken path is exactly the new safety-net being introduced.

cmd/tcl/testworkflow-toolkit/commands/kill.go — the ui.Failf call at line 188 must be replaced with return fmt.Errorf(...) before merging.

Important Files Changed

Filename	Overview
cmd/tcl/testworkflow-toolkit/commands/kill.go	Refactors kill logic into `RunKill`/`RunKillWithOptions` and adds OOM health check; health error path calls `ui.Failf` (os.Exit) instead of returning an error, breaking the function contract and the accompanying integration test.
pkg/tcl/testworkflowstcl/testworkflowprocessor/operations.go	Removes `SetOptional(true)` from the Stop services stage so health-check failures propagate and mark executions as failed — straightforward, correct change.
test/integration/components/toolkit/services_oomkill_test.go	Adds integration tests for OOM detection; `TestServiceOOMKilledDetectedAtStop_Integration` calls `RunKillWithOptions` and asserts on the returned error, but will never pass because `ui.Failf` exits the process before the error can be returned.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NewKillCmd / RunKillWithOptions] --> B[RunKill]
    B --> C[worker.List service instances]
    C --> D{items found?}
    D -- yes --> E[Resolve namespace from items]
    D -- no --> F[Skip namespace resolution]
    E --> G
    F --> G[For each item: getServiceRestarts]
    G --> H{restarts > 0?}
    H -- yes --> I[Append to healthErrors]
    H -- no --> J[Continue]
    I --> K
    J --> K{len conditions > 0 && machine != nil?}
    K -- yes --> L[Collect logs for matching services]
    K -- no --> M[Skip log collection]
    L --> N
    M --> N[worker.DestroyGroup]
    N --> O{healthErrors?}
    O -- yes --> P["ui.Failf → os.Exit(1) ⚠️\nNEVER returns error"]
    O -- no --> Q[return nil]
    P -.->|"Should be: return fmt.Errorf(...)"|Q

_{Reviews (2): Last reviewed commit: "use ui.fail instead of returning an erro..." | Re-trigger Greptile}

greptile-apps · 2026-04-23T16:08:01Z

+// KillDependencies holds the dependencies needed for the kill operation.
+type KillDependencies struct {
+	Worker    executionworkertypes.Worker
+	Namespace string
+	Ref       string
+}


Unused exported type (KillDependencies)

KillDependencies is defined and exported here but is never referenced anywhere in the codebase. It appears to be a leftover from a refactoring draft and should be removed.

…ions

greptile-apps · 2026-04-28T14:40:35Z

+	if len(healthErrors) > 0 {
+		ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))
+	}


ui.Failf exits the process instead of returning an error

ui.Failf calls os.Exit(1) unconditionally (confirmed in pkg/ui/printers.go:171). When health errors are found, RunKill exits the process rather than returning an error. This breaks the contract stated in RunKillWithOptions's doc comment ("Returns an error if any service has failed") and makes TestServiceOOMKilledDetectedAtStop_Integration non-functional: the test process exits at this line, so assert.Error(t, err) is never reached and the test always crashes rather than passing.

The CLI path already has ui.ExitOnError("stopping services", err) at the call site, so returning a proper error is sufficient to cause the process to exit with code 1 there.

Suggested change

if len(healthErrors) > 0 {

ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))

}

if len(healthErrors) > 0 {

return fmt.Errorf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))

}

Copilot

Pull request overview

Adds post-startup service health validation so workflow service crashes (e.g., OOMKilled) are detected at teardown and cause the overall workflow to fail.

Changes:

Add a restart-count-based health check to the kill command before destroying service resources.
Make the “Stop services” stage non-optional so teardown failures mark executions as failed.
Add integration tests covering OOMKill-after-readiness scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
test/integration/components/toolkit/services_oomkill_test.go	New integration tests for detecting OOMKilled services (monitor vs. stop/kill safety net).
pkg/tcl/testworkflowstcl/testworkflowprocessor/operations.go	Makes the “Stop services” stage non-optional to propagate failures.
cmd/tcl/testworkflow-toolkit/commands/kill.go	Refactors kill logic into `RunKill*` helpers and adds Kubernetes restart-count health checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

courageousillumination

Is restarts the only way we have to look at this? Seems a bit indirect, but if that's all that's available this is fine with me.

@tkonieczny do you think this is a breaking change or fixing a bug? It might cause some tests that were "passing" before to now fail (if they were passing without relying on the services). I think it's probably fine, but I don't want anyone to be confused about the new behavior.

de-wim · 2026-04-28T16:11:25Z

Is restarts the only way we have to look at this? Seems a bit indirect, but if that's all that's available this is fine with me.

I know, it feels a bit... hacky. But I tried a few other things, such as notifications, but those didn't work.

I'm exploring using events now.

…ination reason and current state of pod

de-wim added 3 commits April 23, 2026 17:24

fix: [TKC-5014] check workflow services health at the end of test wor…

1ad960d

…kflow execution

fix: lint: len(conditions) also covers conditions != nil

ef874b2

fix: make the stop services stage non-optional

b40e5ab

otherwise, it wouldn't be able to turn the execution red.

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

de-wim added 3 commits April 23, 2026 20:11

fix: address review comments

ca9d572

fix: check restart count for workflow service counters, not notificat…

d71fc9e

…ions

use ui.fail instead of returning an error if service pods restarted.

cc6dc38

de-wim marked this pull request as ready for review April 28, 2026 14:37

Copilot AI review requested due to automatic review settings April 28, 2026 14:37

de-wim requested review from a team as code owners April 28, 2026 14:37

de-wim requested review from povilasv and tkonieczny April 28, 2026 14:37

Copilot started reviewing on behalf of de-wim April 28, 2026 14:37 View session

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Copilot AI reviewed Apr 28, 2026

View reviewed changes

fix: address review comments

5601a8c

courageousillumination approved these changes Apr 28, 2026

View reviewed changes

fix: add waits for OOMKill in integration test

0c31480

chore: replace restart count health check with checking previous term…

a9d17ff

…ination reason and current state of pod

de-wim merged commit 44fd076 into main May 5, 2026
15 of 16 checks passed

de-wim deleted the tkc-5014-service-oomkilled-after-readiness-service-is-marked-as-green branch May 5, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [TKC-5014] check workflow services health at the end of test workflow execution#7572

fix: [TKC-5014] check workflow services health at the end of test workflow execution#7572
de-wim merged 9 commits intomainfrom
tkc-5014-service-oomkilled-after-readiness-service-is-marked-as-green

de-wim commented Apr 23, 2026 •

edited

Loading

Uh oh!

de-wim commented Apr 23, 2026

Uh oh!

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot Apr 23, 2026

Uh oh!

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

courageousillumination left a comment

Uh oh!

de-wim commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

de-wim commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull request description

Checklist (choose whats happened)

Breaking changes

Changes

Fixes

Uh oh!

de-wim commented Apr 23, 2026

Uh oh!

greptile-apps Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

courageousillumination left a comment

Choose a reason for hiding this comment

Uh oh!

de-wim commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

de-wim commented Apr 23, 2026 •

edited

Loading

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading