Skip to content

fix: [TKC-5014] check workflow services health at the end of test workflow execution#7572

Merged
de-wim merged 9 commits intomainfrom
tkc-5014-service-oomkilled-after-readiness-service-is-marked-as-green
May 5, 2026
Merged

fix: [TKC-5014] check workflow services health at the end of test workflow execution#7572
de-wim merged 9 commits intomainfrom
tkc-5014-service-oomkilled-after-readiness-service-is-marked-as-green

Conversation

@de-wim
Copy link
Copy Markdown
Contributor

@de-wim de-wim commented Apr 23, 2026

Pull request description

This PR adds a health check before killing workflow services, detecting OOMKilled (or otherwise crashed) containers by inspecting pod restart counts. Previously, once a service passed its readiness probe, no further health validation occurred, so post-startup crashes went unnoticed. The stop-services stage is also made non-optional so detected failures properly mark the execution as failed.

Checklist (choose whats happened)

  • breaking change! (describe)
  • tested locally
  • tested on cluster
  • added new dependencies
  • updated the docs
  • added a test

Breaking changes

  • None

Changes

  • Executions are now marked as failures when a workflow service crashes during testing

Fixes

@de-wim
Copy link
Copy Markdown
Contributor Author

de-wim commented Apr 23, 2026

@greptileai please take a look

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR adds a post-startup health check to the workflow service kill phase by inspecting pod restart counts, and makes the "Stop services" stage non-optional so OOMKilled services cause executions to fail. The refactoring into RunKill/RunKillWithOptions is well-structured, but there is a blocking issue:

  • ui.Failf exits the process instead of returning an error (kill.go:188). ui.Failf calls os.Exit(1), so RunKill never returns an error for health check failures. This breaks the function's documented contract and makes TestServiceOOMKilledDetectedAtStop_Integration permanently non-functional — the test process exits before assert.Error(t, err) is reached. The fix is return fmt.Errorf(...) instead; the CLI path already has ui.ExitOnError at the call site to handle the exit.

Confidence Score: 3/5

Not safe to merge — the health-check error path exits the process instead of returning an error, breaking both the API contract and the integration test that validates TKC-5014.

One P1 bug: ui.Failf (os.Exit) in RunKill means health errors can never be returned as errors to callers, making RunKillWithOptions and its integration test non-functional. Score sits at 3/5 (below the 4/5 P1 ceiling) because the broken path is exactly the new safety-net being introduced.

cmd/tcl/testworkflow-toolkit/commands/kill.go — the ui.Failf call at line 188 must be replaced with return fmt.Errorf(...) before merging.

Important Files Changed

Filename Overview
cmd/tcl/testworkflow-toolkit/commands/kill.go Refactors kill logic into RunKill/RunKillWithOptions and adds OOM health check; health error path calls ui.Failf (os.Exit) instead of returning an error, breaking the function contract and the accompanying integration test.
pkg/tcl/testworkflowstcl/testworkflowprocessor/operations.go Removes SetOptional(true) from the Stop services stage so health-check failures propagate and mark executions as failed — straightforward, correct change.
test/integration/components/toolkit/services_oomkill_test.go Adds integration tests for OOM detection; TestServiceOOMKilledDetectedAtStop_Integration calls RunKillWithOptions and asserts on the returned error, but will never pass because ui.Failf exits the process before the error can be returned.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NewKillCmd / RunKillWithOptions] --> B[RunKill]
    B --> C[worker.List service instances]
    C --> D{items found?}
    D -- yes --> E[Resolve namespace from items]
    D -- no --> F[Skip namespace resolution]
    E --> G
    F --> G[For each item: getServiceRestarts]
    G --> H{restarts > 0?}
    H -- yes --> I[Append to healthErrors]
    H -- no --> J[Continue]
    I --> K
    J --> K{len conditions > 0 && machine != nil?}
    K -- yes --> L[Collect logs for matching services]
    K -- no --> M[Skip log collection]
    L --> N
    M --> N[worker.DestroyGroup]
    N --> O{healthErrors?}
    O -- yes --> P["ui.Failf → os.Exit(1) ⚠️\nNEVER returns error"]
    O -- no --> Q[return nil]
    P -.->|"Should be: return fmt.Errorf(...)"|Q
Loading

Reviews (2): Last reviewed commit: "use ui.fail instead of returning an erro..." | Re-trigger Greptile

Comment thread cmd/tcl/testworkflow-toolkit/commands/kill.go
Comment on lines +71 to +76
// KillDependencies holds the dependencies needed for the kill operation.
type KillDependencies struct {
Worker executionworkertypes.Worker
Namespace string
Ref string
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused exported type (KillDependencies)

KillDependencies is defined and exported here but is never referenced anywhere in the codebase. It appears to be a leftover from a refactoring draft and should be removed.

Comment thread test/integration/components/toolkit/services_oomkill_test.go Outdated
@de-wim de-wim marked this pull request as ready for review April 28, 2026 14:37
Copilot AI review requested due to automatic review settings April 28, 2026 14:37
@de-wim de-wim requested review from a team as code owners April 28, 2026 14:37
@de-wim de-wim requested review from povilasv and tkonieczny April 28, 2026 14:37
Comment on lines +187 to +189
if len(healthErrors) > 0 {
ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ui.Failf exits the process instead of returning an error

ui.Failf calls os.Exit(1) unconditionally (confirmed in pkg/ui/printers.go:171). When health errors are found, RunKill exits the process rather than returning an error. This breaks the contract stated in RunKillWithOptions's doc comment ("Returns an error if any service has failed") and makes TestServiceOOMKilledDetectedAtStop_Integration non-functional: the test process exits at this line, so assert.Error(t, err) is never reached and the test always crashes rather than passing.

The CLI path already has ui.ExitOnError("stopping services", err) at the call site, so returning a proper error is sufficient to cause the process to exit with code 1 there.

Suggested change
if len(healthErrors) > 0 {
ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))
}
if len(healthErrors) > 0 {
return fmt.Errorf("unhealthy services detected: %s", strings.Join(healthErrors, "; "))
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds post-startup service health validation so workflow service crashes (e.g., OOMKilled) are detected at teardown and cause the overall workflow to fail.

Changes:

  • Add a restart-count-based health check to the kill command before destroying service resources.
  • Make the “Stop services” stage non-optional so teardown failures mark executions as failed.
  • Add integration tests covering OOMKill-after-readiness scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
test/integration/components/toolkit/services_oomkill_test.go New integration tests for detecting OOMKilled services (monitor vs. stop/kill safety net).
pkg/tcl/testworkflowstcl/testworkflowprocessor/operations.go Makes the “Stop services” stage non-optional to propagate failures.
cmd/tcl/testworkflow-toolkit/commands/kill.go Refactors kill logic into RunKill* helpers and adds Kubernetes restart-count health checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/integration/components/toolkit/services_oomkill_test.go Outdated
Comment thread test/integration/components/toolkit/services_oomkill_test.go
Comment thread cmd/tcl/testworkflow-toolkit/commands/kill.go Outdated
Comment thread cmd/tcl/testworkflow-toolkit/commands/kill.go
Comment thread cmd/tcl/testworkflow-toolkit/commands/kill.go Outdated
Comment thread cmd/tcl/testworkflow-toolkit/commands/kill.go Outdated
Copy link
Copy Markdown
Contributor

@courageousillumination courageousillumination left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is restarts the only way we have to look at this? Seems a bit indirect, but if that's all that's available this is fine with me.

@tkonieczny do you think this is a breaking change or fixing a bug? It might cause some tests that were "passing" before to now fail (if they were passing without relying on the services). I think it's probably fine, but I don't want anyone to be confused about the new behavior.

@de-wim
Copy link
Copy Markdown
Contributor Author

de-wim commented Apr 28, 2026

Is restarts the only way we have to look at this? Seems a bit indirect, but if that's all that's available this is fine with me.

I know, it feels a bit... hacky. But I tried a few other things, such as notifications, but those didn't work.

I'm exploring using events now.

@de-wim de-wim merged commit 44fd076 into main May 5, 2026
15 of 16 checks passed
@de-wim de-wim deleted the tkc-5014-service-oomkilled-after-readiness-service-is-marked-as-green branch May 5, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants