Conversation
otherwise, it wouldn't be able to turn the execution red.
|
@greptileai please take a look |
Greptile SummaryThis PR adds a post-startup health check to the workflow service kill phase by inspecting pod restart counts, and makes the "Stop services" stage non-optional so OOMKilled services cause executions to fail. The refactoring into
Confidence Score: 3/5Not safe to merge — the health-check error path exits the process instead of returning an error, breaking both the API contract and the integration test that validates TKC-5014. One P1 bug: cmd/tcl/testworkflow-toolkit/commands/kill.go — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[NewKillCmd / RunKillWithOptions] --> B[RunKill]
B --> C[worker.List service instances]
C --> D{items found?}
D -- yes --> E[Resolve namespace from items]
D -- no --> F[Skip namespace resolution]
E --> G
F --> G[For each item: getServiceRestarts]
G --> H{restarts > 0?}
H -- yes --> I[Append to healthErrors]
H -- no --> J[Continue]
I --> K
J --> K{len conditions > 0 && machine != nil?}
K -- yes --> L[Collect logs for matching services]
K -- no --> M[Skip log collection]
L --> N
M --> N[worker.DestroyGroup]
N --> O{healthErrors?}
O -- yes --> P["ui.Failf → os.Exit(1) ⚠️\nNEVER returns error"]
O -- no --> Q[return nil]
P -.->|"Should be: return fmt.Errorf(...)"|Q
Reviews (2): Last reviewed commit: "use ui.fail instead of returning an erro..." | Re-trigger Greptile |
| // KillDependencies holds the dependencies needed for the kill operation. | ||
| type KillDependencies struct { | ||
| Worker executionworkertypes.Worker | ||
| Namespace string | ||
| Ref string | ||
| } |
| if len(healthErrors) > 0 { | ||
| ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; ")) | ||
| } |
There was a problem hiding this comment.
ui.Failf exits the process instead of returning an error
ui.Failf calls os.Exit(1) unconditionally (confirmed in pkg/ui/printers.go:171). When health errors are found, RunKill exits the process rather than returning an error. This breaks the contract stated in RunKillWithOptions's doc comment ("Returns an error if any service has failed") and makes TestServiceOOMKilledDetectedAtStop_Integration non-functional: the test process exits at this line, so assert.Error(t, err) is never reached and the test always crashes rather than passing.
The CLI path already has ui.ExitOnError("stopping services", err) at the call site, so returning a proper error is sufficient to cause the process to exit with code 1 there.
| if len(healthErrors) > 0 { | |
| ui.Failf("unhealthy services detected: %s", strings.Join(healthErrors, "; ")) | |
| } | |
| if len(healthErrors) > 0 { | |
| return fmt.Errorf("unhealthy services detected: %s", strings.Join(healthErrors, "; ")) | |
| } |
There was a problem hiding this comment.
Pull request overview
Adds post-startup service health validation so workflow service crashes (e.g., OOMKilled) are detected at teardown and cause the overall workflow to fail.
Changes:
- Add a restart-count-based health check to the
killcommand before destroying service resources. - Make the “Stop services” stage non-optional so teardown failures mark executions as failed.
- Add integration tests covering OOMKill-after-readiness scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| test/integration/components/toolkit/services_oomkill_test.go | New integration tests for detecting OOMKilled services (monitor vs. stop/kill safety net). |
| pkg/tcl/testworkflowstcl/testworkflowprocessor/operations.go | Makes the “Stop services” stage non-optional to propagate failures. |
| cmd/tcl/testworkflow-toolkit/commands/kill.go | Refactors kill logic into RunKill* helpers and adds Kubernetes restart-count health checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
courageousillumination
left a comment
There was a problem hiding this comment.
Is restarts the only way we have to look at this? Seems a bit indirect, but if that's all that's available this is fine with me.
@tkonieczny do you think this is a breaking change or fixing a bug? It might cause some tests that were "passing" before to now fail (if they were passing without relying on the services). I think it's probably fine, but I don't want anyone to be confused about the new behavior.
I know, it feels a bit... hacky. But I tried a few other things, such as notifications, but those didn't work. I'm exploring using events now. |
…ination reason and current state of pod
Pull request description
This PR adds a health check before killing workflow services, detecting OOMKilled (or otherwise crashed) containers by inspecting pod restart counts. Previously, once a service passed its readiness probe, no further health validation occurred, so post-startup crashes went unnoticed. The stop-services stage is also made non-optional so detected failures properly mark the execution as failed.
Checklist (choose whats happened)
Breaking changes
Changes
Fixes