Skip to content

fix: fix resource leaks — response bodies, signals, connections, file descriptors, data structures#7094

Open
Copilot wants to merge 8 commits intomainfrom
copilot/check-resource-leaks
Open

fix: fix resource leaks — response bodies, signals, connections, file descriptors, data structures#7094
Copilot wants to merge 8 commits intomainfrom
copilot/check-resource-leaks

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 27, 2026

Systematic audit and fix of resource leaks across the codebase: unclosed HTTP response bodies, unregistered signal channels, goroutine leaks, TCP connection leaks, file descriptor leaks, and data structures that grow without bound.

Pull request description

HTTP response body leaks

  • pkg/runner/executionlogswriter.gores.Body never closed after httpClient.Do() in Save(); now drains body on non-200 via io.Copy(io.Discard, ...) to allow HTTP connection reuse
  • pkg/cloud/data/artifact/artifacts_storage.gorsp.Body not closed on non-200 status in getObject(); now reads up to 4KB of error body via io.LimitReader and includes it in the error message for better diagnostics
  • pkg/api/v1/client/direct_client.goresp.Body leaked in baseExec() on 4xx/5xx; callers early-return before defer resp.Body.Close() is set
  • pkg/controlplaneclient/execution.go — now drains response body on non-200 via io.Copy(io.Discard, ...) before returning to enable TCP connection reuse

signal.Notify without signal.Stop

Added defer signal.Stop(ch) in all seven call sites:
pkg/mcp/server.go, cmd/api-server/commons/commons.go, cmd/testworkflow-toolkit/commands/root.go, cmd/kubectl-testkube/commands/root.go, cmd/kubectl-testkube/commands/dashboard.go, cmd/tcl/kubectl-testkube/devbox/command.go

Goroutine leaks in signal handlers

cmd/testworkflow-init/runner/runner.go and cmd/tcl/kubectl-testkube/devbox/command.go had goroutines blocked on <-sigCh with no ctx.Done() exit path — leaked whenever the outer function returned normally without a signal. Changed to select { case <-sigCh: … case <-ctx.Done(): }. Also added defer ctxCancel() in devbox command.go to ensure the context is always cancelled on function exit, allowing the signal-wait goroutine to terminate.

External connection leaks

  • cmd/testworkflow-init/control/client.goClose() only called c.cancel() (cancels dial context, does not close an already-established TCP connection); added _ = c.conn.Close()
  • pkg/tmp/files.goos.CreateTemp FD leaked; added _ = tmpfile.Close() before returning the path
  • cmd/tcl/kubectl-testkube/devbox/devutils/crdfssync.go — bare file.Close() at end of loadFile(); converted to defer

Data structure leaks

  • pkg/cache/inmem.goInMemoryCache entries stored by Set() but never re-fetched via Get() accumulated in the sync.Map forever (no active eviction). Added flush() triggered every 10 Set() calls using an atomic counter; uses collect-then-delete to avoid mutating the map during Range:
    if c.setCount.Add(1)%flushInterval == 0 {
        c.flush()
    }
  • pkg/testworkflows/testworkflowexecutor/testworkflowfetcher.gocontainsSameMap had inverted boolean logic: return true fired on key mismatch instead of match. The prefetchedLabels dedup guard never worked, appending duplicates on every call. Fixed with a local equal flag and break-on-mismatch.

Checklist (choose whats happened)

  • breaking change! (describe)
  • tested locally
  • tested on cluster
  • added new dependencies
  • updated the docs
  • added a test

Breaking changes

Changes

  • Closed HTTP response bodies in all error and success paths
  • Drain HTTP response bodies on non-200 error paths to enable TCP connection reuse
  • Read and include error body content (up to 4KB) in presigned URL error messages for better diagnostics
  • Registered signal.Stop for every signal.Notify call site
  • Fixed goroutine signal-handler pattern to exit on ctx.Done()
  • Added defer ctxCancel() in devbox command to ensure context cleanup on normal exit
  • control.Client.Close() now closes the underlying TCP connection
  • tmp.Name() closes the temp file descriptor before returning
  • crdfssync.loadFile uses defer file.Close()
  • InMemoryCache.Set triggers periodic sweep of expired entries
  • containsSameMap logic corrected to return true on match

Fixes


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copilot AI changed the title [WIP] Check code for resource leaks fix: close HTTP response bodies to prevent resource leaks Feb 27, 2026
@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 27, 2026

@greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 27, 2026

Greptile Summary

This PR systematically addresses resource leaks across the codebase, covering HTTP response bodies, signal handlers, goroutines, TCP connections, file descriptors, and data structures. The changes are well-focused and follow Go best practices for resource cleanup.

Key improvements:

  • HTTP response bodies: Added defer Body.Close() and drains error responses to enable TCP connection reuse
  • Signal handling: Registered signal.Stop() for all signal.Notify() calls to prevent channel leaks
  • Goroutine leaks: Fixed signal handler pattern with select and ctx.Done() case to allow graceful termination
  • Connection cleanup: TCP connections and file descriptors now properly closed
  • Data structure fixes: InMemoryCache now has periodic cleanup, containsSameMap logic corrected

All fixes are correct and follow Go idioms. No issues found.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All changes are focused resource leak fixes following Go best practices. The fixes are minimal, well-tested patterns (defer, signal.Stop, context cancellation), and address real memory/connection leaks without changing business logic.
  • No files require special attention

Important Files Changed

Filename Overview
pkg/cache/inmem.go Added periodic flush mechanism to prevent unbounded memory growth by removing expired entries every 10 Set operations
pkg/testworkflows/testworkflowexecutor/testworkflowfetcher.go Fixed inverted boolean logic in containsSameMap that was causing duplicate entries to accumulate
pkg/runner/executionlogswriter.go Added defer res.Body.Close() and drains response body on error to enable HTTP connection reuse
pkg/cloud/data/artifact/artifacts_storage.go Added response body reading (up to 4KB) and closing on error, includes error body in diagnostics
pkg/api/v1/client/direct_client.go Fixed response body leak in error path and added file descriptor cleanup in GetFile
cmd/testworkflow-init/runner/runner.go Fixed goroutine leak by adding ctx.Done() case to signal handler and defer signal.Stop()
cmd/tcl/kubectl-testkube/devbox/command.go Fixed goroutine leak with ctx.Done() case, added defer ctxCancel() and defer signal.Stop()
cmd/testworkflow-init/control/client.go Added TCP connection close in Close() method to prevent connection leak

Last reviewed commit: adf422a

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copilot AI changed the title fix: close HTTP response bodies to prevent resource leaks fix: fix resource leaks (tickers, file descriptors, HTTP response bodies) Feb 28, 2026
…eak in RunInit)

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copilot AI changed the title fix: fix resource leaks (tickers, file descriptors, HTTP response bodies) fix: fix memory leaks (signal.Notify without signal.Stop, goroutine leak) Feb 28, 2026
Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copilot AI changed the title fix: fix memory leaks (signal.Notify without signal.Stop, goroutine leak) fix: fix resource leaks (response bodies, signal channels, goroutines) Feb 28, 2026
@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 28, 2026

@greptile

@vsukhin vsukhin marked this pull request as ready for review February 28, 2026 08:48
Copilot AI review requested due to automatic review settings February 28, 2026 08:48
@vsukhin vsukhin requested a review from a team as a code owner February 28, 2026 08:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets resource-leak fixes across the Go codebase, primarily around properly releasing OS/HTTP resources and cleaning up signal/ticker handling to avoid leaked goroutines and blocked channels during shutdown.

Changes:

  • Close leaked resources (files, HTTP response bodies) and stop tickers to prevent FD/goroutine leaks.
  • Add signal.Stop(...) to signal.Notify(...) call sites and adjust signal-handling goroutines to respect ctx.Done().
  • Improve file download write-paths by ensuring created files are always closed (including close-error propagation in some cases).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/utilization/core/writers.go Ensure opened file is closed in Print() to avoid FD leaks.
pkg/triggers/scraper.go Stop time.Ticker on exit to avoid ticker resource leak.
pkg/runner/executionlogswriter.go Close HTTP response body after presigned PUT request.
pkg/mcp/server.go Stop signal notifications for graceful shutdown handlers.
pkg/event/emitter.go Stop tickers in long-running loops to avoid ticker leaks.
pkg/controlplaneclient/execution.go Close HTTP response body after presigned PUT request.
pkg/cloud/data/artifact/artifacts_storage.go Close response body on non-200 in getObject().
pkg/cloud/client/rest.go Close HTTP response bodies in REST client methods.
pkg/cloud/client/agents.go Close HTTP response body in GetSecretKey().
pkg/api/v1/client/direct_client.go Close body on error path in baseExec(); ensure created file is closed in GetFile().
pkg/api/v1/client/cloud_client.go Ensure created file is closed (with close-error propagation) in GetFile().
cmd/testworkflow-toolkit/commands/root.go Add signal.Stop(stopSignal) to signal handler goroutine.
cmd/testworkflow-init/runner/runner.go Add signal.Stop and make signal goroutine exit on ctx.Done().
cmd/tcl/kubectl-testkube/devbox/devutils/binarystorage.go Close HTTP response bodies on additional paths in upload logic.
cmd/tcl/kubectl-testkube/devbox/command.go Add signal.Stop and make signal goroutine exit on ctx.Done().
cmd/tcl/devbox-binary-storage/main.go Close opened file in HTTP GET handler.
cmd/kubectl-testkube/commands/testworkflowtemplates/create.go Close opened input file for --file mode.
cmd/kubectl-testkube/commands/testworkflows/create.go Close opened input file for --file mode.
cmd/kubectl-testkube/commands/root.go Add signal.Stop(stopSignal) to signal handler goroutine.
cmd/kubectl-testkube/commands/pro/login.go Close HTTP response bodies after reading public-info.
cmd/kubectl-testkube/commands/dashboard.go Stop signal notifications for dashboard port-forward stop channel.
cmd/api-server/commons/commons.go Add signal.Stop(stopSignal) in HandleCancelSignal.
Comments suppressed due to low confidence (3)

cmd/api-server/commons/commons.go:75

  • defer signal.Stop(stopSignal) runs when this handler returns (including immediately after the first signal), which disables further signal delivery. That prevents the nested goroutine waiting for a second signal from ever unblocking, so the intended forced os.Exit(137) on repeated signals won’t work. Consider stopping notifications only after shutdown completes (e.g., defer signal.Stop at the call site that owns the lifecycle).
		defer signal.Stop(stopSignal)
		select {
		case <-ctx.Done():
			return nil
		case sig := <-stopSignal:

cmd/testworkflow-toolkit/commands/root.go:57

  • defer signal.Stop(stopSignal) is inside the errgroup goroutine. When the first signal is received, this goroutine returns and signal.Stop runs immediately, which prevents the nested “second signal => os.Exit(137)” goroutine from ever receiving a second signal (it will block forever on <-stopSignal). Move signal.Stop(stopSignal) to a defer in Execute() (outside the goroutine), or otherwise ensure notifications stay active until the process exits/shutdown completes.
		defer signal.Stop(stopSignal)
		select {
		case <-ctx.Done():
			return nil
		case sig := <-stopSignal:

cmd/kubectl-testkube/commands/root.go:218

  • defer signal.Stop(stopSignal) is executed when this errgroup goroutine returns (e.g., right after receiving the first signal). That stops further signal delivery and breaks the intended “second SIGINT/SIGTERM forces os.Exit(137)” behavior (the inner goroutine will block on <-stopSignal). Prefer deferring signal.Stop(stopSignal) in Execute() itself, after the command finishes, so signal delivery remains active during shutdown.
		defer signal.Stop(stopSignal)
		select {
		case <-ctx.Done():
			return nil
		case sig := <-stopSignal:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 73 to +82
ctx, ctxCancel := context.WithCancel(context.Background())
stopSignal := make(chan os.Signal, 1)
signal.Notify(stopSignal, syscall.SIGINT, syscall.SIGTERM)
defer signal.Stop(stopSignal)
go func() {
<-stopSignal
ctxCancel()
select {
case <-stopSignal:
ctxCancel()
case <-ctx.Done():
}
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctxCancel is only invoked on a signal, so if the command exits normally the context is never canceled and the signal-wait goroutine can remain blocked (even with the new select { ... case <-ctx.Done() }). Add defer ctxCancel() right after context.WithCancel(...) so the goroutine can always exit and any context-bound work is properly canceled.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added defer ctxCancel() right after context.WithCancel(...) in adf422a.

return errors.Wrap(err, "failed to save file in the object storage")
}
defer res.Body.Close()
if res.StatusCode != http.StatusOK {
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On non-200 responses this returns without draining the response body. Closing without reading to EOF prevents HTTP connection reuse in Go’s net/http transport. Consider reading/discarding the body (and optionally including a small response snippet in the error) before returning.

Suggested change
if res.StatusCode != http.StatusOK {
if res.StatusCode != http.StatusOK {
// Drain the response body to allow HTTP connection reuse.
_, _ = io.Copy(io.Discard, res.Body)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added io.Copy(io.Discard, res.Body) before the error return in adf422a.

return errors.Wrap(err, "failed to save file in cloud storage")
}
defer res.Body.Close()
if res.StatusCode != http.StatusOK {
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On non-200 responses this returns after defer res.Body.Close() but without draining the body. In Go net/http, not reading to EOF typically prevents the underlying TCP connection from being reused. Consider draining/reading the response body before returning the status error.

Suggested change
if res.StatusCode != http.StatusOK {
if res.StatusCode != http.StatusOK {
// Drain response body to allow the underlying TCP connection to be reused.
_, _ = io.Copy(io.Discard, res.Body)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added io.Copy(io.Discard, res.Body) before the error return in adf422a.

Comment on lines 172 to +173
if rsp.StatusCode != http.StatusOK {
_ = rsp.Body.Close()
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When StatusCode != 200, the body is closed without being drained/read. In Go’s net/http this can inhibit connection reuse and also discards potentially useful error details from the response. Consider reading/discarding the body (possibly with a reasonable size limit) before closing and returning the error.

Suggested change
if rsp.StatusCode != http.StatusOK {
_ = rsp.Body.Close()
if rsp.StatusCode != http.StatusOK {
const maxErrBodyBytes = 4 * 1024
bodyBytes, _ := io.ReadAll(io.LimitReader(rsp.Body, maxErrBodyBytes))
_ = rsp.Body.Close()
bodyMsg := strings.TrimSpace(string(bodyBytes))
if bodyMsg != "" {
return nil, errors.Errorf("error getting file from presigned url: expected 200 OK response code, got %d: %s", rsp.StatusCode, bodyMsg)
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now reads up to 4KB of the error body via io.LimitReader and includes it in the error message when non-empty. Fixed in adf422a.

…on-deferred file close)

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 28, 2026

@greptile

…SameMap inverted logic)

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copilot AI changed the title fix: fix resource leaks (response bodies, signal channels, goroutines) fix: fix resource leaks — response bodies, signals, connections, file descriptors, data structures Feb 28, 2026
@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 28, 2026

@greptile

@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 28, 2026

@copilot please recheck all above with Opus model

…add defer ctxCancel()

Co-authored-by: vsukhin <5984962+vsukhin@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 28, 2026

@copilot please recheck all above with Opus model

Rechecked all review comments and addressed each one in adf422a:

  • cmd/tcl/kubectl-testkube/devbox/command.go: added defer ctxCancel() after context.WithCancel() so the signal-wait goroutine can always exit
  • pkg/runner/executionlogswriter.go: now drains response body on non-200 via io.Copy(io.Discard, ...) before returning error, enabling HTTP connection reuse
  • pkg/controlplaneclient/execution.go: same drain-before-return pattern added
  • pkg/cloud/data/artifact/artifacts_storage.go: now reads up to 4KB of error body via io.LimitReader and includes the content in the error message for better diagnostics

@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Feb 28, 2026

@greptile

@olensmar
Copy link
Copy Markdown
Member

olensmar commented Mar 6, 2026

@vsukhin ready to merge?

@vsukhin
Copy link
Copy Markdown
Collaborator

vsukhin commented Mar 6, 2026

@vsukhin ready to merge?

No, some of them look safe, but other need to be carefully checked

Copy link
Copy Markdown
Contributor

@mgazza mgazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — comprehensive resource leak sweep. All fixes are correct. Notable: the containsSameMap fix is a genuine logic bug (was returning true on mismatch instead of continuing). InMemoryCache flush, ticker stops, signal stops, resp.Body closes, and file handle closes are all solid. Good companion to cloud-api#3136.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants