Skip to content

Comments

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217

Open
jkhelil wants to merge 1 commit intotektoncd:mainfrom
jkhelil:fix_deadlock
Open

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217
jkhelil wants to merge 1 commit intotektoncd:mainfrom
jkhelil:fix_deadlock

Conversation

@jkhelil
Copy link
Member

@jkhelil jkhelil commented Feb 14, 2026

Changes

  • Added installerSetName parameter to ensureResources() and all callers
  • Checks OwnerReferences to determine if resource belongs to this InstallerSet
  • Only waits for owned resources, skips others
  • Allows reconciliation to continue even with TERMINATING CRDs

Fixes #2474

The operator enters a deadlock when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The current code immediately aborts the entire reconciliation phase, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created.

Symptoms:

  • No ServiceAccounts created in openshift-pipelines namespace
  • InstallerSets stuck with "reconcile again and proceed"
  • No component pods running (Deployments fail: serviceaccount not found)
  • Webhooks unavailable → TektonConfig can't reconcile
  • Operator logs show infinite CRD fetching loop

Impact: Complete operator failure during installations, upgrades, downgrades, or recovery operations.

Root Cause

Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

NONE

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Feb 14, 2026
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jkhelil after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2026
@jkhelil
Copy link
Member Author

jkhelil commented Feb 17, 2026

@vdemeester @anithapriyanatarajan PTAL

@jkhelil jkhelil closed this Feb 17, 2026
@jkhelil jkhelil reopened this Feb 17, 2026
@anithapriyanatarajan
Copy link
Contributor

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2026
@anithapriyanatarajan
Copy link
Contributor

@jkhelil - could you help with steps to reproduce the issue?

}

// Resource is being deleted by another controller/InstallerSet, skip it
ressourceLogger.Debug("resource is being deleted by another owner, skipping")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to log the finalizer name as well and Deletion time stamp here?

@jkhelil
Copy link
Member Author

jkhelil commented Feb 18, 2026

@jkhelil - could you help with steps to reproduce the issue?
It’s a somewhat strange error that we’ve been carrying across several versions. if you remember all the issues people encountered during upgrades
what happens:
Some TaskRuns or PipelineRuns have finalizers. A user tries to delete TektonConfig to uninstall the operator because they notice a problem during an upgrade, or because teams are experiencing etcd or infrastructure issues.
When TektonConfig is deleted, the CRDs are also removed, but they get stuck due to the finalizers on the TaskRuns. The user doesn’t see this because they delete TektonConfig and don’t monitor the remaining resources. TektonConfig appears as deleted.
The user then tries to reinstall or upgrade using a new CSV, but it doesn’t work. This is because the installer set is stuck in a loop: it checks whether there is a deletion timestamp on the CRDs and keeps reconciling again. The loop never proceeds to the next step (installing the service accounts, RBAC, etc.).
The user ends up seeing an error related to the webhook service account, but in reality there is nothing wrong with it — the installer set is simply stuck continuously reconciling.

to reproduce, install once, have some workload, check finaliers are there, delete tektonconfig, reinstall or do an upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect *stale* tekton objects prior to a upgrade / reconciliaton

3 participants