-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Describe the bug
When a DBInstance CR is deleted, the ACK RDS controller removes the Kubernetes finalizer
(finalizers.rds.services.k8s.aws/DBInstance) and allows the CR to be garbage collected before
the underlying AWS RDS instance has fully transitioned to deleted state in AWS. As a result,
the DBSubnetGroup that the instance belongs to becomes blocked during its own deletion for an
extended and indeterminate period.
AWS rejects the DeleteDBSubnetGroup API call with:
InvalidDBSubnetGroupStateFault: The DB subnet group '<name>' cannot be deleted
because it is still in use by DB instance '<instance-name>'
However, the Kubernetes CR for <instance-name> no longer exists
(kubectl get dbinstance <name> -n <namespace> returns NotFound). The ACK RDS controller
has lost all tracking state for the orphaned AWS instance. The DBSubnetGroup CR remains in
a Terminating state — it retries on exponential backoff but is blocked until AWS completes
the asynchronous instance deletion (typically 5–15+ minutes), compounded by controller backoff
delays that can reach 16+ minutes between retries. While it does eventually self-resolve, the
failure mode has several problems:
- There is no K8s CR to wait on for the orphaned instance — the controller cannot proactively
poll the AWS instance status to know when to retry - Exponential backoff means the total wall-clock time can far exceed the actual AWS deletion
time, as retries become increasingly infrequent - The
ACK.Recoverable: Truecondition gives no indication that the dependency is an orphaned
AWS resource with no corresponding K8s CR — operators have no visibility into the root cause
Steps to reproduce
- Create an Aurora PostgreSQL cluster with ACK:
DBCluster+ 2×DBInstance+DBSubnetGroup - Delete all resources simultaneously (e.g. via namespace deletion or KRO instance deletion)
- The RDS controller initiates deletion: instances first, then subnet group
- Observe the
DBInstanceCRs are deleted from Kubernetes (finalizer cleared) while AWS is
still processing the deletion asynchronously - The
DBSubnetGroupdeletion is attempted but AWS returnsInvalidDBSubnetGroupStateFault - Confirm the instance CR is gone:
$ kubectl get dbinstance <instance-name> -n <namespace> Error from server (NotFound): dbinstances.rds.services.k8s.aws "<instance-name>" not found DBSubnetGroupremains inTerminatingfor an extended period — it eventually self-resolves
once AWS completes the async instance deletion, but total wait time is compounded by
exponential backoff (observed: 15–30+ minutes total)
Expected outcome
The ACK RDS controller should not remove the finalizers.rds.services.k8s.aws/DBInstance
finalizer from a DBInstance CR until the AWS RDS API confirms the instance is fully
deleted — i.e.,
DescribeDBInstances returns the instance in deleted state or returns DBInstanceNotFound.
This ensures that dependent resources such as DBSubnetGroup can only be deleted after their
AWS-side dependencies are fully gone — not merely after their K8s CRs are removed.
Observed conditions
Stuck DBSubnetGroup conditions (from kubectl get dbsubnetgroup <name> -n <ns> -o jsonpath='{.status.conditions}'):
[
{
"lastTransitionTime": "2026-03-05T05:57:31Z",
"message": "InvalidDBSubnetGroupStateFault: The DB subnet group 'spoke1-dev-subnet-group' cannot be deleted because it is still in use by DB instance 'spoke1-dev-instance-2'",
"reason": "InvalidDBSubnetGroupStateFault",
"status": "True",
"type": "ACK.Recoverable"
},
{
"lastTransitionTime": "2026-03-05T05:57:31Z",
"message": "Resource synced successfully",
"reason": "",
"status": "False",
"type": "ACK.ResourceSynced"
},
{
"lastTransitionTime": "2026-03-05T06:27:31Z",
"message": "Selected",
"reason": "roleARN: arn:aws:iam::123456789012:role/spoke1-spoke-role, selectorName: spoke1-ack-role",
"status": "True",
"type": "ACK.IAMRoleSelected"
}
]DBInstance CR state at time of DBSubnetGroup failure:
$ kubectl get dbinstance spoke1-dev-instance-2 -n spoke1
Error from server (NotFound): dbinstances.rds.services.k8s.aws "spoke1-dev-instance-2" not found
The AWS RDS instance still existed in AWS at this point (deletion was in progress) but the
Kubernetes CR had already been removed with its finalizer cleared.
Environment
- Kubernetes version: v1.35.0-eks-3a10415
- Using EKS: Yes, EKS v1.35
- AWS service targeted: RDS (Aurora PostgreSQL)
- ACK RDS controller version:
public.ecr.aws/aws-controllers-k8s/rds-controller:1.7.6 - Other ACK controllers involved:
ec2-controller:1.9.2eks-controller:1.11.1iam-controller:1.6.1kms-controller:1.2.1s3-controller:1.3.1secretsmanager-controller:1.2.1
- KRO version: v0.8.5 (kubernetes-sigs/kro)
- Aurora engine:
aurora-postgresql, version14.19 - Aurora instance class:
db.r6g.large - Cluster topology: 1
DBCluster+ 2DBInstance+ 1DBSubnetGroup - Deletion trigger: Kubernetes namespace deletion (all ACK CRs deleted simultaneously
by KROkro.run/finalizercleanup) - CARM configuration:
featureGates.IAMRoleSelector: "true",enableCARM: false - ResourceGraphDefinition file:
awsgen3infra1flat-rg.yaml