Skip to content

DBInstance K8s CR is removed before AWS instance finishes deleting causing DBSubnetGroup to be permanently stuck on deletion #2799

@jayadeyemi

Description

@jayadeyemi

Describe the bug

When a DBInstance CR is deleted, the ACK RDS controller removes the Kubernetes finalizer
(finalizers.rds.services.k8s.aws/DBInstance) and allows the CR to be garbage collected before
the underlying AWS RDS instance has fully transitioned to deleted state in AWS. As a result,
the DBSubnetGroup that the instance belongs to becomes blocked during its own deletion for an
extended and indeterminate period.

AWS rejects the DeleteDBSubnetGroup API call with:

InvalidDBSubnetGroupStateFault: The DB subnet group '<name>' cannot be deleted
because it is still in use by DB instance '<instance-name>'

However, the Kubernetes CR for <instance-name> no longer exists
(kubectl get dbinstance <name> -n <namespace> returns NotFound). The ACK RDS controller
has lost all tracking state for the orphaned AWS instance. The DBSubnetGroup CR remains in
a Terminating state — it retries on exponential backoff but is blocked until AWS completes
the asynchronous instance deletion (typically 5–15+ minutes), compounded by controller backoff
delays that can reach 16+ minutes between retries. While it does eventually self-resolve, the
failure mode has several problems:

  1. There is no K8s CR to wait on for the orphaned instance — the controller cannot proactively
    poll the AWS instance status to know when to retry
  2. Exponential backoff means the total wall-clock time can far exceed the actual AWS deletion
    time, as retries become increasingly infrequent
  3. The ACK.Recoverable: True condition gives no indication that the dependency is an orphaned
    AWS resource with no corresponding K8s CR — operators have no visibility into the root cause

Steps to reproduce

  1. Create an Aurora PostgreSQL cluster with ACK: DBCluster + 2× DBInstance + DBSubnetGroup
  2. Delete all resources simultaneously (e.g. via namespace deletion or KRO instance deletion)
  3. The RDS controller initiates deletion: instances first, then subnet group
  4. Observe the DBInstance CRs are deleted from Kubernetes (finalizer cleared) while AWS is
    still processing the deletion asynchronously
  5. The DBSubnetGroup deletion is attempted but AWS returns InvalidDBSubnetGroupStateFault
  6. Confirm the instance CR is gone:
    $ kubectl get dbinstance <instance-name> -n <namespace>
    Error from server (NotFound): dbinstances.rds.services.k8s.aws "<instance-name>" not found
    
  7. DBSubnetGroup remains in Terminating for an extended period — it eventually self-resolves
    once AWS completes the async instance deletion, but total wait time is compounded by
    exponential backoff (observed: 15–30+ minutes total)

Expected outcome

The ACK RDS controller should not remove the finalizers.rds.services.k8s.aws/DBInstance
finalizer from a DBInstance CR until the AWS RDS API confirms the instance is fully
deleted — i.e.,
DescribeDBInstances returns the instance in deleted state or returns DBInstanceNotFound.

This ensures that dependent resources such as DBSubnetGroup can only be deleted after their
AWS-side dependencies are fully gone — not merely after their K8s CRs are removed.


Observed conditions

Stuck DBSubnetGroup conditions (from kubectl get dbsubnetgroup <name> -n <ns> -o jsonpath='{.status.conditions}'):

[
  {
    "lastTransitionTime": "2026-03-05T05:57:31Z",
    "message": "InvalidDBSubnetGroupStateFault: The DB subnet group 'spoke1-dev-subnet-group' cannot be deleted because it is still in use by DB instance 'spoke1-dev-instance-2'",
    "reason": "InvalidDBSubnetGroupStateFault",
    "status": "True",
    "type": "ACK.Recoverable"
  },
  {
    "lastTransitionTime": "2026-03-05T05:57:31Z",
    "message": "Resource synced successfully",
    "reason": "",
    "status": "False",
    "type": "ACK.ResourceSynced"
  },
  {
    "lastTransitionTime": "2026-03-05T06:27:31Z",
    "message": "Selected",
    "reason": "roleARN: arn:aws:iam::123456789012:role/spoke1-spoke-role, selectorName: spoke1-ack-role",
    "status": "True",
    "type": "ACK.IAMRoleSelected"
  }
]

DBInstance CR state at time of DBSubnetGroup failure:

$ kubectl get dbinstance spoke1-dev-instance-2 -n spoke1
Error from server (NotFound): dbinstances.rds.services.k8s.aws "spoke1-dev-instance-2" not found

The AWS RDS instance still existed in AWS at this point (deletion was in progress) but the
Kubernetes CR had already been removed with its finalizer cleared.


Environment

  • Kubernetes version: v1.35.0-eks-3a10415
  • Using EKS: Yes, EKS v1.35
  • AWS service targeted: RDS (Aurora PostgreSQL)
  • ACK RDS controller version: public.ecr.aws/aws-controllers-k8s/rds-controller:1.7.6
  • Other ACK controllers involved:
    • ec2-controller:1.9.2
    • eks-controller:1.11.1
    • iam-controller:1.6.1
    • kms-controller:1.2.1
    • s3-controller:1.3.1
    • secretsmanager-controller:1.2.1
  • KRO version: v0.8.5 (kubernetes-sigs/kro)
  • Aurora engine: aurora-postgresql, version 14.19
  • Aurora instance class: db.r6g.large
  • Cluster topology: 1 DBCluster + 2 DBInstance + 1 DBSubnetGroup
  • Deletion trigger: Kubernetes namespace deletion (all ACK CRs deleted simultaneously
    by KRO kro.run/finalizer cleanup)
  • CARM configuration: featureGates.IAMRoleSelector: "true", enableCARM: false
  • ResourceGraphDefinition file: awsgen3infra1flat-rg.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.service/rdsIndicates issues or PRs that are related to rds-controller.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions