DBInstance K8s CR is removed before AWS instance finishes deleting causing DBSubnetGroup to be permanently stuck on deletion


## Describe the bug

When a `DBInstance` CR is deleted, the ACK RDS controller removes the Kubernetes finalizer
(`finalizers.rds.services.k8s.aws/DBInstance`) and allows the CR to be garbage collected before
the underlying AWS RDS instance has fully transitioned to `deleted` state in AWS. As a result,
the `DBSubnetGroup` that the instance belongs to becomes blocked during its own deletion for an
extended and indeterminate period.

AWS rejects the `DeleteDBSubnetGroup` API call with:

```
InvalidDBSubnetGroupStateFault: The DB subnet group '<name>' cannot be deleted
because it is still in use by DB instance '<instance-name>'
```

However, the Kubernetes CR for `<instance-name>` no longer exists
(`kubectl get dbinstance <name> -n <namespace>` returns `NotFound`). The ACK RDS controller
has lost all tracking state for the orphaned AWS instance. The `DBSubnetGroup` CR remains in
a `Terminating` state — it retries on exponential backoff but is blocked until AWS completes
the asynchronous instance deletion (typically 5–15+ minutes), compounded by controller backoff
delays that can reach 16+ minutes between retries. While it does eventually self-resolve, the
failure mode has several problems:

1. There is no K8s CR to wait on for the orphaned instance — the controller cannot proactively
   poll the AWS instance status to know when to retry
2. Exponential backoff means the total wall-clock time can far exceed the actual AWS deletion
   time, as retries become increasingly infrequent
3. The `ACK.Recoverable: True` condition gives no indication that the dependency is an orphaned
   AWS resource with no corresponding K8s CR — operators have no visibility into the root cause

---

## Steps to reproduce

1. Create an Aurora PostgreSQL cluster with ACK: `DBCluster` + 2× `DBInstance` + `DBSubnetGroup`
2. Delete all resources simultaneously (e.g. via namespace deletion or KRO instance deletion)
3. The RDS controller initiates deletion: instances first, then subnet group
4. Observe the `DBInstance` CRs are deleted from Kubernetes (finalizer cleared) while AWS is
   still processing the deletion asynchronously
5. The `DBSubnetGroup` deletion is attempted but AWS returns `InvalidDBSubnetGroupStateFault`
6. Confirm the instance CR is gone:
   ```
   $ kubectl get dbinstance <instance-name> -n <namespace>
   Error from server (NotFound): dbinstances.rds.services.k8s.aws "<instance-name>" not found
   ```
7. `DBSubnetGroup` remains in `Terminating` for an extended period — it eventually self-resolves
   once AWS completes the async instance deletion, but total wait time is compounded by
   exponential backoff (observed: 15–30+ minutes total)

---

## Expected outcome

The ACK RDS controller should **not** remove the `finalizers.rds.services.k8s.aws/DBInstance`
finalizer from a `DBInstance` CR until the AWS RDS API confirms the instance is fully
deleted — i.e.,
`DescribeDBInstances` returns the instance in `deleted` state or returns `DBInstanceNotFound`.

This ensures that dependent resources such as `DBSubnetGroup` can only be deleted after their
AWS-side dependencies are fully gone — not merely after their K8s CRs are removed.

---

## Observed conditions

**Stuck `DBSubnetGroup` conditions** (from `kubectl get dbsubnetgroup <name> -n <ns> -o jsonpath='{.status.conditions}'`):

```json
[
  {
    "lastTransitionTime": "2026-03-05T05:57:31Z",
    "message": "InvalidDBSubnetGroupStateFault: The DB subnet group 'spoke1-dev-subnet-group' cannot be deleted because it is still in use by DB instance 'spoke1-dev-instance-2'",
    "reason": "InvalidDBSubnetGroupStateFault",
    "status": "True",
    "type": "ACK.Recoverable"
  },
  {
    "lastTransitionTime": "2026-03-05T05:57:31Z",
    "message": "Resource synced successfully",
    "reason": "",
    "status": "False",
    "type": "ACK.ResourceSynced"
  },
  {
    "lastTransitionTime": "2026-03-05T06:27:31Z",
    "message": "Selected",
    "reason": "roleARN: arn:aws:iam::123456789012:role/spoke1-spoke-role, selectorName: spoke1-ack-role",
    "status": "True",
    "type": "ACK.IAMRoleSelected"
  }
]
```

**`DBInstance` CR state at time of `DBSubnetGroup` failure**:

```
$ kubectl get dbinstance spoke1-dev-instance-2 -n spoke1
Error from server (NotFound): dbinstances.rds.services.k8s.aws "spoke1-dev-instance-2" not found
```

The AWS RDS instance still existed in AWS at this point (deletion was in progress) but the
Kubernetes CR had already been removed with its finalizer cleared.

---

## Environment

- **Kubernetes version**: v1.35.0-eks-3a10415
- **Using EKS**: Yes, EKS v1.35
- **AWS service targeted**: RDS (Aurora PostgreSQL)
- **ACK RDS controller version**: `public.ecr.aws/aws-controllers-k8s/rds-controller:1.7.6`
- **Other ACK controllers involved**:
  - `ec2-controller:1.9.2`
  - `eks-controller:1.11.1`
  - `iam-controller:1.6.1`
  - `kms-controller:1.2.1`
  - `s3-controller:1.3.1`
  - `secretsmanager-controller:1.2.1`
- **KRO version**: v0.8.5 ([kubernetes-sigs/kro](https://github.com/kubernetes-sigs/kro))
- **Aurora engine**: `aurora-postgresql`, version `14.19`
- **Aurora instance class**: `db.r6g.large`
- **Cluster topology**: 1 `DBCluster` + 2 `DBInstance` + 1 `DBSubnetGroup`
- **Deletion trigger**: Kubernetes namespace deletion (all ACK CRs deleted simultaneously
  by KRO `kro.run/finalizer` cleanup)
- **CARM configuration**: `featureGates.IAMRoleSelector: "true"`, `enableCARM: false`
- **ResourceGraphDefinition file**: [`awsgen3infra1flat-rg.yaml`](https://github.com/indiana-university/gen3-kro/blob/Version2/argocd/charts/resource-groups/templates/awsgen3infra1flat-rg.yaml)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBInstance K8s CR is removed before AWS instance finishes deleting causing DBSubnetGroup to be permanently stuck on deletion #2799

Describe the bug

Steps to reproduce

Expected outcome

Observed conditions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DBInstance K8s CR is removed before AWS instance finishes deleting causing DBSubnetGroup to be permanently stuck on deletion #2799

Description

Describe the bug

Steps to reproduce

Expected outcome

Observed conditions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions