RedshiftCreateClusterOperator leaks Redshift cluster on failure with partial IAM permissions #61930
Unanswered
SameerMesiah97
asked this question in
General
Replies: 2 comments 2 replies
-
Use setup and teardown to spin and clean resources There is no need to implement such logic per operator. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
converted to a discussion |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon>=9.21.0rc1Apache Airflow version
main
Operating System
Debian GNU/Linux 12 (bookworm)
Deployment
Other
Deployment details
No response
What happened
When using
RedshiftCreateClusterOperator, a Redshift cluster may be successfully created even when the AWS execution role has partial Redshift permissions, for example lackingredshift:DescribeClusters.In this scenario, the operator successfully calls
create_clusterand the Redshift cluster begins provisioning in AWS. However, subsequent steps—such as waiting for the cluster to become available whenwait_for_completion=True—fail due to insufficient permissions.The Airflow task then fails, but the Redshift cluster continues provisioning or remains active in AWS, resulting in leaked infrastructure and ongoing cost.
This can occur, for example, when the execution role allows
redshift:CreateClusterbut explicitly deniesredshift:DescribeClusters, which is required by the waiter used to monitor cluster availability.What you think should happen instead
If the operator fails after successfully initiating cluster creation (for example due to missing
DescribeClustersor other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the cluster.Cleanup should be attempted opportunistically (i.e. only if the cluster identifier is known and the necessary permissions are available), and failure to clean up should not mask or replace the original exception.
How to reproduce
Create an IAM role that allows
redshift:CreateClusterbut deniesredshift:DescribeClusters.Configure an AWS connection in Airflow using this role.
(The connection ID
aws_test_connis used for this reproduction.)Ensure a valid Redshift cluster subnet group exists.
(For example:
example-subnet-group.)Use the following DAG:
Observed Behaviour
The task fails due to missing
redshift:DescribeClusterspermissions, but the Redshift cluster is successfully created and remains active in AWS. The cluster is not cleaned up automatically and continues incurring cost.Anything else
Redshift clusters begin incurring cost immediately once creation starts, even if the cluster never reaches an
availablestate. When post-creation failures occur, leaked clusters can therefore result in unexpected and ongoing cost.This issue follows a broader pattern across AWS operators where resources are created successfully but not cleaned up when subsequent steps fail. Apache Airflow has been introducing best-effort cleanup behavior to address this class of problems consistently across providers.
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions