Skip to content

prombench malformed PR runs after Exaustion errors (or other start errors) #936

@bwplotka

Description

@bwplotka

We can't run more than 3 benchmarks. When this happens we have exhaustion error for some node pools.

This gets prometheus_start to never finish - it waits for node pool

prometheus_stop then never finish too

waiting for nodepools to be deleted
infra gke nodes check-deleted -a *** \
	-v ZONE:europe-west3-a -v GKE_PROJECT_ID:macro-mile-203600 \
	-v EKS_WORKER_ROLE_ARN: -v EKS_CLUSTER_ROLE_ARN: \
	-v EKS_SUBNET_IDS: -v SEPARATOR: \
	-v CLUSTER_NAME:test-infra -v PR_NUMBER:18000 \
	-f ./manifests/prombench/nodes_gke.yaml
11:35:06 gke.go:517: nodepool running name: prometheus-18000
make: *** [Makefile:120: all_nodes_deleted] Error 1

No timeouts on either of those jobs (I see it's running for 2h just fine) https://github.com/prometheus/prometheus/actions/runs/21665627212/job/62460234151

We need to make it robust so:

  • We have some eventual timeouts
  • Running cancel WHEN start is still running cancels starts
  • restart == start ideally
  • We don't need to manually remove prometheus-xyz node pool

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions