Skip to content

Commit 9524b56

Browse files
Add observability (#37716)
* Add Observability Metrics * Update script * update readme * fix readme * update readme * update port * Update examples/terraform/envoy-ratelimiter/deploy.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix gemini review --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 67c3183 commit 9524b56

File tree

5 files changed

+177
-66
lines changed

5 files changed

+177
-66
lines changed

examples/terraform/envoy-ratelimiter/README.md

Lines changed: 48 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Example Beam Java Pipelines using it:
3838
- **Cloud NAT (Prerequisite)**: Allows private nodes to pull Docker images.
3939
- **Envoy Rate Limit Service**: A stateless Go/gRPC service that handles rate limit logic.
4040
- **Redis**: Stores the rate limit counters.
41-
- **StatsD Exporter**: Sidecar container that converts StatsD metrics to Prometheus format, exposed on port `9102`.
41+
- **Prometheus Metrics**: Exposes Prometheus metrics on port `9090`. These metrics are exported to Google Cloud Monitoring.
4242
- **Internal Load Balancer**: A Google Cloud TCP Load Balancer exposing the Rate Limit service internally within the VPC.
4343

4444
## Prerequisites:
@@ -82,7 +82,7 @@ cluster_name = "ratelimit-cluster" # Name of the GKE cluster
8282
deletion_protection = true # Prevent accidental cluster deletion (set "true" for prod)
8383
control_plane_cidr = "172.16.0.0/28" # CIDR for GKE control plane (must not overlap with subnet)
8484
namespace = "envoy-ratelimiter" # Kubernetes namespace for deployment
85-
enable_metrics = false # Deploy statsd-exporter sidecar
85+
enable_metrics = true # Enable metrics export to Google Cloud Monitoring
8686
ratelimit_replicas = 1 # Initial number of Rate Limit pods
8787
min_replicas = 1 # Minimum HPA replicas
8888
max_replicas = 5 # Maximum HPA replicas
@@ -110,25 +110,34 @@ EOF
110110
```
111111
112112
# Deploy Envoy Rate Limiter:
113-
1. Initialize Terraform to download providers and modules:
113+
114+
1. **Deploy Script (Recommended)**:
115+
Run the helper script to handle the deployment process automatically:
114116
```bash
115-
terraform init
117+
./deploy.sh
116118
```
119+
The script will provide the ip address of the load balancer once the deployment is complete.
117120

118-
2. Plan and apply the changes:
121+
2. **Deploy (Manual Alternative)**:
122+
If you prefer running Terraform manually, you can use the following commands:
119123
```bash
120-
terraform plan -out=tfplan
121-
terraform apply tfplan
124+
# Step 1: Initialize Terraform
125+
terraform init
126+
127+
# Step 2: Create Cluster
128+
terraform apply -target=time_sleep.wait_for_cluster
129+
130+
# Step 3: Create Resources
131+
terraform apply
122132
```
123133

124-
3. Connect to the service:
125134
After deployment, get the **Internal** IP address:
126135
```bash
127136
terraform output load_balancer_ip
128137
```
129138
The service is accessible **only from within the VPC** (e.g., via Dataflow workers or GCE instances in the same network) at `<INTERNAL_IP>:8081`.
130139

131-
4. **Test with Dataflow Workflow**:
140+
3. **Test with Dataflow Workflow**:
132141
Verify connectivity and rate limiting logic by running the example Dataflow pipeline.
133142

134143
```bash
@@ -150,11 +159,40 @@ The service is accessible **only from within the VPC** (e.g., via Dataflow worke
150159
```
151160

152161

162+
# Observability & Metrics:
163+
This module supports exporting native Prometheus metrics to **Google Cloud Monitoring**.
164+
165+
`enable_metrics` is set to `true` by default.
166+
167+
### Sample Metrics
168+
| Metric Name | Description |
169+
| :--- | :--- |
170+
| `ratelimit_service_rate_limit_total_hits` | Total rate limit requests received. |
171+
| `ratelimit_service_rate_limit_over_limit` | Requests that exceeded the limit (HTTP 429). |
172+
| `ratelimit_service_rate_limit_near_limit` | Requests that are approaching the limit. |
173+
| `ratelimit_service_call_should_rate_limit` | Total valid gRPC calls to the service. |
174+
175+
*Note: You will also see many other Go runtime metrics (`go_*`) and Redis client metrics (`redis_*`)
176+
177+
### Viewing in Google Cloud Console
178+
1. Go to **Monitoring** > **Metrics Explorer**.
179+
2. Click **Select a metric**.
180+
3. Search for `ratelimit` and select **Prometheus Target** > **ratelimit**.
181+
4. Select a metric (e.g., `ratelimit_service_rate_limit_over_limit`) and click **Apply**.
182+
5. Use **Filters** to drill down by `domain`, `key`, and `value` (e.g., `key=database`, `value=users`).
183+
153184
# Clean up resources:
154185
To destroy the cluster and all created resources:
186+
187+
```bash
188+
./deploy.sh destroy
189+
```
190+
191+
Alternatively:
155192
```bash
156193
terraform destroy
157194
```
195+
158196
*Note: If `deletion_protection` was enabled, you must set it to `false` in `terraform.tfvars` before destroying.*
159197

160198
# Variables description:
@@ -169,7 +207,7 @@ terraform destroy
169207
|control_plane_cidr |CIDR block for GKE control plane |172.16.0.0/28 |
170208
|cluster_name |Name of the GKE cluster |ratelimit-cluster |
171209
|namespace |Kubernetes namespace to deploy resources into |envoy-ratelimiter |
172-
|enable_metrics |Deploy statsd-exporter sidecar |false |
210+
|enable_metrics |Enable metrics export to Google Cloud Monitoring |true |
173211
|deletion_protection |Prevent accidental cluster deletion |false |
174212
|ratelimit_replicas |Initial number of Rate Limit pods |1 |
175213
|min_replicas |Minimum HPA replicas |1 |
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
#!/bin/bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one or more
4+
# contributor license agreements. See the NOTICE file distributed with
5+
# this work for additional information regarding copyright ownership.
6+
# The ASF licenses this file to You under the Apache License, Version 2.0
7+
# (the "License"); you may not use this file except in compliance with
8+
# the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
#
18+
19+
# This script deploys the Envoy Rate Limiter on GKE.
20+
21+
set -e
22+
23+
COMMAND=${1:-"apply"}
24+
25+
# 1. Initialize Terraform
26+
if [ ! -d ".terraform" ]; then
27+
echo "Initializing Terraform..."
28+
terraform init
29+
else
30+
# Verify terraform initialization is valid, or re-initialize
31+
terraform init -upgrade=false >/dev/null 2>&1 || terraform init
32+
fi
33+
34+
if [ "$COMMAND" = "destroy" ]; then
35+
echo "Destroying Envoy Rate Limiter Resources..."
36+
terraform destroy -auto-approve
37+
exit $?
38+
fi
39+
40+
if [ "$COMMAND" = "apply" ]; then
41+
echo "Deploying Envoy Rate Limiter..."
42+
43+
echo "--------------------------------------------------"
44+
echo "Creating/Updating GKE Cluster..."
45+
echo "--------------------------------------------------"
46+
# Deploy the cluster and wait for it to be ready.
47+
terraform apply -target=time_sleep.wait_for_cluster -auto-approve
48+
49+
echo ""
50+
echo "--------------------------------------------------"
51+
echo "Deploying Application Resources..."
52+
echo "--------------------------------------------------"
53+
# Deploy the rest of the resources
54+
terraform apply -auto-approve
55+
56+
echo ""
57+
echo "Deployment Complete!"
58+
echo "Cluster Name: $(terraform output -raw cluster_name)"
59+
echo "Load Balancer IP: $(terraform output -raw load_balancer_ip)"
60+
exit 0
61+
fi
62+
63+
echo "Usage:"
64+
echo " ./deploy.sh [apply] # Initialize and deploy resources (Default)"
65+
echo " ./deploy.sh destroy # Destroy resources"
66+
exit 1

examples/terraform/envoy-ratelimiter/prerequisites.tf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ resource "google_project_service" "required" {
2121
"container",
2222
"iam",
2323
"compute",
24+
"monitoring",
2425
])
2526

2627
service = "${each.key}.googleapis.com"

examples/terraform/envoy-ratelimiter/ratelimit.tf

Lines changed: 60 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -158,11 +158,36 @@ resource "kubernetes_deployment" "ratelimit" {
158158
port {
159159
container_port = 6070
160160
}
161+
dynamic "port" {
162+
for_each = var.enable_metrics ? [1] : []
163+
content {
164+
name = "metrics"
165+
container_port = 9090
166+
}
167+
}
161168

162169
env {
163-
name = "USE_STATSD"
170+
name = "USE_PROMETHEUS"
164171
value = var.enable_metrics ? "true" : "false"
165172
}
173+
dynamic "env" {
174+
for_each = var.enable_metrics ? [1] : []
175+
content {
176+
name = "PROMETHEUS_ADDR"
177+
value = ":9090"
178+
}
179+
}
180+
dynamic "env" {
181+
for_each = var.enable_metrics ? [1] : []
182+
content {
183+
name = "PROMETHEUS_PATH"
184+
value = "/metrics"
185+
}
186+
}
187+
env {
188+
name = "USE_STATSD"
189+
value = "false"
190+
}
166191
env {
167192
name = "DISABLE_STATS"
168193
value = var.enable_metrics ? "false" : "true"
@@ -203,14 +228,6 @@ resource "kubernetes_deployment" "ratelimit" {
203228
name = "CONFIG_TYPE"
204229
value = "FILE"
205230
}
206-
env {
207-
name = "STATSD_HOST"
208-
value = "localhost"
209-
}
210-
env {
211-
name = "STATSD_PORT"
212-
value = "9125"
213-
}
214231
env {
215232
name = "GRPC_MAX_CONNECTION_AGE"
216233
value = var.ratelimit_grpc_max_connection_age
@@ -231,41 +248,7 @@ resource "kubernetes_deployment" "ratelimit" {
231248
}
232249
}
233250

234-
dynamic "container" {
235-
for_each = var.enable_metrics ? [1] : []
236-
content {
237-
name = "statsd-exporter"
238-
image = var.statsd_exporter_image
239-
args = ["--log.format=json"]
240-
241-
dynamic "port" {
242-
for_each = var.enable_metrics ? [1] : []
243-
content {
244-
name = "metrics"
245-
container_port = 9102
246-
}
247-
}
248-
dynamic "port" {
249-
for_each = var.enable_metrics ? [1] : []
250-
content {
251-
name = "statsd-udp"
252-
container_port = 9125
253-
protocol = "UDP"
254-
}
255-
}
256-
# statsd-exporter does not use much resources, so setting resources to the minimum
257-
resources {
258-
requests = {
259-
cpu = "50m"
260-
memory = "64Mi"
261-
}
262-
limits = {
263-
cpu = "100m"
264-
memory = "128Mi"
265-
}
266-
}
267-
}
268-
}
251+
269252

270253
volume {
271254
name = "config-volume"
@@ -361,8 +344,8 @@ resource "kubernetes_service" "ratelimit" {
361344
for_each = var.enable_metrics ? [1] : []
362345
content {
363346
name = "metrics"
364-
port = 9102
365-
target_port = 9102
347+
port = 9090
348+
target_port = 9090
366349
}
367350
}
368351
}
@@ -398,15 +381,38 @@ resource "kubernetes_service" "ratelimit_external" {
398381
port = 6070
399382
target_port = 6070
400383
}
401-
dynamic "port" {
402-
for_each = var.enable_metrics ? [1] : []
403-
content {
404-
name = "metrics"
405-
port = 9102
406-
target_port = 9102
407-
}
408-
}
384+
409385
}
410386

411387
depends_on = [kubernetes_namespace.ratelimit_namespace]
412388
}
389+
390+
# Pod Monitoring
391+
resource "kubernetes_manifest" "ratelimit_pod_monitoring" {
392+
manifest = {
393+
apiVersion = "monitoring.googleapis.com/v1"
394+
kind = "PodMonitoring"
395+
metadata = {
396+
name = "ratelimit-monitoring"
397+
namespace = var.namespace
398+
}
399+
spec = {
400+
selector = {
401+
matchLabels = {
402+
app = "ratelimit"
403+
}
404+
}
405+
endpoints = [
406+
{
407+
port = "metrics"
408+
path = "/metrics"
409+
interval = "15s"
410+
}
411+
]
412+
}
413+
}
414+
depends_on = [
415+
kubernetes_deployment.ratelimit,
416+
time_sleep.wait_for_cluster
417+
]
418+
}

examples/terraform/envoy-ratelimiter/variables.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ variable "namespace" {
183183
}
184184

185185
variable "enable_metrics" {
186-
description = "Whether to deploy the statsd-exporter sidecar for Prometheus metrics"
186+
description = "Enable metrics export to Google Cloud Monitoring"
187187
type = bool
188-
default = false
188+
default = true
189189
}

0 commit comments

Comments
 (0)