🚀 Kubernetes Production Cluster Setup

Installation, Backup & Disaster Recovery with kubeadm

A complete, production‑grade guide for deploying Kubernetes on AWS using kubeadm, including Backup + Disaster Recovery (DR) best practices and AWS Private Subnet Networking.

1️⃣ Introduction

This guide helps you:

🎯 Deploy a Kubernetes cluster using kubeadm on AWS
🔐 Back up critical Kubernetes components
🔄 Restore and recover the cluster during a failure
🌐 Configure secure private subnet networking
🛡️ Troubleshoot common AWS networking issues

🧰 Prerequisites

🐧 Ubuntu 20.04 or later
🧑‍💻 sudo privileges
🌐 Internet access (via NAT Gateway for private subnets)
💻 EC2 instance type: t2.medium or higher (t4g.medium for ARM/Graviton)

☁️ AWS Setup Overview

🛡 All nodes in same Security Group with proper rules (see Security section)
🌐 Private Subnet deployment with NAT Gateway for outbound
🧩 Create + attach a custom ENI with static private IP to Master
🔓 Security Group rules configured for VPC internal communication

2️⃣ AWS Security Group Configuration

🔐 Required Security Rules for Private Kubernetes Cluster

Critical: All nodes must be in the same Security Group with these inbound rules:

Port/Protocol	Service	Source	Purpose
22	SSH	`<ADMIN_IP>/32`	Secure admin access 🔥
6443	kube-apiserver	`<VPC_CIDR>`	API server access
10250	kubelet	`<VPC_CIDR>`	Pod exec/logs
179	Calico BGP	`<VPC_CIDR>`	Calico routing
All TCP	Node-to-Node	`<VPC_CIDR>`	Internal communication
All UDP	Node-to-Node	`<VPC_CIDR>`	Internal communication

# Example AWS CLI commands to configure Security Group
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
# Replace <ADMIN_IP> with your public IP (e.g., 203.0.113.25)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxxx \
  --ip-permissions \
    IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges='[{CidrIp=<ADMIN_IP>/32}]' \
    IpProtocol=tcp,FromPort=6443,ToPort=6443,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=10250,ToPort=10250,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=179,ToPort=179,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=udp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]'

⚠️ Security Warning: NEVER open port 6443 to 0.0.0.0/0 (public internet) ⚠️ SSH Access: Restrict SSH (port 22) to your admin IP only

📌 Final Working Network Architecture

Component	Subnet Type	Security
Master Node (API @ `<ENI_PRIVATE_IP>` / ens6)	Private	VPC only
Worker Node(s)	Private	Internal communication
Calico CNI	Pod network `192.168.0.0/16`	Fully working
CoreDNS	Private	Resolves cluster.local
NAT Gateway/Instance	Public → Private	Outbound only ✔
Future Nginx LB EC2	Public	Port 80/443 to world
CloudFlare DNS	External	Points to Nginx LB EIP

3️⃣ Kubernetes Installation & Cluster Setup

🌐 AWS Networking Preparation

Before creating the Master node:

🧩 Create ENI in private subnet
🔐 Assign static private IP (e.g., <ENI_PRIVATE_IP>)
🔗 Attach ENI to Master as secondary network interface
▶ Use ENI private IP for kubeadm init

🔹 Prevents IP/cert conflicts during DR 📝 Example: If your VPC is 10.0.0.0/16 and subnet is 10.0.1.0/24, you might use 10.0.1.160 as your ENI IP

🔄 Common Setup (Master & Worker Nodes)

Run on all nodes 👇

🔕 Disable Swap

sudo swapoff -a

🔧 Load Kernel Modules

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

🌐 Apply Sysctl Params

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system

📦 Install containerd

sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y containerd.io

containerd config default | sed -e 's/SystemdCgroup = false/SystemdCgroup = true/' | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable --now containerd

🚀 Install Kubernetes Components (v1.29)

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl

sudo curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | \
sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /" | \
sudo tee /etc/apt/sources.list.d/kubernetes.list

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl etcd-client
sudo apt-mark hold kubelet kubeadm kubectl

🔹 All nodes must run SAME Kubernetes versions!

🖥 Master Node Setup

🏷 Set Hostname

sudo hostnamectl set-hostname master-cp
echo "127.0.0.1 master-cp" | sudo tee -a /etc/hosts

⚠️ Required for Disaster Recovery

🚀 Initialize the Control Plane

sudo kubeadm init --apiserver-advertise-address=<ENI-PRIVATE-IP> --pod-network-cidr=192.168.0.0/16

🔧 Configure Kubelet to Use ENI IP

CRITICAL for DR: Configure kubelet to register with the ENI IP instead of ephemeral IP

# Create kubelet extra args configuration
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<ENI-PRIVATE-IP>"
EOF

# Restart kubelet to apply changes
sudo systemctl daemon-reload
sudo systemctl restart kubelet

⚠️ Important: Replace <ENI-PRIVATE-IP> with your actual ENI static IP (e.g., 10.0.1.160) 🔹 This ensures the node registers with the static IP, preventing certificate mismatches during DR

🔍 Verify Node IP

# Check that the node is registered with the ENI IP
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jq

You should see the ENI IP as the InternalIP.

🔑 Configure kubeconfig

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

🌐 Install CNI (Calico) with Multi-ENI Fix

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml

CRITICAL FIX for Multi-ENI Nodes: When master has multiple network interfaces (ens5 + ens6), Calico may autodetect the wrong IP.

# Configure Calico to use the correct network interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>

# Restart Calico nodes to apply changes
kubectl delete pod -n kube-system -l k8s-app=calico-node

🔹 This tells Calico to use the interface that can reach your VPC CIDR ⚠️ Without this fix, Calico will be stuck at 0/1 Ready and pod networking will fail

🔄 Restart CoreDNS

After fixing Calico networking, restart CoreDNS to refresh service routing:

kubectl -n kube-system rollout restart deploy/coredns

🔗 Get Worker Join Command

kubeadm token create --print-join-command

👷 Worker Node Setup

Add:

sudo at beginning
--cri-socket "unix:///run/containerd/containerd.sock" flag
--v=5 at end (optional, for verbose logging)

Example:

sudo kubeadm join <ENI-IP>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash> --cri-socket "unix:///run/containerd/containerd.sock" --v=5

🏷 Label Worker Node (Optional but Recommended)

# Label the worker node for better visibility
kubectl label node <worker-node-name> node-role.kubernetes.io/worker=worker

🔍 Verify Cluster Health

# Check all nodes are Ready
kubectl get nodes -o wide

# Check all system pods are Running
kubectl get pods -A -o wide

# Test DNS resolution inside a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

✔ All nodes should be Ready ✔ Master node should show ENI IP ✔ All pods should be Running ✔ DNS should resolve successfully

4️⃣ Backup Strategy (Master Only)

📦 What to Backup

Component	Path	Purpose
💾 ETCD Snapshot	`/var/lib/etcd`	Cluster state
🔐 Kubernetes Configs	`/etc/kubernetes/`	API certs & configs
🆔 Kubelet Identity	`/var/lib/kubelet`	Node certificates
⚙️ Kubelet Config	`/etc/default/kubelet`	Node IP configuration

⏺ Take ETCD Snapshot

sudo ETCDCTL_API=3 etcdctl snapshot save /root/k8s-backup/etcd.db \
 --endpoints=https://127.0.0.1:2379 \
 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
 --cert=/etc/kubernetes/pki/etcd/server.crt \
 --key=/etc/kubernetes/pki/etcd/server.key

🗂 Backup configs

sudo mkdir -p /root/k8s-backup
sudo cp -r /etc/kubernetes /root/k8s-backup/kubernetes
sudo cp -r /var/lib/kubelet /root/k8s-backup/kubelet
sudo cp /etc/default/kubelet /root/k8s-backup/kubelet-default
sudo -i
cd /root
sudo tar czf k8s-backup.tar.gz k8s-backup

☁ Upload to S3

🔹 If this EC2 instance has an IAM Role with S3 permissions — NO aws configure is required 🔹 If restoring from a laptop or non‑role instance — run aws configure first

AWS Credential Requirements

Environment	Need `aws configure`?	Why
EC2 with IAM Role	❌ No	Auto temporary credentials ✔
EC2 w/out IAM Role	✅ Yes	No automatic credentials
Laptop / external machine	✅ Yes	Needs manual keys

aws s3 cp /root/k8s-backup.tar.gz s3://<bucket>/k8s-backups/$(date +%F-%H%M).tar.gz

🔹 Requires AWS CLI & IAM role with S3 permissions

5️⃣ Disaster Recovery — Master Failure

Launch replacement Master with:

Same ENI ⚙️
Same hostname 🏷
Same Security Group

📥 Download Backup

aws s3 cp s3://<bucket>/k8s-backups/<file>.tar.gz /root/k8s-backup.tar.gz

🔄 Restore Data

sudo systemctl stop kubelet
sudo rm -rf /var/lib/etcd
sudo ETCDCTL_API=3 etcdctl snapshot restore /root/k8s-backup/etcd.db --data-dir=/var/lib/etcd
sudo tar xzf /root/k8s-backup.tar.gz -C /
sudo systemctl restart kubelet

🔍 Validate Recovery

kubectl get nodes -o wide
kubectl get pods -A
kubectl get svc -A

✔ Cluster restored automatically in 30–60 sec ✔ Master node should show ENI IP

6️⃣ Troubleshooting Common Issues

✅ Kubernetes Cluster Issues & Solutions

Issue	Root Cause	Fix Applied	Status
Nodes communication failing	Security Group blocked node-to-node	Allow All TCP/UDP from `<VPC_CIDR>`	✔ Fixed
Unable to exec/log into pods → kubelet timeout	Port 10250 blocked by SG	Allow TCP 10250 from `<VPC_CIDR>`	✔ Fixed
Calico not ready (0/1), wrong IP detected	Master has two ENIs, Calico autodetected incorrectly	Set Calico IP autodetection: `can-reach=<VPC_CIDR>`	✔ Fixed
CoreDNS 0/1 running or failing	Calico not routing Pod IP correctly	After fixing Calico, CoreDNS became ready	✔ Fixed
DNS inside pods failing (`NXDOMAIN`)	CoreDNS could not reach API service	Enable TCP 6443 from `<VPC_CIDR>`	✔ Fixed
API server publicly open (`0.0.0.0/0`)	Wrong SG inbound config	Restrict 6443 to `<VPC_CIDR>` + SSH from `<ADMIN_IP>`	✔ Secured
Pod DNS resolution failing	CoreDNS needed restart after network fix	Restart CoreDNS deployment	✔ Fixed
Worker node role missing	kubeadm join doesn't auto-tag labels	Label node: `worker=worker`	✔ Fixed

🔧 Critical Fix Commands

Fix Calico IP Autodetection (Multi-ENI Nodes)

# Configure Calico to detect correct interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>

# Restart Calico nodes
kubectl delete pod -n kube-system -l k8s-app=calico-node

Restart CoreDNS After Network Changes

kubectl -n kube-system rollout restart deploy/coredns

Label Worker Node

kubectl label node <node-name> node-role.kubernetes.io/worker=worker

Fix Node Showing Ephemeral IP

# Configure kubelet to use ENI IP
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<YOUR-ENI-IP>"
EOF

sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Verify
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jq

Test DNS Resolution

# Quick DNS test
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Expected output:
# Server:    10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# Name:      kubernetes.default
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

Verify Security Group Rules

# List Security Group rules
aws ec2 describe-security-groups --group-ids sg-xxxxxxxxx

# Check specific port is open
nc -zv <master-ip> 6443
nc -zv <master-ip> 10250

7️⃣ Final Production Checklist

Category	Required
Swap disabled	✔️
Static ENI private IP	✔️
Same hostname (`master-cp`)	✔️
Kubelet configured with `--node-ip`	✔️
Calico IP autodetection configured	✔️
Security Group rules configured	✔️
Port 6443 restricted to VPC only	✔️
Port 10250 open from VPC	✔️
SSH restricted to admin IP	✔️
Same Kubernetes versions	✔️
CoreDNS running and healthy	✔️
Backup stored safely	✔️
No `kubeadm init` during DR	✔️
Nodes Ready after restore	✔️

🎯 Conclusion

You now have:

🔐 Highly available cluster design
💾 Reliable backup workflow
🔄 Fully tested DR procedure
⚙️ Proper node IP configuration for DR
🌐 Secure private networking with AWS VPC
🛡️ Hardened security group configuration
🔧 Troubleshooting knowledge for common issues

✨ You're production‑ready!

📁 Scripts Overview

Automation scripts included in this repository help streamline Kubernetes lifecycle operations.

Script	Path	Run On	Purpose
install-common.sh	scripts/	Master + Worker	Installs containerd, kubeadm, and required settings
master-setup.sh	scripts/	Master	Initializes control plane with ENI private IP
worker-join.sh	scripts/	Worker	Automatically joins worker to cluster
backup.sh	scripts/	Master	Creates etcd + Kubernetes secrets backup and uploads to S3
restore.sh	scripts/	Restore Master	Automates Disaster Recovery restore process

Script Usage

1) Install common components on master and worker nodes

cd scripts
chmod +x install-common.sh
./install-common.sh

2) Initialize Kubernetes master

chmod +x master-setup.sh
./master-setup.sh

3) Join worker node to the cluster

(Ensure join-command.sh is copied from master)

chmod +x worker-join.sh
./worker-join.sh

4) Backup Kubernetes state to S3

chmod +x backup.sh
./backup.sh

5) Restore cluster using Disaster Recovery process

chmod +x restore.sh
./restore.sh

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Scripts		Scripts
diagrams		diagrams
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🚀 Kubernetes Production Cluster Setup

Installation, Backup & Disaster Recovery with kubeadm

1️⃣ Introduction

🧰 Prerequisites

☁️ AWS Setup Overview

2️⃣ AWS Security Group Configuration

🔐 Required Security Rules for Private Kubernetes Cluster

📌 Final Working Network Architecture

3️⃣ Kubernetes Installation & Cluster Setup

🌐 AWS Networking Preparation

🔄 Common Setup (Master & Worker Nodes)

🔕 Disable Swap

🔧 Load Kernel Modules

🌐 Apply Sysctl Params

📦 Install containerd

🚀 Install Kubernetes Components (v1.29)

🖥 Master Node Setup

🏷 Set Hostname

🚀 Initialize the Control Plane

🔧 Configure Kubelet to Use ENI IP

🔍 Verify Node IP

🔑 Configure kubeconfig

🌐 Install CNI (Calico) with Multi-ENI Fix

🔄 Restart CoreDNS

🔗 Get Worker Join Command

👷 Worker Node Setup

🏷 Label Worker Node (Optional but Recommended)

🔍 Verify Cluster Health

4️⃣ Backup Strategy (Master Only)

📦 What to Backup

⏺ Take ETCD Snapshot

🗂 Backup configs

☁ Upload to S3

5️⃣ Disaster Recovery — Master Failure

📥 Download Backup

🔄 Restore Data

🔍 Validate Recovery

6️⃣ Troubleshooting Common Issues

✅ Kubernetes Cluster Issues & Solutions

🔧 Critical Fix Commands

Fix Calico IP Autodetection (Multi-ENI Nodes)

Restart CoreDNS After Network Changes

Label Worker Node

Fix Node Showing Ephemeral IP

Test DNS Resolution

Verify Security Group Rules

7️⃣ Final Production Checklist

🎯 Conclusion

📁 Scripts Overview

Script Usage

1) Install common components on master and worker nodes

2) Initialize Kubernetes master

3) Join worker node to the cluster

4) Backup Kubernetes state to S3

5) Restore cluster using Disaster Recovery process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages