A complete, production‑grade guide for deploying Kubernetes on AWS using kubeadm, including Backup + Disaster Recovery (DR) best practices and AWS Private Subnet Networking.
This guide helps you:
- 🎯 Deploy a Kubernetes cluster using kubeadm on AWS
- 🔐 Back up critical Kubernetes components
- 🔄 Restore and recover the cluster during a failure
- 🌐 Configure secure private subnet networking
- 🛡️ Troubleshoot common AWS networking issues
- 🐧 Ubuntu 20.04 or later
- 🧑💻
sudoprivileges - 🌐 Internet access (via NAT Gateway for private subnets)
- 💻 EC2 instance type: t2.medium or higher (t4g.medium for ARM/Graviton)
- 🛡 All nodes in same Security Group with proper rules (see Security section)
- 🌐 Private Subnet deployment with NAT Gateway for outbound
- 🧩 Create + attach a custom ENI with static private IP to Master
- 🔓 Security Group rules configured for VPC internal communication
Critical: All nodes must be in the same Security Group with these inbound rules:
| Port/Protocol | Service | Source | Purpose |
|---|---|---|---|
| 22 | SSH | <ADMIN_IP>/32 |
Secure admin access 🔥 |
| 6443 | kube-apiserver | <VPC_CIDR> |
API server access |
| 10250 | kubelet | <VPC_CIDR> |
Pod exec/logs |
| 179 | Calico BGP | <VPC_CIDR> |
Calico routing |
| All TCP | Node-to-Node | <VPC_CIDR> |
Internal communication |
| All UDP | Node-to-Node | <VPC_CIDR> |
Internal communication |
# Example AWS CLI commands to configure Security Group
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
# Replace <ADMIN_IP> with your public IP (e.g., 203.0.113.25)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxxxxxx \
--ip-permissions \
IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges='[{CidrIp=<ADMIN_IP>/32}]' \
IpProtocol=tcp,FromPort=6443,ToPort=6443,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
IpProtocol=tcp,FromPort=10250,ToPort=10250,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
IpProtocol=tcp,FromPort=179,ToPort=179,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
IpProtocol=udp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]'
⚠️ Security Warning: NEVER open port 6443 to0.0.0.0/0(public internet)⚠️ SSH Access: Restrict SSH (port 22) to your admin IP only
| Component | Subnet Type | Security |
|---|---|---|
Master Node (API @ <ENI_PRIVATE_IP> / ens6) |
Private | VPC only |
| Worker Node(s) | Private | Internal communication |
| Calico CNI | Pod network 192.168.0.0/16 |
Fully working |
| CoreDNS | Private | Resolves cluster.local |
| NAT Gateway/Instance | Public → Private | Outbound only ✔ |
| Future Nginx LB EC2 | Public | Port 80/443 to world |
| CloudFlare DNS | External | Points to Nginx LB EIP |
Before creating the Master node:
- 🧩 Create ENI in private subnet
- 🔐 Assign static private IP (e.g.,
<ENI_PRIVATE_IP>) - 🔗 Attach ENI to Master as secondary network interface
- ▶ Use ENI private IP for
kubeadm init
🔹 Prevents IP/cert conflicts during DR 📝 Example: If your VPC is
10.0.0.0/16and subnet is10.0.1.0/24, you might use10.0.1.160as your ENI IP
Run on all nodes 👇
sudo swapoff -acat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfiltercat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --systemsudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y containerd.io
containerd config default | sed -e 's/SystemdCgroup = false/SystemdCgroup = true/' | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable --now containerdsudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl
sudo curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | \
sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /" | \
sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl etcd-client
sudo apt-mark hold kubelet kubeadm kubectl🔹 All nodes must run SAME Kubernetes versions!
sudo hostnamectl set-hostname master-cp
echo "127.0.0.1 master-cp" | sudo tee -a /etc/hosts
⚠️ Required for Disaster Recovery
sudo kubeadm init --apiserver-advertise-address=<ENI-PRIVATE-IP> --pod-network-cidr=192.168.0.0/16CRITICAL for DR: Configure kubelet to register with the ENI IP instead of ephemeral IP
# Create kubelet extra args configuration
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<ENI-PRIVATE-IP>"
EOF
# Restart kubelet to apply changes
sudo systemctl daemon-reload
sudo systemctl restart kubelet
⚠️ Important: Replace<ENI-PRIVATE-IP>with your actual ENI static IP (e.g., 10.0.1.160) 🔹 This ensures the node registers with the static IP, preventing certificate mismatches during DR
# Check that the node is registered with the ENI IP
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jqYou should see the ENI IP as the InternalIP.
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/configkubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yamlCRITICAL FIX for Multi-ENI Nodes: When master has multiple network interfaces (ens5 + ens6), Calico may autodetect the wrong IP.
# Configure Calico to use the correct network interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>
# Restart Calico nodes to apply changes
kubectl delete pod -n kube-system -l k8s-app=calico-node🔹 This tells Calico to use the interface that can reach your VPC CIDR
⚠️ Without this fix, Calico will be stuck at 0/1 Ready and pod networking will fail
After fixing Calico networking, restart CoreDNS to refresh service routing:
kubectl -n kube-system rollout restart deploy/corednskubeadm token create --print-join-commandAdd:
sudoat beginning--cri-socket "unix:///run/containerd/containerd.sock"flag--v=5at end (optional, for verbose logging)
Example:
sudo kubeadm join <ENI-IP>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash> --cri-socket "unix:///run/containerd/containerd.sock" --v=5# Label the worker node for better visibility
kubectl label node <worker-node-name> node-role.kubernetes.io/worker=worker# Check all nodes are Ready
kubectl get nodes -o wide
# Check all system pods are Running
kubectl get pods -A -o wide
# Test DNS resolution inside a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default✔ All nodes should be Ready ✔ Master node should show ENI IP ✔ All pods should be Running ✔ DNS should resolve successfully
| Component | Path | Purpose |
|---|---|---|
| 💾 ETCD Snapshot | /var/lib/etcd |
Cluster state |
| 🔐 Kubernetes Configs | /etc/kubernetes/ |
API certs & configs |
| 🆔 Kubelet Identity | /var/lib/kubelet |
Node certificates |
| ⚙️ Kubelet Config | /etc/default/kubelet |
Node IP configuration |
sudo ETCDCTL_API=3 etcdctl snapshot save /root/k8s-backup/etcd.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keysudo mkdir -p /root/k8s-backup
sudo cp -r /etc/kubernetes /root/k8s-backup/kubernetes
sudo cp -r /var/lib/kubelet /root/k8s-backup/kubelet
sudo cp /etc/default/kubelet /root/k8s-backup/kubelet-default
sudo -i
cd /root
sudo tar czf k8s-backup.tar.gz k8s-backup🔹 If this EC2 instance has an IAM Role with S3 permissions — NO
aws configureis required 🔹 If restoring from a laptop or non‑role instance — runaws configurefirst
AWS Credential Requirements
| Environment | Need aws configure? |
Why |
|---|---|---|
| EC2 with IAM Role | ❌ No | Auto temporary credentials ✔ |
| EC2 w/out IAM Role | ✅ Yes | No automatic credentials |
| Laptop / external machine | ✅ Yes | Needs manual keys |
aws s3 cp /root/k8s-backup.tar.gz s3://<bucket>/k8s-backups/$(date +%F-%H%M).tar.gz🔹 Requires AWS CLI & IAM role with S3 permissions
Launch replacement Master with:
- Same ENI ⚙️
- Same hostname 🏷
- Same Security Group
aws s3 cp s3://<bucket>/k8s-backups/<file>.tar.gz /root/k8s-backup.tar.gzsudo systemctl stop kubelet
sudo rm -rf /var/lib/etcd
sudo ETCDCTL_API=3 etcdctl snapshot restore /root/k8s-backup/etcd.db --data-dir=/var/lib/etcd
sudo tar xzf /root/k8s-backup.tar.gz -C /
sudo systemctl restart kubeletkubectl get nodes -o wide
kubectl get pods -A
kubectl get svc -A✔ Cluster restored automatically in 30–60 sec ✔ Master node should show ENI IP
| Issue | Root Cause | Fix Applied | Status |
|---|---|---|---|
| Nodes communication failing | Security Group blocked node-to-node | Allow All TCP/UDP from <VPC_CIDR> |
✔ Fixed |
| Unable to exec/log into pods → kubelet timeout | Port 10250 blocked by SG | Allow TCP 10250 from <VPC_CIDR> |
✔ Fixed |
| Calico not ready (0/1), wrong IP detected | Master has two ENIs, Calico autodetected incorrectly | Set Calico IP autodetection: can-reach=<VPC_CIDR> |
✔ Fixed |
| CoreDNS 0/1 running or failing | Calico not routing Pod IP correctly | After fixing Calico, CoreDNS became ready | ✔ Fixed |
DNS inside pods failing (NXDOMAIN) |
CoreDNS could not reach API service | Enable TCP 6443 from <VPC_CIDR> |
✔ Fixed |
API server publicly open (0.0.0.0/0) |
Wrong SG inbound config | Restrict 6443 to <VPC_CIDR> + SSH from <ADMIN_IP> |
✔ Secured |
| Pod DNS resolution failing | CoreDNS needed restart after network fix | Restart CoreDNS deployment | ✔ Fixed |
| Worker node role missing | kubeadm join doesn't auto-tag labels | Label node: worker=worker |
✔ Fixed |
# Configure Calico to detect correct interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>
# Restart Calico nodes
kubectl delete pod -n kube-system -l k8s-app=calico-nodekubectl -n kube-system rollout restart deploy/corednskubectl label node <node-name> node-role.kubernetes.io/worker=worker# Configure kubelet to use ENI IP
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<YOUR-ENI-IP>"
EOF
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# Verify
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jq# Quick DNS test
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
# Expected output:
# Server: 10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# Name: kubernetes.default
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local# List Security Group rules
aws ec2 describe-security-groups --group-ids sg-xxxxxxxxx
# Check specific port is open
nc -zv <master-ip> 6443
nc -zv <master-ip> 10250| Category | Required |
|---|---|
| Swap disabled | ✔️ |
| Static ENI private IP | ✔️ |
Same hostname (master-cp) |
✔️ |
Kubelet configured with --node-ip |
✔️ |
| Calico IP autodetection configured | ✔️ |
| Security Group rules configured | ✔️ |
| Port 6443 restricted to VPC only | ✔️ |
| Port 10250 open from VPC | ✔️ |
| SSH restricted to admin IP | ✔️ |
| Same Kubernetes versions | ✔️ |
| CoreDNS running and healthy | ✔️ |
| Backup stored safely | ✔️ |
No kubeadm init during DR |
✔️ |
| Nodes Ready after restore | ✔️ |
You now have:
- 🔐 Highly available cluster design
- 💾 Reliable backup workflow
- 🔄 Fully tested DR procedure
- ⚙️ Proper node IP configuration for DR
- 🌐 Secure private networking with AWS VPC
- 🛡️ Hardened security group configuration
- 🔧 Troubleshooting knowledge for common issues
✨ You're production‑ready!
Automation scripts included in this repository help streamline Kubernetes lifecycle operations.
| Script | Path | Run On | Purpose |
|---|---|---|---|
| install-common.sh | scripts/ | Master + Worker | Installs containerd, kubeadm, and required settings |
| master-setup.sh | scripts/ | Master | Initializes control plane with ENI private IP |
| worker-join.sh | scripts/ | Worker | Automatically joins worker to cluster |
| backup.sh | scripts/ | Master | Creates etcd + Kubernetes secrets backup and uploads to S3 |
| restore.sh | scripts/ | Restore Master | Automates Disaster Recovery restore process |
cd scripts
chmod +x install-common.sh
./install-common.shchmod +x master-setup.sh
./master-setup.sh(Ensure join-command.sh is copied from master)
chmod +x worker-join.sh
./worker-join.shchmod +x backup.sh
./backup.shchmod +x restore.sh
./restore.sh