Skip to content

NitinBisht28/Kubeadm-Prod-Setup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

🚀 Kubernetes Production Cluster Setup

Installation, Backup & Disaster Recovery with kubeadm

A complete, production‑grade guide for deploying Kubernetes on AWS using kubeadm, including Backup + Disaster Recovery (DR) best practices and AWS Private Subnet Networking.


1️⃣ Introduction

This guide helps you:

  • 🎯 Deploy a Kubernetes cluster using kubeadm on AWS
  • 🔐 Back up critical Kubernetes components
  • 🔄 Restore and recover the cluster during a failure
  • 🌐 Configure secure private subnet networking
  • 🛡️ Troubleshoot common AWS networking issues

🧰 Prerequisites

  • 🐧 Ubuntu 20.04 or later
  • 🧑‍💻 sudo privileges
  • 🌐 Internet access (via NAT Gateway for private subnets)
  • 💻 EC2 instance type: t2.medium or higher (t4g.medium for ARM/Graviton)

☁️ AWS Setup Overview

  • 🛡 All nodes in same Security Group with proper rules (see Security section)
  • 🌐 Private Subnet deployment with NAT Gateway for outbound
  • 🧩 Create + attach a custom ENI with static private IP to Master
  • 🔓 Security Group rules configured for VPC internal communication

2️⃣ AWS Security Group Configuration

🔐 Required Security Rules for Private Kubernetes Cluster

Critical: All nodes must be in the same Security Group with these inbound rules:

Port/Protocol Service Source Purpose
22 SSH <ADMIN_IP>/32 Secure admin access 🔥
6443 kube-apiserver <VPC_CIDR> API server access
10250 kubelet <VPC_CIDR> Pod exec/logs
179 Calico BGP <VPC_CIDR> Calico routing
All TCP Node-to-Node <VPC_CIDR> Internal communication
All UDP Node-to-Node <VPC_CIDR> Internal communication
# Example AWS CLI commands to configure Security Group
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
# Replace <ADMIN_IP> with your public IP (e.g., 203.0.113.25)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxxxxxx \
  --ip-permissions \
    IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges='[{CidrIp=<ADMIN_IP>/32}]' \
    IpProtocol=tcp,FromPort=6443,ToPort=6443,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=10250,ToPort=10250,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=179,ToPort=179,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]' \
    IpProtocol=udp,FromPort=0,ToPort=65535,IpRanges='[{CidrIp=<VPC_CIDR>}]'

⚠️ Security Warning: NEVER open port 6443 to 0.0.0.0/0 (public internet) ⚠️ SSH Access: Restrict SSH (port 22) to your admin IP only

📌 Final Working Network Architecture

Component Subnet Type Security
Master Node (API @ <ENI_PRIVATE_IP> / ens6) Private VPC only
Worker Node(s) Private Internal communication
Calico CNI Pod network 192.168.0.0/16 Fully working
CoreDNS Private Resolves cluster.local
NAT Gateway/Instance Public → Private Outbound only ✔
Future Nginx LB EC2 Public Port 80/443 to world
CloudFlare DNS External Points to Nginx LB EIP

3️⃣ Kubernetes Installation & Cluster Setup

🌐 AWS Networking Preparation

Before creating the Master node:

  1. 🧩 Create ENI in private subnet
  2. 🔐 Assign static private IP (e.g., <ENI_PRIVATE_IP>)
  3. 🔗 Attach ENI to Master as secondary network interface
  4. ▶ Use ENI private IP for kubeadm init

🔹 Prevents IP/cert conflicts during DR 📝 Example: If your VPC is 10.0.0.0/16 and subnet is 10.0.1.0/24, you might use 10.0.1.160 as your ENI IP


🔄 Common Setup (Master & Worker Nodes)

Run on all nodes 👇

🔕 Disable Swap

sudo swapoff -a

🔧 Load Kernel Modules

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

🌐 Apply Sysctl Params

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system

📦 Install containerd

sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y containerd.io

containerd config default | sed -e 's/SystemdCgroup = false/SystemdCgroup = true/' | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable --now containerd

🚀 Install Kubernetes Components (v1.29)

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl

sudo curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | \
sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /" | \
sudo tee /etc/apt/sources.list.d/kubernetes.list

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl etcd-client
sudo apt-mark hold kubelet kubeadm kubectl

🔹 All nodes must run SAME Kubernetes versions!


🖥 Master Node Setup

🏷 Set Hostname

sudo hostnamectl set-hostname master-cp
echo "127.0.0.1 master-cp" | sudo tee -a /etc/hosts

⚠️ Required for Disaster Recovery

🚀 Initialize the Control Plane

sudo kubeadm init --apiserver-advertise-address=<ENI-PRIVATE-IP> --pod-network-cidr=192.168.0.0/16

🔧 Configure Kubelet to Use ENI IP

CRITICAL for DR: Configure kubelet to register with the ENI IP instead of ephemeral IP

# Create kubelet extra args configuration
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<ENI-PRIVATE-IP>"
EOF

# Restart kubelet to apply changes
sudo systemctl daemon-reload
sudo systemctl restart kubelet

⚠️ Important: Replace <ENI-PRIVATE-IP> with your actual ENI static IP (e.g., 10.0.1.160) 🔹 This ensures the node registers with the static IP, preventing certificate mismatches during DR

🔍 Verify Node IP

# Check that the node is registered with the ENI IP
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jq

You should see the ENI IP as the InternalIP.

🔑 Configure kubeconfig

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

🌐 Install CNI (Calico) with Multi-ENI Fix

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml

CRITICAL FIX for Multi-ENI Nodes: When master has multiple network interfaces (ens5 + ens6), Calico may autodetect the wrong IP.

# Configure Calico to use the correct network interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>

# Restart Calico nodes to apply changes
kubectl delete pod -n kube-system -l k8s-app=calico-node

🔹 This tells Calico to use the interface that can reach your VPC CIDR ⚠️ Without this fix, Calico will be stuck at 0/1 Ready and pod networking will fail

🔄 Restart CoreDNS

After fixing Calico networking, restart CoreDNS to refresh service routing:

kubectl -n kube-system rollout restart deploy/coredns

🔗 Get Worker Join Command

kubeadm token create --print-join-command

👷 Worker Node Setup

Add:

  • sudo at beginning
  • --cri-socket "unix:///run/containerd/containerd.sock" flag
  • --v=5 at end (optional, for verbose logging)

Example:

sudo kubeadm join <ENI-IP>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash> --cri-socket "unix:///run/containerd/containerd.sock" --v=5

🏷 Label Worker Node (Optional but Recommended)

# Label the worker node for better visibility
kubectl label node <worker-node-name> node-role.kubernetes.io/worker=worker

🔍 Verify Cluster Health

# Check all nodes are Ready
kubectl get nodes -o wide

# Check all system pods are Running
kubectl get pods -A -o wide

# Test DNS resolution inside a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

✔ All nodes should be Ready ✔ Master node should show ENI IP ✔ All pods should be Running ✔ DNS should resolve successfully


4️⃣ Backup Strategy (Master Only)

📦 What to Backup

Component Path Purpose
💾 ETCD Snapshot /var/lib/etcd Cluster state
🔐 Kubernetes Configs /etc/kubernetes/ API certs & configs
🆔 Kubelet Identity /var/lib/kubelet Node certificates
⚙️ Kubelet Config /etc/default/kubelet Node IP configuration

⏺ Take ETCD Snapshot

sudo ETCDCTL_API=3 etcdctl snapshot save /root/k8s-backup/etcd.db \
 --endpoints=https://127.0.0.1:2379 \
 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
 --cert=/etc/kubernetes/pki/etcd/server.crt \
 --key=/etc/kubernetes/pki/etcd/server.key

🗂 Backup configs

sudo mkdir -p /root/k8s-backup
sudo cp -r /etc/kubernetes /root/k8s-backup/kubernetes
sudo cp -r /var/lib/kubelet /root/k8s-backup/kubelet
sudo cp /etc/default/kubelet /root/k8s-backup/kubelet-default
sudo -i
cd /root
sudo tar czf k8s-backup.tar.gz k8s-backup

☁ Upload to S3

🔹 If this EC2 instance has an IAM Role with S3 permissions — NO aws configure is required 🔹 If restoring from a laptop or non‑role instance — run aws configure first

AWS Credential Requirements

Environment Need aws configure? Why
EC2 with IAM Role ❌ No Auto temporary credentials ✔
EC2 w/out IAM Role ✅ Yes No automatic credentials
Laptop / external machine ✅ Yes Needs manual keys
aws s3 cp /root/k8s-backup.tar.gz s3://<bucket>/k8s-backups/$(date +%F-%H%M).tar.gz

🔹 Requires AWS CLI & IAM role with S3 permissions


5️⃣ Disaster Recovery — Master Failure

Launch replacement Master with:

  • Same ENI ⚙️
  • Same hostname 🏷
  • Same Security Group

📥 Download Backup

aws s3 cp s3://<bucket>/k8s-backups/<file>.tar.gz /root/k8s-backup.tar.gz

🔄 Restore Data

sudo systemctl stop kubelet
sudo rm -rf /var/lib/etcd
sudo ETCDCTL_API=3 etcdctl snapshot restore /root/k8s-backup/etcd.db --data-dir=/var/lib/etcd
sudo tar xzf /root/k8s-backup.tar.gz -C /
sudo systemctl restart kubelet

🔍 Validate Recovery

kubectl get nodes -o wide
kubectl get pods -A
kubectl get svc -A

✔ Cluster restored automatically in 30–60 sec ✔ Master node should show ENI IP


6️⃣ Troubleshooting Common Issues

✅ Kubernetes Cluster Issues & Solutions

Issue Root Cause Fix Applied Status
Nodes communication failing Security Group blocked node-to-node Allow All TCP/UDP from <VPC_CIDR> ✔ Fixed
Unable to exec/log into pods → kubelet timeout Port 10250 blocked by SG Allow TCP 10250 from <VPC_CIDR> ✔ Fixed
Calico not ready (0/1), wrong IP detected Master has two ENIs, Calico autodetected incorrectly Set Calico IP autodetection: can-reach=<VPC_CIDR> ✔ Fixed
CoreDNS 0/1 running or failing Calico not routing Pod IP correctly After fixing Calico, CoreDNS became ready ✔ Fixed
DNS inside pods failing (NXDOMAIN) CoreDNS could not reach API service Enable TCP 6443 from <VPC_CIDR> ✔ Fixed
API server publicly open (0.0.0.0/0) Wrong SG inbound config Restrict 6443 to <VPC_CIDR> + SSH from <ADMIN_IP> ✔ Secured
Pod DNS resolution failing CoreDNS needed restart after network fix Restart CoreDNS deployment ✔ Fixed
Worker node role missing kubeadm join doesn't auto-tag labels Label node: worker=worker ✔ Fixed

🔧 Critical Fix Commands

Fix Calico IP Autodetection (Multi-ENI Nodes)

# Configure Calico to detect correct interface
# Replace <VPC_CIDR> with your VPC CIDR (e.g., 10.0.0.0/16)
kubectl -n kube-system set env daemonset/calico-node IP_AUTODETECTION_METHOD=can-reach=<VPC_CIDR>

# Restart Calico nodes
kubectl delete pod -n kube-system -l k8s-app=calico-node

Restart CoreDNS After Network Changes

kubectl -n kube-system rollout restart deploy/coredns

Label Worker Node

kubectl label node <node-name> node-role.kubernetes.io/worker=worker

Fix Node Showing Ephemeral IP

# Configure kubelet to use ENI IP
sudo tee /etc/default/kubelet > /dev/null <<EOF
KUBELET_EXTRA_ARGS="--node-ip=<YOUR-ENI-IP>"
EOF

sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Verify
kubectl get node master-cp -o jsonpath='{.status.addresses}' | jq

Test DNS Resolution

# Quick DNS test
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Expected output:
# Server:    10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# Name:      kubernetes.default
# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

Verify Security Group Rules

# List Security Group rules
aws ec2 describe-security-groups --group-ids sg-xxxxxxxxx

# Check specific port is open
nc -zv <master-ip> 6443
nc -zv <master-ip> 10250

7️⃣ Final Production Checklist

Category Required
Swap disabled ✔️
Static ENI private IP ✔️
Same hostname (master-cp) ✔️
Kubelet configured with --node-ip ✔️
Calico IP autodetection configured ✔️
Security Group rules configured ✔️
Port 6443 restricted to VPC only ✔️
Port 10250 open from VPC ✔️
SSH restricted to admin IP ✔️
Same Kubernetes versions ✔️
CoreDNS running and healthy ✔️
Backup stored safely ✔️
No kubeadm init during DR ✔️
Nodes Ready after restore ✔️

🎯 Conclusion

You now have:

  • 🔐 Highly available cluster design
  • 💾 Reliable backup workflow
  • 🔄 Fully tested DR procedure
  • ⚙️ Proper node IP configuration for DR
  • 🌐 Secure private networking with AWS VPC
  • 🛡️ Hardened security group configuration
  • 🔧 Troubleshooting knowledge for common issues

✨ You're production‑ready!


📁 Scripts Overview

Automation scripts included in this repository help streamline Kubernetes lifecycle operations.

Script Path Run On Purpose
install-common.sh scripts/ Master + Worker Installs containerd, kubeadm, and required settings
master-setup.sh scripts/ Master Initializes control plane with ENI private IP
worker-join.sh scripts/ Worker Automatically joins worker to cluster
backup.sh scripts/ Master Creates etcd + Kubernetes secrets backup and uploads to S3
restore.sh scripts/ Restore Master Automates Disaster Recovery restore process

Script Usage

1) Install common components on master and worker nodes

cd scripts
chmod +x install-common.sh
./install-common.sh

2) Initialize Kubernetes master

chmod +x master-setup.sh
./master-setup.sh

3) Join worker node to the cluster

(Ensure join-command.sh is copied from master)

chmod +x worker-join.sh
./worker-join.sh

4) Backup Kubernetes state to S3

chmod +x backup.sh
./backup.sh

5) Restore cluster using Disaster Recovery process

chmod +x restore.sh
./restore.sh

About

Production-grade Kubernetes setup using kubeadm on AWS (1 Master + 1 Worker). Includes backup & disaster recovery using etcd + S3, static ENI IP for stability, and cost-optimized DR flow. Great for DevOps learners and real-world cluster practice.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages