Kubernetes Mastery — From Fundamentals to Advanced Operations
Prerequisites: Familiarity with Linux command line, basic networking (TCP/IP, DNS, HTTP), containers and Docker fundamentals (images, containers, Dockerfiles), YAML syntax, and a general understanding of distributed systems concepts.
Why Kubernetes — The Problem It Solves
Containers changed how we package software — a single image that runs the same way on a laptop and in production. But the moment you go from one container to dozens (or thousands), a new class of problems appears. Which server should run each container? What happens when a container crashes at 3 AM? How do services find each other as instances come and go? Kubernetes exists to answer these questions.
Before diving into architecture and YAML manifests, it is worth understanding why this system was built and what life looks like without it. That context makes every Kubernetes concept that follows feel like a solution to a problem you already recognize.
The Pain of Manual Container Orchestration
Imagine you run a web application with three services: an API, a background worker, and a Redis cache. You have four servers. Without an orchestrator, you are the orchestrator. Here is what a typical deployment script looks like when humans manage containers directly:
#!/bin/bash
# deploy.sh — manual container deployment across 4 servers
# 1. Decide which server has capacity (check manually or guess)
ssh server-02 "docker stats --no-stream" | grep -v "0.00%"
# 2. Pull the new image on each target server
for host in server-01 server-02 server-03; do
ssh $host "docker pull registry.example.com/api:v2.4.1"
done
# 3. Stop old containers one at a time (hope nobody notices)
ssh server-01 "docker stop api-1 && docker rm api-1"
ssh server-01 "docker run -d --name api-1 -p 8080:8080 \
-e DB_HOST=10.0.1.50 -e DB_PASSWORD=s3cret \
registry.example.com/api:v2.4.1"
# 4. Repeat for server-02 and server-03...
# 5. Update the load balancer config manually
# 6. Pray that DNS propagation is fast enough
# 7. Set up a cron job to restart crashed containers (?)
This script has serious problems. The database password is hardcoded in plain text. There is no health check — if the new version crashes on startup, traffic still routes to it. Rolling back means re-running the script with the old image tag and hoping the state is clean. Scaling up means provisioning a new server, installing Docker, copying SSH keys, and updating the script. At 3 AM, when server-02 dies, nobody restarts those containers until a human wakes up.
The Five Hard Problems at Scale
The script above is not a strawman — it is how many teams actually operated in 2014–2016. The problems it exposes fall into five categories that every container orchestrator must solve.
mindmap
root((Container
Orchestration))
Scheduling
Bin-packing onto nodes
Resource-aware placement
Affinity and anti-affinity
Self-Healing
Restart crashed containers
Replace unresponsive nodes
Health check enforcement
Service Discovery
Stable DNS names
Load balancing across replicas
Dynamic IP handling
Rolling Updates
Zero-downtime deploys
Automatic rollback
Canary and blue-green
Configuration
Secret management
Environment-specific config
Hot-reload support
Scaling
Horizontal pod autoscaling
Cluster autoscaling
Scale to zero
Storage
Persistent volumes
Dynamic provisioning
Storage class abstraction
1. Scheduling
Given 50 containers and 10 servers, which container goes where? You need to consider CPU and memory availability, disk I/O requirements, data locality, and constraints like "don't put two replicas of the same service on the same host." Doing this by hand is error-prone. Doing it automatically, thousands of times per day, is a scheduling problem that requires a dedicated system.
2. Service Discovery and Load Balancing
Containers get new IP addresses every time they restart. If your API talks to a Redis instance at a hardcoded IP, that breaks the moment Redis is rescheduled to a different node. You need a dynamic registry that tracks where each service is running and routes traffic accordingly — without requiring code changes.
3. Self-Healing
Containers crash. Nodes go offline. Disks fill up. A production-grade system must detect these failures and recover automatically — restart the container, reschedule it to a healthy node, and stop sending traffic to instances that are not ready. The goal is to match a desired state continuously, not just at deployment time.
4. Rolling Updates and Rollbacks
Deploying a new version should not cause downtime. You need to bring up new containers, verify they are healthy, shift traffic, and drain old containers — all without dropping requests. When a deployment goes wrong, you need to revert to the previous version in seconds, not minutes.
5. Secret and Configuration Management
Database passwords, API keys, and TLS certificates cannot live in Docker images or shell scripts. You need a way to inject secrets at runtime, rotate them without redeploying, and ensure that only authorized services can access them.
From Google Borg to Kubernetes
Kubernetes did not appear from a vacuum. Google had been running containers at scale since the early 2000s — long before Docker existed. Their internal systems, Borg and its successor Omega, orchestrated billions of containers per week across Google's global infrastructure. Gmail, Search, YouTube, and Maps all run on Borg.
In 2014, Google open-sourced a system that captured the core design principles of Borg and Omega but was rebuilt from scratch for the broader community. Three key ideas carried over:
- Declarative configuration. You describe what you want (3 replicas of this service, 512 MiB of RAM each), not how to get there. The system continuously works to make reality match your declaration.
- API-driven everything. Every operation — deploying, scaling, inspecting — goes through a versioned REST API. There is no SSH, no special CLI magic. The
kubectlcommand is just an API client. - Reconciliation loops. Controllers constantly compare actual state against desired state and take corrective action. This is what makes the system self-healing: it is always converging, not just executing a one-time script.
In 2015, Google donated Kubernetes to the newly formed Cloud Native Computing Foundation (CNCF). It was the first project to graduate from CNCF, and it catalyzed an entire ecosystem — from container runtimes (containerd, CRI-O) to service meshes (Istio, Linkerd) to observability tools (Prometheus, Jaeger).
The name "Kubernetes" comes from the Greek word κυβερνήτης, meaning "helmsman" or "pilot." The abbreviation K8s replaces the eight letters between "K" and "s." The seven-sided logo represents the original project codename: "Project Seven" (a reference to Seven of Nine from Star Trek).
The Declarative Model: Tell It What, Not How
The most important mental shift when learning Kubernetes is moving from imperative commands to declarative specifications. Instead of scripting a sequence of steps, you write a document that describes your desired end state and hand it to the system.
Here is the contrast. The imperative approach you saw earlier — SSH into servers, run Docker commands, update load balancers — is a recipe of how to deploy. If any step fails, the system is in an unknown state. The declarative approach is fundamentally different:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: registry.example.com/api:v2.4.1
ports:
- containerPort: 8080
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
Apply this file with a single command — kubectl apply -f api-deployment.yaml — and Kubernetes handles everything: finding nodes with available resources, pulling the image, starting containers, running health checks, configuring rolling updates, and injecting the secret. If a container crashes, Kubernetes restarts it. If a node dies, Kubernetes reschedules the pods elsewhere. The YAML file becomes the single source of truth.
Before and After: A Side-by-Side Comparison
| Operation | Manual (Bash + SSH + Docker) | Kubernetes |
|---|---|---|
| Deploy 3 replicas | SSH into 3 servers, docker run on each, update load balancer config |
kubectl apply -f deployment.yaml — scheduler places pods automatically |
| Scale to 10 replicas | Provision servers, install Docker, update deploy script, run it | kubectl scale deployment/api --replicas=10 or edit the YAML |
| Rolling update | Stop/start containers one by one, manually verify, update LB between each | Change the image tag in YAML and apply — Kubernetes handles the rollout |
| Rollback | Re-run deploy script with old tag, hope state is clean | kubectl rollout undo deployment/api — instant revert |
| Self-healing | Cron job checking docker ps, manual restart, pager alert at 3 AM |
Built-in: kubelet restarts failed containers, scheduler replaces lost pods |
| Service discovery | Hardcoded IPs in config files, consul/etcd bolted on separately | Built-in DNS: api.default.svc.cluster.local resolves automatically |
| Secrets | Environment variables in scripts, .env files on disk, plaintext |
Kubernetes Secrets, injected as volumes or env vars, RBAC-controlled |
| Resource limits | Hope nobody deploys a memory-leaking container that kills the host | Per-container CPU/memory requests and limits, enforced by cgroups |
Kubernetes vs. the Alternatives
Kubernetes is not the only container orchestrator. Docker Swarm, HashiCorp Nomad, and Amazon ECS all solve overlapping problems. Choosing between them depends on your scale, team expertise, and operational requirements.
| Criteria | Kubernetes | Docker Swarm | Nomad | Amazon ECS |
|---|---|---|---|---|
| Learning curve | Steep — many concepts and abstractions | Gentle — extends Docker CLI naturally | Moderate — simpler model, HCL config | Moderate — AWS console or Terraform |
| Ecosystem | Massive — CNCF landscape, Helm charts, operators | Limited — mostly Docker-native tools | Growing — integrates with Vault, Consul | AWS-only — tight Fargate, ALB, CloudWatch integration |
| Multi-cloud | Yes — runs on any cloud or bare metal | Yes — but declining adoption | Yes — cloud-agnostic | No — locked to AWS |
| Workload types | Containers (primary), VMs via KubeVirt | Containers only | Containers, VMs, Java JARs, binaries | Containers only |
| Auto-scaling | HPA, VPA, Cluster Autoscaler, KEDA | Manual or basic rules | Built-in autoscaler | Application Auto Scaling, Fargate auto |
| Production adoption | De facto standard — used by ~96% of organizations surveyed by CNCF (2023) | Declining — Docker Inc. pivoted away | Niche — strong at companies using HashiCorp stack | Significant — dominant within AWS-only shops |
If you run fewer than five services on a single cloud provider and your team has no Kubernetes experience, starting with ECS (on AWS), Cloud Run (on GCP), or Azure Container Apps can get you to production faster. Kubernetes shines when you need multi-cloud portability, complex networking policies, custom operators, or you are scaling beyond what managed PaaS platforms handle well. Don't adopt it for résumé-driven development.
The API-Driven Architecture
Everything in Kubernetes flows through its API server. When you run kubectl apply, you are making an HTTP request to the API. When a controller restarts a crashed pod, it reads from and writes to the same API. When a CI/CD pipeline deploys a new version, it hits the same endpoint. This uniform interface is what makes Kubernetes so extensible.
# kubectl is just an API client — you can do the same with curl
kubectl get pods -o json
# Equivalent raw API call (with authentication)
curl -k https://<api-server>:6443/api/v1/namespaces/default/pods \
--header "Authorization: Bearer $TOKEN"
# Watch for real-time changes (the same mechanism controllers use)
curl -k https://<api-server>:6443/api/v1/namespaces/default/pods?watch=true \
--header "Authorization: Bearer $TOKEN"
This design means you can build your own tools, dashboards, and automation on top of Kubernetes without special access. Custom Resource Definitions (CRDs) let you extend the API with your own object types, and operators — custom controllers that watch those objects — let you encode domain-specific operational knowledge into the cluster itself. You can define a PostgresCluster resource and let an operator handle backups, failovers, and upgrades automatically.
The Reconciliation Loop: Why Declarative Wins
The power of the declarative model becomes clear when things go wrong. In an imperative system, a failed step leaves you in a partially applied state — some servers have the new version, others have the old one. Recovery means writing more imperative logic to detect and fix the inconsistency.
In Kubernetes, every controller runs a continuous loop: observe the current state, compare it to the desired state, and act to close the gap. This loop runs constantly — not just at deploy time.
flowchart LR
A["Desired State\n(YAML in etcd)"] -->|compare| B{"Diff?"}
B -->|No difference| C["Do nothing\n(system is healthy)"]
B -->|Drift detected| D["Take action\n(create, update, delete)"]
D --> E["Actual State\n(running containers)"]
E -->|observe| B
C -->|wait, re-check| B
style A fill:#4a9eff,color:#fff,stroke:#3380cc
style B fill:#f5a623,color:#fff,stroke:#c4851c
style D fill:#e74c3c,color:#fff,stroke:#b83a2e
style E fill:#2ecc71,color:#fff,stroke:#25a25a
style C fill:#2ecc71,color:#fff,stroke:#25a25a
Suppose you declare 3 replicas of your API. A node crashes, taking one replica with it. The Deployment controller notices the actual count (2) does not match the desired count (3), and it schedules a new pod on a healthy node. No human intervention. No pager alert. The system converges back to the desired state automatically. This is the fundamental difference between running containers and orchestrating them.
Kubernetes handles container-level failures automatically, but it does not replace monitoring, alerting, or capacity planning. If your cluster runs out of nodes, the scheduler cannot place new pods. If your application has a logic bug, Kubernetes will faithfully keep running the broken code. Declarative orchestration handles infrastructure-level reliability — application-level reliability is still your responsibility.
What You Will Build on This Foundation
This section answered the "why." You now understand the five core problems — scheduling, service discovery, self-healing, rolling updates, and configuration management — and why a declarative, API-driven system is the right abstraction for solving them at scale. You have seen how Kubernetes inherits battle-tested ideas from a decade of Google's internal infrastructure and where it fits relative to alternatives.
In the next section, Cluster Architecture at a Glance, you will see how Kubernetes is structured — the control plane components that make decisions and the node components that execute them. Every concept from this section — the API server, the reconciliation loop, the scheduler — maps directly to a specific component in the architecture.
Cluster Architecture at a Glance
A Kubernetes cluster is a set of machines — physical or virtual — organized into two distinct roles: control plane nodes that make decisions about the cluster, and worker nodes that run your application workloads. Every interaction you have with Kubernetes, from deploying an app to scaling a service, flows through this architecture.
Understanding this split is foundational. The control plane is the brain; worker nodes are the muscle. They communicate over well-defined APIs, and every component has a single, clear responsibility. This section maps out every piece so you know exactly what happens when you run kubectl apply.
The Two Halves of a Cluster
A Kubernetes cluster divides cleanly into the control plane and the data plane (worker nodes). The control plane manages cluster state — it decides what should run and where. Worker nodes execute those decisions — they actually run your containers. Here is what lives on each side:
| Control Plane Components | Role |
|---|---|
| kube-apiserver | Front door to the cluster. All reads/writes to cluster state go through it. Exposes the REST API that kubectl, controllers, and kubelets all talk to. |
| etcd | Distributed key-value store. The single source of truth for all cluster data — every Pod spec, Service definition, and ConfigMap lives here. |
| kube-scheduler | Watches for newly created Pods with no assigned node and selects the best node based on resource requirements, affinity rules, and constraints. |
| kube-controller-manager | Runs controller loops (Deployment controller, ReplicaSet controller, Node controller, etc.) that continuously reconcile desired state with actual state. |
| cloud-controller-manager | Integrates with cloud provider APIs for load balancers, routes, and node lifecycle. Only present when running on a cloud platform. |
| Worker Node Components | Role |
|---|---|
| kubelet | Agent on every node. Watches the API server for Pods assigned to its node, then instructs the container runtime to start/stop containers. Reports node and Pod status back. |
| kube-proxy | Maintains network rules (iptables or IPVS) on each node so that Service ClusterIPs and NodePorts route traffic to the correct Pod endpoints. |
| Container Runtime | The software that actually runs containers. Kubernetes talks to it through the Container Runtime Interface (CRI). Common runtimes: containerd, CRI-O. |
Cluster Architecture Diagram
The diagram below shows how control plane and worker node components connect. Notice that every component communicates through the API server — it is the single hub. No component talks directly to etcd except the API server, and no component talks directly to another component's internal state.
graph TB
subgraph CP["Control Plane Node(s)"]
API["kube-apiserver"]
ETCD["etcd"]
SCHED["kube-scheduler"]
CM["kube-controller-manager"]
CCM["cloud-controller-manager"]
end
subgraph W1["Worker Node 1"]
KL1["kubelet"]
KP1["kube-proxy"]
CR1["Container Runtime"]
P1A["Pod A"]
P1B["Pod B"]
end
subgraph W2["Worker Node 2"]
KL2["kubelet"]
KP2["kube-proxy"]
CR2["Container Runtime"]
P2A["Pod C"]
P2B["Pod D"]
end
USER["👤 User / CI"]
KUBECTL["kubectl"]
USER --> KUBECTL
KUBECTL -->|"HTTPS REST"| API
API <-->|"read/write state"| ETCD
SCHED -->|"watch & bind Pods"| API
CM -->|"watch & reconcile"| API
CCM -->|"cloud provider calls"| API
KL1 -->|"watch assigned Pods"| API
KL2 -->|"watch assigned Pods"| API
KL1 --> CR1
KL2 --> CR2
CR1 --> P1A
CR1 --> P1B
CR2 --> P2A
CR2 --> P2B
KP1 -->|"watch Services/Endpoints"| API
KP2 -->|"watch Services/Endpoints"| API
No other component reads from or writes to etcd directly. The kube-scheduler, controller manager, kubelets, and kube-proxy all interact with etcd indirectly through the API server. This is a deliberate design choice — it centralizes authentication, authorization (RBAC), admission control, and validation in a single place.
Desired State vs. Actual State
Kubernetes operates on a declarative model. You tell it what you want (desired state), and the system continuously works to make reality match. You never say "start 3 Pods" imperatively — you declare "there should be 3 replicas" and Kubernetes figures out the rest.
The desired state lives in etcd as resource specs (Deployments, Services, ConfigMaps). The actual state is the real-time status of nodes and containers as reported by kubelets. Controller loops in the controller manager are the bridge — they watch for drift between the two and take corrective action. If a Pod crashes, the ReplicaSet controller sees the replica count is below the desired number and tells the API server to create a new Pod.
Anatomy of a Request: From kubectl apply to Running Containers
Understanding the end-to-end request flow demystifies Kubernetes. Here is exactly what happens when you apply a Deployment manifest:
sequenceDiagram
actor User
participant kubectl
participant API as kube-apiserver
participant etcd
participant CM as controller-manager
participant Sched as kube-scheduler
participant KL as kubelet
participant CRT as Container Runtime
User->>kubectl: kubectl apply -f deployment.yaml
kubectl->>API: POST /apis/apps/v1/namespaces/default/deployments
API->>API: Authenticate, Authorize (RBAC), Admission Control
API->>etcd: Persist Deployment object
etcd-->>API: Write confirmed
Note over CM: Deployment controller watch triggers
CM->>API: Read new Deployment, create ReplicaSet
API->>etcd: Persist ReplicaSet object
CM->>API: ReplicaSet controller creates Pod objects
API->>etcd: Persist Pod objects (nodeName = "")
Note over Sched: Scheduler watch triggers on unbound Pods
Sched->>API: Read unbound Pods, evaluate node fitness
Sched->>API: Bind Pod → selected Node (set nodeName)
API->>etcd: Update Pod with nodeName
Note over KL: Kubelet watch triggers for its node
KL->>API: Read Pod spec assigned to this node
KL->>CRT: Pull image, create & start containers
CRT-->>KL: Containers running
KL->>API: Report Pod status = Running
API->>etcd: Update Pod status
The entire flow is asynchronous and event-driven. No component blocks waiting for the next one — they all use watches (long-lived HTTP connections to the API server) to react to changes. This is why Kubernetes can handle thousands of Pods across hundreds of nodes without a central orchestration bottleneck.
The Seven Stages in Plain Language
- Submission. You run
kubectl apply. kubectl reads your YAML, validates it client-side, and sends an HTTPS request to the API server. - Admission & Persistence. The API server authenticates your identity, checks RBAC authorization, runs admission webhooks (e.g., injecting sidecar containers), validates the schema, and writes the Deployment object to etcd.
- Controller Reconciliation. The Deployment controller notices the new Deployment and creates a ReplicaSet. The ReplicaSet controller sees the new ReplicaSet and creates the specified number of Pod objects — but these Pods have no
nodeNameyet. - Scheduling. The scheduler watches for Pods without a node assignment. It evaluates each worker node against the Pod's resource requests, node selectors, affinity/anti-affinity rules, and taints/tolerations. It picks the best node and writes the binding back to the API server.
- Kubelet Execution. The kubelet on the chosen node sees a new Pod assigned to it. It pulls the container image (if not cached), creates the container sandbox via the Container Runtime Interface (CRI), and starts the containers.
- Status Reporting. The kubelet continuously reports Pod status (Pending → Running → Succeeded/Failed) back to the API server, which persists it in etcd.
- Ongoing Reconciliation. Controllers keep watching. If a container crashes, the kubelet restarts it (based on the Pod's
restartPolicy). If a node goes down, the node controller marks it as NotReady, and the ReplicaSet controller creates replacement Pods on healthy nodes.
Seeing It in Practice
You can observe this flow in real time. Open two terminal windows — one to watch events as they happen, and another to trigger the deployment:
# Terminal 1 — watch cluster events in real time
kubectl get events --watch --sort-by='.lastTimestamp'
# Terminal 2 — apply a simple deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-demo
spec:
replicas: 3
selector:
matchLabels:
app: nginx-demo
template:
metadata:
labels:
app: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.27
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
EOF
In the events window, you will see the exact sequence play out: the Deployment is created, then a ReplicaSet, then three Pods. Each Pod goes through Scheduled → Pulling → Pulled → Created → Started events. You can also inspect each layer directly:
# See the ownership chain: Deployment → ReplicaSet → Pod
kubectl get deploy nginx-demo -o wide
kubectl get rs -l app=nginx-demo -o wide
kubectl get pods -l app=nginx-demo -o wide
# Check which node each Pod was scheduled to
kubectl get pods -l app=nginx-demo -o custom-columns=\
NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Managed Kubernetes vs. Self-Managed Clusters
When you run Kubernetes, you have a choice: manage the control plane yourself, or let a cloud provider handle it. This decision affects your operational burden, cost model, and level of control. In production, most teams choose managed Kubernetes unless they have specific compliance or customization requirements.
| Aspect | Managed (EKS, GKE, AKS) | Self-Managed (kubeadm, k3s, Rancher) |
|---|---|---|
| Control Plane | Provider runs and patches the API server, etcd, scheduler, and controller manager. You never SSH into control plane nodes. | You install, configure, upgrade, and monitor every control plane component yourself. |
| etcd | Fully managed, backed up automatically, and replicated across availability zones. Invisible to you. | You manage etcd cluster health, backups, compaction, and disaster recovery. |
| Upgrades | One-click or automated control plane upgrades. You still manage worker node upgrades (often via managed node groups). | You plan and execute upgrades for both control plane and worker nodes, one minor version at a time. |
| Networking | Integrated with cloud VPCs, load balancers, and DNS. CNI plugins often pre-configured or tightly integrated (e.g., VPC-native Pods on GKE). | You choose and configure the CNI plugin (Calico, Cilium, Flannel), set up load balancers, and manage DNS. |
| Cost | Control plane fee (e.g., ~$0.10/hr on EKS/AKS, free on GKE Autopilot for basic tier) + worker node compute costs. | No control plane fee, but you pay for the compute of control plane nodes and the engineering time to manage them. |
| Customization | Limited API server flags and admission controller choices. You use what the provider supports. | Full control over every flag, plugin, and configuration file. You can run custom schedulers, admission webhooks, and etcd topologies. |
| Best For | Teams that want to focus on applications, not infrastructure. Most production workloads. | Air-gapped environments, edge deployments, custom compliance requirements, or deep Kubernetes learning. |
Even with managed Kubernetes, you are responsible for worker node updates, Pod security, RBAC policies, network policies, application-level monitoring, and backup of your own workload state. The provider manages the control plane — everything else is still on you.
Quick Cluster Inspection Commands
Regardless of whether your cluster is managed or self-managed, these commands give you immediate visibility into the architecture of any cluster you connect to:
# Cluster info: API server endpoint, CoreDNS, and add-on URLs
kubectl cluster-info
# All nodes with roles, version, and OS info
kubectl get nodes -o wide
# Control plane components health (self-managed clusters)
kubectl get componentstatuses # deprecated but still works in many clusters
kubectl get --raw='/healthz?verbose'
# See system Pods running the control plane and node agents
kubectl get pods -n kube-system -o wide
# Detailed view of a specific node's capacity and allocatable resources
kubectl describe node <node-name> | grep -A 6 "Capacity\|Allocatable"
If you run kubectl get pods -n kube-system on EKS, GKE, or AKS, you will not see the API server, scheduler, or controller manager Pods — the provider runs them outside your cluster's visibility. You will see kube-proxy, CoreDNS, and any CNI plugin DaemonSets, but the core control plane is abstracted away.
Putting It All Together
The architecture of a Kubernetes cluster is deliberate in its separation of concerns. The API server is the single communication hub. etcd is the single source of truth. The scheduler makes placement decisions. Controllers reconcile desired state with reality. Kubelets do the work of running containers. Every piece is independently replaceable and horizontally scalable.
This architecture is why Kubernetes can self-heal: if a node fails, the node controller detects it, the ReplicaSet controller creates replacement Pods, the scheduler places them on healthy nodes, and the kubelets start them — all automatically, with no human intervention. The next section digs deeper into each control plane component and how they work internally.
Control Plane Components — Under the Hood
The control plane is the brain of a Kubernetes cluster. It makes global decisions about scheduling, detects and responds to cluster events, and serves as the single source of truth for the desired state of every object. Understanding how each component works — and how they interact — is essential for debugging production issues and designing resilient clusters.
This section dissects four components: the API server (the front door), etcd (the memory), the scheduler (the matchmaker), and the controller manager (the reconciliation engine). We will look at internals, trace real request flows, and use kubectl to inspect each component in a live cluster.
flowchart TB
subgraph ControlPlane["Control Plane"]
API["kube-apiserver\n(REST API Gateway)"]
ETCD["etcd\n(Distributed KV Store)"]
SCHED["kube-scheduler\n(Pod Placement)"]
CM["kube-controller-manager\n(Reconciliation Loops)"]
end
USER["kubectl / Client"] -->|"REST + AuthN/AuthZ\n+ Admission"| API
API <-->|"gRPC\n(reads & writes)"| ETCD
API -->|"Watch: unscheduled Pods"| SCHED
SCHED -->|"Bind Pod → Node"| API
API -->|"Watch: resource changes"| CM
CM -->|"Create/Update objects"| API
subgraph Nodes["Worker Nodes"]
K1["kubelet"]
K2["kubelet"]
end
API -->|"Watch: Pod specs"| K1
API -->|"Watch: Pod specs"| K2
style ControlPlane fill:#1a1a2e,stroke:#4a9eff,stroke-width:2px,color:#e0e0e0
style Nodes fill:#1a1a2e,stroke:#50c878,stroke-width:2px,color:#e0e0e0
style API fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
style ETCD fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style SCHED fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style CM fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
kube-apiserver — The Front Door to Everything
Every interaction with your cluster — whether from kubectl, the dashboard, a CI/CD pipeline, or an internal controller — goes through the API server. It is the only component that talks to etcd directly. Nothing else reads or writes persistent state without going through this gateway first.
The API server is a stateless REST API built on HTTP/2. You can scale it horizontally by running multiple instances behind a load balancer. Each instance is functionally identical because etcd holds all the state.
The Request Pipeline: AuthN → AuthZ → Admission → etcd
Every request that hits the API server passes through a well-defined pipeline. Understanding this pipeline is critical for debugging 403 Forbidden errors, webhook failures, and mysterious object mutations.
flowchart LR
REQ["Incoming\nRequest"] --> AUTHN["Authentication\n(Who are you?)"]
AUTHN --> AUTHZ["Authorization\n(Can you do this?)"]
AUTHZ --> MUT["Mutating\nAdmission"]
MUT --> SCHEMA["Object Schema\nValidation"]
SCHEMA --> VAL["Validating\nAdmission"]
VAL --> ETCD["Persist\nto etcd"]
style REQ fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
style AUTHN fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style AUTHZ fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style MUT fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style SCHEMA fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
style VAL fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style ETCD fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
Authentication determines who is making the request. The API server evaluates multiple authenticators in sequence — client certificates, bearer tokens, OIDC tokens, service account tokens — and stops at the first one that succeeds. The result is a username, UID, and group membership.
Authorization determines what the authenticated identity is allowed to do. Kubernetes supports multiple authorizers: Node, ABAC, RBAC, and Webhook. In practice, nearly every cluster uses RBAC. The API server checks each authorizer in order; a single "allow" grants the request, while a "deny" from all authorizers rejects it.
Admission control happens in two phases. Mutating admission webhooks and built-in plugins run first — they can modify the object (e.g., injecting sidecar containers, setting default resource requests). Then validating admission webhooks run — they can accept or reject but cannot modify. This two-phase design means validators always see the final, mutated object.
# Check which admission plugins are enabled on your API server
kubectl get pod kube-apiserver-controlplane -n kube-system \
-o jsonpath='{.spec.containers[0].command}' | tr ',' '\n' | grep enable-admission
# List all registered mutating and validating webhook configurations
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
API Groups and Versioning
Kubernetes organizes its API into groups and versions. The core group (Pods, Services, ConfigMaps) lives at /api/v1. Named groups live under /apis/<group>/<version> — for example, /apis/apps/v1 for Deployments, or /apis/batch/v1 for Jobs. Every resource has a stability level indicated by its version: v1 (GA), v1beta1 (beta), or v1alpha1 (alpha).
# Discover all API groups and their preferred versions
kubectl api-versions
# List all resources in the 'apps' group
kubectl api-resources --api-group=apps
# Make a raw API call — bypasses kubectl abstractions
kubectl get --raw /apis/apps/v1/namespaces/default/deployments | jq '.items[].metadata.name'
# Check API server health endpoints
kubectl get --raw /healthz
kubectl get --raw /readyz
The API server supports long-lived watch connections. Instead of polling, clients (schedulers, controllers, kubelets) open a watch stream and receive a real-time event feed of changes. This is how Kubernetes achieves near-instant reaction to state changes — and why the API server is the central nervous system of the cluster.
etcd — The Cluster's Source of Truth
etcd is a distributed, strongly consistent key-value store. Every Kubernetes object — every Pod, Service, Secret, ConfigMap, and CustomResource — is serialized (typically as protobuf) and stored in etcd. If etcd is lost and unrecoverable, the cluster's desired state is gone. This makes etcd the single most critical piece of infrastructure in any Kubernetes deployment.
Data Model and Key Structure
Kubernetes stores objects in etcd under a predictable key hierarchy. The default prefix is /registry. A Deployment named nginx in the default namespace is stored at /registry/deployments/default/nginx. Understanding this structure is useful when inspecting etcd directly for disaster recovery or debugging.
# List top-level keys in etcd (run on a control plane node)
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only --limit=20 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Look up a specific object
ETCDCTL_API=3 etcdctl get /registry/deployments/default/nginx \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Raft Consensus
etcd uses the Raft consensus algorithm to replicate data across its cluster members. One member is elected the leader; all writes go through the leader and are replicated to followers. A write is committed only when a majority (quorum) of members acknowledge it. This means a 3-node etcd cluster tolerates 1 failure, and a 5-node cluster tolerates 2 failures.
flowchart LR
C["API Server\n(Client)"] -->|"Write request"| L["etcd Leader"]
L -->|"Replicate log entry"| F1["etcd Follower 1"]
L -->|"Replicate log entry"| F2["etcd Follower 2"]
F1 -->|"Acknowledge"| L
F2 -->|"Acknowledge"| L
L -->|"Commit (quorum reached)\nReturn success"| C
style C fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
style L fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style F1 fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style F2 fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
| etcd Cluster Size | Quorum Required | Fault Tolerance | Recommendation |
|---|---|---|---|
| 1 | 1 | 0 failures | Development only |
| 3 | 2 | 1 failure | Standard production |
| 5 | 3 | 2 failures | High-availability production |
| 7 | 4 | 3 failures | Rarely needed — added latency |
Compaction, Defragmentation, and Backup
etcd keeps a history of all key revisions to support the watch mechanism. Over time, this history grows and consumes disk space. Compaction discards revisions older than a given point, while defragmentation reclaims the freed disk space. Kubernetes runs automatic compaction (default: every 5 minutes, retaining roughly the last 5 minutes of history), but defragmentation must be triggered manually or via a cron job.
# Check etcd cluster health and member list
ETCDCTL_API=3 etcdctl endpoint health --cluster \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
ETCDCTL_API=3 etcdctl member list --write-out=table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check current database size and revision
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Snapshot backup — THE most important operational task for etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot is valid
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20240115.db \
--write-out=table
# Restore from snapshot (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \
--data-dir=/var/lib/etcd-restored \
--name=controlplane \
--initial-cluster=controlplane=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380
If you run a production Kubernetes cluster and you are not backing up etcd on a regular schedule (at minimum daily, ideally hourly), you are one disk failure away from losing your entire cluster state. Automate it. Verify restores periodically. Store backups off-cluster and encrypted.
kube-scheduler — Deciding Where Pods Run
When you create a Pod (directly or via a Deployment), it initially has no spec.nodeName. The scheduler watches for these unscheduled Pods, evaluates every available node against a series of constraints and preferences, and then binds the Pod to the best node by writing the node name back to the API server. The kubelet on that node then picks up the Pod and starts its containers.
The Two-Phase Scheduling Cycle
The scheduler operates in two distinct phases for each unscheduled Pod. First, it filters out nodes that cannot run the Pod. Then, it scores the remaining candidates to find the best fit. This filter-then-score approach keeps the algorithm efficient — scoring only runs on nodes that have already passed all hard constraints.
flowchart LR
P["Unscheduled Pod"] --> F["Filter Phase\n(Predicates)"]
F -->|"Feasible nodes"| S["Score Phase\n(Priorities)"]
S -->|"Highest score"| B["Bind to Node"]
F -.- F1["NodeResourcesFit"]
F -.- F2["NodeAffinity"]
F -.- F3["TaintToleration"]
F -.- F4["PodTopologySpread"]
S -.- S1["LeastAllocated"]
S -.- S2["BalancedAllocation"]
S -.- S3["ImageLocality"]
S -.- S4["NodeAffinityPriority"]
style P fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
style F fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style S fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style B fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
| Phase | Purpose | Key Plugins | Effect |
|---|---|---|---|
| Filter | Hard constraints — eliminate unsuitable nodes | NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread | Node is either feasible or not |
| Score | Soft preferences — rank the remaining candidates | LeastAllocated, BalancedAllocation, ImageLocality, InterPodAffinity | Each node gets a score 0–100 |
Resource-Based Scheduling
The scheduler uses requests (not limits) to determine if a node has enough capacity. If a Pod requests 500m CPU and 256Mi memory, the scheduler only considers nodes with at least that much allocatable capacity remaining. This is why setting accurate resource requests is critical — overestimate and you waste capacity, underestimate and Pods get scheduled onto overloaded nodes.
# See allocatable resources vs current requests on each node
kubectl describe nodes | grep -A 5 "Allocated resources"
# More precise: get node capacity, allocatable, and current allocation
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_CAP:.status.capacity.cpu,\
MEM_CAP:.status.capacity.memory,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory
# Why was a Pod not scheduled? Check events.
kubectl describe pod <pod-name> | grep -A 10 "Events"
# Inspect scheduler decisions in the scheduler logs
kubectl logs -n kube-system kube-scheduler-controlplane --tail=50
Node Affinity and Anti-Affinity
Node affinity lets you constrain which nodes a Pod can be scheduled on based on node labels. It comes in two flavors: requiredDuringSchedulingIgnoredDuringExecution (hard requirement — filter phase) and preferredDuringSchedulingIgnoredDuringExecution (soft preference — scoring phase). The "IgnoredDuringExecution" suffix means the rule only applies at scheduling time; if a node's labels change later, already-running Pods are not evicted.
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values: ["nvidia-a100", "nvidia-v100"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a"]
containers:
- name: training
image: ml-training:v2
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
Taints and Tolerations
While affinity pulls Pods toward nodes, taints push Pods away. A taint on a node repels all Pods that do not have a matching toleration. This is how Kubernetes keeps user workloads off control plane nodes (they carry a node-role.kubernetes.io/control-plane:NoSchedule taint) and how you can dedicate nodes for specific purposes.
There are three taint effects: NoSchedule (hard — never schedule here without a toleration), PreferNoSchedule (soft — try to avoid), and NoExecute (hard — evict already-running Pods that do not tolerate the taint).
# View taints on all nodes
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
TAINTS:.spec.taints
# Add a taint to dedicate nodes for a team
kubectl taint nodes worker-3 team=ml:NoSchedule
# Remove a taint (note the trailing dash)
kubectl taint nodes worker-3 team=ml:NoSchedule-
# Check which Pods tolerate control-plane taints
kubectl get pods -A -o json | jq -r '
.items[] |
select(.spec.tolerations[]? | .key == "node-role.kubernetes.io/control-plane") |
"\(.metadata.namespace)/\(.metadata.name)"'
# Pod that tolerates the ML team taint
apiVersion: v1
kind: Pod
metadata:
name: ml-job
spec:
tolerations:
- key: "team"
operator: "Equal"
value: "ml"
effect: "NoSchedule"
containers:
- name: training
image: pytorch:latest
kube-controller-manager — Relentless Reconciliation
The controller manager is a single binary that bundles dozens of independent control loops (controllers). Each controller watches a specific set of resources through the API server and continuously drives the actual state toward the desired state. This is the heart of the Kubernetes declarative model: you declare what you want, and controllers make it happen.
The Controller Pattern
Every controller follows the same fundamental loop: observe (watch the API server for changes), diff (compare desired state vs. actual state), and act (make API calls to close the gap). If a controller crashes and restarts, it simply re-reads the current state from the API server and picks up where it left off. No state is stored locally.
flowchart TB
WATCH["Watch API Server\n(Informer Cache)"] --> DIFF["Compare Desired\nvs. Actual State"]
DIFF -->|"Drift detected"| ACT["Take Action\n(Create / Update / Delete)"]
ACT --> API["API Server"]
API --> WATCH
DIFF -->|"In sync"| WATCH
style WATCH fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
style DIFF fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
style ACT fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
style API fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
Key Controllers and What They Do
| Controller | Watches | Reconciles | What It Does |
|---|---|---|---|
| Deployment | Deployments | ReplicaSets | Creates/updates ReplicaSets during rollouts; manages rollout history, rollback, and scaling |
| ReplicaSet | ReplicaSets | Pods | Ensures the correct number of Pod replicas exist; creates or deletes Pods to match .spec.replicas |
| Node | Nodes | Node status, taints | Monitors node heartbeats; adds NoExecute taints to unreachable nodes; triggers Pod eviction |
| ServiceAccount | Namespaces | ServiceAccounts | Creates a default ServiceAccount in every new namespace |
| Namespace | Namespaces | All namespaced resources | Handles namespace deletion — garbage-collects all resources within a terminating namespace |
| Job | Jobs | Pods | Creates Pods to completion; tracks success/failure counts; handles parallelism and backoff |
| EndpointSlice | Services, Pods | EndpointSlices | Maintains the mapping of Service selectors to Pod IPs |
| GarbageCollector | Owner references | Orphaned objects | Cascading deletion — when you delete a Deployment, it deletes owned ReplicaSets and Pods |
# Observe the controller pattern in action — scale a Deployment
# and watch the ReplicaSet controller respond
kubectl scale deployment/nginx --replicas=5
kubectl get events --watch --field-selector reason=SuccessfulCreate
# See the ownership chain: Deployment → ReplicaSet → Pod
kubectl get rs -o custom-columns=\
NAME:.metadata.name,\
OWNER:.metadata.ownerReferences[0].name,\
DESIRED:.spec.replicas,\
READY:.status.readyReplicas
# Inspect controller manager logs
kubectl logs -n kube-system kube-controller-manager-controlplane --tail=100
# Check which controllers are enabled
kubectl get pod kube-controller-manager-controlplane -n kube-system \
-o jsonpath='{.spec.containers[0].command}' | tr ',' '\n' | grep controllers
Leader Election
In a highly available cluster with multiple control plane nodes, you have multiple instances of the controller manager and scheduler running. But only one instance of each should be actively reconciling at any time — otherwise you'd get duplicate actions. Kubernetes solves this with leader election.
Each component races to acquire a Lease object in the kube-system namespace. The winner becomes the active leader and periodically renews the lease. If the leader crashes, it stops renewing, and another instance acquires the lease within seconds. You can inspect the current leaders directly.
# Check which node is the current leader for each component
kubectl get lease -n kube-system
# Detailed view — see holderIdentity and renewal time
kubectl get lease kube-controller-manager -n kube-system -o yaml
kubectl get lease kube-scheduler -n kube-system -o yaml
# Example output (holderIdentity shows the current leader):
# holderIdentity: controlplane-1_xxxxxxxx-xxxx
# leaseDurationSeconds: 15
# renewTime: "2024-01-15T10:23:45.000000Z"
Putting It All Together: The Life of a Deployment
To solidify how these components interact, let's trace what happens end-to-end when you run kubectl apply -f deployment.yaml for a new 3-replica Deployment.
kubectl → API Server
kubectl serializes your YAML, sends a POST to /apis/apps/v1/namespaces/default/deployments. The API server authenticates (client cert), authorizes (RBAC — does your user have create on deployments?), runs mutating admission (e.g., injects default labels), validates the schema, runs validating admission, and persists the Deployment object to etcd.
Deployment Controller → ReplicaSet
The Deployment controller (inside kube-controller-manager) receives a watch event: "new Deployment created." It creates a ReplicaSet with .spec.replicas: 3 and an owner reference pointing back to the Deployment. The ReplicaSet is written to the API server and persisted to etcd.
ReplicaSet Controller → Pods
The ReplicaSet controller receives a watch event: "new ReplicaSet with 3 desired replicas, 0 current." It creates 3 Pod objects, each with no spec.nodeName and an owner reference to the ReplicaSet. These Pods are persisted to etcd via the API server.
Scheduler → Bind Pods to Nodes
The scheduler receives watch events for 3 unscheduled Pods. For each Pod, it runs the filter phase (eliminating nodes without enough resources, wrong taints, etc.) and the score phase (preferring nodes with balanced allocation). It writes the selected spec.nodeName back to each Pod object via the API server.
Kubelet → Container Runtime
The kubelet on each selected node receives a watch event: "Pod assigned to me." It pulls the container image (if not cached), creates the sandbox via the container runtime (containerd), starts the containers, sets up networking, and begins reporting Pod status back to the API server.
# Watch the entire chain in real-time in a second terminal
kubectl get events --watch
# Then trigger a Deployment in your first terminal
kubectl create deployment nginx --image=nginx:1.25 --replicas=3
# You will see events flow through:
# Deployment created
# ReplicaSet created (by deployment-controller)
# Pods created (by replicaset-controller)
# Pods scheduled (by default-scheduler)
# Containers pulling image, started (by kubelet)
When a Pod is stuck in Pending, the problem is usually between steps 3 and 4 — the scheduler cannot find a feasible node. Run kubectl describe pod <name> and look at the Events section. The scheduler reports exactly which filter plugins failed and on how many nodes (e.g., "0/3 nodes are available: 3 Insufficient cpu").
Node Components — Kubelet, Kube-Proxy, and Container Runtime
The control plane makes decisions, but it is the node components that do the actual work. Every worker node in a Kubernetes cluster runs three core components: the kubelet, which manages pod lifecycle; kube-proxy, which handles network routing for Services; and a container runtime, which pulls images and runs containers. Understanding how these three collaborate is essential for debugging node-level issues and optimizing cluster performance.
Each of these components operates independently but communicates through well-defined interfaces. The kubelet talks to the container runtime via the Container Runtime Interface (CRI). Kube-proxy watches the API server for Service and Endpoint changes, then programs kernel-level routing rules. The container runtime handles the low-level mechanics of creating and destroying containers. Together, they turn a bare Linux machine into a functioning Kubernetes node.
Worker Node Architecture
The following diagram shows how the three node components interact on a single worker node when traffic arrives for a Kubernetes Service and when the kubelet manages a pod's lifecycle.
flowchart TB
API["API Server\n(Control Plane)"]
subgraph WorkerNode["Worker Node"]
direction TB
KL["kubelet"]
KP["kube-proxy"]
CR["Container Runtime\n(containerd / CRI-O)"]
cAdvisor["cAdvisor\n(embedded)"]
IPTABLES["iptables / IPVS\n(kernel)"]
subgraph Pod1["Pod A"]
C1["Container 1"]
C2["Container 2"]
end
subgraph Pod2["Pod B"]
C3["Container 3"]
end
end
API -- "pod specs,\ndesired state" --> KL
KL -- "status reports,\nnode heartbeat" --> API
KL -- "CRI gRPC calls\n(RunPodSandbox,\nCreateContainer)" --> CR
CR --> Pod1
CR --> Pod2
KL -- "metrics" --> cAdvisor
cAdvisor -.-> Pod1
cAdvisor -.-> Pod2
API -- "Service &\nEndpointSlice watches" --> KP
KP -- "programs rules" --> IPTABLES
IPTABLES -- "DNAT to\npod IP:port" --> Pod1
IPTABLES -- "DNAT to\npod IP:port" --> Pod2
Kubelet — The Node Agent
The kubelet is a daemon that runs on every node (including control plane nodes). It is the sole authority responsible for ensuring that the containers described in a Pod spec are running and healthy. The kubelet does not manage containers that were not created by Kubernetes — it only cares about Pods assigned to its node by the API server (or defined as static pods on disk).
Pod Lifecycle Management
When the scheduler assigns a Pod to a node, the kubelet picks up the assignment by watching the API server. It then drives the Pod through its lifecycle: pulling images, creating the pod sandbox (network namespace), starting init containers in sequence, starting app containers, running startup/liveness/readiness probes, and ultimately tearing everything down when the Pod is deleted. Each transition is reported back to the API server as a pod status update.
The kubelet's main sync loop runs every 10 seconds by default (configurable via --sync-frequency). On each iteration, it compares the desired state from the API server against the actual state on the node and reconciles any differences. If a container crashes, the kubelet restarts it according to the pod's restartPolicy with an exponential backoff that caps at 5 minutes.
You can inspect the kubelet's view of what is running on a node by querying its read-only status port or using kubectl:
# List all pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=worker-1
# Check kubelet health on the node itself
systemctl status kubelet
# View kubelet logs for pod lifecycle events
journalctl -u kubelet --no-pager --since "10 minutes ago" | grep -i "SyncLoop"
# Inspect the kubelet's configuration
kubectl get --raw "/api/v1/nodes/worker-1/proxy/configz" | jq .
Static Pods
Static pods are managed directly by the kubelet on a specific node, without the API server scheduling them. The kubelet watches a directory on disk (default: /etc/kubernetes/manifests/) and automatically creates or destroys pods when YAML files are added to or removed from that directory. This is how kubeadm-based clusters run control plane components — etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all run as static pods.
The kubelet creates a mirror pod in the API server for each static pod so that kubectl get pods can show them. However, you cannot delete a mirror pod through the API — you must remove the manifest file from disk.
# Find the static pod manifest directory
ps aux | grep kubelet | grep -- "--pod-manifest-path"
# Or check the kubelet config file
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# List static pod manifests (on a kubeadm control plane node)
ls -la /etc/kubernetes/manifests/
# etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
You can create your own static pods by placing a manifest in the static pod directory. The kubelet will pick it up within seconds:
# /etc/kubernetes/manifests/debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: debug-static
namespace: default
spec:
containers:
- name: debug
image: busybox:1.36
command: ["sleep", "3600"]
CRI — Container Runtime Interface
The kubelet does not start containers directly. Instead, it communicates with the container runtime through the Container Runtime Interface (CRI), a gRPC-based API defined as a set of protobuf services. CRI has two services: the RuntimeService (for managing pod sandboxes and containers) and the ImageService (for pulling, listing, and removing images).
This abstraction is what allows Kubernetes to support multiple container runtimes interchangeably. The kubelet connects to the CRI endpoint via a Unix socket — typically /run/containerd/containerd.sock for containerd or /var/run/crio/crio.sock for CRI-O.
# Check which CRI endpoint the kubelet is using
ps aux | grep kubelet | grep -- "container-runtime-endpoint"
# Use crictl to interact with the runtime directly (works with any CRI runtime)
crictl --runtime-endpoint unix:///run/containerd/containerd.sock info
# List running containers through CRI
crictl ps
# List pod sandboxes
crictl pods
# Pull an image through CRI
crictl pull nginx:1.25
# Inspect a specific container
crictl inspect <CONTAINER_ID>
crictl is the standard CLI for debugging CRI-compatible runtimes. It ships with most Kubernetes distributions and uses the same CRI gRPC calls as the kubelet. Configure its default endpoint in /etc/crictl.yaml so you don't have to pass --runtime-endpoint every time.
cAdvisor Metrics and Resource Monitoring
The kubelet embeds cAdvisor (Container Advisor), which collects CPU, memory, filesystem, and network usage metrics for every running container. These metrics serve two critical purposes: they power the kubectl top command (via the Metrics Server, which scrapes the kubelet's /metrics/resource endpoint), and they inform the kubelet's own eviction decisions when a node runs low on resources.
# View resource usage per pod on a node (requires Metrics Server)
kubectl top pods -n default
# View resource usage per node
kubectl top nodes
# Query the kubelet's metrics endpoint directly (from the node)
curl -sk https://localhost:10250/metrics/resource \
--cert /var/lib/kubelet/pki/kubelet-client-current.pem \
--key /var/lib/kubelet/pki/kubelet-client-current.pem | head -30
# Check kubelet's summary API for detailed per-container stats
kubectl get --raw "/api/v1/nodes/worker-1/proxy/stats/summary" | jq '.pods[0]'
Garbage Collection
Over time, terminated containers and unused images accumulate on each node, consuming disk space. The kubelet runs garbage collection for both. Container garbage collection removes dead containers based on three settings: MaxPerPodContainer (default 1 — keep the last terminated container per pod), MaxContainers (total dead containers on the node), and a minimum age threshold. Image garbage collection triggers based on disk usage thresholds.
# Key kubelet config fields for garbage collection (/var/lib/kubelet/config.yaml)
imageMinimumGCAge: 2m # Don't GC images younger than 2 minutes
imageGCHighThresholdPercent: 85 # Start removing images when disk hits 85%
imageGCLowThresholdPercent: 80 # Stop removing images when disk drops to 80%
evictionHard:
imagefs.available: "15%" # Evict pods if image filesystem < 15% free
memory.available: "100Mi" # Evict pods if available memory < 100Mi
nodefs.available: "10%" # Evict pods if root filesystem < 10% free
# Check current image disk usage on a node
crictl images | tail -5
crictl imagefsinfo
# Manually trigger image cleanup (remove unused images)
crictl rmi --prune
# See if the node is under disk pressure (triggers eviction)
kubectl describe node worker-1 | grep -A 5 "Conditions"
Kube-Proxy — Service Networking at the Kernel Level
When you create a Kubernetes Service, it gets a stable virtual IP address (ClusterIP) that doesn't map to any network interface. Kube-proxy is the component that makes this virtual IP actually work. It runs on every node, watches the API server for Service and EndpointSlice objects, and programs the node's kernel networking stack to translate Service IPs into real pod IPs using destination NAT (DNAT).
Kube-proxy does not proxy traffic itself in the data path (despite its name). It is a control plane component for node networking — it configures the rules and then steps aside. All actual packet forwarding happens in the Linux kernel.
How Service Routing Works
flowchart LR
Client["Client Pod\n10.244.1.5"]
SVC["Service ClusterIP\n10.96.0.100:80"]
KERNEL["Linux Kernel\n(iptables / IPVS)"]
EP1["Pod A\n10.244.2.10:8080"]
EP2["Pod B\n10.244.3.22:8080"]
EP3["Pod C\n10.244.1.18:8080"]
WATCH["kube-proxy\n(watches API server)"]
APISVR["API Server"]
Client -- "dst: 10.96.0.100:80" --> KERNEL
KERNEL -- "DNAT to\n10.244.2.10:8080" --> EP1
KERNEL -. "or" .-> EP2
KERNEL -. "or" .-> EP3
APISVR -- "Service &\nEndpointSlice changes" --> WATCH
WATCH -- "programs\nrouting rules" --> KERNEL
The flow works like this: a client pod sends a packet to the Service ClusterIP. The kernel intercepts the packet before it leaves the node (using PREROUTING or OUTPUT chains) and rewrites the destination address to one of the backing pod IPs. The response packet goes directly back to the client with the source address rewritten to the Service IP, so the client never knows which pod it hit.
Proxy Modes Compared
Kube-proxy supports three modes for programming these routing rules. The mode determines which kernel subsystem does the actual work. Each has meaningful trade-offs at scale.
| Feature | iptables (default) | IPVS | nftables |
|---|---|---|---|
| Kernel subsystem | netfilter / iptables | Linux Virtual Server (LVS) | nftables (netfilter successor) |
| Default since | Kubernetes 1.2 | Stable since 1.11 | Alpha in 1.29, beta in 1.31 |
| Rule complexity | O(n) — chain of rules | O(1) — hash table lookup | O(1) — optimized rulesets |
| Load balancing | Random (equal probability) | Round-robin, least-conn, weighted, and more | Random (equal probability) |
| Performance at scale | Degrades at >5,000 Services | Handles 10,000+ Services well | Better than iptables, comparable to IPVS |
| Rule update speed | Full chain rewrite on change | Incremental updates | Incremental, transactional updates |
| Session affinity | Yes (via recent module) | Yes (built-in persistence) | Yes |
| Best for | Small-to-medium clusters | Large clusters with many Services | Modern kernels, future default |
If you are running a cluster with more than a few thousand Services, switch to IPVS mode. The iptables mode rewrites the entire rule chain on every Service or Endpoint change, which causes noticeable latency spikes in large clusters. IPVS uses a kernel-level hash table and supports incremental updates.
Inspecting Kube-Proxy Rules
Understanding what kube-proxy has actually programmed into the kernel is one of the most valuable debugging skills for Service connectivity issues. The commands differ by proxy mode.
# Check which proxy mode is active
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
# --- iptables mode ---
# List all KUBE-SERVICES chains (one per ClusterIP:port)
iptables -t nat -L KUBE-SERVICES -n | head -20
# Trace rules for a specific Service ClusterIP
iptables -t nat -L -n -v | grep "10.96.0.100"
# Show the full DNAT chain for a service
# (Follow KUBE-SVC-xxx -> KUBE-SEP-xxx chains to see pod endpoints)
iptables -t nat -L KUBE-SVC-XXXXXX -n -v
# --- IPVS mode ---
# List all virtual servers (Service IPs) and their real servers (Pod IPs)
ipvsadm -Ln
# Show stats for a specific Service ClusterIP
ipvsadm -Ln -t 10.96.0.100:80
# --- nftables mode ---
# List nftables rules managed by kube-proxy
nft list table ip kube-proxy
Session Affinity
By default, kube-proxy distributes traffic across all healthy endpoints with no stickiness. If you need a client to consistently reach the same backend pod (for example, for in-memory sessions), you can enable session affinity on the Service. Kubernetes supports ClientIP affinity, which routes all requests from the same source IP to the same pod for a configurable timeout (default: 10,800 seconds / 3 hours).
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web
ports:
- port: 80
targetPort: 8080
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 1800 # 30 minutes
# Verify session affinity is set on a Service
kubectl describe svc web-app | grep -i "session"
# Test affinity — repeated requests should hit the same pod
for i in $(seq 1 5); do
kubectl exec test-pod -- curl -s web-app.default.svc.cluster.local | grep "pod-name"
done
Kube-Proxy Logs and Debugging
# kube-proxy usually runs as a DaemonSet — find the pod on your node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
# View kube-proxy logs
kubectl logs -n kube-system kube-proxy-abc12 --tail=50
# Check kube-proxy's current configuration
kubectl get configmap kube-proxy -n kube-system -o yaml
# Verify kube-proxy is correctly watching EndpointSlices
kubectl logs -n kube-system kube-proxy-abc12 | grep -i "endpoints"
# Check conntrack entries for active NAT translations
conntrack -L -d 10.96.0.100 2>/dev/null | head -10
Container Runtime — Where Containers Actually Run
The container runtime is the software that actually executes containers. It pulls images from registries, creates filesystem layers, sets up namespaces and cgroups, and starts the process inside the container. Kubernetes interacts with the runtime exclusively through the CRI specification, which means any runtime implementing the CRI gRPC API works with Kubernetes.
The CRI Specification
CRI defines two gRPC services. The RuntimeService handles the pod and container lifecycle — creating sandboxes, starting and stopping containers, executing commands, attaching to running containers, and port forwarding. The ImageService handles image operations — pulling, listing, inspecting, and removing images. This clean separation means the kubelet never needs to know the implementation details of how containers or images are managed under the hood.
| CRI Service | Key RPCs | Purpose |
|---|---|---|
| RuntimeService | RunPodSandbox | Create the pod's network namespace and infrastructure container |
CreateContainer | Create a container within an existing pod sandbox | |
StartContainer | Start a previously created container | |
StopContainer | Gracefully stop a running container | |
RemoveContainer | Remove a stopped container | |
ExecSync | Execute a command in a container synchronously | |
| ImageService | PullImage | Pull an image from a registry |
ListImages | List images on the node | |
RemoveImage | Remove an image from the node | |
ImageFsInfo | Report filesystem usage for image storage |
containerd — The Default Runtime
containerd is the default container runtime for most Kubernetes distributions, including kubeadm clusters, GKE, EKS, and AKS. It was originally extracted from Docker as a standalone daemon and donated to the CNCF. containerd handles the full container lifecycle: image transfer and storage, container execution (via runc as the low-level OCI runtime), snapshotting (filesystem layering), and network namespace management.
The architecture is layered: the kubelet calls containerd's CRI plugin over gRPC, containerd manages image and container metadata, and it delegates actual container creation to an OCI runtime (usually runc, but gVisor's runsc and Kata Containers' kata-runtime are also supported).
# Check containerd status
systemctl status containerd
# View containerd version and runtime info
ctr version
# Use the containerd-specific CLI to list containers (in the k8s.io namespace)
ctr -n k8s.io containers list
# List images managed by containerd for Kubernetes
ctr -n k8s.io images list | head -10
# View the containerd configuration
cat /etc/containerd/config.toml | grep -A 5 "plugins.*cri"
# Check which OCI runtime containerd is using
cat /etc/containerd/config.toml | grep -A 3 "runtimes.runc"
CRI-O — The Kubernetes-Native Alternative
CRI-O is a lightweight container runtime built specifically for Kubernetes. Unlike containerd, which has a broader scope (Docker, Swarm, etc.), CRI-O implements only the CRI interface and nothing more. It follows a strict version-locking policy with Kubernetes: CRI-O 1.29.x supports Kubernetes 1.29.x, CRI-O 1.30.x supports Kubernetes 1.30.x, and so on. This makes it a popular choice in OpenShift clusters and environments that want the thinnest possible runtime layer.
# Check CRI-O status
systemctl status crio
# View CRI-O version and config
crio --version
crio config --default | head -30
# On a CRI-O node, crictl works exactly the same way
crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps
crictl --runtime-endpoint unix:///var/run/crio/crio.sock images
The Dockershim Deprecation
Before Kubernetes 1.24, the kubelet could talk to Docker Engine via a built-in adapter called dockershim. This adapter translated CRI calls into Docker API calls. The chain was wasteful: kubelet → dockershim → Docker Engine → containerd → runc. Kubernetes 1.24 removed dockershim entirely. If your cluster previously used Docker as its runtime, it was migrated to containerd (which Docker itself used internally). Your container images remain 100% compatible because both Docker and Kubernetes build and run OCI-compliant images.
Image Pulling and Caching
When a pod is scheduled to a node, the kubelet instructs the container runtime to pull any images not already present on the node. Images are stored in a content-addressable store and shared across containers — if two pods use the same image, it is stored only once. The pull behavior is controlled by the pod spec's imagePullPolicy.
| imagePullPolicy | Behavior | When to use |
|---|---|---|
Always | Always contacts the registry (uses cached layers if digest matches) | Tags that can change, like latest or dev |
IfNotPresent | Pulls only if the image is not on the node | Immutable tags like v1.2.3 (default for tagged images) |
Never | Never pulls — fails if image is missing | Pre-loaded images, air-gapped environments |
# List images cached on the node
crictl images
# Check the disk space used by images
crictl imagefsinfo
# Pre-pull an image to avoid cold-start latency
crictl pull gcr.io/my-project/my-app:v2.1.0
# Debug image pull failures — check kubelet logs
journalctl -u kubelet --no-pager --since "5 minutes ago" | grep -i "pull"
# Check events on a pod stuck in ImagePullBackOff
kubectl describe pod my-pod | grep -A 10 "Events"
Avoid using the :latest tag in production. With imagePullPolicy: Always (the default for :latest), every pod start contacts the registry, adding latency and creating a hard dependency on registry availability. With IfNotPresent, different nodes may run different versions of :latest. Always use immutable, versioned tags like v1.4.2 or full digest references like sha256:abc123....
Putting It All Together — Node Status and Debugging
The kubelet reports the overall health of all three components back to the API server as node conditions. When any component is unhealthy, it surfaces as a condition change on the node object. Understanding these conditions is the starting point for diagnosing node-level problems.
# Full node inspection — allocatable resources, conditions, images, and running pods
kubectl describe node worker-1
# Check node conditions specifically
kubectl get node worker-1 -o jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'
# Verify the container runtime the node is using
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
# Output: containerd://1.7.11
# Comprehensive node debug — creates a privileged debug pod on the node
kubectl debug node/worker-1 -it --image=ubuntu:22.04
# Once inside the debug pod:
# chroot /host
# systemctl status kubelet
# systemctl status containerd
# journalctl -u kubelet --since "30 minutes ago" | tail -50
| Node Condition | Healthy Value | What It Means When Unhealthy |
|---|---|---|
Ready | True | Kubelet cannot communicate with the API server or the container runtime is down |
MemoryPressure | False | Node memory is running low — kubelet will start evicting pods |
DiskPressure | False | Root or image filesystem is nearly full — image GC and pod eviction triggered |
PIDPressure | False | Too many processes on the node — kubelet refuses new pods |
NetworkUnavailable | False | CNI plugin has not configured networking — pods cannot communicate |
With a solid understanding of these three components, you can trace any node-level issue from symptom to root cause. A pod stuck in ContainerCreating? Check the kubelet logs and the container runtime. A Service not reachable? Inspect kube-proxy's iptables or IPVS rules. An ImagePullBackOff? Look at the kubelet events and test the image pull with crictl. These components are the foundation — everything else in Kubernetes runs on top of them.
The Kubernetes API and kubectl Essentials
Every interaction with a Kubernetes cluster — whether you type a kubectl command, deploy from CI/CD, or a controller reconciles state — goes through the Kubernetes API server. The API is the single source of truth. Understanding its structure is not optional background knowledge; it is the foundation for everything else you will do with Kubernetes.
kubectl is the primary CLI client for this API. It translates your commands into HTTP requests against the API server, then formats the responses for your terminal. This section covers the API's structure, how to explore it, and the essential kubectl commands organized by real-world workflow.
The Kubernetes API Structure
The Kubernetes API is a RESTful HTTP API organized into API groups. Each group contains a set of related resources, and each resource has one or more supported versions. When you create a Deployment, you are making an HTTP POST to a specific URL path that encodes the group, version, and resource type.
API Groups
Resources are organized into groups to keep the API modular and allow independent versioning. The core group (also called the legacy group) has no explicit group name in its URL path — its resources live directly under /api/v1. All other groups are under /apis/<group-name>/<version>.
| API Group | URL Path Prefix | Key Resources | Purpose |
|---|---|---|---|
| core (legacy) | /api/v1 | Pod, Service, ConfigMap, Secret, Namespace, Node, PersistentVolume | Foundational cluster primitives |
apps | /apis/apps/v1 | Deployment, StatefulSet, DaemonSet, ReplicaSet | Workload controllers |
batch | /apis/batch/v1 | Job, CronJob | Run-to-completion and scheduled workloads |
networking.k8s.io | /apis/networking.k8s.io/v1 | Ingress, NetworkPolicy, IngressClass | Network routing and access control |
rbac.authorization.k8s.io | /apis/rbac.authorization.k8s.io/v1 | Role, ClusterRole, RoleBinding, ClusterRoleBinding | Access control |
storage.k8s.io | /apis/storage.k8s.io/v1 | StorageClass, CSIDriver, VolumeAttachment | Storage provisioning |
autoscaling | /apis/autoscaling/v2 | HorizontalPodAutoscaler | Automatic scaling |
The full API URL for a namespaced resource follows this pattern: /apis/{group}/{version}/namespaces/{namespace}/{resource}. For example, to list Deployments in the production namespace, the API server handles a GET request to /apis/apps/v1/namespaces/production/deployments.
Resource Versioning
Each API group version indicates its stability level. This is not just a label — it carries guarantees about backward compatibility and how long the version will be supported.
| Version Label | Stability | Meaning |
|---|---|---|
v1, v2 | GA (stable) | Fully supported, backward-compatible changes only. Safe for production. |
v1beta1, v2beta1 | Beta | Enabled by default but may have breaking changes between releases. Feature is well-tested. |
v1alpha1 | Alpha | Disabled by default. May be removed without notice. Never use in production. |
You specify the API version in every manifest's apiVersion field. For core group resources, you write apiVersion: v1. For named groups, you write apiVersion: group/version — for example, apiVersion: apps/v1 for a Deployment.
Exploring the API with kubectl
You do not need to memorize the full API table. Two commands let you explore it interactively. kubectl api-resources lists every resource type available in your cluster, and kubectl explain gives you inline documentation for any resource or field.
# List all available resource types with their API group and kind
kubectl api-resources
# Filter to only namespaced resources
kubectl api-resources --namespaced=true
# Filter by API group
kubectl api-resources --api-group=apps
# Show supported API versions
kubectl api-versions
The output of api-resources shows each resource's short name (useful for saving keystrokes), the API group it belongs to, whether it is namespaced, and its Kind. For example, you will see that deploy is the short name for deployments in the apps group.
# Get top-level documentation for a resource
kubectl explain deployment
# Drill into a specific field path
kubectl explain deployment.spec.strategy
# Show the full recursive structure
kubectl explain deployment.spec --recursive
# Specify an API version explicitly
kubectl explain cronjob --api-version=batch/v1
Use kubectl explain as your first reference instead of searching the web. It reflects the exact API version running on your cluster, which may differ from online documentation. Combine it with --recursive to see the full field hierarchy, then drill into specific paths for descriptions and types.
Kubeconfig, Contexts, and Namespace Management
Before you can talk to a cluster, kubectl needs to know which cluster, which user credentials, and which namespace to use by default. All of this is stored in the kubeconfig file, typically located at ~/.kube/config. The file has three main sections: clusters, users, and contexts.
Kubeconfig Structure
apiVersion: v1
kind: Config
current-context: dev-cluster
clusters:
- name: dev-cluster
cluster:
server: https://dev-k8s.example.com:6443
certificate-authority-data: LS0tLS1CRUd...
- name: prod-cluster
cluster:
server: https://prod-k8s.example.com:6443
certificate-authority-data: LS0tLS1CRUd...
users:
- name: dev-admin
user:
client-certificate-data: LS0tLS1CRUd...
client-key-data: LS0tLS1CRUd...
- name: prod-reader
user:
token: eyJhbGciOiJSUzI1NiIs...
contexts:
- name: dev-cluster
context:
cluster: dev-cluster
user: dev-admin
namespace: default
- name: prod-readonly
context:
cluster: prod-cluster
user: prod-reader
namespace: monitoring
A context is a triple of (cluster, user, namespace). It gives you a named shortcut so you can switch between environments with a single command instead of specifying credentials and endpoints every time.
Context Switching
# See all available contexts
kubectl config get-contexts
# Check which context is currently active
kubectl config current-context
# Switch to a different context
kubectl config use-context prod-readonly
# Set the default namespace for the current context
kubectl config set-context --current --namespace=kube-system
# View the full merged kubeconfig
kubectl config view
Namespace Management
Namespaces provide scope for resource names and are the primary boundary for access control policies and resource quotas. Every namespaced command defaults to the namespace set in your current context (or default if none is set). You can override this on any command with the -n flag.
# List all namespaces
kubectl get namespaces
# Run a command in a specific namespace
kubectl get pods -n kube-system
# Query across ALL namespaces
kubectl get pods --all-namespaces
kubectl get pods -A # shorthand
# Create a new namespace
kubectl create namespace staging
Essential kubectl Commands by Workflow
Rather than listing commands alphabetically, let's organize them by the workflow stages you will actually use every day: create, inspect, update, delete, and debug.
Creating Resources
There are two primary ways to create resources: kubectl apply (declarative) and kubectl create (imperative). The declarative approach with apply is strongly preferred for anything beyond quick experiments, because it tracks the desired state and enables repeatable updates. The next section on the declarative model covers the "why" in depth.
# Declarative: apply a manifest file (create or update)
kubectl apply -f deployment.yaml
# Apply an entire directory of manifests
kubectl apply -f ./k8s/
# Apply from a URL
kubectl apply -f https://raw.githubusercontent.com/org/repo/main/manifests/app.yaml
# Dry-run to validate without creating anything
kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server # server-side validation
# Imperative: create resources directly (errors if resource already exists)
kubectl create deployment nginx --image=nginx:1.27 --replicas=3
kubectl create service clusterip nginx --tcp=80:80
kubectl create configmap app-config --from-file=config.properties
kubectl create secret generic db-creds --from-literal=password=s3cret
kubectl create fails if the resource already exists. kubectl apply creates the resource if it does not exist and updates it if it does. In production workflows, apply is almost always what you want because it is idempotent — you can run it repeatedly with the same result.
Inspecting Resources
Inspection is where you will spend most of your kubectl time. The core commands are get (list resources), describe (detailed view with events), and logs (container output). Each serves a different purpose in your investigation workflow.
# List pods with status information
kubectl get pods
kubectl get pods -o wide # includes Node, IP, and nominated node
kubectl get pods --show-labels # show all labels as columns
kubectl get pods -l app=frontend # filter by label selector
kubectl get pods --field-selector=status.phase=Running
# List multiple resource types at once
kubectl get pods,services,deployments
# Describe gives you the full picture: spec, status, conditions, and events
kubectl describe pod nginx-7d6b8f5c9-x2k4m
kubectl describe deployment frontend
kubectl describe node worker-01
# Container logs
kubectl logs nginx-7d6b8f5c9-x2k4m
kubectl logs nginx-7d6b8f5c9-x2k4m -c sidecar # specific container
kubectl logs nginx-7d6b8f5c9-x2k4m --previous # logs from crashed container
kubectl logs -f nginx-7d6b8f5c9-x2k4m # follow (stream) logs
kubectl logs -l app=frontend --all-containers # logs from all matching pods
# Execute commands inside a running container
kubectl exec nginx-7d6b8f5c9-x2k4m -- ls /etc/nginx
kubectl exec -it nginx-7d6b8f5c9-x2k4m -- /bin/sh # interactive shell
The difference between get and describe is critical. get gives you a tabular summary — perfect for scanning across many resources. describe gives you the full story of a single resource, including the Events section at the bottom, which is often the first place you find the reason for a failure (image pull errors, insufficient resources, failed health checks).
Updating Resources
For updates, the declarative approach is to modify your YAML file and run kubectl apply again. But kubectl also provides imperative update commands that are useful for quick changes during development or incident response.
# Edit a resource in your default editor (opens YAML in $EDITOR)
kubectl edit deployment frontend
# Patch a resource with a JSON merge patch
kubectl patch deployment frontend \
-p '{"spec":{"replicas":5}}'
# Strategic merge patch (default) — merges arrays intelligently
kubectl patch deployment frontend --type=strategic \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"app","image":"app:v2.1"}]}}}}'
# JSON patch — precise array operations
kubectl patch deployment frontend --type=json \
-p '[{"op":"replace","path":"/spec/replicas","value":5}]'
# Quick replica scaling
kubectl scale deployment frontend --replicas=10
# Update the container image directly
kubectl set image deployment/frontend app=myapp:v2.1
# Add or update labels and annotations
kubectl label pods nginx-7d6b8f5c9-x2k4m env=staging
kubectl annotate deployment frontend team=platform
Deleting Resources
Deletion can be targeted or broad. By default, kubectl delete waits for the resource to be fully terminated (which respects graceful shutdown periods). Use --wait=false to return immediately.
# Delete a specific resource
kubectl delete pod nginx-7d6b8f5c9-x2k4m
kubectl delete deployment frontend
# Delete using the same manifest file used to create it
kubectl delete -f deployment.yaml
# Delete by label selector
kubectl delete pods -l app=frontend
# Delete all resources of a type in a namespace
kubectl delete pods --all -n staging
# Force-delete a stuck pod (skips graceful shutdown)
kubectl delete pod stuck-pod --grace-period=0 --force
# Delete a namespace and ALL resources within it
kubectl delete namespace staging
Deleting a Pod directly does not prevent it from being recreated. If the Pod is managed by a Deployment or ReplicaSet, the controller will immediately create a replacement. To remove Pods permanently, delete the controller (Deployment, StatefulSet, etc.) that owns them.
Debugging
When things go wrong — and they will — kubectl provides tools to dig deeper without modifying production workloads. Port-forwarding and proxying let you reach cluster-internal endpoints from your local machine. The debug command lets you attach diagnostic containers to running pods.
# Forward a local port to a pod (access pod's port 8080 at localhost:8080)
kubectl port-forward pod/frontend-5d7b8c9f-k2m4x 8080:8080
# Forward to a service (kubectl picks a backing pod)
kubectl port-forward svc/frontend 8080:80
# Start a proxy to the entire API server (accessible at localhost:8001)
kubectl proxy
# Attach a debug container to a running pod (ephemeral container)
kubectl debug -it frontend-5d7b8c9f-k2m4x --image=busybox --target=app
# Create a copy of a pod with a debug container for troubleshooting
kubectl debug frontend-5d7b8c9f-k2m4x -it --copy-to=debug-pod --image=ubuntu
# Debug a node directly (creates a privileged pod on the node)
kubectl debug node/worker-01 -it --image=ubuntu
Output Formatting
The default table output is fine for quick glances, but real-world workflows often require structured data — piping to jq, feeding into scripts, or extracting a single value for a CI variable. The -o flag controls the output format.
Standard Formats
# Full YAML output — great for seeing the complete resource spec
kubectl get deployment frontend -o yaml
# Full JSON output — ideal for piping to jq
kubectl get pods -o json | jq '.items[].metadata.name'
# Wide table with extra columns (Node, IP, etc.)
kubectl get pods -o wide
# Just the resource names (useful for scripting)
kubectl get pods -o name
JSONPath — Extracting Specific Fields
JSONPath expressions let you extract exactly the fields you need without an external JSON processor. The syntax uses curly braces around the path expression, starting from the root object.
# Get the IP address of a specific pod
kubectl get pod nginx-7d6b8f5c9-x2k4m \
-o jsonpath='{.status.podIP}'
# List all container images across all pods
kubectl get pods \
-o jsonpath='{.items[*].spec.containers[*].image}'
# Format with newlines for readability
kubectl get pods \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
# Get the node port of a NodePort service
kubectl get svc frontend \
-o jsonpath='{.spec.ports[0].nodePort}'
# Extract a secret value (base64-encoded)
kubectl get secret db-creds \
-o jsonpath='{.data.password}' | base64 -d
Custom Columns — Readable Tables from Arbitrary Fields
When you want a table format but with different columns than the default, custom-columns lets you define exactly what to show. Each column has a header and a JSONPath expression.
# Custom columns: pod name, node, status, and restart count
kubectl get pods -o custom-columns=\
'NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'
# Deployments with image and replica info
kubectl get deployments -o custom-columns=\
'NAME:.metadata.name,IMAGE:.spec.template.spec.containers[0].image,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas'
Real-World Workflow Patterns
Individual commands are building blocks. In practice, you chain them together to accomplish real tasks. Here are patterns you will use regularly.
Investigating a Failing Deployment
When a deployment is not reaching its desired replica count, work through the resource hierarchy: Deployment → ReplicaSet → Pod → Container logs → Events.
# Step 1: Check the deployment status
kubectl get deployment frontend
kubectl describe deployment frontend | tail -20
# Step 2: Check the ReplicaSet it created
kubectl get replicasets -l app=frontend
# Step 3: Find pods that are not Ready
kubectl get pods -l app=frontend --field-selector=status.phase!=Running
# Step 4: Describe the failing pod for events and conditions
kubectl describe pod frontend-5d7b8c9f-crash1
# Step 5: Check container logs (including previous crashed container)
kubectl logs frontend-5d7b8c9f-crash1 --previous
# Step 6: Check cluster-wide events sorted by time
kubectl get events --sort-by='.lastTimestamp' -n default | tail -20
Quick Rollback After a Bad Deploy
# View rollout history
kubectl rollout history deployment/frontend
# Check the current rollout status
kubectl rollout status deployment/frontend
# Roll back to the previous version
kubectl rollout undo deployment/frontend
# Roll back to a specific revision
kubectl rollout undo deployment/frontend --to-revision=3
# Pause a rollout (prevent further changes from progressing)
kubectl rollout pause deployment/frontend
# Resume a paused rollout
kubectl rollout resume deployment/frontend
Generating Manifests Without Memorizing YAML
One of the most practical kubectl tricks: use imperative commands with --dry-run=client -o yaml to generate manifest templates. This saves you from writing YAML from scratch and ensures the basic structure is correct.
# Generate a Deployment manifest
kubectl create deployment web --image=nginx:1.27 --replicas=3 \
--dry-run=client -o yaml > deployment.yaml
# Generate a Service manifest
kubectl create service clusterip web --tcp=80:8080 \
--dry-run=client -o yaml > service.yaml
# Generate a Job manifest
kubectl create job backup --image=postgres:16 \
-- pg_dump -h db myapp \
--dry-run=client -o yaml > job.yaml
# Generate a CronJob manifest
kubectl create cronjob cleanup --image=busybox \
--schedule="0 2 * * *" -- /bin/sh -c "echo cleaning up" \
--dry-run=client -o yaml > cronjob.yaml
Working Across Multiple Resources
# Get a comprehensive view of everything in a namespace
kubectl get all -n production
# Watch resources in real time (updates as state changes)
kubectl get pods -w
# Get top resource consumers (requires metrics-server)
kubectl top pods --sort-by=memory
kubectl top nodes
# Diff local changes against the live cluster state before applying
kubectl diff -f deployment.yaml
# Apply changes only when the diff looks right
kubectl diff -f deployment.yaml && kubectl apply -f deployment.yaml
# Bulk operations: restart all pods in a deployment (rolling restart)
kubectl rollout restart deployment/frontend
# Copy files to/from a container
kubectl cp ./local-file.txt frontend-pod:/tmp/remote-file.txt
kubectl cp frontend-pod:/var/log/app.log ./app.log
Use kubectl diff -f before every kubectl apply in production. It shows you exactly what will change, just like git diff before a commit. This habit prevents surprises — especially when multiple people manage the same cluster.
Quick Reference: Common Short Names and Flags
Typing full resource names gets tedious. Kubernetes provides short names for frequently used resources, and kubectl supports shorthand flags. Here are the ones you will use most.
| Resource | Short Name | Example |
|---|---|---|
| pods | po | kubectl get po |
| services | svc | kubectl get svc |
| deployments | deploy | kubectl get deploy |
| replicasets | rs | kubectl get rs |
| configmaps | cm | kubectl get cm |
| namespaces | ns | kubectl get ns |
| nodes | no | kubectl get no |
| persistentvolumeclaims | pvc | kubectl get pvc |
| persistentvolumes | pv | kubectl get pv |
| serviceaccounts | sa | kubectl get sa |
| horizontalpodautoscalers | hpa | kubectl get hpa |
| ingresses | ing | kubectl get ing |
| Flag | Short | Purpose |
|---|---|---|
--namespace | -n | Target a specific namespace |
--all-namespaces | -A | Query all namespaces |
--selector | -l | Filter by label selector |
--output | -o | Set output format |
--follow | -f | Stream logs in real time |
--watch | -w | Watch for resource changes |
--container | -c | Target a specific container in a pod |
The Declarative Model and Reconciliation Loops
Most infrastructure tools ask you to write a script of commands: "create this, then modify that, then delete the other thing." Kubernetes works differently. You hand the cluster a document that says what you want, and a fleet of controllers figure out how to make it happen. This is the declarative model, and it is the single most important design decision in Kubernetes.
The declarative approach turns cluster management into a convergence problem. You declare a desired state \u2014 three replicas of your web server, a load balancer on port 443, a 10Gi persistent volume \u2014 and the system continuously works to make reality match that declaration. If something drifts (a pod crashes, a node goes down, someone manually deletes a resource), the controllers detect the gap and close it automatically.
Imperative vs. Declarative: Two Mental Models
Kubernetes supports both imperative and declarative workflows through kubectl, but they represent fundamentally different ways of thinking about infrastructure. Understanding the distinction is critical to using Kubernetes effectively.
Imperative commands tell Kubernetes exactly what to do right now. You issue a verb \u2014 create, scale, delete \u2014 and the action executes immediately. There is no record of intent beyond the resulting object in the cluster.
# Imperative: issue commands one at a time
kubectl create deployment nginx --image=nginx:1.27
kubectl scale deployment nginx --replicas=3
kubectl set image deployment/nginx nginx=nginx:1.28
kubectl delete deployment nginx
Declarative configuration tells Kubernetes what the end state should look like. You write a manifest (a YAML file), and kubectl apply sends it to the API server. Kubernetes computes the difference between what exists and what you declared, then makes only the necessary changes.
# deployment.yaml \u2014 the desired state
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.28
# Declarative: apply the manifest \u2014 works for create AND update
kubectl apply -f deployment.yaml
| Aspect | Imperative (kubectl create, kubectl run) | Declarative (kubectl apply) |
|---|---|---|
| Mental model | "Do this action now" | "Make reality match this file" |
| Idempotency | No \u2014 running create twice errors | Yes \u2014 apply is safe to run repeatedly |
| Change tracking | None \u2014 history lives in your terminal | Manifests are version-controlled in Git |
| Collaboration | Hard to share or review shell commands | Pull requests on YAML files |
| Drift correction | Manual \u2014 you must re-run commands | Re-apply the manifest to restore desired state |
| Best for | Quick experiments, one-off debugging | Production workloads, GitOps, automation |
kubectl apply stores the last-applied configuration as an annotation on the object (kubectl.kubernetes.io/last-applied-configuration). This is how it computes three-way diffs \u2014 comparing your new manifest, the last-applied manifest, and the live object in the cluster. If you mix create and apply on the same resource, this annotation is missing and diff calculations can produce unexpected results.
The Reconciliation Loop: Observe \u2192 Diff \u2192 Act
Declaring desired state is only half the story. The other half is reconciliation \u2014 the continuous process by which Kubernetes controllers drive actual state toward desired state. Every controller in Kubernetes follows the same three-phase pattern.
- Observe \u2014 Read the current state of the world from the API server (the objects the controller is responsible for, plus any dependent resources).
- Diff \u2014 Compare the observed state against the desired state declared in the object\u2019s spec.
- Act \u2014 If there is a gap, take the minimum set of actions to close it: create missing resources, update drifted ones, or delete ones that should no longer exist. Then update the object\u2019s
statussubresource to reflect the new reality.
This loop runs continuously. It is not a one-shot process triggered by your kubectl apply. After you apply a Deployment with replicas: 3, the Deployment controller does not just create three pods and walk away. It keeps watching. If a pod dies at 3 AM, the controller notices and creates a replacement \u2014 no human intervention required.
stateDiagram-v2
[*] --> DesiredStateDeclared: User applies manifest
DesiredStateDeclared --> Watching: Controller starts watch
Watching --> DriftDetected: Actual \u2260 Desired
Watching --> Watching: Actual = Desired (no-op)
DriftDetected --> Acting: Create / Update / Delete resources
Acting --> StatusUpdate: Write status to API server
StatusUpdate --> Watching: Resume watch loop
The reconciliation loop. Controllers cycle between watching for changes and acting on detected drift. The loop never terminates \u2014 it runs for the entire lifetime of the cluster.
Watches: Event-Driven, Not Polling
A naive implementation of the reconciliation loop would poll the API server on a timer \u2014 "check every 5 seconds if anything changed." Kubernetes is smarter than that. Controllers use the API server\u2019s watch mechanism, a long-lived HTTP connection (using chunked transfer encoding) that streams change events in real time.
When a controller starts, it performs a list operation to get the full current state, then opens a watch starting from the resource version returned by the list. The API server pushes ADDED, MODIFIED, and DELETED events as they happen. This design means controllers react to changes within milliseconds, not on some polling interval.
# You can see the watch mechanism in action with kubectl
# This opens a long-lived connection and streams events as pods change
kubectl get pods --watch
# Raw API equivalent \u2014 note the watch=true query parameter
kubectl get pods --v=8 2>&1 | grep "GET"
# Output shows: GET https://<api-server>/api/v1/namespaces/default/pods?watch=true
Internally, the client-go library (used by all built-in controllers and most custom ones) wraps this list+watch pattern in a component called an Informer. An Informer maintains a local in-memory cache of the resources it watches, so controllers can read state without hitting the API server on every reconciliation. It also de-duplicates events and feeds them into a work queue, ensuring each resource is reconciled at most once at a time.
Eventual Consistency: The Kubernetes Bargain
Kubernetes is an eventually consistent system. When you kubectl apply a manifest, the API server acknowledges the write to etcd and returns immediately. But the actual resources \u2014 the pods, the endpoints, the iptables rules \u2014 are not yet created. That work happens asynchronously as each controller picks up the change and acts on it.
This means there is always a window of time where actual state does not match desired state. A Deployment can exist in etcd before its pods are running. A Service can be created before its endpoints are populated. The gap is normally small (seconds), but under heavy load or during node failures, convergence can take longer.
This is a deliberate tradeoff. Kubernetes optimizes for resilience and scalability over immediate consistency. In a system managing thousands of nodes and tens of thousands of pods, attempting synchronous, transactional updates across all components would be impossibly slow and fragile. Instead, each controller independently converges its slice of the world, and the system as a whole settles into the desired state.
Do not write scripts that kubectl apply a resource and immediately assume it is ready. Use kubectl wait or kubectl rollout status to block until the system has converged. For example: kubectl rollout status deployment/nginx --timeout=120s.
Drift Detection and Self-Healing in Practice
The real power of the reconciliation model becomes visible when things go wrong. The following scenarios demonstrate how Kubernetes detects drift between desired and actual state, then automatically heals.
Scenario 1: Pod Deletion \u2014 The Controller Recreates It
Start with a Deployment running three replicas. Manually delete one pod. The ReplicaSet controller notices the count dropped to 2, compares it against the desired count of 3, and immediately creates a replacement.
# Create a deployment with 3 replicas
kubectl apply -f deployment.yaml
# Confirm 3 pods are running
kubectl get pods -l app=nginx
# NAME READY STATUS AGE
# nginx-7c5ddbdf54-abc12 1/1 Running 30s
# nginx-7c5ddbdf54-def34 1/1 Running 30s
# nginx-7c5ddbdf54-ghi56 1/1 Running 30s
# Delete one pod manually \u2014 simulating a crash
kubectl delete pod nginx-7c5ddbdf54-abc12
# Within seconds, a replacement appears
kubectl get pods -l app=nginx
# NAME READY STATUS AGE
# nginx-7c5ddbdf54-def34 1/1 Running 60s
# nginx-7c5ddbdf54-ghi56 1/1 Running 60s
# nginx-7c5ddbdf54-xyz99 1/1 Running 3s <-- new pod
Scenario 2: Imperative Drift \u2014 Declarative Correction
What happens if someone uses kubectl scale to manually change the replica count on a live Deployment? The cluster state drifts from what the manifest declares. The next kubectl apply of the original manifest detects the difference and corrects it. This is why imperative edits on declaratively managed resources are dangerous \u2014 the next apply will overwrite them.
# Someone scales the deployment imperatively
kubectl scale deployment/nginx --replicas=5
# The cluster now has 5 pods. But the manifest says 3.
# Re-apply the manifest to restore desired state:
kubectl apply -f deployment.yaml
# The controller terminates the 2 extra pods
kubectl get pods -l app=nginx
# Back to 3 pods
Scenario 3: Node Failure \u2014 Rescheduling to Healthy Nodes
When a node becomes unreachable, the node controller marks it as NotReady after a timeout (default: 40 seconds). The pod eviction controller then taints the node, and pods bound to it begin terminating. The ReplicaSet controller sees replica counts drop below the desired number and schedules replacements on healthy nodes \u2014 all without human intervention.
# Monitor the self-healing process during a node failure
kubectl get nodes --watch
# NAME STATUS ROLES AGE VERSION
# node-1 Ready <none> 10d v1.30.2
# node-2 NotReady <none> 10d v1.30.2 <-- node went down
kubectl get pods -l app=nginx -o wide --watch
# Pods on node-2 enter Terminating status
# New pods are scheduled on node-1 (or other healthy nodes)
# Check events to see the reconciliation in action
kubectl get events --sort-by=.lastTimestamp | tail -5
# 2m Normal SuccessfulCreate replicaset/nginx-7c5ddbdf54 Created pod: nginx-7c5ddbdf54-new01
Multiple Controllers, One Convergence
A single kubectl apply of a Deployment triggers a cascade of controllers, each reconciling its own piece of the puzzle. No single controller handles the full lifecycle \u2014 they compose together through the objects they create and own.
- The Deployment controller sees the new Deployment object. It creates (or updates) a ReplicaSet to match the pod template.
- The ReplicaSet controller sees the new ReplicaSet. It counts existing pods matching the selector. If the count is too low, it creates new Pod objects.
- The Scheduler sees unscheduled Pods (those with no
nodeName). It assigns each pod to a node by writing thenodeNamefield. - The kubelet on each assigned node watches for pods bound to it. It pulls the container image and starts the container through the container runtime.
- The Endpoint/EndpointSlice controller sees pods become Ready. It adds their IPs to the corresponding Service\u2019s endpoint list.
- kube-proxy (or the CNI plugin) picks up the new endpoints and updates the node\u2019s network rules so traffic can reach the new pods.
Each controller independently converges its own resources. The Deployment controller does not know or care about iptables rules. The scheduler does not know about Services. Yet the final result \u2014 a fully networked, load-balanced set of running containers \u2014 emerges from their independent reconciliation loops.
Use kubectl get events --sort-by=.lastTimestamp to trace the reconciliation cascade after applying a manifest. You will see events from the Deployment controller, the ReplicaSet controller, the scheduler, and the kubelet \u2014 in order \u2014 as each one does its part.
Why This Matters
The declarative model with reconciliation loops gives you three properties that are hard to achieve with imperative scripting:
- Self-healing. The system continuously repairs itself. Crashed pods are restarted, failed nodes are drained, and missing resources are recreated. You do not need monitoring scripts or cron jobs to do this \u2014 it is built into the architecture.
- Idempotent operations. Running
kubectl apply -f manifest.yamlten times has the same result as running it once. This makes automation safe and CI/CD pipelines reliable. You never have to worry about "is this resource already created?" - Git as the source of truth. Because the desired state is a YAML file, it lives in version control. You get history, code review, rollback, and audit trails for free. This is the foundation of the GitOps workflow pattern covered later in this guide.
Pods — The Smallest Deployable Unit
A Pod is the smallest object you can create in Kubernetes. It represents a group of one or more containers that share a network identity, an IPC namespace, and optionally a set of storage volumes. When Kubernetes schedules work onto a node, it schedules entire Pods — never individual containers.
Most Pods you encounter in the wild contain a single container. But the abstraction exists as a group because some workloads genuinely need tightly coupled processes: a web server and a log-shipping sidecar, or an application container alongside a service-mesh proxy. These containers are co-located, co-scheduled, and run in a shared context.
What Containers in a Pod Share
Containers inside the same Pod are not isolated from each other the way separate Pods are. They share three key namespaces, which has concrete implications for how you design your applications.
| Shared Resource | What This Means | Practical Implication |
|---|---|---|
| Network namespace | All containers share the same IP address and port space | Containers talk to each other via localhost. Two containers cannot bind to the same port. |
| IPC namespace | Containers can use System V IPC or POSIX shared memory | Useful for legacy apps that communicate via shared memory segments or semaphores. |
| Volumes | Volumes defined at the Pod level can be mounted into any container | A sidecar can read log files written by the main container to a shared emptyDir volume. |
Each Pod gets its own unique cluster IP address. Other Pods in the cluster communicate with it using that IP, regardless of how many containers are running inside. From the network’s perspective, a Pod is a single host.
Containers within a Pod share the network and IPC namespaces, but they each have their own filesystem. A file written inside one container is not visible to another unless they both mount the same volume at the relevant path.
Pod Lifecycle
Every Pod moves through a defined set of phases from creation to termination. Understanding these phases is critical for debugging — when a Pod is stuck, its phase tells you where in the lifecycle it stalled.
stateDiagram-v2
[*] --> Pending
Pending --> Running : Scheduled, images pulled, containers starting
Running --> Succeeded : All containers exit with code 0
Running --> Failed : A container exits with non-zero code
Pending --> Failed : Image pull fails or cannot be scheduled
Running --> Unknown : Node becomes unreachable
Unknown --> Running : Node reconnects
Unknown --> Failed : Node stays unreachable beyond timeout
Succeeded --> [*]
Failed --> [*]
Pod Phases
| Phase | Description |
|---|---|
Pending | The Pod is accepted by the cluster but one or more containers are not yet running. This includes time spent waiting for scheduling, pulling images, and initializing init containers. |
Running | The Pod has been bound to a node and all containers have been created. At least one container is running, starting, or restarting. |
Succeeded | All containers in the Pod terminated with exit code 0 and will not be restarted. Typical for Jobs and batch workloads. |
Failed | All containers have terminated, and at least one exited with a non-zero exit code or was terminated by the system. |
Unknown | The state of the Pod cannot be determined, usually because the kubelet on the node has stopped reporting. |
Pod Conditions
While the phase gives you a high-level summary, conditions provide a more granular picture. Each condition is a boolean with a reason and a timestamp. You can inspect them with kubectl describe pod or query them in JSON output.
| Condition | Meaning |
|---|---|
PodScheduled | The Pod has been assigned to a node. |
Initialized | All init containers have completed successfully. |
ContainersReady | All containers in the Pod are ready (passed readiness probes). |
Ready | The Pod is ready to serve traffic and should be added to Service endpoints. |
Container States
Each container inside a Pod has its own state, independent of the Pod phase. Kubernetes tracks three possible states per container:
- Waiting — The container is not yet running. The
reasonfield tells you why:ContainerCreating,ImagePullBackOff,CrashLoopBackOff, etc. - Running — The container is executing. The
startedAttimestamp tells you when it began. - Terminated — The container finished execution. You get the
exitCode,reason, and bothstartedAtandfinishedAttimestamps.
Restart Policies
The restartPolicy field in the Pod spec controls what the kubelet does when a container exits. It applies to all containers in the Pod — you cannot set different restart policies for different containers. The default is Always.
| Policy | Behavior | Best For |
|---|---|---|
Always | Restart the container regardless of exit code. Uses exponential backoff (10s, 20s, 40s, … up to 5 min). | Long-running services managed by Deployments, StatefulSets, DaemonSets. |
OnFailure | Restart only if the container exits with a non-zero code. Containers that exit 0 stay terminated. | Jobs and batch tasks that should retry on failure but stop on success. |
Never | Never restart, regardless of exit code. | One-shot diagnostic or debug Pods where you want to inspect the exit state. |
When a container repeatedly crashes, the kubelet delays restarts using exponential backoff. This is the infamous CrashLoopBackOff status — it means the kubelet is waiting before retrying. The delay caps at 5 minutes.
Anatomy of a Pod Spec
A Pod spec is more than just a list of containers. It includes scheduling constraints, security settings, volume definitions, and metadata that controllers and the scheduler use to make decisions. Here is an annotated example that shows the most commonly used fields.
apiVersion: v1
kind: Pod
metadata:
name: my-app
namespace: default
labels:
app: my-app
version: v1
spec:
serviceAccountName: my-app-sa # Identity for RBAC and API access
restartPolicy: Always
# --- Scheduling constraints ---
nodeSelector:
disk: ssd # Only nodes with this label
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule" # Tolerate the gpu taint
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: kubernetes.io/hostname # Spread across nodes
# --- Security ---
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
# --- Volumes ---
volumes:
- name: config-vol
configMap:
name: my-app-config
- name: data-vol
persistentVolumeClaim:
claimName: my-app-data
- name: tmp
emptyDir: {}
# --- Containers ---
containers:
- name: app
image: my-app:1.2.0
ports:
- containerPort: 8080
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
volumeMounts:
- name: config-vol
mountPath: /etc/app/config
readOnly: true
- name: data-vol
mountPath: /var/data
- name: tmp
mountPath: /tmp
resources:
requests:
cpu: "250m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
A few things to note in this spec. The serviceAccountName determines what Kubernetes API permissions the Pod has — always set it explicitly rather than relying on the default service account. The securityContext at the Pod level applies to all containers; you can override it per-container for finer control. Scheduling fields like nodeSelector, tolerations, and affinity work together to control where the Pod lands.
Creating and Inspecting Pods
You will rarely create bare Pods in production — Deployments and Jobs handle that for you. But during development and debugging, working directly with Pods is essential. Here are the most common operations.
Create a Pod Imperatively
The fastest way to spin up a Pod for quick testing. The --rm and -it flags make it behave like a temporary interactive shell.
# Run a one-off Pod (deleted automatically when you exit)
kubectl run tmp-shell --rm -it --image=alpine -- /bin/sh
# Run a Pod with a specific command
kubectl run dns-test --image=busybox --restart=Never -- nslookup kubernetes.default
# Generate a Pod YAML without creating it (dry-run)
kubectl run my-app --image=nginx:1.25 --port=80 --dry-run=client -o yaml > pod.yaml
Create a Pod Declaratively
# Apply a Pod manifest
kubectl apply -f pod.yaml
# Watch the Pod come up
kubectl get pod my-app -w
Inspecting Pod State
kubectl get shows a summary. kubectl describe gives you the full picture: events, conditions, container states, and reasons for failures. When something goes wrong, describe is always your first stop.
# Summary with status, restarts, age, IP, and node
kubectl get pod my-app -o wide
# Full details: events, conditions, container states
kubectl describe pod my-app
# Logs from the primary container
kubectl logs my-app
# Logs from a specific container in a multi-container Pod
kubectl logs my-app -c sidecar
# Stream logs in real time
kubectl logs my-app -f --tail=50
# Check container states as JSON
kubectl get pod my-app -o jsonpath='{.status.containerStatuses[*].state}'
Debugging Pods
When a container is running but misbehaving, kubectl exec lets you open a shell inside it. This works only if the container has a shell binary — distroless and minimal images often do not. That is where ephemeral debug containers come in.
Exec into a Running Container
# Open an interactive shell
kubectl exec -it my-app -- /bin/sh
# Run a one-off command
kubectl exec my-app -- cat /etc/app/config/settings.yaml
# Exec into a specific container in a multi-container Pod
kubectl exec -it my-app -c sidecar -- /bin/bash
Ephemeral Debug Containers
Introduced as stable in Kubernetes 1.25, ephemeral containers solve the “distroless debugging” problem. You inject a temporary container — with the tools you need — into a running Pod. The debug container shares the Pod’s namespaces, so it can see the same network interfaces, processes, and file systems.
# Attach a debug container that shares the target container's PID namespace
kubectl debug -it my-app --image=busybox --target=app
# Debug a CrashLoopBackOff Pod by copying it with a different command
kubectl debug my-app -it --copy-to=my-app-debug --container=app -- /bin/sh
# Debug at the node level (creates a privileged Pod on the node)
kubectl debug node/worker-1 -it --image=ubuntu
The --target flag in kubectl debug makes the ephemeral container share the process namespace of the specified container. This means you can use ps aux from the debug container to see the target’s processes, inspect /proc, and even attach a debugger.
Deleting Pods
When you delete a Pod, Kubernetes sends a SIGTERM to every container and waits up to the terminationGracePeriodSeconds (default: 30 seconds) for a clean shutdown. If containers are still running after that deadline, it sends SIGKILL.
# Graceful delete (waits for terminationGracePeriodSeconds)
kubectl delete pod my-app
# Force delete (skip the grace period — use sparingly)
kubectl delete pod my-app --grace-period=0 --force
# Delete all Pods matching a label
kubectl delete pods -l app=my-app
# Delete a Pod from a YAML file
kubectl delete -f pod.yaml
Force-deleting a Pod (--grace-period=0 --force) does not wait for confirmation that the containers have actually stopped. The Pod object is removed from etcd immediately, but the container processes may still be running on the node. Use this only when a Pod is stuck in Terminating and you are certain the node is unreachable or the workload is safe to abandon.
ReplicaSets and Deployments — Managing Pod Replicas
In the previous section, you learned that a Pod is the smallest deployable unit in Kubernetes. But here is the thing: you almost never create Pods directly. A bare Pod has no self-healing capability — if it crashes, gets evicted, or its node goes down, it is gone forever. Nothing recreates it.
This is where ReplicaSets and Deployments come in. They form a two-layer abstraction that keeps your application running at the desired scale and gives you controlled rollout and rollback capabilities. Understanding how these layers interact is essential to operating anything in production on Kubernetes.
ReplicaSets: The Replication Engine
A ReplicaSet has one job: ensure that a specified number of identical Pod replicas are running at all times. It does this through a continuous reconciliation loop. The ReplicaSet controller watches the cluster state, compares the current number of matching Pods to the desired count in the spec, and creates or deletes Pods to close the gap.
The "matching" part is critical. A ReplicaSet finds its Pods using a label selector, not by tracking specific Pod names. It looks for Pods whose labels match its spec.selector, counts them, and acts accordingly. This decoupled relationship means a ReplicaSet can even adopt pre-existing Pods if their labels match.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
Three key fields define a ReplicaSet: spec.replicas sets the desired Pod count, spec.selector defines which labels to match, and spec.template provides the Pod blueprint for creating new replicas. The labels in template.metadata.labels must match the selector — Kubernetes validates this and rejects the object if they do not align.
While ReplicaSets handle replication, they have no concept of updates or rollbacks. If you change the Pod template on a ReplicaSet, existing Pods are not replaced — only new Pods created after the change use the updated template. This is why Deployments exist.
Deployments: The Abstraction You Actually Use
A Deployment is a higher-level controller that manages ReplicaSets on your behalf. When you create a Deployment, it creates a ReplicaSet. When you update the Deployment's Pod template (for example, changing the container image), the Deployment creates a new ReplicaSet with the updated template and gradually scales it up while scaling the old one down. This is how rolling updates work.
The ownership chain is: Deployment → ReplicaSet → Pods. The Deployment never manages Pods directly. It manages ReplicaSets, and each ReplicaSet manages its own set of Pods. Old ReplicaSets are kept around (scaled to zero) so that rollbacks can reactivate them instantly.
graph TD
D["Deployment<br/><strong>web-app</strong>"]
RS1["ReplicaSet · revision 1<br/><em>nginx:1.24</em><br/>replicas: 0"]
RS2["ReplicaSet · revision 2<br/><em>nginx:1.25</em><br/>replicas: 3"]
P1["Pod web-app-a7x2k"]
P2["Pod web-app-b9m3p"]
P3["Pod web-app-c4n8q"]
D --> RS1
D --> RS2
RS2 --> P1
RS2 --> P2
RS2 --> P3
style D fill:#326ce5,stroke:#fff,color:#fff
style RS1 fill:#666,stroke:#999,color:#ccc
style RS2 fill:#1a73e8,stroke:#fff,color:#fff
style P1 fill:#4caf50,stroke:#fff,color:#fff
style P2 fill:#4caf50,stroke:#fff,color:#fff
style P3 fill:#4caf50,stroke:#fff,color:#fff
Deployment Spec Anatomy
A Deployment spec is structurally similar to a ReplicaSet — it has replicas, selector, and template — but adds fields for update strategy, revision history, and rollout behavior. Here is a production-ready Deployment with every important field annotated.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
labels:
app: web-app
spec:
replicas: 3
revisionHistoryLimit: 5 # Keep 5 old ReplicaSets for rollback
selector:
matchLabels:
app: web-app
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # At most 1 extra Pod during update
maxUnavailable: 0 # Never drop below desired count
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.4.1
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Deployment Strategies
Kubernetes offers two built-in strategies for how a Deployment replaces old Pods with new ones. Choosing the right one depends on whether your application can tolerate running two versions simultaneously.
RollingUpdate (Default)
The RollingUpdate strategy incrementally replaces Pods. Kubernetes creates new Pods from the updated ReplicaSet and terminates old Pods from the previous ReplicaSet in a controlled sequence. At no point does the total count of available Pods drop to zero (assuming sane settings), which makes this the standard choice for stateless web services.
Two parameters control the pace of the rollout:
| Parameter | What It Controls | Default | Example |
|---|---|---|---|
maxSurge |
How many extra Pods (above desired count) can exist during the update | 25% | With 4 replicas and maxSurge: 1, up to 5 Pods can run simultaneously |
maxUnavailable |
How many Pods can be unavailable (not ready) during the update | 25% | With 4 replicas and maxUnavailable: 0, all 4 must stay ready throughout |
Setting maxSurge: 1 and maxUnavailable: 0 is the most conservative configuration — Kubernetes spins up one new Pod, waits for it to become ready, then terminates one old Pod, and repeats. This is slower but guarantees zero capacity loss. Setting both to higher values (or percentages) speeds up the rollout at the cost of temporary over-provisioning or reduced capacity.
Recreate
The Recreate strategy is simpler and more brutal: it kills all existing Pods before creating any new ones. This means a period of complete downtime during the update. Use this only when your application cannot run two versions side-by-side — for example, when a new version requires an exclusive database migration lock or uses an incompatible on-disk format.
spec:
strategy:
type: Recreate # No rollingUpdate block needed
Rollout Management
Every change to a Deployment’s .spec.template triggers a new rollout. Kubernetes tracks these rollouts as revisions, and you can inspect, pause, and undo them using kubectl rollout commands. Changes to fields outside the template (like replicas) do not trigger a new rollout.
Watching a Rollout in Progress
# Trigger a rollout by updating the image
kubectl set image deployment/web-app web-app=myregistry/web-app:v2.5.0
# Watch the rollout progress in real-time
kubectl rollout status deployment/web-app
# Waiting for deployment "web-app" rollout to finish:
# 1 out of 3 new replicas have been updated...
# 2 out of 3 new replicas have been updated...
# 3 out of 3 new replicas have been updated...
# deployment "web-app" successfully rolled out
Viewing Rollout History
Each rollout creates a new revision. You can inspect the full history and see what changed in each revision. Adding the --record flag (or using annotations) when making changes captures the command that triggered each revision.
# List all revisions
kubectl rollout history deployment/web-app
# REVISION CHANGE-CAUSE
# 1 kubectl apply --filename=web-app.yaml
# 2 kubectl set image deployment/web-app web-app=myregistry/web-app:v2.4.1
# 3 kubectl set image deployment/web-app web-app=myregistry/web-app:v2.5.0
# Inspect a specific revision’s Pod template
kubectl rollout history deployment/web-app --revision=2
Rolling Back
When a new release is broken, you can instantly revert to a previous revision. The rollback does not re-deploy the old image from scratch — it reactivates the old ReplicaSet (which was kept around at zero replicas) and scales it back up. This makes rollbacks fast.
# Roll back to the previous revision
kubectl rollout undo deployment/web-app
# Roll back to a specific revision
kubectl rollout undo deployment/web-app --to-revision=2
# Verify the rollback completed
kubectl rollout status deployment/web-app
Revision History Limits
Every old ReplicaSet consumes a small amount of etcd storage and clutters kubectl get rs output. The spec.revisionHistoryLimit field controls how many old ReplicaSets are retained. The default is 10. Setting it to 0 disables rollback entirely because no old ReplicaSets are preserved. A value between 3 and 10 is practical for most workloads.
Scaling
Scaling a Deployment changes the replicas field on the active ReplicaSet. You can do this imperatively with kubectl scale or declaratively by updating the manifest. Since scaling does not change the Pod template, it does not trigger a new rollout or create a new revision.
# Imperative scaling
kubectl scale deployment/web-app --replicas=5
# Verify the new replica count
kubectl get deployment web-app
# NAME READY UP-TO-DATE AVAILABLE AGE
# web-app 5/5 5 5 12d
For declarative scaling, update spec.replicas in your YAML and apply it. For automatic scaling based on CPU or memory utilization, use a HorizontalPodAutoscaler (HPA) — which adjusts the replica count on the Deployment dynamically. When using an HPA, you typically omit spec.replicas from your manifest to avoid conflicts between the HPA controller and your declared value.
# Create an HPA targeting 70% CPU utilization, scaling between 3 and 10 replicas
kubectl autoscale deployment/web-app --min=3 --max=10 --cpu-percent=70
Real-World Patterns: Blue-Green and Canary Deployments
The built-in RollingUpdate strategy works well for most scenarios, but some teams need finer control over traffic shifting during deploys. Kubernetes does not have first-class "blue-green" or "canary" resources, but you can implement both patterns using native primitives: Deployments, Services, and label selectors.
Blue-Green Deployments
In a blue-green deployment, you run two complete environments side by side — "blue" (current) and "green" (new). All traffic goes to one environment at a time. Once the green environment is validated, you switch traffic instantly by updating the Service selector. If something goes wrong, you switch back.
# deployment-blue.yaml — the currently active version
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-blue
spec:
replicas: 3
selector:
matchLabels:
app: web-app
version: blue
template:
metadata:
labels:
app: web-app
version: blue
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.4.1
---
# deployment-green.yaml — the new version, deployed alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-green
spec:
replicas: 3
selector:
matchLabels:
app: web-app
version: green
template:
metadata:
labels:
app: web-app
version: green
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.5.0
---
# service.yaml — traffic switch controlled by the 'version' label
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app
version: blue # ← Change to "green" to switch traffic
ports:
- port: 80
targetPort: 8080
The cutover is a single operation: patch the Service selector from blue to green. Traffic shifts immediately because the Service updates its endpoint list. To roll back, patch the selector back to blue. Once you are confident the green version is stable, delete the blue Deployment.
# Switch traffic from blue to green
kubectl patch svc web-app -p '{"spec":{"selector":{"version":"green"}}}'
# Rollback: switch traffic back to blue
kubectl patch svc web-app -p '{"spec":{"selector":{"version":"blue"}}}'
Canary Deployments
A canary deployment routes a small percentage of traffic to the new version while the majority continues hitting the stable version. In native Kubernetes, you achieve this by running two Deployments with a shared label that the Service selects on. Traffic is distributed across all matching Pods proportionally.
# Stable deployment — 9 replicas serving ~90% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-stable
spec:
replicas: 9
selector:
matchLabels:
app: web-app
track: stable
template:
metadata:
labels:
app: web-app # ← Shared label the Service selects on
track: stable
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.4.1
---
# Canary deployment — 1 replica serving ~10% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-canary
spec:
replicas: 1
selector:
matchLabels:
app: web-app
track: canary
template:
metadata:
labels:
app: web-app # ← Same shared label
track: canary
spec:
containers:
- name: web-app
image: myregistry/web-app:v2.5.0
---
# Service selects ONLY on app: web-app — matches both Deployments
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app # ← Does NOT include 'track', so both match
ports:
- port: 80
targetPort: 8080
Since the Service selects on app: web-app only, it targets all 10 Pods (9 stable + 1 canary). Kubernetes distributes requests roughly evenly across endpoints, so about 10% of traffic hits the canary. You control the ratio by adjusting replica counts. To promote the canary, update the stable Deployment’s image and scale the canary back to zero.
Native Kubernetes canary deployments control traffic split by Pod ratio, which is coarse-grained. If you need header-based routing, precise percentage-based traffic splitting (like 1% to canary), or automatic rollback on error rate thresholds, look into a service mesh like Istio or a progressive delivery controller like Argo Rollouts or Flagger.
Putting It All Together
Here is a summary of the relationship between the resources covered in this section and when to reach for each pattern:
| Resource / Pattern | What It Does | When to Use |
|---|---|---|
| ReplicaSet | Maintains a fixed number of Pod replicas via label selectors | Never directly — always through a Deployment |
| Deployment (RollingUpdate) | Incrementally rolls out new Pod versions with zero-downtime | Default choice for stateless applications |
| Deployment (Recreate) | Terminates all old Pods before creating new ones | Applications that cannot run two versions simultaneously |
| Blue-Green | Instant traffic cutover between two full environments | When you need instant rollback and can afford double the resources |
| Canary | Routes a fraction of traffic to the new version for validation | High-risk changes where you want gradual exposure |
In the next section, you will learn about StatefulSets — the controller designed for applications that need stable network identities and persistent storage, where the interchangeable nature of Deployment-managed Pods breaks down.
StatefulSets — Stable Identity for Stateful Applications
Not every workload is stateless. Databases need persistent disk, message queues need stable network addresses, and distributed stores need to know which replica they are. A Deployment treats every Pod as interchangeable — when a Pod dies, it gets a new random name, a new IP, and its local storage vanishes. That model breaks fundamentally for anything that stores data or participates in a cluster protocol.
StatefulSets solve this by giving each Pod a persistent identity that survives rescheduling. The Pod name, its DNS hostname, and its storage volume are all stable. Pod mysql-0 is always mysql-0, whether it runs on node A today or node B tomorrow.
Why Deployments Fall Short for Stateful Workloads
Consider a 3-node PostgreSQL cluster running in streaming replication. The primary (node 0) writes the WAL, and replicas (nodes 1 and 2) connect to the primary by hostname to stream changes. If you use a Deployment, you immediately hit three problems: Pod names are random (so replicas cannot find the primary), storage is ephemeral (so a restarted Pod loses all data), and Pods are created and destroyed in any order (so the primary might not exist when replicas start).
| Behavior | Deployment | StatefulSet |
|---|---|---|
| Pod naming | Random suffix (app-7b9f4d) | Ordinal index (app-0, app-1) |
| Network identity | Changes on every reschedule | Stable DNS per Pod via headless Service |
| Persistent storage | Shared PVC or ephemeral | Dedicated PVC per Pod via volumeClaimTemplates |
| Startup/shutdown order | All Pods created in parallel | Sequential by ordinal (0 → 1 → 2) |
| Rolling update | Any order, surge allowed | Reverse ordinal order (2 → 1 → 0) |
| Pod replacement | New Pod with new identity | Replacement reuses same name and PVC |
The Three Guarantees of StatefulSets
1. Stable Network Identity
Each Pod in a StatefulSet gets a predictable hostname following the pattern <statefulset-name>-<ordinal>. A StatefulSet named postgres with 3 replicas creates Pods named postgres-0, postgres-1, and postgres-2. When postgres-1 is rescheduled to a different node, it keeps the name postgres-1 — and critically, its DNS record points to the new IP.
This identity is paired with a headless Service (covered below) to give each Pod a stable DNS name like postgres-0.postgres-headless.default.svc.cluster.local. Other components can connect to a specific replica by name, which is exactly what replication protocols require.
2. Stable Persistent Storage
StatefulSets use volumeClaimTemplates to create a dedicated PersistentVolumeClaim for each Pod. When the StatefulSet creates postgres-0, it also creates a PVC named data-postgres-0. If postgres-0 is deleted and recreated, the new Pod reattaches to the same data-postgres-0 PVC — your data survives.
3. Ordered Deployment and Scaling
By default, StatefulSets create Pods sequentially. Pod 0 must be Running and Ready before Pod 1 starts. This ordering matters for leader-election or primary-replica setups where the first node must initialize the cluster before replicas join. Scaling down reverses the order: the highest-ordinal Pod is terminated first.
flowchart LR
subgraph "StatefulSet: postgres (replicas=3)"
direction LR
P0["postgres-0<br/><small>Created first</small>"] -->|"Ready ✓"| P1["postgres-1<br/><small>Created second</small>"]
P1 -->|"Ready ✓"| P2["postgres-2<br/><small>Created third</small>"]
end
subgraph "PersistentVolumeClaims"
PVC0["data-postgres-0<br/>10Gi"]
PVC1["data-postgres-1<br/>10Gi"]
PVC2["data-postgres-2<br/>10Gi"]
end
P0 -.->|"bound"| PVC0
P1 -.->|"bound"| PVC1
P2 -.->|"bound"| PVC2
The ordering guarantee applies to startup and shutdown, not to steady-state operation. Once all Pods are Running and Ready, they operate independently. If postgres-1 crashes, only postgres-1 is restarted — it does not wait for postgres-0 or affect postgres-2.
Headless Services — Direct Pod DNS
A normal ClusterIP Service gives you a single virtual IP that load-balances across all Pods. That is useless when you need to connect to a specific Pod — you cannot tell a replica "connect to the primary at this VIP" because the VIP might route to any backend Pod.
A headless Service is a Service with clusterIP: None. Instead of creating a single VIP, it creates individual DNS A records for each Pod. This gives every StatefulSet Pod a predictable, resolvable hostname.
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
labels:
app: postgres
spec:
clusterIP: None # This makes it headless
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
With this Service in place, each Pod gets a DNS entry following the pattern:
# Pattern:
# <pod-name>.<headless-service>.<namespace>.svc.cluster.local
postgres-0.postgres-headless.default.svc.cluster.local → 10.244.1.5
postgres-1.postgres-headless.default.svc.cluster.local → 10.244.2.8
postgres-2.postgres-headless.default.svc.cluster.local → 10.244.3.3
The StatefulSet's spec.serviceName field must reference this headless Service. This is how Kubernetes knows which Service to register Pod DNS records with. Without this link, your Pods will not get individual DNS entries.
Anatomy of a StatefulSet Manifest
A StatefulSet spec looks similar to a Deployment, with two important additions: the serviceName field that links to the headless Service, and the volumeClaimTemplates section that defines per-Pod storage. Here is the complete structure for a 3-replica PostgreSQL cluster:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres-headless # Must match the headless Service name
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
volumeClaimTemplates: # One PVC per Pod
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: standard
resources:
requests:
storage: 10Gi
The volumeClaimTemplates section acts as a PVC template. For each Pod ordinal, Kubernetes creates a PVC named <template-name>-<statefulset-name>-<ordinal> — in this case data-postgres-0, data-postgres-1, and data-postgres-2. These PVCs are not deleted when Pods are rescheduled, which is exactly what you want for a database.
Practical Example: PostgreSQL with Streaming Replication
A real-world replicated PostgreSQL setup needs each Pod to know its role (primary vs. replica) and configure itself accordingly. The ordinal index makes this straightforward: Pod 0 initializes as primary, all others connect to Pod 0 as replicas.
Start by creating the Secret and headless Service:
# Create the password secret
kubectl create secret generic postgres-secret \
--from-literal=password='S3cur3P@ss' \
--from-literal=replication-password='R3plP@ss'
# Apply the headless Service
kubectl apply -f postgres-headless-svc.yaml
Next, use a ConfigMap to hold an init script that detects whether the Pod is the primary or a replica based on its hostname ordinal:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-init
data:
setup.sh: |
#!/bin/bash
set -e
# Extract ordinal from hostname (e.g., "postgres-0" -> "0")
ORDINAL="${HOSTNAME##*-}"
if [ "$ORDINAL" -eq 0 ]; then
echo "Initializing as PRIMARY"
# Primary-specific config: enable replication slots, WAL shipping
cat >> /var/lib/postgresql/data/pgdata/postgresql.conf <<CONF
wal_level = replica
max_wal_senders = 5
max_replication_slots = 5
CONF
else
echo "Initializing as REPLICA — connecting to postgres-0"
# Use pg_basebackup to clone data from the primary
pg_basebackup -h postgres-0.postgres-headless \
-U replicator -D /var/lib/postgresql/data/pgdata \
-Fp -Xs -R
fi
The key insight is the HOSTNAME environment variable. Kubernetes sets it to the Pod name (postgres-0, postgres-1, etc.), so you can parse the ordinal and branch your initialization logic. Replicas connect to postgres-0.postgres-headless — the stable DNS name — regardless of what node the primary is running on.
Pod Management Policy
The default podManagementPolicy is OrderedReady, which enforces the sequential startup and shutdown behavior described above. For workloads that do not need ordering — for example, a distributed cache where all nodes are peers — you can set it to Parallel to launch all Pods at once.
spec:
podManagementPolicy: Parallel # All Pods start/stop simultaneously
replicas: 5
serviceName: memcached-headless
| Policy | Startup | Shutdown | Use case |
|---|---|---|---|
OrderedReady | Sequential (0 → 1 → 2) | Reverse (2 → 1 → 0) | Databases, leader-follower systems |
Parallel | All Pods at once | All Pods at once | Peer-to-peer caches, stateful workers |
Update Strategies
StatefulSets support two update strategies that control how Pods are replaced when you change the Pod template (e.g., updating the container image).
RollingUpdate (default)
Pods are updated one at a time in reverse ordinal order (highest ordinal first). This is intentional — in most primary/replica setups, replicas have higher ordinals and should be updated before the primary. Each Pod must become Running and Ready before the next one is updated.
RollingUpdate with Partition (Canary Deploys)
The partition parameter lets you perform a staged rollout. Only Pods with an ordinal greater than or equal to the partition value are updated. This is a powerful canary mechanism — you can test a new image on the last replica before rolling it out to the entire set.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Only pods with ordinal >= 2 are updated
# Step 1: Set partition to 2, update the image
# Only postgres-2 gets the new image
kubectl patch statefulset postgres -p \
'{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset/postgres postgres=postgres:17
# Step 2: Verify postgres-2 is healthy with the new version
kubectl get pod postgres-2 -o jsonpath='{.spec.containers[0].image}'
# Step 3: Lower partition to 0 to roll out to all Pods
kubectl patch statefulset postgres -p \
'{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
OnDelete
With OnDelete, Kubernetes does not automatically update Pods when you change the template. Instead, you manually delete Pods one by one, and each replacement is created with the new template. This gives you full control over the update pace and order, which can be critical for databases where you must manually verify replication health between each step.
spec:
updateStrategy:
type: OnDelete
Common Pitfalls
PVC Retention: Volumes Outlive the StatefulSet
When you delete a StatefulSet or scale it down, the PVCs are not automatically deleted. This is a safety feature — you do not want to lose a database volume because someone ran kubectl delete statefulset. However, it means orphaned PVCs accumulate and continue to consume storage (and cost money) unless you clean them up.
# List PVCs that belonged to a deleted StatefulSet
kubectl get pvc -l app=postgres
# Manually delete orphaned PVCs after confirming data is backed up
kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2
Starting with Kubernetes 1.27 (stable), you can configure automatic PVC cleanup using persistentVolumeClaimRetentionPolicy:
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Delete # Delete PVCs when StatefulSet is deleted
whenScaled: Retain # Keep PVCs when scaling down (safe default)
Ordering Dependencies and Stuck Rollouts
With OrderedReady, if Pod 0 fails to become Ready, the entire StatefulSet is stuck — Pods 1 and 2 will never be created. This cascading failure is one of the most common StatefulSet debugging scenarios. Always check the Pod 0 logs and events first when a StatefulSet is not scaling up.
# Diagnose a stuck StatefulSet
kubectl describe statefulset postgres
kubectl get pods -l app=postgres
kubectl logs postgres-0 --previous # Check crash logs
kubectl describe pod postgres-0 # Check events for scheduling/volume issues
Scaling Down Safely
Scaling down removes Pods in reverse ordinal order (highest first), which is usually the safest order for primary-replica setups. However, Kubernetes does not understand your application's replication state. If postgres-2 holds data that has not been replicated elsewhere, scaling from 3 to 2 can cause data loss. Always verify your application's replication health before scaling down.
Never scale a StatefulSet to 0 and then delete it as a "cleanup" strategy. Deleting the StatefulSet with --cascade=foreground will delete Pods in order, but the PVCs remain. If you later recreate the StatefulSet, the old PVCs will be reattached — which may contain stale data that conflicts with a fresh initialization. Either delete PVCs explicitly after backup, or use persistentVolumeClaimRetentionPolicy.
When to Use Operators Instead
Raw StatefulSets give you identity, storage, and ordering — but they do not understand your application. They cannot perform automated failover when a PostgreSQL primary dies, rebalance partitions in a Kafka cluster, or trigger a backup before scaling down. For production databases and complex distributed systems, a Kubernetes Operator wraps the StatefulSet with application-specific automation.
| Scenario | Recommendation |
|---|---|
| Learning / dev environments | Raw StatefulSet is fine — keep it simple |
| Single-node database (no replication) | StatefulSet with 1 replica works well |
| Replicated database in production | Use an Operator (CloudNativePG, Zalando Postgres Operator, MySQL Operator) |
| Kafka, Elasticsearch, Cassandra | Use the vendor-supported Operator for lifecycle management |
| Custom stateful app with simple needs | StatefulSet + init containers + readiness probes |
Even when using an Operator, understanding StatefulSets is essential. Operators build on top of StatefulSets, and debugging a misbehaving Operator almost always means inspecting the underlying StatefulSet, Pods, and PVCs. The concepts from this section — stable identity, headless Services, ordered scaling, and PVC lifecycle — remain the foundation.
DaemonSets, Jobs, and CronJobs — Specialized Workload Controllers
Deployments and StatefulSets cover most workloads, but not all work fits the “run N replicas forever” model. Some pods need to run on every node — log shippers, monitoring agents, network plugins. Others need to run once, finish, and exit. Still others need to fire on a schedule, like a nightly database backup. Kubernetes provides three specialized controllers for exactly these patterns: DaemonSets, Jobs, and CronJobs.
Each controller builds on the same reconciliation loop that powers Deployments, but they differ in when pods are created, where they are placed, and what happens when a pod finishes. Understanding these differences lets you model your entire workload landscape without resorting to external cron daemons, systemd services, or manual node-by-node deployments.
DaemonSets — One Pod Per Node
A DaemonSet ensures that every node (or a targeted subset) runs exactly one copy of a pod. When a new node joins the cluster, the DaemonSet controller automatically schedules a pod on it. When a node is removed, the pod is garbage-collected. You never specify a replica count — the node count is the replica count.
This makes DaemonSets the natural choice for node-level infrastructure that must be present everywhere in the cluster. The most common use cases include:
- Log collectors — Fluentd, Fluent Bit, or Logstash agents that tail container logs from
/var/logand forward them to a central store. - Monitoring agents — Prometheus
node-exporterfor hardware and OS metrics, Datadog agents, or New Relic infrastructure agents. - CNI / network plugins — Calico, Cilium, and AWS VPC CNI all run as DaemonSets to configure networking on each node.
- Storage daemons — Ceph OSDs, Longhorn engine processes, or CSI node plugins that need access to the host’s disk and device tree.
Basic DaemonSet YAML
The following manifest deploys Fluent Bit as a log collector across every node. Notice there is no replicas field — the DaemonSet controller handles pod placement automatically.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.1
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 256Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
Targeting a Subset of Nodes
You don’t always want a pod on every node. Maybe your GPU monitoring agent should only run on nodes with GPUs, or a storage daemon should only run on nodes with SSDs. Use nodeSelector or nodeAffinity inside the pod template to restrict placement. The DaemonSet controller will only schedule pods on nodes that match.
# Inside spec.template.spec:
nodeSelector:
node.kubernetes.io/instance-type: gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Tolerations are equally important. Control-plane nodes are typically tainted with node-role.kubernetes.io/control-plane:NoSchedule. If you need your DaemonSet to run on control-plane nodes too (for example, a CNI plugin), you must add a matching toleration. Without it, the DaemonSet simply skips those nodes.
Update Strategies
DaemonSets support two update strategies that control how pods are replaced when you change the pod template:
| Strategy | Behavior | Best For |
|---|---|---|
RollingUpdate |
Automatically terminates old pods and creates new ones, one node at a time. Respects maxUnavailable (default: 1) and maxSurge (default: 0). |
Most workloads — log collectors, monitoring agents, anything that can tolerate brief gaps on individual nodes. |
OnDelete |
Does nothing automatically. New pods are created only when you manually delete existing ones. | Critical infrastructure like CNI plugins or storage daemons where you want full manual control over the rollout sequence. |
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # update one node at a time
maxSurge: 0 # don't create new pod before old is terminated
Unlike Deployments, DaemonSets do not maintain revision history for rollbacks. If a bad image gets pushed via RollingUpdate, you need to update the manifest again to fix it. For mission-critical DaemonSets (especially CNI plugins), OnDelete gives you the safety of testing the new version node-by-node before deleting old pods.
Jobs — Run-to-Completion Workloads
A Job creates one or more pods and ensures that a specified number of them successfully terminate. Once the required number of completions is reached, the Job is considered complete. Unlike Deployments, a Job never restarts a pod that exited with status 0 — the work is done.
Jobs are the right tool for batch processing, data migrations, one-off scripts, report generation, and any task that has a clear beginning and end. The controller handles retries on failure, parallelism, deadlines, and cleanup — you declare the desired behavior, and Kubernetes orchestrates it.
Basic Job YAML
This Job runs a database migration. It creates a single pod, allows up to 4 retries on failure, and enforces a 10-minute deadline.
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
spec:
backoffLimit: 4
activeDeadlineSeconds: 600
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: myapp/migrations:v2.5.0
command: ["python", "manage.py", "migrate", "--no-input"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
Key Job Configuration Fields
Jobs expose several fields that control execution, failure handling, and lifecycle. Understanding each one prevents common pitfalls like runaway retry loops or orphaned completed pods consuming resources.
| Field | Default | Description |
|---|---|---|
completions |
1 | Total number of pods that must succeed. Set to N for batch work that needs N successful completions. |
parallelism |
1 | Maximum number of pods running concurrently. Set higher than 1 to process items in parallel. |
backoffLimit |
6 | Number of retries before the Job is marked as failed. Each retry uses exponential backoff (10s, 20s, 40s, … capped at 6 minutes). |
activeDeadlineSeconds |
none | Hard time limit for the entire Job. If the Job is still running after this duration, all pods are terminated and the Job is marked as failed. |
ttlSecondsAfterFinished |
none | Automatically deletes the Job (and its pods) this many seconds after completion. Requires the TTL-after-finished controller (enabled by default since Kubernetes 1.23). |
Always set ttlSecondsAfterFinished on Jobs. Without it, completed Job objects and their pods stay in the cluster forever, cluttering kubectl get pods output and consuming etcd storage. A value of 3600 (1 hour) gives you time to inspect results before automatic cleanup kicks in.
Parallel Jobs with Completions
When you need to process multiple independent items — say, rendering 50 video chunks — set completions to the total number of items and parallelism to how many can run concurrently. The Job controller creates new pods as existing ones succeed, until all completions are reached.
apiVersion: batch/v1
kind: Job
metadata:
name: video-render
spec:
completions: 50
parallelism: 10
backoffLimit: 5
template:
spec:
restartPolicy: Never
containers:
- name: renderer
image: myapp/renderer:v1.2
command: ["./render.sh"]
env:
- name: QUEUE_URL
value: "sqs://render-jobs"
In this pattern, each pod pulls work from an external queue (SQS, Redis, RabbitMQ). The Job doesn’t know what each pod processes — it only ensures 50 total pods succeed and keeps up to 10 running at any time.
Indexed Jobs for Unique Work Assignment
Indexed Jobs (stable since Kubernetes 1.24) assign each pod a unique index from 0 to completions - 1, exposed via the JOB_COMPLETION_INDEX environment variable. This eliminates the need for an external work queue when each unit of work can be identified by a simple integer — processing file shards, database partition ranges, or test suite splits.
apiVersion: batch/v1
kind: Job
metadata:
name: data-process
spec:
completionMode: Indexed
completions: 8
parallelism: 4
template:
spec:
restartPolicy: Never
containers:
- name: processor
image: myapp/data-pipeline:v3.0
command:
- python
- process_shard.py
- --shard-index=$(JOB_COMPLETION_INDEX)
- --total-shards=8
Pod 0 gets JOB_COMPLETION_INDEX=0, pod 1 gets 1, and so on. If pod 3 fails and is retried, the replacement pod still receives index 3. This guarantees exactly-once processing per index when your application is idempotent.
CronJobs — Scheduled Jobs
A CronJob creates a Job on a repeating schedule, using the same cron expression format that Linux administrators have used for decades. It’s the Kubernetes-native replacement for crontab entries, with the added benefit of running inside the cluster where it has access to Kubernetes secrets, volumes, and service networking.
Schedule Syntax
The schedule field uses standard five-field cron syntax. Each field can be a value, range, step, or wildcard:
| Field | Allowed Values | Example | Meaning |
|---|---|---|---|
| Minute | 0–59 | 30 | At minute 30 |
| Hour | 0–23 | */6 | Every 6 hours |
| Day of month | 1–31 | 1,15 | 1st and 15th |
| Month | 1–12 | * | Every month |
| Day of week | 0–6 (Sun=0) | 1-5 | Monday–Friday |
Some common schedules for reference: "0 * * * *" (every hour on the hour), "*/15 * * * *" (every 15 minutes), "0 2 * * *" (daily at 2 AM), "0 0 * * 0" (weekly on Sunday at midnight), and "0 0 1 * *" (first day of every month).
Basic CronJob YAML
This CronJob creates a nightly database backup at 2:30 AM. The jobTemplate section is identical to a standalone Job spec — the CronJob controller simply stamps out a new Job on each trigger.
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
spec:
schedule: "30 2 * * *"
timeZone: "America/New_York"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
activeDeadlineSeconds: 1800
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: myapp/db-backup:v1.4
command:
- /bin/sh
- -c
- pg_dump $DATABASE_URL | gzip > /backups/db-$(date +%Y%m%d).sql.gz
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
CronJob Configuration Fields
Beyond the schedule, CronJobs offer several fields that control overlap behavior, failure tolerance, and history management. Getting these right is important — a misconfigured CronJob can silently skip runs, stack up parallel executions, or leave hundreds of completed Job objects in the cluster.
| Field | Default | Description |
|---|---|---|
concurrencyPolicy |
Allow |
Allow: Multiple Jobs can run simultaneously. Forbid: Skip the new run if the previous one is still active. Replace: Terminate the still-running Job and start a new one. |
startingDeadlineSeconds |
none | If the CronJob misses its scheduled time (e.g., controller was down), it still creates the Job as long as the delay is within this window. If unset and more than 100 runs are missed, the CronJob stops scheduling entirely. |
suspend |
false |
When set to true, no new Jobs are created on schedule. Already-running Jobs are unaffected. Useful for temporarily pausing a CronJob without deleting it. |
successfulJobsHistoryLimit |
3 | Number of completed Job objects to retain. Older successful Jobs and their pods are automatically deleted. |
failedJobsHistoryLimit |
1 | Number of failed Job objects to retain. Keep this higher (e.g., 5) so you can inspect recent failures. |
timeZone |
UTC (controller’s TZ) | IANA time zone name (e.g., America/New_York). Stable since Kubernetes 1.27. Without it, the schedule runs in the kube-controller-manager’s time zone, which is almost always UTC. |
Choosing a Concurrency Policy
The concurrencyPolicy field is the most consequential CronJob setting. The right choice depends entirely on whether your task is safe to run in parallel and what should happen when a run takes longer than the interval between schedules.
Allow— Use when each run is independent and overlapping is harmless. Example: sending a periodic health check report where two overlapping reports are fine.Forbid— Use when concurrent runs would conflict or corrupt shared resources. Example: a database backup that acquires an advisory lock — running two in parallel would cause one to fail. Missed schedules are simply skipped.Replace— Use when only the latest run matters and stale runs should be stopped. Example: a cache warm-up job where the newest data supersedes anything the previous run was building.
Always set startingDeadlineSeconds on CronJobs. If the CronJob controller is unavailable (during a control-plane upgrade, for example) and misses more than 100 consecutive schedules, the CronJob permanently stops scheduling and logs a "Cannot determine if job needs to be started" error. Setting startingDeadlineSeconds prevents this by limiting how far back the controller looks for missed schedules.
Choosing the Right Controller
When you’re modeling a new workload, ask two questions: Does it need to run continuously or to completion? Does it need to run on specific nodes, or wherever the scheduler decides? The answer maps directly to one of the five workload controllers:
| Controller | Lifecycle | Pod Placement | Typical Use Case |
|---|---|---|---|
| Deployment | Run forever | Scheduler decides | Stateless web services, APIs |
| StatefulSet | Run forever | Ordered, stable identity | Databases, distributed systems |
| DaemonSet | Run forever | One per node (or subset) | Node agents, log shippers, CNI |
| Job | Run to completion | Scheduler decides | Migrations, batch processing |
| CronJob | Run to completion, on schedule | Scheduler decides | Backups, reports, cleanup tasks |
Multi-Container Pod Patterns — Init, Sidecar, Ambassador, Adapter
A Pod is not limited to a single container. When you place multiple containers in the same Pod, they share two critical resources: a network namespace (they all reach each other on localhost) and optionally storage volumes (they can read and write the same files). This is the foundation for every multi-container pattern in Kubernetes.
You would never cram unrelated services into one Pod — that defeats the purpose of microservices. Multi-container Pods exist for tightly coupled helpers that genuinely need to share fate and resources with the main application container. Kubernetes formalizes this with four well-established patterns: Init, Sidecar, Ambassador, and Adapter.
flowchart TB
subgraph pod ["Pod Boundary — shared localhost + volumes"]
direction TB
subgraph init ["① Init Containers (sequential, run-to-completion)"]
I1["init-1: wait-for-db"] --> I2["init-2: run-migrations"]
end
subgraph runtime ["② App + Helper Containers (run concurrently)"]
direction LR
APP["app container\n:8080"]
subgraph sidecar ["Sidecar"]
S["log-shipper\nreads shared volume"]
end
subgraph ambassador ["Ambassador"]
A["db-proxy\nlocalhost:5432"]
end
subgraph adapter ["Adapter"]
D["metrics-adapter\nPrometheus format"]
end
end
init --> runtime
end
style pod fill:#1a1a2e,stroke:#6366f1,stroke-width:2px,color:#e2e8f0
style init fill:#1e293b,stroke:#f59e0b,stroke-width:1px,color:#e2e8f0
style runtime fill:#1e293b,stroke:#22d3ee,stroke-width:1px,color:#e2e8f0
style sidecar fill:#0f172a,stroke:#a78bfa,stroke-width:1px,color:#e2e8f0
style ambassador fill:#0f172a,stroke:#34d399,stroke-width:1px,color:#e2e8f0
style adapter fill:#0f172a,stroke:#fb923c,stroke-width:1px,color:#e2e8f0
Why Containers in the Same Pod?
Containers in the same Pod share a network namespace — they communicate over localhost without any Service or DNS lookup. If your app container listens on port 8080, a sidecar container can reach it at localhost:8080. They also share the same IP address, so external callers see one network identity.
Shared volumes let containers exchange files without network overhead. An app can write log files to an emptyDir volume, and a sidecar can tail those same files and ship them to a logging backend. This cooperation model is what makes the four patterns below so effective.
Init Containers
Init containers run before any app container starts. They execute sequentially — the second init container will not start until the first exits with code 0. If any init container fails, the kubelet retries it according to the Pod's restartPolicy. The Pod stays in Pending state until all init containers succeed.
This makes them perfect for setup tasks that must complete before your application is ready: waiting for a database to accept connections, running schema migrations, cloning a config repository, or downloading ML model weights. The app container is guaranteed that these preconditions are met by the time it starts.
Real-World Example: Wait for a Database, Then Migrate
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
initContainers:
# Init container 1: block until PostgreSQL accepts connections
- name: wait-for-db
image: busybox:1.36
command:
- sh
- -c
- |
until nc -z postgres-svc 5432; do
echo "Waiting for database..."
sleep 2
done
echo "Database is ready"
# Init container 2: run schema migrations
- name: run-migrations
image: myapp/migrator:1.4.0
command: ["./migrate", "--source", "file:///migrations", "--database", "$(DATABASE_URL)", "up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
containers:
- name: web
image: myapp/web:2.1.0
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
The execution order is strict: wait-for-db loops until PostgreSQL is reachable, then run-migrations applies any pending schema changes, and only then does the web container start. If the migration fails (non-zero exit), the kubelet restarts the init container — the app never starts in a broken state.
Init containers respect the Pod's restartPolicy. With the default Always policy, a failed init container is retried with exponential backoff (10s, 20s, 40s, capped at 5 minutes). If restartPolicy is Never, a failed init container causes the entire Pod to fail permanently. Init containers also re-run if the Pod is restarted — they are not skipped on subsequent starts.
Sidecar Containers
Sidecar containers run alongside the main application container for the entire lifetime of the Pod. They extend the app's capabilities without modifying its code — shipping logs, injecting TLS, reloading configuration files, or proxying traffic through a service mesh like Istio's Envoy.
Historically, sidecars were just regular containers listed in spec.containers. The problem: Kubernetes had no way to distinguish a helper from the primary workload. Sidecars could start after the app, and worse, they could prevent a Job Pod from completing because they never exited.
Native Sidecar Support (Kubernetes 1.28+)
Kubernetes 1.28 introduced native sidecar containers (beta in 1.29, stable in 1.31). The mechanism is elegant: you declare a container in initContainers with restartPolicy: Always. This tells the kubelet to start it before the app containers and keep it running for the Pod's entire lifetime. Native sidecars start in init container order, are guaranteed running before app containers launch, and are terminated after the main containers exit.
Real-World Example: Log Shipper Sidecar
apiVersion: v1
kind: Pod
metadata:
name: app-with-log-shipper
spec:
initContainers:
# Native sidecar: starts before app, stays running, stops after app
- name: log-shipper
image: fluent/fluent-bit:3.1
restartPolicy: Always # This makes it a native sidecar
volumeMounts:
- name: app-logs
mountPath: /var/log/app
env:
- name: FLUENTBIT_OUTPUT
value: "elasticsearch"
- name: ES_HOST
value: "elasticsearch.logging.svc.cluster.local"
containers:
- name: app
image: myapp/api:3.0.0
ports:
- containerPort: 8080
volumeMounts:
- name: app-logs
mountPath: /var/log/app
volumes:
- name: app-logs
emptyDir: {}
The app container writes structured JSON logs to /var/log/app/. The log-shipper sidecar tails those files and forwards them to Elasticsearch. Because it is declared as a native sidecar (restartPolicy: Always in initContainers), the kubelet guarantees it starts before app and terminates after app exits. No lost log lines on shutdown.
Before native sidecars, a Job Pod with an Istio Envoy proxy would never complete because the proxy container ran indefinitely. With native sidecars, the kubelet shuts down sidecar containers after the main container exits, so the Pod terminates cleanly. If you run Jobs in a service mesh, native sidecars are essential.
Ambassador Pattern
The Ambassador pattern places a proxy container in the Pod that handles outbound connections on behalf of the app. The app connects to localhost on a well-known port, and the ambassador container handles the complexity of routing, connection pooling, authentication, or protocol translation to the actual remote service.
This cleanly separates connection logic from business logic. The app does not need to know about TLS certificates, connection pool sizes, retry policies, or even which database host it is talking to — it just connects to localhost:5432 and the ambassador handles the rest.
Real-World Example: Cloud SQL Auth Proxy
apiVersion: v1
kind: Pod
metadata:
name: app-with-db-proxy
spec:
serviceAccountName: cloud-sql-sa
containers:
- name: app
image: myapp/api:3.0.0
ports:
- containerPort: 8080
env:
# App connects to localhost — no knowledge of Cloud SQL
- name: DATABASE_HOST
value: "127.0.0.1"
- name: DATABASE_PORT
value: "5432"
- name: DATABASE_NAME
value: "myapp_production"
# Ambassador: proxies localhost:5432 to Cloud SQL over IAM auth
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.13.0
args:
- "--structured-logs"
- "--auto-iam-authn"
- "--port=5432"
- "my-project:us-central1:prod-db"
securityContext:
runAsNonRoot: true
resources:
requests:
cpu: 50m
memory: 64Mi
Google's Cloud SQL Auth Proxy is the canonical ambassador example. The app container connects to 127.0.0.1:5432 as if it were a local PostgreSQL server. The cloud-sql-proxy ambassador terminates that connection and establishes an authenticated, encrypted tunnel to the actual Cloud SQL instance. The app needs zero Cloud SQL awareness — no special drivers, no IAM token management, no TLS certificate handling.
Adapter Pattern
The Adapter pattern normalizes or transforms output from the main container into a format that external systems expect. The most common use case is metrics: your application might expose metrics in a proprietary format, and an adapter container converts them into a Prometheus-compatible /metrics endpoint so your monitoring stack can scrape them uniformly.
Like the Ambassador, the Adapter runs alongside the app and communicates over localhost or shared volumes. The difference is directional: Ambassadors proxy outbound traffic, while Adapters transform output data.
Real-World Example: Redis Metrics Adapter for Prometheus
apiVersion: v1
kind: Pod
metadata:
name: redis-with-exporter
labels:
app: redis
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
spec:
containers:
- name: redis
image: redis:7.2-alpine
ports:
- containerPort: 6379
# Adapter: reads Redis INFO command, exposes Prometheus metrics
- name: redis-exporter
image: oliver006/redis_exporter:v1.62.0
ports:
- containerPort: 9121
env:
- name: REDIS_ADDR
value: "localhost:6379"
resources:
requests:
cpu: 25m
memory: 32Mi
Redis does not natively expose Prometheus metrics. The redis-exporter adapter container connects to Redis on localhost:6379, runs the INFO command, and translates the output into Prometheus exposition format on port 9121. Prometheus scrapes the adapter, not Redis directly. The same pattern works for MySQL, PostgreSQL, NGINX, and dozens of other services that have community-maintained exporters.
Choosing the Right Pattern
| Pattern | Runs | Direction | Typical Use Cases |
|---|---|---|---|
| Init Container | Before app, sequentially | — | DB readiness checks, schema migrations, config cloning, permission setup |
| Sidecar | Alongside app, full lifetime | Varies | Log shippers, service mesh proxies (Envoy), config reloaders, TLS termination |
| Ambassador | Alongside app, full lifetime | Outbound proxy | Database proxies (Cloud SQL, PgBouncer), API gateway sidecars, connection pooling |
| Adapter | Alongside app, full lifetime | Output transformation | Prometheus exporters, log format converters, protocol translators |
Every container you add to a Pod shares its lifecycle — if the Pod is evicted or rescheduled, all containers move together. Sidecars also consume resources counted against the Pod's requests and limits. If a helper container does not need localhost access or shared volumes with your app, deploy it as a separate Pod behind a Service instead. The Ambassador and Adapter patterns are only justified when the tight coupling genuinely simplifies your architecture.
The Kubernetes Networking Model and CNI Plugins
Networking is arguably the most complex piece of a Kubernetes cluster, yet it rests on a deceptively simple foundation. Before a single packet flows, Kubernetes mandates three non-negotiable rules about how Pods, nodes, and the network interact. Understanding these rules — and the pluggable system that implements them — is essential for debugging connectivity issues, choosing the right network plugin, and reasoning about performance.
The Three Fundamental Networking Rules
The Kubernetes networking model is defined by three invariants. These are not suggestions — every conformant cluster implementation must satisfy all three. They are spelled out in the official Kubernetes documentation and form the contract that every CNI plugin must uphold.
- Every Pod gets its own IP address. Each Pod is assigned a unique, cluster-routable IP from the Pod CIDR range. Containers within the same Pod share this IP and communicate over
localhost. - Pods can communicate with all other Pods without NAT. Any Pod can reach any other Pod in the cluster using the destination Pod's IP directly. Source and destination addresses are never translated in transit.
- Nodes can communicate with all Pods (and vice versa) without NAT. A process running on a node can reach any Pod IP directly, and Pods can reach node IPs without address translation.
The result is a flat network — every Pod and node exists in a single, shared address space. There is no port-mapping layer between Pods, no manual link configuration, and no NAT rewriting packet headers. If Pod A knows Pod B's IP, it can just send a packet there.
Kubernetes deliberately does not specify how these rules are implemented — only that they must hold. This is why the networking layer is pluggable. The implementation is delegated to a CNI plugin, which can use overlay networks, BGP routing, eBPF, or anything else that satisfies the contract.
How This Differs from Docker's Default Networking
If you are coming from Docker, this model will feel unfamiliar. Docker's default bridge networking takes a fundamentally different approach: containers on a single host share a private bridge network (typically 172.17.0.0/16), and communication with the outside world requires explicit port mapping via -p flags. This means two containers on different hosts cannot talk to each other by default — they need port forwarding, link aliases, or an overlay network like Docker Swarm's.
| Property | Docker Bridge (Default) | Kubernetes Flat Network |
|---|---|---|
| IP scope | Private to each host | Cluster-wide, routable |
| Cross-host communication | Requires port mapping or overlay setup | Works out of the box via Pod IPs |
| NAT involved | Yes — SNAT/DNAT for external traffic | No NAT between Pods or between Pods and nodes |
| Port conflicts | Mapped ports must be unique per host | Each Pod has its own IP — no port conflicts |
| Service discovery | Manual or via Docker DNS (compose only) | Built-in DNS via CoreDNS + Services |
Kubernetes chose the flat model because it eliminates an entire class of problems. Port conflicts disappear. Applications do not need to know whether they are running in a container. Network policies can reason about Pod IPs as stable identities rather than chasing ephemeral port mappings.
How Pods Get Their IPs — Under the Hood
When the kubelet on a node needs to start a new Pod, the sequence works like this: the container runtime creates the Pod's network namespace (an isolated Linux network stack), then calls the configured CNI plugin. The plugin assigns an IP address from the node's allocated Pod CIDR subnet, creates a virtual ethernet (veth) pair, connects one end to the Pod's namespace and the other to the host network, and sets up routes so traffic can flow.
You can verify a Pod's IP and network namespace from the node itself:
# Get the IP assigned to a Pod
kubectl get pod my-app -o wide
# NAME READY STATUS IP NODE
# my-app 1/1 Running 10.244.1.23 worker-01
# From worker-01, inspect the veth pair
ip link show type veth
# You'll see interfaces like cali* (Calico), lxc* (Cilium), or vethXXXX
# Trace the route to a Pod on another node
ip route get 10.244.2.45
# 10.244.2.45 via 192.168.1.12 dev eth0 (routed to worker-02)
Pod-to-Pod Communication Across Nodes
When Pod A on Node 1 sends a packet to Pod B on Node 2, the packet travels through several layers. It exits Pod A's network namespace via the veth pair, hits the host network stack on Node 1, gets routed to Node 2 (via overlay encapsulation or direct routing depending on the CNI plugin), enters Node 2's host network stack, and is finally delivered into Pod B's namespace.
flowchart LR
subgraph Node1["Node 1 (192.168.1.11)"]
direction TB
PodA["Pod A\n10.244.1.23"]
veth1["veth pair"]
Host1["Host Network Stack\n+ routing table"]
PodA --- veth1 --- Host1
end
subgraph Node2["Node 2 (192.168.1.12)"]
direction TB
Host2["Host Network Stack\n+ routing table"]
veth2["veth pair"]
PodB["Pod B\n10.244.2.45"]
Host2 --- veth2 --- PodB
end
Host1 -- "Overlay (VXLAN/GENEVE)\nor BGP direct route" --> Host2
style PodA fill:#4a9eff,stroke:#2d7cd4,color:#fff
style PodB fill:#4a9eff,stroke:#2d7cd4,color:#fff
style Host1 fill:#f0f4f8,stroke:#9aa5b4
style Host2 fill:#f0f4f8,stroke:#9aa5b4
style veth1 fill:#ffd43b,stroke:#e6b800,color:#333
style veth2 fill:#ffd43b,stroke:#e6b800,color:#333
Pod-to-Pod communication across nodes. The CNI plugin determines whether traffic is encapsulated in an overlay tunnel or routed directly via BGP.
The critical detail is that Pod A's source IP is preserved — Pod B sees the real IP of Pod A, not a translated address. This is what makes network policies, access logs, and distributed tracing work correctly.
CNI — The Container Network Interface
CNI is a specification, not a piece of software. It defines a minimal contract between a container runtime and a network plugin: "here's a network namespace, set it up" and "here's a network namespace, tear it down." The spec was originally developed by CoreOS and is now maintained by the CNCF. It is used by Kubernetes, but also by other container orchestrators like Podman and CRI-O directly.
A CNI plugin is just a binary that the container runtime executes. The runtime passes information through environment variables and stdin (a JSON configuration), and the plugin responds by configuring the network namespace and returning the result (including the assigned IP) as JSON on stdout.
Plugin Configuration
CNI plugins are configured via two things on each node:
- Plugin binaries in
/opt/cni/bin/— the actual executables (e.g.,calico,flannel,bridge,loopback). - Configuration files in
/etc/cni/net.d/— JSON or conflist files that tell the runtime which plugin to invoke and with what parameters.
The container runtime (containerd, CRI-O) reads the configuration directory, picks the first file alphabetically, and uses it for all Pod network setup. Here is a typical configuration file:
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"type": "calico",
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s"
},
"log_level": "info"
}
The type field maps directly to a binary name in /opt/cni/bin/. The ipam block configures IP address management — how the plugin allocates and tracks Pod IPs. Many plugins include their own IPAM, but you can also use standalone IPAM plugins like host-local or whereabouts.
The Plugin Lifecycle: ADD, DEL, CHECK
The CNI spec defines three operations that a runtime can invoke on a plugin. These map directly to the lifecycle of a Pod's network namespace:
| Operation | When It Runs | What It Does |
|---|---|---|
ADD | Pod is being created | Configures the network namespace: creates interfaces, assigns IP, sets up routes. Returns the assigned IP and other details as JSON. |
DEL | Pod is being destroyed | Tears down the network configuration: removes interfaces, releases the IP back to the pool, cleans up routes. |
CHECK | Health verification (periodic) | Validates that the network setup is still correct. Returns an error if something is wrong (e.g., interface missing, IP conflict). Optional — not all plugins implement it. |
You can inspect what happens during these operations by looking at kubelet logs on a node when a Pod is scheduled:
# Watch CNI activity in kubelet logs
journalctl -u kubelet -f | grep -i cni
# List installed CNI plugins on a node
ls /opt/cni/bin/
# bandwidth bridge calico calico-ipam flannel host-local loopback portmap
# View active CNI configuration
cat /etc/cni/net.d/10-calico.conflist | jq '.plugins[].type'
# "calico"
# "bandwidth"
# "portmap"
Comparing Popular CNI Plugins
The CNI plugin you choose has a direct impact on performance, features, operational complexity, and what network policies you can enforce. There is no single "best" plugin — the right choice depends on your cluster size, performance requirements, and whether you need advanced features like encryption or deep observability.
| Plugin | Dataplane | Network Policies | Encryption | Best For |
|---|---|---|---|---|
| Calico | BGP (native routing) or VXLAN overlay | Full Kubernetes + extended Calico policies | WireGuard | Production clusters needing strong policy support and flexibility |
| Cilium | eBPF (kernel-level) | Kubernetes + L7-aware policies (HTTP, gRPC, Kafka) | WireGuard / IPsec | High-performance clusters, deep observability, service mesh replacement |
| Flannel | VXLAN overlay (default), host-gw | None built-in (pair with Calico for policies) | None | Simple clusters, learning environments, minimal overhead |
| Weave Net | VXLAN overlay with mesh routing | Basic Kubernetes network policies | IPsec (sleeve mode) | Small clusters with easy setup and built-in encryption |
Calico
Calico is the most widely deployed CNI plugin in production Kubernetes clusters. In its default mode, it uses BGP (Border Gateway Protocol) to distribute Pod routes directly across nodes — no encapsulation overhead, no tunnel interfaces. Each node acts as a BGP peer and announces its Pod CIDR to the rest of the cluster. This approach gives near bare-metal networking performance.
When BGP is not feasible (for example, in cloud VPCs that block BGP), Calico falls back to VXLAN overlay mode. Calico also includes the most mature network policy implementation in the ecosystem, supporting both standard Kubernetes NetworkPolicy resources and its own extended GlobalNetworkPolicy CRD for cluster-wide rules.
# Install Calico (operator-based)
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/custom-resources.yaml
# Verify Calico Pods are running
kubectl get pods -n calico-system
# NAME READY STATUS RESTARTS
# calico-kube-controllers-7c5f8db89c-x2g4l 1/1 Running 0
# calico-node-abcde 1/1 Running 0
# calico-typha-6f8b5c9d4f-k8m2n 1/1 Running 0
# Check BGP peering status
sudo calicoctl node status
# IPv4 BGP status
# +--------------+-------------------+-------+----------+-------------+
# | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
# +--------------+-------------------+-------+----------+-------------+
# | 192.168.1.12 | node-to-node mesh | up | 10:23:45 | Established |
# +--------------+-------------------+-------+----------+-------------+
Cilium
Cilium takes a radically different approach by moving networking logic into the Linux kernel using eBPF (extended Berkeley Packet Filter). Instead of configuring iptables rules (which become a bottleneck at scale), Cilium attaches eBPF programs directly to network interfaces. This results in lower latency, higher throughput, and — crucially — L7 visibility into application-layer protocols.
Cilium can enforce network policies that understand HTTP methods, gRPC services, Kafka topics, and DNS queries — not just IP addresses and ports. It also includes Hubble, an observability platform that gives you a real-time network traffic flow map of your entire cluster.
# Install Cilium via Helm
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.15.0 \
--namespace kube-system \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
# Check Cilium status
cilium status
# Cilium health daemon: Ok
# IPAM: IPv4: 12/254 allocated
# BandwidthManager: Disabled
# Encryption: Disabled
# Observe live traffic flows with Hubble
hubble observe --namespace default --protocol HTTP
# TIMESTAMP SOURCE DESTINATION TYPE VERDICT
# Jan 15 10:00:01.234 default/frontend default/api-server L7/HTTP FORWARDED
# GET /api/v1/users => 200
Flannel
Flannel is the simplest CNI plugin and often the first one people encounter. It creates a VXLAN overlay network by default: each node gets a subnet from the cluster's Pod CIDR, and traffic between nodes is encapsulated in VXLAN packets. There is no support for network policies — if you need policies with Flannel, you typically pair it with Calico in a configuration called "Canal."
Flannel's simplicity is both its strength and its limitation. It is ideal for development clusters, CI/CD environments, and learning Kubernetes. For production workloads where you need policy enforcement or performance optimization, you will outgrow it.
# Install Flannel
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
# Verify — Flannel runs as a DaemonSet on every node
kubectl get pods -n kube-flannel
# NAME READY STATUS RESTARTS
# kube-flannel-ds-abc12 1/1 Running 0
# kube-flannel-ds-def34 1/1 Running 0
# Inspect the VXLAN interface Flannel creates
ip -d link show flannel.1
# flannel.1: <BROADCAST,MULTICAST,UP> mtu 1450 ...
# vxlan id 1 ... dstport 8472
Weave Net
Weave Net builds a mesh overlay network between nodes using VXLAN (fast datapath) or a user-space "sleeve" mode that can traverse firewalls and NATs. Its standout feature is zero-configuration encryption — enable it with a single password and all inter-node traffic is encrypted via IPsec. Weave also supports basic Kubernetes network policies.
Weave is a solid choice for small- to medium-sized clusters where ease of setup and built-in encryption matter more than raw performance. However, it has seen less active development compared to Calico and Cilium in recent years.
Starting out or running a dev cluster? Use Flannel — it works and stays out of your way. Need network policies and production reliability? Calico is the safest bet with the largest user base. Want cutting-edge performance, L7 policies, and deep observability? Cilium is the direction the ecosystem is heading — it is the default CNI in GKE Dataplane V2 and was adopted as a CNCF graduated project.
Debugging CNI Issues
When a Pod is stuck in ContainerCreating and the events show a CNI error, the problem is almost always in one of three places: the CNI binary is missing, the configuration file is malformed, or the IPAM pool is exhausted. Here is a quick diagnostic checklist:
# 1. Check if the CNI config exists
ls /etc/cni/net.d/
# If empty, no CNI plugin is installed — Pods will stay in ContainerCreating
# 2. Check if the CNI binary exists
ls /opt/cni/bin/ | grep calico
# If the type in your config doesn't match a binary here, ADD will fail
# 3. Look at kubelet logs for the specific error
journalctl -u kubelet --since "5 min ago" | grep -i "cni\|network"
# 4. Check Pod events for the error message
kubectl describe pod stuck-pod | tail -20
# Warning FailedCreatePodSandBox kubelet ...
# failed to set up sandbox container network:
# plugin type="calico" failed: ...
# 5. Verify the CNI plugin DaemonSet is healthy
kubectl get ds -n calico-system # or kube-flannel, kube-system, etc.
If your Pod CIDR is too small for the number of Pods you're running, the IPAM plugin will run out of IPs and new Pods will fail to schedule. Check your cluster's --cluster-cidr and each node's podCIDR allocation with kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. A /24 per node gives you 254 Pod IPs — enough for most workloads but tight if you run many sidecar containers.
Services — ClusterIP, NodePort, and LoadBalancer
Every Pod in Kubernetes gets its own IP address, but that IP is ephemeral. When a Pod is killed and replaced — whether by a Deployment rollout, a node failure, or a scaling event — the new Pod receives a completely different IP. Any client that was using the old IP now has a broken connection. This is the fundamental problem that Services solve.
A Service provides a stable virtual IP address (called a ClusterIP) and a DNS name that remain constant for the lifetime of the Service object. Behind the scenes, the Service uses label selectors to track which Pods should receive traffic, and kube-proxy programs networking rules on every node to route packets to healthy backends. The result is a reliable, load-balanced endpoint that decouples clients from the volatile lifecycle of Pods.
Kubernetes offers four Service types — ClusterIP, NodePort, LoadBalancer, and ExternalName — each building on the one before it. ClusterIP is the default. NodePort adds external access via a static port. LoadBalancer adds a cloud-managed LB on top of NodePort. ExternalName is a special case that creates a DNS CNAME alias.
ClusterIP — The Default Service Type
When you create a Service without specifying a type, you get a ClusterIP service. Kubernetes allocates a virtual IP from the cluster’s service CIDR range (configured at cluster creation, e.g., 10.96.0.0/12). This IP is not bound to any network interface or node — it exists only in the cluster’s networking rules. Pods and other Services within the cluster can reach it, but nothing outside the cluster can.
The virtual IP acts as a stable front door. When a packet is sent to the ClusterIP on a given port, kube-proxy intercepts it and forwards it to one of the backing Pods. The selection of which Pod receives the packet depends on the proxy mode in use.
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: ecommerce
spec:
type: ClusterIP # default — can be omitted
selector:
app: order-api
tier: backend
ports:
- name: http
protocol: TCP
port: 80 # port exposed on the ClusterIP
targetPort: 8080 # port the Pods are listening on
The selector field is what links the Service to its Pods. Kubernetes continuously watches for Pods matching the labels app: order-api and tier: backend, and adds their IPs to the Service’s endpoint list. When a Pod becomes unready or is deleted, it is removed automatically.
How kube-proxy Makes It Work
kube-proxy runs on every node and watches the API server for Service and Endpoint changes. When it detects an update, it programs the node’s networking layer to perform the actual packet rewriting. There are three proxy modes, with iptables being the most common.
| Proxy Mode | Mechanism | Load Balancing | Performance |
|---|---|---|---|
iptables (default) | Inserts NAT rules into the kernel’s netfilter tables | Random selection via --probability chains | Good for up to ~5,000 Services; O(n) rule evaluation |
ipvs | Programs the kernel’s IPVS virtual server table | Round-robin, least-connections, source-hash, and more | O(1) lookup via hash tables; scales to 10,000+ Services |
nftables | Uses nftables (successor to iptables) with native maps | Random with nftables probability | O(1) lookup; available since Kubernetes v1.31 |
In iptables mode, kube-proxy creates a chain of rules for each Service. A packet destined for the ClusterIP is DNAT’d (destination NAT) to a randomly selected Pod IP. Return traffic is automatically reverse-NAT’d via conntrack, so the client sees responses from the ClusterIP — not the Pod IP. In IPVS mode, the same concept applies but the kernel’s built-in load balancer handles the forwarding with better performance at scale.
NodePort — Exposing Services Outside the Cluster
A NodePort service builds on top of ClusterIP. Kubernetes still allocates a virtual ClusterIP, but additionally opens a static port — the NodePort — on every node in the cluster. Any traffic arriving at <NodeIP>:<NodePort> is forwarded to the Service, which then load-balances it to a backing Pod. The NodePort range is 30000–32767 by default (configurable via the API server’s --service-node-port-range flag).
This means external clients can reach the service by hitting any node’s IP on the allocated port. The node does not need to be running the backing Pod — kube-proxy on every node has rules to forward NodePort traffic to the correct Pod, even if it is on a different node.
apiVersion: v1
kind: Service
metadata:
name: order-service-nodeport
spec:
type: NodePort
selector:
app: order-api
ports:
- name: http
protocol: TCP
port: 80 # ClusterIP port (internal)
targetPort: 8080 # Pod port
nodePort: 30080 # static port on every node (optional — auto-assigned if omitted)
With this Service, traffic flows through three layers: the external client connects to 192.168.1.10:30080 (any node IP), kube-proxy forwards it to the ClusterIP 10.96.x.x:80, and then DNAT sends it to a Pod on port 8080. If you omit the nodePort field, Kubernetes auto-assigns one from the available range.
LoadBalancer — Cloud-Integrated External Access
The LoadBalancer type extends NodePort by instructing the cloud provider’s controller to provision an external load balancer (an AWS NLB/ALB, GCP TCP LB, Azure LB, etc.). The cloud LB receives a public or internal IP and forwards traffic to the NodePorts on your cluster nodes. Kubernetes automatically configures health checks and backend pools.
This is the simplest way to expose a Service to the internet in a cloud environment. However, each LoadBalancer Service gets its own cloud LB — which means its own IP address and its own billing line item. For clusters with many externally-facing services, an Ingress controller (covered in the next section) is more cost-effective.
apiVersion: v1
kind: Service
metadata:
name: order-service-lb
annotations:
# AWS-specific: request an NLB instead of a Classic LB
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
selector:
app: order-api
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080
# Optional: restrict source IPs allowed through the LB
loadBalancerSourceRanges:
- 203.0.113.0/24
After creation, the status.loadBalancer.ingress field is populated with the external IP or hostname assigned by the cloud provider. This can take 30 seconds to a few minutes depending on the cloud. Use kubectl get svc order-service-lb -w to watch for it.
ExternalName — DNS Alias for External Services
ExternalName is fundamentally different from the other three types. It does not create a ClusterIP, does not configure kube-proxy rules, and does not proxy any traffic. Instead, it creates a DNS CNAME record in the cluster’s DNS (CoreDNS) that maps the Service name to an external hostname.
apiVersion: v1
kind: Service
metadata:
name: legacy-payments
namespace: ecommerce
spec:
type: ExternalName
externalName: payments.legacy-datacenter.example.com
When a Pod resolves legacy-payments.ecommerce.svc.cluster.local, CoreDNS returns a CNAME to payments.legacy-datacenter.example.com. This is useful during migrations — your application code references a Kubernetes Service name, and you can later swap the ExternalName for a ClusterIP Service pointing to in-cluster Pods without changing application configuration.
Traffic Flow Through Each Service Type
The following diagram shows how traffic traverses the networking layers for each Service type. Notice how each type builds on the previous one — LoadBalancer wraps NodePort, which wraps ClusterIP.
flowchart LR
subgraph External
Client(["External Client"])
CloudLB(["Cloud Load Balancer"])
end
subgraph Cluster ["Kubernetes Cluster"]
subgraph Node1 ["Node 1 — 192.168.1.10"]
KP1["kube-proxy
iptables / IPVS rules"]
Pod1(["Pod A
10.244.1.5:8080"])
end
subgraph Node2 ["Node 2 — 192.168.1.11"]
KP2["kube-proxy
iptables / IPVS rules"]
Pod2(["Pod B
10.244.2.8:8080"])
end
SvcIP["ClusterIP
10.96.47.12:80"]
end
Client -- "1 LoadBalancer
external-ip:80" --> CloudLB
CloudLB -- "2 NodePort
any-node:30080" --> KP1
KP1 -- "3 ClusterIP
10.96.47.12:80" --> SvcIP
SvcIP -. "DNAT to Pod" .-> Pod1
SvcIP -. "DNAT to Pod" .-> Pod2
KP2 -- "3 ClusterIP" --> SvcIP
InternalPod(["In-Cluster Pod"]) -- "ClusterIP only
10.96.47.12:80" --> SvcIP
style SvcIP fill:#4a90d9,color:#fff,stroke:#2a6cb8
style CloudLB fill:#f5a623,color:#fff,stroke:#d4891a
style Pod1 fill:#7ed321,color:#fff,stroke:#5ea318
style Pod2 fill:#7ed321,color:#fff,stroke:#5ea318
Endpoints and EndpointSlices
When you create a Service with a selector, Kubernetes automatically creates a companion Endpoints object with the same name. This object contains a flat list of IP:port pairs for every Pod that matches the selector and has passed its readiness probe. kube-proxy watches Endpoints objects to know where to send traffic.
# View the Endpoints for a Service
kubectl get endpoints order-service -n ecommerce
# NAME ENDPOINTS AGE
# order-service 10.244.1.5:8080,10.244.2.8:8080 4m
# Detailed view shows ready and not-ready addresses
kubectl describe endpoints order-service -n ecommerce
The problem with Endpoints is scalability. A single Endpoints object stores every backend IP in one resource. For Services with thousands of Pods, any single Pod change triggers an update to the entire Endpoints object, which must be transmitted to every node running kube-proxy. This creates a quadratic scaling problem.
EndpointSlices (stable since Kubernetes v1.21) fix this by splitting the backend list into multiple slices, each holding up to 100 endpoints by default. When a Pod changes, only the affected EndpointSlice is updated and propagated. This dramatically reduces API server load and network bandwidth in large clusters.
# List EndpointSlices for a Service
kubectl get endpointslices -l kubernetes.io/service-name=order-service -n ecommerce
# Inspect a specific EndpointSlice
kubectl describe endpointslice order-service-abc12 -n ecommerce
Headless Services
Sometimes you do not want Kubernetes to load-balance for you. You want the actual Pod IPs — for example, when running a database cluster where each node has a distinct identity, or when implementing client-side load balancing with gRPC. A headless Service is created by setting clusterIP: None.
With a headless Service, Kubernetes does not allocate a virtual IP and kube-proxy does not create any forwarding rules. Instead, a DNS lookup for the Service name returns A records for each individual Pod IP. If the Service is combined with a StatefulSet, each Pod also gets a stable DNS hostname like pod-0.my-service.namespace.svc.cluster.local.
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
spec:
type: ClusterIP
clusterIP: None # makes it headless
selector:
app: postgres
ports:
- name: tcp-postgres
port: 5432
targetPort: 5432
# DNS lookup returns individual Pod IPs (no ClusterIP)
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never -- nslookup postgres-headless.default.svc.cluster.local
# Server: 10.96.0.10
# Name: postgres-headless.default.svc.cluster.local
# Address 1: 10.244.1.12
# Address 2: 10.244.2.19
# Address 3: 10.244.3.7
Session Affinity
By default, kube-proxy distributes traffic to backends with no stickiness — each new connection may land on a different Pod. If your application requires that all requests from the same client go to the same Pod (e.g., for in-memory session state), you can enable session affinity.
Kubernetes supports one type of session affinity: ClientIP. When enabled, kube-proxy creates affinity rules based on the client’s source IP address. All connections from the same IP are routed to the same Pod for a configurable timeout (default: 10,800 seconds / 3 hours).
apiVersion: v1
kind: Service
metadata:
name: session-app
spec:
selector:
app: web-frontend
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1-hour sticky sessions
ports:
- port: 80
targetPort: 3000
Session affinity is a pragmatic escape hatch, not a best practice. It reduces the effectiveness of load balancing and breaks when Pods are rescheduled. For production workloads, store session state in Redis or a database, and keep your Pods stateless. Save ClientIP affinity for legacy apps that cannot be easily refactored.
Internal and External Traffic Policies
By default, when traffic arrives at a node, kube-proxy may forward it to a Pod on any node in the cluster. This adds an extra network hop and obscures the client’s source IP (because the packet is SNAT’d by the forwarding node). Two fields let you control this behavior.
externalTrafficPolicy
For NodePort and LoadBalancer Services, setting externalTrafficPolicy: Local tells kube-proxy to only forward to Pods running on the same node that received the traffic. This preserves the client’s source IP and eliminates the extra hop, but it means nodes without matching Pods will fail health checks and receive no traffic from the load balancer. You must ensure your Pods are reasonably spread across nodes.
internalTrafficPolicy
Similarly, internalTrafficPolicy: Local (available since Kubernetes v1.26) restricts in-cluster traffic to Pods on the same node as the client. This is useful for node-local caches or logging agents where you want each Pod to talk to the agent on its own node.
apiVersion: v1
kind: Service
metadata:
name: order-service-local
spec:
type: LoadBalancer
externalTrafficPolicy: Local # preserve source IP, avoid extra hops
internalTrafficPolicy: Cluster # default — route to any node
selector:
app: order-api
ports:
- port: 80
targetPort: 8080
Service Type Comparison
| Feature | ClusterIP | NodePort | LoadBalancer | ExternalName |
|---|---|---|---|---|
| Virtual IP allocated | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| Accessible from inside cluster | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Via DNS |
| Accessible from outside cluster | ❌ No | ✅ Via NodeIP:port | ✅ Via external IP | N/A |
| Requires cloud provider | No | No | Yes | No |
| Port range | Any | 30000–32767 | Any (LB frontend) | N/A |
| Typical use case | Internal microservices | Dev/test, bare-metal | Production external access | External DB, SaaS aliases |
| kube-proxy rules | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
Essential kubectl Commands for Services
# Create a ClusterIP service imperatively
kubectl expose deployment order-api --port=80 --target-port=8080 --name=order-service
# Create a NodePort service
kubectl expose deployment order-api --type=NodePort --port=80 --target-port=8080
# List all Services with wide output (shows ClusterIP + external IP)
kubectl get svc -o wide
# Watch for LoadBalancer external IP assignment
kubectl get svc order-service-lb -w
# Debug: check which Pods are behind a Service
kubectl get endpoints order-service
kubectl get endpointslices -l kubernetes.io/service-name=order-service
# Debug: verify kube-proxy iptables rules for a ClusterIP
sudo iptables -t nat -L KUBE-SERVICES -n | grep order-service
# Temporary local access: port-forward through a Service
kubectl port-forward svc/order-service 8080:80
# DNS resolution test from inside the cluster
kubectl run dns-debug --rm -it --image=busybox:1.36 --restart=Never -- nslookup order-service.ecommerce.svc.cluster.local
Exposing a NodePort directly to the internet requires clients to know individual node IPs, provides no SSL termination, and offers no health-check-based routing. In production, always front NodePort with a load balancer — whether cloud-managed (LoadBalancer type) or self-hosted (e.g., MetalLB for bare-metal clusters). Use NodePort alone only for development, debugging, or tightly controlled internal access.
Ingress, Ingress Controllers, and the Gateway API
In the previous section, you saw how Services expose Pods inside and outside the cluster via ClusterIP, NodePort, and LoadBalancer. These abstractions work well for raw TCP/UDP connectivity — but they fall apart the moment you need HTTP-aware routing. A LoadBalancer Service gives you a single external IP mapped to a single backend. If you run 20 microservices, you get 20 cloud load balancers, 20 public IPs, and 20 separate bills.
Real-world HTTP traffic demands more: routing requests to different backends based on the hostname (api.example.com vs. app.example.com) or URL path (/api/v1 vs. /static), terminating TLS at the edge, injecting headers, enforcing rate limits, and performing canary releases. These are Layer 7 concerns, and Services operate at Layer 4. This gap is exactly what Ingress — and its successor, the Gateway API — exist to fill.
The Ingress Resource
An Ingress is a Kubernetes API object that declares HTTP and HTTPS routing rules. You define which hostnames and paths map to which backend Services, and an Ingress controller (a separate component you must install) reads those rules and configures a reverse proxy accordingly. The Ingress resource itself does nothing without a controller — it is purely declarative configuration.
Here is a minimal Ingress that routes traffic for two hostnames to two different Services:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend-service
port:
number: 80
The spec.ingressClassName field tells Kubernetes which Ingress controller should handle this resource. Before Kubernetes 1.18, this was done via the kubernetes.io/ingress.class annotation — you will still see that pattern in older configurations.
Path Types
Kubernetes supports three pathType values, and the distinction matters for how requests are matched:
| Path Type | Matching Behavior | Example |
|---|---|---|
Exact | Only matches the exact URL path | /api matches /api but not /api/ or /api/users |
Prefix | Matches based on URL path prefix split by / | /api matches /api, /api/, and /api/users |
ImplementationSpecific | Matching depends on the Ingress controller | NGINX treats it as a regex-capable path; others may differ |
TLS Termination
Ingress supports TLS termination at the edge by referencing a Kubernetes Secret that contains the certificate and private key. The controller terminates HTTPS and forwards plain HTTP to your backend Services.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-ingress
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls-secret
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
The Secret must be of type kubernetes.io/tls and contain tls.crt and tls.key data fields. In practice, most teams use cert-manager to automatically provision and renew TLS certificates from Let's Encrypt, eliminating manual Secret management entirely.
Annotations: The Escape Hatch
The Ingress spec is deliberately minimal — it covers hosts, paths, backends, and TLS. Everything else (rate limiting, CORS headers, authentication, WebSocket support, custom timeouts) must be configured through controller-specific annotations. This is the Ingress API's biggest practical reality: annotations are where most of your configuration lives.
metadata:
annotations:
# NGINX Ingress Controller annotations
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
nginx.ingress.kubernetes.io/enable-cors: "true"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
Annotations are scoped to a specific controller implementation. If you migrate from NGINX Ingress Controller to Traefik, every annotation must be rewritten. This tight coupling is one of the primary motivations behind the Gateway API.
Comparing Popular Ingress Controllers
Kubernetes does not ship with an Ingress controller. You must install one, and the choice significantly affects your operational experience. Each controller is a reverse proxy that watches for Ingress resources and translates them into its own configuration format.
| Controller | Proxy Engine | Best For | Key Strength | Watch Out For |
|---|---|---|---|---|
| NGINX Ingress Controller | NGINX / OpenResty | General purpose, high traffic | Mature, huge community, extensive annotation catalog | Two forks exist (community kubernetes/ingress-nginx vs. NGINX Inc's nginxinc/kubernetes-ingress) — don't confuse them |
| Traefik | Traefik (Go) | Dynamic environments, automatic TLS | Built-in Let's Encrypt, dashboard, middleware chains, native CRDs | Performance slightly lower than NGINX under extreme load; Traefik v2/v3 CRD changes can be disruptive |
| HAProxy Ingress | HAProxy | Ultra-low latency, TCP workloads | Battle-tested proxy, excellent for mixed TCP/HTTP traffic | Smaller community, fewer ready-made examples |
| AWS ALB Ingress Controller | AWS ALB (cloud-native) | AWS-native deployments | Provisions actual AWS ALBs; uses target groups for Pod-level routing | AWS-only, costs per ALB, slower to provision than in-cluster proxies |
| Envoy-based (Contour, Emissary) | Envoy | Service mesh integration, gRPC | xDS-based dynamic config, HTTP/2 and gRPC first-class support | Higher resource footprint, steeper learning curve |
If you have no strong preference, start with the community NGINX Ingress Controller (kubernetes/ingress-nginx). It has the broadest documentation, the most Stack Overflow answers, and handles the vast majority of workloads well. Move to specialized controllers when you have a specific need: Envoy for gRPC-heavy traffic, AWS ALB for deep AWS integration, or Traefik if you want automatic Let's Encrypt with zero configuration.
Limitations of the Ingress API
After years of production use, the Kubernetes community identified several fundamental problems with the Ingress API that no amount of annotation hacking could fix:
- Lowest common denominator spec. The Ingress spec only covers basic host/path routing and TLS. Every controller extends it differently through annotations, making manifests non-portable.
- No role separation. A single Ingress resource mixes infrastructure concerns (which ports to listen on, what TLS policy to use) with application concerns (which paths route where). Both the platform team and the app developer edit the same object.
- No support for non-HTTP protocols. TCP, UDP, and gRPC routing require controller-specific CRDs or annotations — there is no standard way to express them.
- Header-based routing is impossible. You cannot route based on HTTP headers, query parameters, or request methods in the Ingress spec.
- Traffic splitting is absent. Canary deployments, A/B testing, and weighted routing all require annotations or custom CRDs.
- Single resource, single namespace. An Ingress resource can only reference backend Services in its own namespace, making cross-namespace routing cumbersome.
These limitations led to the development of the Gateway API — not as a patch to Ingress, but as a ground-up redesign.
The Gateway API
The Gateway API is a collection of Kubernetes CRDs that provide expressive, extensible, and role-oriented routing. It graduated to GA (v1.0) in October 2023 for its core HTTP routing features. Unlike Ingress, the Gateway API was designed by a multi-vendor working group with explicit goals: portability across implementations, a rich feature set without annotations, and clear separation of concerns between personas.
The Three-Resource Model
The Gateway API splits what was a single Ingress resource into three distinct resource types, each managed by a different persona:
flowchart TB
subgraph infra["Infrastructure Provider"]
GC["GatewayClass
Defines the controller implementation
e.g., Envoy, NGINX, Cilium"]
end
subgraph platform["Cluster Operator / Platform Team"]
GW["Gateway
Declares listeners: ports, protocols, TLS
e.g., HTTPS on port 443"]
end
subgraph appdev["Application Developer"]
HR["HTTPRoute
Defines host/path matching,
backends, traffic splitting"]
end
GC -->|"referenced by"| GW
GW -->|"routes attach to"| HR
| Resource | Managed By | Responsibility |
|---|---|---|
GatewayClass | Infrastructure provider (cloud vendor, mesh operator) | Defines which controller implementation to use — analogous to StorageClass for storage |
Gateway | Cluster operator / platform team | Declares listeners (ports, protocols, TLS certificates), namespaces allowed to attach routes |
HTTPRoute | Application developer | Specifies host and path matching rules, backend Services, filters (header modification, redirects, mirroring), and traffic weights |
This separation is powerful. The platform team configures TLS policies and which namespaces are allowed to bind routes. App developers create HTTPRoutes in their own namespace without needing to touch infrastructure configuration or request cluster-admin privileges. Each persona controls only what they own.
Gateway API in Practice
Here is a complete example showing all three resources working together. The GatewayClass is typically provided by the controller installation; you will usually only create the Gateway and HTTPRoute.
# 1. GatewayClass — installed by the infrastructure provider
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: eg
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
# 2. Gateway — created by the platform team
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: main-gateway
namespace: infra
spec:
gatewayClassName: eg
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- name: wildcard-tls
kind: Secret
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
gateway-access: "true"
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: Same
# 3. HTTPRoute — created by the app developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: store-routes
namespace: store
spec:
parentRefs:
- name: main-gateway
namespace: infra
hostnames:
- "store.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: store-api
port: 8080
weight: 90
- name: store-api-canary
port: 8080
weight: 10
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: store-frontend
port: 3000
Notice the weight fields on the first rule — 90% of /api traffic goes to the stable backend and 10% to the canary. This traffic splitting is a first-class feature of the Gateway API, not an annotation hack. Also note that the HTTPRoute lives in the store namespace but attaches to a Gateway in the infra namespace via parentRefs. Cross-namespace routing is built in.
HTTPRoute Features Beyond Basic Routing
The HTTPRoute resource supports matching and filtering capabilities that are impossible in the Ingress API without annotations:
rules:
# Header-based routing
- matches:
- headers:
- name: x-api-version
value: "v2"
path:
type: PathPrefix
value: /api
backendRefs:
- name: api-v2
port: 8080
# Request redirect
- matches:
- path:
type: Exact
value: /old-page
filters:
- type: RequestRedirect
requestRedirect:
scheme: https
statusCode: 301
path:
type: ReplaceFullPath
replaceFullPath: /new-page
# Header modification
- matches:
- path:
type: PathPrefix
value: /internal
filters:
- type: RequestHeaderModifier
requestHeaderModifier:
add:
- name: X-Internal-Request
value: "true"
remove:
- X-Debug
backendRefs:
- name: internal-service
port: 8080
Other Route Types
The Gateway API is not limited to HTTP. Alongside HTTPRoute, the specification defines additional route types for other protocols:
| Route Type | Protocol | Status (as of v1.2) | Use Case |
|---|---|---|---|
HTTPRoute | HTTP / HTTPS | GA (v1.0+) | Web applications, REST APIs |
GRPCRoute | gRPC | GA (v1.1+) | gRPC services with method-level routing |
TLSRoute | TLS passthrough | Experimental | SNI-based routing without termination |
TCPRoute | Raw TCP | Experimental | Databases, custom TCP protocols |
UDPRoute | Raw UDP | Experimental | DNS, gaming, streaming |
Ingress vs. Gateway API: Side-by-Side
| Capability | Ingress | Gateway API |
|---|---|---|
| Host-based routing | ✅ Built-in | ✅ Built-in |
| Path-based routing | ✅ Built-in | ✅ Built-in |
| TLS termination | ✅ Built-in | ✅ Built-in |
| Header-based routing | ❌ Annotations only | ✅ Built-in |
| Traffic splitting / canary | ❌ Annotations only | ✅ Built-in weights |
| Request redirect / rewrite | ❌ Annotations only | ✅ Built-in filters |
| Cross-namespace routing | ❌ Not supported | ✅ Via parentRefs |
| Role-based ownership | ❌ Single resource | ✅ GatewayClass → Gateway → Route |
| TCP/UDP routing | ❌ Not in spec | ✅ TCPRoute / UDPRoute (experimental) |
| gRPC routing | ❌ Not in spec | ✅ GRPCRoute (GA) |
| Portability across controllers | ⚠️ Spec only; annotations break | ✅ Conformance tests enforce it |
| Maturity / ecosystem | ✅ Widely deployed, massive docs | ⚠️ Growing rapidly; most major controllers support it |
For new projects, prefer the Gateway API. Its core features are GA, every major controller supports it (NGINX, Envoy Gateway, Cilium, Traefik, Istio, Kong), and it is the official successor to Ingress. For existing clusters with Ingress already in production, there is no rush to migrate — Ingress is not deprecated and will be supported for the foreseeable future. When you do migrate, most controllers support both APIs simultaneously, so you can run them in parallel.
Installing and Using the Gateway API
The Gateway API CRDs are not bundled with Kubernetes itself. You install them separately before creating any Gateway or HTTPRoute resources.
# Install the Gateway API CRDs (standard channel — GA resources only)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
# Or include experimental resources (TCPRoute, UDPRoute, TLSRoute)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/experimental-install.yaml
# Verify the CRDs are installed
kubectl get crds | grep gateway.networking.k8s.io
After installing the CRDs, you still need a controller that implements them. Envoy Gateway, Cilium, Istio, and the NGINX Gateway Fabric are all popular choices. Each controller installs its own GatewayClass — check the controller's documentation for setup instructions.
In the next section, you will see how DNS resolution works inside the cluster via CoreDNS — the mechanism that lets Pods discover Services by name, which is ultimately what Ingress and Gateway backends rely on to forward traffic.
DNS Resolution in Kubernetes with CoreDNS
Every time a Pod connects to a Service by name — curl http://api-server or a database connection string like postgres://db-primary:5432 — it relies on DNS resolution happening inside the cluster. Kubernetes uses CoreDNS as its default cluster DNS server (replacing kube-dns since v1.13). CoreDNS runs as a Deployment in the kube-system namespace and provides name resolution for Services, Pods, and external domains to every workload in the cluster.
Understanding how cluster DNS works is essential because DNS misconfiguration is one of the most common causes of connectivity failures in Kubernetes. A surprising amount of latency problems also trace back to how DNS search domains and the ndots setting interact with external lookups.
How CoreDNS Is Deployed
CoreDNS runs as a Deployment (typically two replicas for high availability) behind a Service named kube-dns in the kube-system namespace. The Service gets a well-known ClusterIP — usually 10.96.0.10 or whatever the first usable IP in your Service CIDR is. This IP is critical because the kubelet on every node writes it into the /etc/resolv.conf of every Pod it creates.
# Inspect the CoreDNS Deployment and Service
kubectl get deploy coredns -n kube-system
kubectl get svc kube-dns -n kube-system
# Check what a Pod sees as its DNS config
kubectl exec -it my-pod -- cat /etc/resolv.conf
A typical Pod's /etc/resolv.conf looks like this:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Three things to note here: the nameserver points to the kube-dns Service ClusterIP, the search line lists domain suffixes that get appended during resolution, and ndots:5 controls when those suffixes are used. All three are configured by the kubelet based on the Pod's namespace and DNS policy.
DNS Resolution Flow
sequenceDiagram
participant Pod
participant resolv as /etc/resolv.conf
participant CoreDNS as CoreDNS (kube-dns)
participant Upstream as Upstream DNS
Pod->>resolv: resolve "api-server"
resolv->>CoreDNS: api-server.default.svc.cluster.local?
CoreDNS->>CoreDNS: Look up in cluster zone
CoreDNS-->>Pod: 10.100.23.45 (ClusterIP)
Note over Pod,Upstream: External domain resolution
Pod->>resolv: resolve "api.github.com"
resolv->>CoreDNS: api.github.com.default.svc.cluster.local?
CoreDNS-->>resolv: NXDOMAIN
resolv->>CoreDNS: api.github.com.svc.cluster.local?
CoreDNS-->>resolv: NXDOMAIN
resolv->>CoreDNS: api.github.com.cluster.local?
CoreDNS-->>resolv: NXDOMAIN
resolv->>CoreDNS: api.github.com?
CoreDNS->>Upstream: api.github.com?
Upstream-->>CoreDNS: 140.82.121.6
CoreDNS-->>Pod: 140.82.121.6
DNS Naming Conventions
Kubernetes creates DNS records automatically when you create Services and (optionally) Pods. The naming follows a strict hierarchy rooted at cluster.local (the default cluster domain). Understanding these patterns lets you address any resource by its fully qualified domain name (FQDN).
Service DNS Records
Every Service gets an A/AAAA record in the format:
<service-name>.<namespace>.svc.cluster.local
For a ClusterIP Service, this resolves to the virtual ClusterIP. For a headless Service (clusterIP: None), it resolves to the set of Pod IPs backing the Service. Headless Services also create individual A records for each Pod when used with a StatefulSet:
# Regular Service
my-service.production.svc.cluster.local → 10.100.23.45 (ClusterIP)
# Headless Service with StatefulSet
redis-0.redis-headless.production.svc.cluster.local → 10.244.1.12 (Pod IP)
redis-1.redis-headless.production.svc.cluster.local → 10.244.2.8 (Pod IP)
redis-2.redis-headless.production.svc.cluster.local → 10.244.3.19 (Pod IP)
Services with named ports also get SRV records, which encode both the port number and the protocol:
# SRV record format
_<port-name>._<protocol>.<service>.<namespace>.svc.cluster.local
# Example: Service with port named "http" on TCP
_http._tcp.my-service.production.svc.cluster.local → 0 0 8080 my-service.production.svc.cluster.local
Pod DNS Records
Pods get DNS records based on their IP address, with dots replaced by dashes:
# Pod DNS record format
<pod-ip-dashed>.<namespace>.pod.cluster.local
# Example: Pod with IP 10.244.1.12 in "production" namespace
10-244-1-12.production.pod.cluster.local → 10.244.1.12
In practice, you rarely query Pod DNS records directly. Service DNS is the primary mechanism for discovery. The table below summarizes the key record types.
| Record Type | DNS Name Pattern | Resolves To |
|---|---|---|
| Service (ClusterIP) | svc.ns.svc.cluster.local | ClusterIP address |
| Service (Headless) | svc.ns.svc.cluster.local | Set of Pod IPs |
| StatefulSet Pod | pod-0.svc.ns.svc.cluster.local | Individual Pod IP |
| Pod | 10-244-1-12.ns.pod.cluster.local | Pod IP |
| SRV | _port._proto.svc.ns.svc.cluster.local | Port number + target |
You don't have to use the full FQDN. Within the same namespace, my-service works. Across namespaces, my-service.other-namespace suffices. The search line in /etc/resolv.conf automatically appends .default.svc.cluster.local, .svc.cluster.local, and .cluster.local during resolution.
DNS Policies
Not every Pod should resolve DNS the same way. A Pod running a node-level agent might need the host's DNS, while a regular application Pod should use cluster DNS. Kubernetes provides four DNS policies, set via the dnsPolicy field in the Pod spec, that control how /etc/resolv.conf is constructed.
| Policy | Behavior | Use Case |
|---|---|---|
ClusterFirst | Queries go to CoreDNS. Non-cluster domains are forwarded upstream. This is the default. | Standard application Pods |
Default | Pod inherits the DNS config from the node it runs on (/etc/resolv.conf of the host). | Pods that need host-level DNS without cluster resolution |
ClusterFirstWithHostNet | Same as ClusterFirst, but explicitly required for Pods using hostNetwork: true (which would otherwise default to Default). | Host-networked Pods that still need cluster DNS (e.g., ingress controllers) |
None | Kubernetes does not set up DNS at all. You must supply your own config via dnsConfig. | Custom DNS setups, split-horizon DNS, or environments with special resolvers |
The None policy is often paired with dnsConfig to build a completely custom resolver setup. Here's an example that uses a corporate DNS server alongside CoreDNS:
apiVersion: v1
kind: Pod
metadata:
name: custom-dns-pod
spec:
dnsPolicy: "None"
dnsConfig:
nameservers:
- 10.96.0.10 # CoreDNS
- 172.16.0.53 # Corporate DNS
searches:
- default.svc.cluster.local
- svc.cluster.local
- corp.example.com
options:
- name: ndots
value: "3"
containers:
- name: app
image: nginx:1.27
The Corefile — CoreDNS Configuration
CoreDNS is configured through a file called the Corefile, stored in a ConfigMap named coredns in the kube-system namespace. The Corefile is a chain of server blocks, each declaring a DNS zone and the plugins that handle it. Plugins execute in the order they are listed.
# View the current Corefile
kubectl get configmap coredns -n kube-system -o yaml
A typical default Corefile looks like this:
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
Key Plugins Explained
| Plugin | Purpose | Details |
|---|---|---|
kubernetes | Cluster DNS records | Reads Services and Pods from the Kubernetes API and serves A, AAAA, SRV, and PTR records for the cluster.local zone. |
forward | Upstream resolution | Forwards queries that don't match the cluster zone to upstream nameservers. /etc/resolv.conf refers to the node's resolver. You can also specify explicit IPs like 8.8.8.8. |
cache | Response caching | Caches DNS responses for the specified TTL (in seconds). Reduces load on upstream servers and speeds up repeated lookups. |
loop | Loop detection | Detects forwarding loops (CoreDNS forwarding to itself) and halts the server to prevent infinite recursion. Don't remove this. |
reload | Hot reload | Watches the Corefile for changes and reloads the configuration without restarting the Pod. Typically checks every 30 seconds. |
errors | Error logging | Logs errors to stdout, which is picked up by the container's log stream. |
health | Health endpoint | Exposes http://:8080/health for liveness probes. The lameduck option adds a grace period before reporting unhealthy during shutdown. |
ready | Readiness endpoint | Exposes http://:8181/ready. Returns 200 only when all plugins are operational. |
prometheus | Metrics | Exposes Prometheus metrics on :9153. Key metrics include coredns_dns_requests_total and coredns_dns_responses_total. |
loadbalance | Round-robin | Randomizes the order of A records in each response, providing basic client-side load balancing. |
Customizing the Corefile
Two common customizations are forwarding specific domains to custom DNS servers and adding extra DNS entries. You edit the coredns ConfigMap directly. The reload plugin picks up changes automatically.
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health { lameduck 5s }
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . 8.8.8.8 8.8.4.4 {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
# Forward corp.example.com to internal DNS
corp.example.com:53 {
errors
cache 30
forward . 172.16.0.53 172.16.0.54
}
The second server block above handles all queries for corp.example.com and its subdomains, forwarding them to corporate DNS servers at 172.16.0.53 and 172.16.0.54. Everything else goes through the main block. This pattern is common in hybrid environments where internal services have their own DNS zone.
The ndots:5 Problem
The ndots:5 setting in /etc/resolv.conf is Kubernetes' most misunderstood DNS behavior. It means: if a name has fewer than 5 dots, treat it as a relative name and try all search domains before querying it as-is. The intent is to let you write short names like my-service or my-service.other-namespace and have them resolve through the search list.
The problem emerges with external domains. A name like api.github.com has only 2 dots — fewer than 5 — so the resolver doesn't query it directly. Instead, it tries the search domains first:
api.github.com.default.svc.cluster.local→ NXDOMAINapi.github.com.svc.cluster.local→ NXDOMAINapi.github.com.cluster.local→ NXDOMAINapi.github.com→ ✔ resolved
That's four DNS queries instead of one for every external domain lookup. In applications that make heavy use of external APIs, this creates significant unnecessary DNS traffic and adds latency (typically 2–10ms per extra query). Multiply that across thousands of Pods making frequent HTTP calls and the load on CoreDNS becomes substantial.
Mitigations
You have several options to reduce the overhead, depending on how much control you have over the application and the Pod spec.
Option 1: Use FQDNs with a trailing dot. A trailing dot tells the resolver the name is absolute — skip the search list entirely. This requires changing application code or configuration.
# Instead of this (triggers search list):
api.github.com
# Use this (absolute, no search list):
api.github.com.
Option 2: Lower ndots in the Pod spec. Setting ndots:1 means any name with at least one dot (like api.github.com) is queried directly first. The trade-off is that cross-namespace short names like my-service.other-namespace require you to add .svc.cluster.local explicitly.
apiVersion: v1
kind: Pod
metadata:
name: low-ndots-pod
spec:
dnsConfig:
options:
- name: ndots
value: "2"
containers:
- name: app
image: my-app:1.0
Option 3: Use NodeLocal DNSCache. This DaemonSet runs a DNS caching agent on every node, reducing the latency and load caused by repeated queries. It doesn't eliminate extra queries, but it makes them much cheaper by serving cached NXDOMAIN responses locally.
For most workloads, lowering ndots to 2 is the best balance. Single-name Service lookups (my-service) and cross-namespace lookups (my-service.other-ns) still work through the search list, but external domains like api.github.com resolve in a single query. Only deep subdomains like a.b.c.example.com would still trigger extra lookups.
Debugging DNS Issues
When Pods can't connect to Services or external endpoints, DNS is one of the first things to investigate. The fastest approach is to launch a temporary Pod with DNS tools and query CoreDNS directly.
Step 1: Launch a debug Pod
The busybox image includes nslookup, and the registry.k8s.io/e2e-test-images/jessie-dnsutils image includes both nslookup and dig. Use whichever is appropriate:
# Quick debug pod with nslookup
kubectl run dns-debug --rm -it --restart=Never \
--image=busybox:1.36 -- sh
# Full debug pod with dig + nslookup
kubectl run dns-debug --rm -it --restart=Never \
--image=registry.k8s.io/e2e-test-images/jessie-dnsutils -- bash
Step 2: Test cluster DNS resolution
# Resolve a Service in the same namespace
nslookup my-service
# Resolve a Service in another namespace
nslookup my-service.other-namespace.svc.cluster.local
# Resolve an external domain
nslookup google.com
# Use dig for detailed output (shows query path, TTL, response code)
dig my-service.default.svc.cluster.local
# Query CoreDNS directly by IP
dig @10.96.0.10 kubernetes.default.svc.cluster.local
Step 3: Check CoreDNS health
# Are CoreDNS pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Verify the kube-dns Service has endpoints
kubectl get endpoints kube-dns -n kube-system
Common DNS Problems and Causes
| Symptom | Likely Cause | Investigation |
|---|---|---|
nslookup times out | CoreDNS pods are down or the kube-dns Service has no endpoints | Check kubectl get pods -n kube-system -l k8s-app=kube-dns |
| Cluster names resolve but external domains don't | Upstream forwarder is misconfigured or unreachable from the node | Check the forward plugin in the Corefile; test upstream connectivity from the node |
Intermittent SERVFAIL responses | UDP conntrack race condition (Linux kernel < 5.0) causing dropped packets | Enable use-vc option in dnsConfig to force TCP, or deploy NodeLocal DNSCache |
| Slow external lookups | ndots:5 causing unnecessary search domain queries | Run dig +search api.example.com and count queries; lower ndots |
| Wrong Service IP returned | Stale DNS cache or Service was recreated with a new ClusterIP | Restart the client Pod to flush its local resolver cache |
CoreDNS CrashLoopBackOff | Forwarding loop detected by the loop plugin (often caused by host /etc/resolv.conf pointing to 127.0.0.1) | Check CoreDNS logs; fix the forward directive to point to a real upstream server |
On nodes where /etc/resolv.conf points to 127.0.0.1 or 127.0.0.53 (common with systemd-resolved), the default forward . /etc/resolv.conf in the Corefile can create a loop: CoreDNS forwards to localhost, which forwards back to CoreDNS. The loop plugin detects this and crashes CoreDNS on purpose. Fix it by pointing forward to explicit upstream servers like 8.8.8.8 or your infrastructure's DNS.
Network Policies — Microsegmentation in Kubernetes
By default, every Pod in a Kubernetes cluster can talk to every other Pod — across namespaces, across nodes, with zero restrictions. This is the "flat network" model baked into the Kubernetes networking specification. It makes getting started easy, but it is a serious security liability in any real environment.
Imagine a compromised frontend Pod freely reaching your database Pod, or a rogue workload in a shared cluster scanning every service on the network. Without network policies, there is nothing stopping lateral movement once an attacker gains a foothold in any Pod. NetworkPolicy resources give you microsegmentation — fine-grained firewall rules that restrict which Pods can communicate with which, on which ports, and in which direction.
flowchart LR
subgraph "Default: No Network Policies"
A[Pod A
frontend] <-->|allowed| B[Pod B
api]
A <-->|allowed| C[Pod C
database]
B <-->|allowed| C
D[Pod D
untrusted] <-->|allowed| C
D <-->|allowed| B
D <-->|allowed| A
end
style D fill:#e74c3c,color:#fff,stroke:#c0392b
style C fill:#2ecc71,color:#fff,stroke:#27ae60
In the diagram above, every Pod can reach every other Pod. The untrusted Pod has full access to the database — exactly the scenario Network Policies are designed to prevent.
The NetworkPolicy Resource
A NetworkPolicy is a namespaced resource that defines traffic rules for a group of Pods. It has three core building blocks: a pod selector that identifies which Pods the policy targets, ingress rules that control incoming traffic, and egress rules that control outgoing traffic. Here is the skeleton:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
This policy targets all Pods with app: api in the production namespace. It allows inbound traffic only from Pods labeled app: frontend on port 8080, and permits outbound traffic only to Pods labeled app: database on port 5432. All other traffic to and from the api Pods is denied.
spec.podSelector — Targeting Pods
The podSelector field uses standard label selectors to determine which Pods in the policy's namespace are affected. An empty selector (podSelector: {}) matches all Pods in the namespace — this is how you build default deny rules. A selector like matchLabels: {app: api, tier: backend} targets only Pods carrying both labels.
Ingress Rules — Controlling Inbound Traffic
The ingress array defines who is allowed to send traffic to the selected Pods. Each rule in the array is an independent allow rule. Within a single rule, the from array has three types of selectors that can be combined:
| Selector | Scope | Example Use Case |
|---|---|---|
podSelector | Pods in the same namespace as the policy | Allow frontend Pods to reach API Pods |
namespaceSelector | All Pods in namespaces matching the label selector | Allow monitoring namespace to scrape metrics |
ipBlock | CIDR ranges (external or internal IPs) | Allow traffic from corporate VPN 10.0.0.0/8 |
Items within a single from entry are ANDed. Separate entries in the from array are ORed. Placing podSelector and namespaceSelector in the same entry means "Pods matching this label in namespaces matching that label." Placing them as separate entries means "Pods matching this label or any Pod in namespaces matching that label." Getting this wrong silently opens or closes traffic you did not intend.
Here is the difference spelled out in YAML. The first example is an AND (both conditions must be true). The second is an OR (either condition allows traffic):
# AND — Pods labeled app:prometheus IN namespaces labeled team:monitoring
ingress:
- from:
- namespaceSelector:
matchLabels:
team: monitoring
podSelector:
matchLabels:
app: prometheus
# OR — Pods labeled app:prometheus OR any Pod in team:monitoring namespaces
ingress:
- from:
- namespaceSelector:
matchLabels:
team: monitoring
- podSelector:
matchLabels:
app: prometheus
Egress Rules — Controlling Outbound Traffic
Egress rules mirror ingress exactly, but use the to field instead of from. They support the same three selectors: podSelector, namespaceSelector, and ipBlock. Egress policies are critical for preventing compromised Pods from exfiltrating data or reaching external command-and-control servers.
egress:
# Allow DNS resolution (critical — almost always needed)
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow outbound to the database
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
When you add egress restrictions, you almost certainly need to explicitly allow DNS (UDP and TCP port 53) to the kube-system namespace or to all namespaces. Without this, Pods cannot resolve Service names and virtually everything breaks — even though the underlying IP might be allowed.
Port Specifications
Both ingress and egress rules support a ports array. Each entry specifies a protocol (TCP, UDP, or SCTP) and a port number or named port. You can also use endPort to specify a range. If you omit the ports field entirely, the rule applies to all ports.
ports:
- protocol: TCP
port: 8080 # single port
- protocol: TCP
port: 9090
endPort: 9099 # port range 9090-9099
- protocol: TCP
port: http # named port from Pod spec
How Policies Combine — The Additive Model
NetworkPolicies are additive. When multiple policies select the same Pod, the effective set of allowed connections is the union of all those policies. You cannot write a policy that removes access granted by another policy. This is by design — it makes the system predictable and prevents accidental lockouts from conflicting rules.
Here is how it works in practice: if Policy A allows ingress from the frontend on port 8080, and Policy B allows ingress from the monitoring namespace on port 9090, then both types of traffic are allowed. There is no priority or ordering between policies. The only way to restrict traffic is to not include it in any policy that selects those Pods — which is why starting with a deny-all baseline is so important.
Default Deny Policies
The strongest pattern for securing a namespace is to start with a deny-all policy and then layer on explicit allows. The moment any NetworkPolicy selects a Pod, all traffic not explicitly permitted by some policy is denied. A policy with an empty podSelector and empty rules achieves exactly this.
Default Deny All Ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # selects ALL Pods in namespace
policyTypes:
- Ingress # no ingress rules = deny all inbound
This selects every Pod in the production namespace and declares Ingress as a policy type — but provides zero ingress rules. The result: all inbound traffic to every Pod in the namespace is blocked. Outbound traffic is unaffected because Egress is not listed in policyTypes.
Default Deny All Egress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress # no egress rules = deny all outbound
Default Deny Both Directions
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
With this applied, no Pod in the namespace can send or receive any traffic until you explicitly add policies that allow it. This is the zero-trust starting point.
CNI Plugin Requirements
Here is the part that catches many teams off guard: NetworkPolicy resources are only enforced if your CNI plugin supports them. The Kubernetes API server will happily accept your NetworkPolicy manifests regardless — no errors, no warnings — but they are silently ignored if the CNI cannot enforce them.
| CNI Plugin | NetworkPolicy Support | Notes |
|---|---|---|
| Calico | ✅ Full support | Also supports its own extended GlobalNetworkPolicy CRD |
| Cilium | ✅ Full support | eBPF-based; also supports L7 policies (HTTP, gRPC) |
| Weave Net | ✅ Full support | Standard Kubernetes NetworkPolicy |
| Antrea | ✅ Full support | Also offers tiered policy CRDs |
| Flannel | ❌ No support | Provides connectivity only; pair with Calico ("Canal") for policies |
| AWS VPC CNI | ⚠️ Partial | Requires enabling the network policy agent add-on on EKS |
After applying a deny-all policy, test it. Run kubectl exec into a Pod and try to curl a service that should now be blocked. If the request succeeds, your CNI is not enforcing policies. This is a critical validation step — do not assume policies are active just because the resource was created.
Practical Patterns
Pattern 1: Isolating a Namespace
The most common starting point is locking down an entire namespace so that only Pods within it can communicate with each other. External namespaces cannot reach in, and Pods inside cannot reach out to other namespaces (except DNS).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: isolate-namespace
namespace: tenant-a
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
# Allow traffic only from within this namespace
- from:
- podSelector: {}
egress:
# Allow traffic within this namespace
- to:
- podSelector: {}
# Allow DNS resolution
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Every Pod in tenant-a can reach other Pods in tenant-a and resolve DNS. Nothing else — no cross-namespace communication, no internet access.
Pattern 2: Deny All, Then Allow Specific Service-to-Service Communication
This is the zero-trust approach: start with a full deny, then create targeted policies for each legitimate communication path. The following example models a three-tier application where the frontend talks to the API, and the API talks to the database. Nothing else is permitted.
# 1. Deny all traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# 2. Allow DNS for all Pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
# 3. Frontend can receive external ingress, send to API
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frontend-policy
namespace: production
spec:
podSelector:
matchLabels:
app: frontend
policyTypes:
- Ingress
- Egress
ingress:
- ports:
- protocol: TCP
port: 80
egress:
- to:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 8080
---
# 4. API receives from frontend, sends to database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
---
# 5. Database receives from API only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-policy
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 5432
flowchart LR
EXT[External
Traffic] -->|":80"| FE[frontend]
FE -->|":8080"| API[api]
API -->|":5432"| DB[database]
FE x--x|blocked| DB
EXT x--x|blocked| DB
EXT x--x|blocked| API
style EXT fill:#95a5a6,color:#fff,stroke:#7f8c8d
style FE fill:#3498db,color:#fff,stroke:#2980b9
style API fill:#f39c12,color:#fff,stroke:#e67e22
style DB fill:#2ecc71,color:#fff,stroke:#27ae60
This is five manifests, but the pattern is systematic. Each service gets exactly the connections it needs and nothing more. If the API is compromised, it can reach the database — but the database policy restricts it to port 5432, and the API cannot reach the internet or other namespaces.
Pattern 3: Allowing Cross-Namespace Monitoring
A common real-world requirement is allowing your monitoring stack (Prometheus, Grafana) to scrape metrics from application namespaces. Label the monitoring namespace and use a namespaceSelector:
# First, label the monitoring namespace:
# kubectl label namespace monitoring purpose=monitoring
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-prometheus-scraping
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
purpose: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090
- protocol: TCP
port: 9091
Notice this uses the AND form — namespaceSelector and podSelector are in the same from entry. This means only Pods labeled app: prometheus within namespaces labeled purpose: monitoring can reach the metrics ports. Any other Pod in the monitoring namespace is still blocked.
Pattern 4: Restricting Egress to External CIDRs
Sometimes a Pod needs to reach an external service (a SaaS API, a managed database outside the cluster). Use ipBlock to allow traffic only to specific IP ranges:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-external-egress
namespace: production
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Egress
egress:
# Allow DNS
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
# Allow Stripe API (example IPs)
- to:
- ipBlock:
cidr: 54.187.174.169/32
- ipBlock:
cidr: 54.187.205.235/32
ports:
- protocol: TCP
port: 443
The payment-service can reach exactly two external IP addresses on port 443 (HTTPS) and nothing else. Even if the service is compromised, data exfiltration to arbitrary internet hosts is blocked.
Debugging Network Policies
When traffic is unexpectedly blocked (or unexpectedly allowed), use this checklist:
- Verify your CNI supports NetworkPolicy. Run
kubectl get pods -n kube-systemand check which CNI is running. - List policies affecting a Pod. Use
kubectl get networkpolicy -n <namespace>and check each policy'spodSelectoragainst your Pod's labels. - Check label accuracy. A typo in a label selector silently matches zero Pods. Run
kubectl get pods --show-labelsto verify. - Test connectivity directly. Exec into a Pod and use
curl,wget, orncto test reachability:kubectl exec -it <pod> -- curl -m 3 <target>:<port> - Remember DNS. If egress is restricted and DNS is not explicitly allowed, name resolution fails and every connection attempt by hostname fails with it.
Volumes — Attaching Storage to Pods
Every container in Kubernetes starts with a fresh, isolated filesystem layered from its image. When that container restarts — whether from a crash, a liveness probe failure, or a rolling update — everything written to that filesystem is gone. This is by design: containers are ephemeral. But most real applications need to persist data, share files between containers, or load configuration from external sources.
Kubernetes volumes solve this by providing a directory that is mounted into a container's filesystem and whose lifecycle is tied to the Pod (not the container). A volume outlives container restarts within the same Pod, and different volume types connect you to everything from temporary scratch space to cloud provider block storage.
The Ephemeral Filesystem Problem
To understand why volumes matter, consider what happens without them. A container writes log files, caches data, or stores user uploads to its local filesystem. When the kubelet restarts that container (say, after an OOMKill), the replacement container starts from a clean image layer. All written data is lost. If you have a sidecar container that needs to read those same log files, it cannot — each container has its own isolated filesystem root.
Volumes address both problems. They survive container restarts within a Pod, and they can be mounted into multiple containers simultaneously, enabling shared-storage patterns like the sidecar log collector or the adapter pattern.
Volume Lifecycle
A Kubernetes volume is declared at the Pod level (under spec.volumes) and mounted into one or more containers (under spec.containers[].volumeMounts). The critical rule: a volume's lifetime is tied to the Pod's lifetime. When the Pod is deleted, the volume is cleaned up — unless it points to external, persistent storage (which we cover in the next section on PersistentVolumes).
The volume types covered here (emptyDir, configMap, secret, etc.) are Pod-scoped. They are created when the Pod is scheduled and destroyed when the Pod is removed. PersistentVolumes and PersistentVolumeClaims, covered in the next section, decouple storage lifecycle from Pod lifecycle entirely.
emptyDir — Temporary Shared Storage
An emptyDir volume starts as an empty directory when a Pod is assigned to a node. It exists for the lifetime of that Pod and is shared across all containers in the Pod. This is the workhorse volume for multi-container patterns: a main application container writes to the volume, and a sidecar reads from it.
By default, emptyDir is backed by the node's disk (whatever medium backs the node's filesystem). You can set medium: Memory to use a tmpfs RAM-backed filesystem instead, which is faster but counts against the container's memory limit and disappears on node reboot.
apiVersion: v1
kind: Pod
metadata:
name: log-collector
spec:
containers:
- name: app
image: nginx:1.27
volumeMounts:
- name: shared-logs
mountPath: /var/log/nginx
- name: sidecar
image: busybox:1.36
command: ["sh", "-c", "tail -F /logs/access.log"]
volumeMounts:
- name: shared-logs
mountPath: /logs
readOnly: true
volumes:
- name: shared-logs
emptyDir: {}
In this example, Nginx writes access logs to /var/log/nginx. The sidecar container sees those same files at /logs because both mount the same emptyDir volume. The sidecar mounts it read-only since it only needs to tail the logs.
For a RAM-backed scratch volume with a size cap:
volumes:
- name: scratch-space
emptyDir:
medium: Memory
sizeLimit: 128Mi
hostPath — Access to the Node Filesystem
A hostPath volume mounts a file or directory from the host node's filesystem directly into the Pod. This gives containers access to node-level resources like Docker's socket, system logs, or device files. It is powerful but dangerous — you are breaking the isolation boundary between your Pod and the underlying node.
apiVersion: v1
kind: Pod
metadata:
name: node-log-reader
spec:
containers:
- name: log-reader
image: busybox:1.36
command: ["sh", "-c", "cat /host-logs/syslog"]
volumeMounts:
- name: host-logs
mountPath: /host-logs
readOnly: true
volumes:
- name: host-logs
hostPath:
path: /var/log
type: Directory
The type field adds safety checks before the mount. Common values include Directory (must already exist as a directory), File (must already exist as a file), DirectoryOrCreate (create if missing), and "" (empty string, no checks — the default).
A Pod with a writable hostPath mount to / has full read/write access to the entire node filesystem — effectively root on the host. In production, restrict hostPath usage via Pod Security Admission or OPA/Gatekeeper policies. Most workloads should never need it. The exception is system-level DaemonSets (log agents, monitoring agents, CSI drivers) that intentionally need node access.
configMap and secret — Mount Configuration as Files
ConfigMaps and Secrets are Kubernetes API objects that store key-value data. When you mount them as volumes, each key becomes a file in the mounted directory, with the value as the file content. This is how you inject configuration files, TLS certificates, or credential files into containers without baking them into the image.
ConfigMap Volume
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
data:
nginx.conf: |
server {
listen 80;
location / {
root /usr/share/nginx/html;
}
}
extra-settings.conf: |
gzip on;
---
apiVersion: v1
kind: Pod
metadata:
name: web-server
spec:
containers:
- name: nginx
image: nginx:1.27
volumeMounts:
- name: config-vol
mountPath: /etc/nginx/conf.d
volumes:
- name: config-vol
configMap:
name: nginx-config
After mounting, the container's /etc/nginx/conf.d directory contains two files: nginx.conf and extra-settings.conf. If you only need specific keys, use the items field to select them and optionally remap filenames:
volumes:
- name: config-vol
configMap:
name: nginx-config
items:
- key: nginx.conf
path: default.conf
This mounts only the nginx.conf key, but names the resulting file default.conf inside the mount directory.
Secret Volume
Secret volumes work identically to ConfigMap volumes, except the data comes from a Secret object. Kubernetes mounts Secret files with 0644 permissions by default; you can tighten this with defaultMode. The files are backed by a tmpfs in memory — they are never written to the node's disk.
apiVersion: v1
kind: Pod
metadata:
name: tls-app
spec:
containers:
- name: app
image: my-app:2.1
volumeMounts:
- name: tls-certs
mountPath: /etc/tls
readOnly: true
volumes:
- name: tls-certs
secret:
secretName: app-tls
defaultMode: 0400
The Secret app-tls might contain keys tls.crt and tls.key. After mounting, the container finds them at /etc/tls/tls.crt and /etc/tls/tls.key, both with restrictive 0400 permissions (owner read-only).
downwardAPI — Expose Pod Metadata as Files
The downwardAPI volume projects Pod metadata — labels, annotations, resource limits, the Pod name, namespace, and node name — into files inside the container. This is useful when your application needs self-awareness without querying the Kubernetes API directly.
apiVersion: v1
kind: Pod
metadata:
name: self-aware-app
labels:
app: backend
version: v2
annotations:
owner: platform-team
spec:
containers:
- name: app
image: my-app:2.1
volumeMounts:
- name: pod-info
mountPath: /etc/podinfo
volumes:
- name: pod-info
downwardAPI:
items:
- path: labels
fieldRef:
fieldPath: metadata.labels
- path: annotations
fieldRef:
fieldPath: metadata.annotations
- path: cpu-limit
resourceFieldRef:
containerName: app
resource: limits.cpu
Inside the container, /etc/podinfo/labels contains the content app="backend"\nversion="v2". The cpu-limit file contains the numeric CPU limit. Labels and annotations are kept in sync — if an annotation changes on the running Pod, the mounted file updates automatically (with a short delay).
projected — Combine Multiple Sources
A projected volume merges multiple volume sources — Secrets, ConfigMaps, downwardAPI fields, and serviceAccountToken — into a single mount point. Without projected volumes, you would need separate volume declarations and separate mount paths for each source. This quickly becomes unwieldy when a container needs a TLS cert from a Secret, a config file from a ConfigMap, and a service account token all in the same directory.
apiVersion: v1
kind: Pod
metadata:
name: combined-config
spec:
containers:
- name: app
image: my-app:2.1
volumeMounts:
- name: all-config
mountPath: /etc/app-config
readOnly: true
volumes:
- name: all-config
projected:
sources:
- configMap:
name: app-settings
items:
- key: app.yaml
path: app.yaml
- secret:
name: app-tls
items:
- key: tls.crt
path: certs/tls.crt
- key: tls.key
path: certs/tls.key
- downwardAPI:
items:
- path: pod-name
fieldRef:
fieldPath: metadata.name
- serviceAccountToken:
path: token
expirationSeconds: 3600
audience: vault
The resulting directory at /etc/app-config contains app.yaml (from the ConfigMap), certs/tls.crt and certs/tls.key (from the Secret), pod-name (from the downward API), and token (a bound service account token with a 1-hour expiry). The serviceAccountToken source is particularly useful — it generates short-lived, audience-scoped tokens, which is the preferred alternative to the long-lived tokens from the default service account Secret.
subPath — Mount a Single File Without Hiding a Directory
When you mount a volume to a container path, it replaces the entire directory at that path. If the container image already has files in /etc/nginx/conf.d and you mount a ConfigMap there, the original files disappear. The subPath field solves this by mounting a single file (or subdirectory) from the volume into the target path, leaving the rest of the directory untouched.
apiVersion: v1
kind: Pod
metadata:
name: subpath-demo
spec:
containers:
- name: nginx
image: nginx:1.27
volumeMounts:
- name: custom-config
mountPath: /etc/nginx/conf.d/custom.conf
subPath: custom.conf
volumes:
- name: custom-config
configMap:
name: nginx-custom
Now only custom.conf is placed at /etc/nginx/conf.d/custom.conf. The image's default default.conf and any other files in that directory remain intact.
ConfigMap and Secret volumes normally auto-update when the underlying object changes (with a delay of up to the kubelet sync period, ~60 seconds by default). When you use subPath, this auto-update does not happen. The file is mounted as a bind mount, not a symlink, so the kubelet cannot swap it. If you need live-reloading config, mount the full volume to a separate directory and symlink or have your application watch that path.
volumeMounts Options Reference
Beyond mountPath and subPath, the volumeMounts field supports several options that control how the volume behaves inside the container.
| Field | Type | Description |
|---|---|---|
mountPath |
string | Absolute path inside the container where the volume is mounted. Required. |
readOnly |
bool | Mount the volume as read-only. Defaults to false. Use for secrets, config, and any data the container should not modify. |
subPath |
string | Mount a single file or subdirectory from the volume instead of the root. Prevents hiding existing directory contents. |
subPathExpr |
string | Like subPath but supports environment variable expansion ($(VAR_NAME)). Useful for per-Pod paths in StatefulSets. |
mountPropagation |
string | Controls whether mounts made inside the container are visible to the host and vice versa. Values: None (default), HostToContainer, Bidirectional. Only needed for CSI drivers and system-level Pods. |
Volume Type Comparison
Choosing the right volume type depends on what data you need and how long it should survive. This table summarizes the core Pod-level volume types at a glance.
| Volume Type | Data Source | Lifetime | Writable | Primary Use Case |
|---|---|---|---|---|
emptyDir |
Empty (node disk or RAM) | Pod | Yes | Scratch space, inter-container sharing |
hostPath |
Node filesystem | Node | Yes | System DaemonSets, node-level access |
configMap |
ConfigMap object | Pod | No* | Config files, env-specific settings |
secret |
Secret object | Pod | No* | TLS certs, credentials, tokens |
downwardAPI |
Pod metadata | Pod | No | Expose labels, annotations, resource limits |
projected |
Multiple sources combined | Pod | No* | Unified mount for config + secrets + tokens |
* ConfigMap and Secret volumes are mounted read-only by default since Kubernetes 1.19+ when the Immutable feature is used. Without that, writes technically succeed in the container but are not persisted back to the API object.
Putting It Together — A Complete Multi-Volume Pod
Real-world Pods often combine several volume types. Here is a complete example: a web application Pod that loads its config from a ConfigMap, its TLS certificates from a Secret, writes temporary cache data to an emptyDir, and exposes Pod labels to the application.
apiVersion: v1
kind: Pod
metadata:
name: production-web
labels:
app: web
tier: frontend
spec:
containers:
- name: web
image: my-web-app:3.4
ports:
- containerPort: 8443
volumeMounts:
- name: app-config
mountPath: /etc/app/config.yaml
subPath: config.yaml
readOnly: true
- name: tls
mountPath: /etc/tls
readOnly: true
- name: cache
mountPath: /tmp/cache
- name: pod-meta
mountPath: /etc/podinfo
readOnly: true
resources:
limits:
memory: 256Mi
cpu: 500m
volumes:
- name: app-config
configMap:
name: web-app-config
- name: tls
secret:
secretName: web-tls-cert
defaultMode: 0400
- name: cache
emptyDir:
sizeLimit: 100Mi
- name: pod-meta
downwardAPI:
items:
- path: labels
fieldRef:
fieldPath: metadata.labels
This pattern — config from a ConfigMap, secrets from a Secret, ephemeral cache from emptyDir, and metadata from the downward API — covers the majority of volume needs for stateless applications. When you need storage that survives Pod deletion, that is when you reach for PersistentVolumes, covered next.
PersistentVolumes and PersistentVolumeClaims
Containers are ephemeral, but data often is not. A database, a file upload service, or a message queue all need storage that survives Pod restarts, rescheduling, and even node failures. Kubernetes solves this with a two-object model: PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). This separation cleanly divides infrastructure provisioning from application consumption.
The Separation of Concerns
A PersistentVolume (PV) is a cluster-level resource representing a piece of physical or cloud storage — an NFS export, an AWS EBS volume, a GCE Persistent Disk, or a local SSD. PVs are created by cluster administrators (or dynamically by a provisioner) and exist independently of any Pod. Think of a PV as the actual hard drive sitting in a rack.
A PersistentVolumeClaim (PVC) is a namespaced request for storage made by a user or workload. It specifies how much storage is needed and how it will be accessed, without caring about where the storage comes from. The PVC is the purchase order; the PV is the inventory item that fulfills it.
PVs are cluster-scoped — they do not belong to any namespace. PVCs are namespace-scoped — they live alongside the Pods that use them. This is the core of the separation: admins manage PVs globally, developers request PVCs within their namespace.
The PV Lifecycle
A PersistentVolume moves through four distinct phases from creation to cleanup. Understanding this lifecycle is critical for debugging storage issues and planning capacity.
stateDiagram-v2
[*] --> Available : Provisioning (Static or Dynamic)
Available --> Bound : PVC matches PV
Bound --> Released : PVC is deleted
Released --> Available : Reclaim policy = Recycle (deprecated)
Released --> [*] : Reclaim policy = Delete
Released --> Released : Reclaim policy = Retain (manual cleanup)
1. Provisioning
Before a PV can be used, it must exist. There are two provisioning strategies, and most production clusters use both.
Static provisioning means a cluster administrator creates PV objects by hand, each pointing to a specific backing storage resource. This is common with on-premises infrastructure — NFS servers, iSCSI targets, or local disks — where automated provisioning is not available.
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv-01
spec:
capacity:
storage: 50Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
server: 10.0.0.5
path: /exports/data
Dynamic provisioning eliminates the need for pre-created PVs. When a PVC references a StorageClass, Kubernetes automatically provisions a matching PV through the class's provisioner plugin. This is the default model on cloud providers (AWS, GCP, Azure) and CSI-based storage systems.
2. Binding
When a PVC is created, the control plane searches for an available PV that satisfies the claim's requirements: sufficient capacity, compatible access modes, matching StorageClass, and any label selectors. If a match is found, the PVC and PV are bound in a one-to-one relationship. This binding is exclusive — once a PV is bound to a PVC, no other PVC can use it, even if the PV has more capacity than the claim requested.
If no matching PV exists and no StorageClass can dynamically provision one, the PVC remains in a Pending state indefinitely until a suitable PV becomes available.
3. Using
Once bound, a Pod can mount the PVC as a volume. The cluster looks up the PVC, finds the bound PV, and mounts the underlying storage into the Pod's container at the specified path. Multiple Pods can use the same PVC simultaneously if the access mode permits it (e.g., ReadWriteMany).
apiVersion: v1
kind: Pod
metadata:
name: app-server
spec:
containers:
- name: app
image: myapp:3.2
volumeMounts:
- mountPath: /var/data
name: data-volume
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: app-data-pvc
4. Reclaiming
When a PVC is deleted, its bound PV enters the Released phase. What happens next depends on the PV's reclaim policy. This is where storage cleanup strategy is configured, and getting it wrong can mean either data loss or orphaned cloud volumes accumulating cost.
Access Modes
Access modes define how a volume can be mounted by nodes. They are constraints on node-level access, not Pod-level — a volume with ReadWriteOnce can be mounted by multiple Pods, but only if they all run on the same node.
| Access Mode | Abbreviation | Meaning | Typical Use Case |
|---|---|---|---|
ReadWriteOnce | RWO | Read-write by a single node | Databases, single-instance apps |
ReadOnlyMany | ROX | Read-only by many nodes | Shared config, static assets |
ReadWriteMany | RWX | Read-write by many nodes | Shared file uploads, CMS content |
ReadWriteOncePod | RWOP | Read-write by a single Pod (K8s 1.27+) | Strict single-writer guarantees |
Not every storage backend supports every access mode. Block storage (EBS, GCE PD, Azure Disk) typically supports only RWO. File-based storage (NFS, EFS, CephFS) can support RWX. Always check your storage provider's documentation.
ReadWriteOncePod was introduced as stable in Kubernetes 1.29. Unlike ReadWriteOnce, which restricts to a single node, RWOP restricts to a single Pod across the entire cluster. This is the right choice when exactly one writer must exist — for example, a leader-elected process writing to a WAL.
Reclaim Policies
The reclaim policy determines what happens to a PV (and its underlying storage) after its bound PVC is deleted. This is set on the PV, not the PVC.
| Policy | What Happens | Data Preserved? | When to Use |
|---|---|---|---|
Retain | PV moves to Released but is not cleaned up. The underlying storage and its data remain intact. An admin must manually reclaim the PV. | Yes | Production databases, any data you cannot afford to lose |
Delete | PV object and the underlying storage resource (e.g., the EBS volume) are both deleted automatically. | No | Dynamically provisioned volumes for stateless or easily-reproducible workloads |
Recycle | Runs rm -rf /thevolume/* on the volume and makes the PV available again. | No | Deprecated. Do not use. Use dynamic provisioning instead. |
Dynamically provisioned PVs inherit the reclaim policy from their StorageClass. The default for most cloud StorageClasses is Delete. If you are running stateful workloads like databases, you should either change the StorageClass default to Retain or patch individual PVs after creation.
# Patch an existing PV to Retain so deleting the PVC won't destroy data
kubectl patch pv my-database-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
Volume Binding Modes
Volume binding mode controls when a PV is bound to a PVC. This is configured on the StorageClass, not on the PV or PVC directly. The choice has real implications for scheduling and data locality.
Immediate (the default) — The PV is provisioned and bound as soon as the PVC is created, regardless of whether any Pod has requested it yet. This works fine for storage that is accessible from any node (like network-attached storage). But it can cause problems with topology-constrained storage: if an EBS volume is provisioned in us-east-1a but the Pod gets scheduled to a node in us-east-1b, the Pod will be stuck in Pending.
WaitForFirstConsumer — The PV binding and provisioning are delayed until a Pod that uses the PVC is scheduled. The scheduler considers the Pod's node assignment first, then provisions storage in the same topology zone. This is the recommended mode for zone-constrained block storage like EBS, GCE PD, and Azure Disk.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3
iops: "4000"
If you see Pods stuck in Pending with events like "volume node affinity conflict", the most common cause is using Immediate binding mode with zone-constrained storage. Switch the StorageClass to WaitForFirstConsumer to fix it.
How PV-PVC Matching Works
When a PVC is created with no specific volumeName, the control plane runs a matching algorithm to find the best available PV. The criteria are evaluated in this order:
- StorageClass — The PVC's
storageClassNamemust exactly match the PV'sstorageClassName. A PVC withstorageClassName: ""(empty string) only matches PVs with no class. A PVC with nostorageClassNamefield uses the cluster's default StorageClass. - Access Modes — The PV must support at least the access modes requested by the PVC. A PV offering
[RWO, ROX]satisfies a PVC requesting[RWO]. - Capacity — The PV's capacity must be greater than or equal to the PVC's requested storage. Kubernetes picks the smallest PV that satisfies the claim to minimize waste.
- Label Selectors — If the PVC defines a
selectorwithmatchLabelsormatchExpressions, only PVs matching those labels are considered. - Volume Name — If the PVC specifies
volumeName, it skips all other matching logic and binds directly to that specific PV (assuming access modes and capacity are compatible).
Putting It All Together: Static Provisioning Example
The following example shows the complete workflow: an admin creates a PV backed by NFS, a developer creates a PVC that matches it, and a Deployment mounts the claim.
# 1. Admin creates the PersistentVolume
apiVersion: v1
kind: PersistentVolume
metadata:
name: shared-nfs-pv
labels:
environment: production
tier: storage
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: "" # Empty string = no dynamic provisioning
nfs:
server: 10.0.0.5
path: /exports/shared-data
# 2. Developer creates a PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-data-pvc
namespace: web-app
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "" # Must match the PV's storageClassName
selector:
matchLabels:
environment: production
# 3. Deployment mounts the PVC
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: nginx
image: nginx:1.27
volumeMounts:
- mountPath: /usr/share/nginx/html
name: shared-content
volumes:
- name: shared-content
persistentVolumeClaim:
claimName: shared-data-pvc
Dynamic Provisioning Example
With dynamic provisioning, you skip the PV creation entirely. The PVC references a StorageClass, and Kubernetes creates the PV automatically. This is the standard approach on cloud-managed clusters.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: database
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: fast-ssd # References the StorageClass provisioner
Once this PVC is applied and a Pod referencing it is scheduled, the fast-ssd StorageClass provisioner creates a 20Gi volume in the same availability zone as the node (because we set WaitForFirstConsumer earlier). A corresponding PV object is automatically created and bound to this PVC.
Inspecting PVs and PVCs
When debugging storage issues, these commands show you the current state and help identify mismatches.
# List all PVs with their status, capacity, access modes, and bound claim
kubectl get pv
# List PVCs in a namespace with their bound PV
kubectl get pvc -n database
# Describe a PVC to see events — useful when a PVC is stuck in Pending
kubectl describe pvc postgres-data -n database
# Check why a PV is in Released state and not reusable
kubectl get pv shared-nfs-pv -o yaml | grep -A 5 claimRef
A common gotcha: when a PV has reclaim policy Retain and its PVC is deleted, the PV moves to Released but still holds a claimRef pointing to the old PVC. No new PVC can bind to it until you manually clear that reference:
# Remove the stale claimRef so the PV becomes Available again
kubectl patch pv shared-nfs-pv --type json \
-p '[{"op": "remove", "path": "/spec/claimRef"}]'
StorageClasses and Dynamic Provisioning
In the previous section, you created PersistentVolumes by hand and then matched them with PersistentVolumeClaims. That workflow is fine for a handful of volumes, but it collapses under real-world conditions. If you have 50 microservices each needing their own volume — across dev, staging, and production — someone has to manually create and manage 150+ PV manifests. Worse, those PVs must be pre-provisioned in the cloud provider before a workload can claim them.
StorageClasses solve this by letting you declare what kind of storage you want. When a PVC references a StorageClass, Kubernetes automatically calls the appropriate provisioner plugin to create the underlying volume, wraps it in a PV object, and binds it to the PVC — all without human intervention. This is dynamic provisioning, and it is how virtually every production cluster manages storage.
The StorageClass Spec
A StorageClass is a cluster-scoped resource (not namespaced) that acts as a template for volume creation. It tells Kubernetes which provisioner to call and what parameters to pass. Here is the anatomy of a StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com # Which plugin creates the volume
parameters: # Provider-specific settings
type: gp3
fsType: ext4
iopsPerGB: "50"
reclaimPolicy: Delete # What happens when PVC is deleted
volumeBindingMode: WaitForFirstConsumer # When to provision
allowVolumeExpansion: true # Can PVCs grow after creation?
Each field controls a different aspect of the provisioning lifecycle. Let's walk through them.
provisioner
The provisioner field identifies the volume plugin responsible for creating and deleting the actual storage backend. Every cloud provider and storage system has its own provisioner. The older "in-tree" provisioners (like kubernetes.io/aws-ebs) are built into Kubernetes itself, but they are deprecated. Modern clusters use out-of-tree CSI drivers instead.
| Cloud / System | Legacy In-Tree Provisioner | Modern CSI Provisioner |
|---|---|---|
| AWS EBS | kubernetes.io/aws-ebs | ebs.csi.aws.com |
| GCP Persistent Disk | kubernetes.io/gce-pd | pd.csi.storage.gke.io |
| Azure Disk | kubernetes.io/azure-disk | disk.csi.azure.com |
| Azure File | kubernetes.io/azure-file | file.csi.azure.com |
| Rancher Local Path | — | rancher.io/local-path |
| Ceph RBD | kubernetes.io/rbd | rbd.csi.ceph.com |
Kubernetes has been migrating all in-tree volume plugins to CSI drivers since v1.23. As of v1.31, in-tree AWS EBS and GCE PD code is removed from the core codebase. Always use the CSI provisioner for new clusters. If you see kubernetes.io/* provisioners in existing manifests, plan a migration to the CSI equivalents.
parameters
The parameters map is passed directly to the provisioner. Its contents are entirely provider-specific — Kubernetes does not validate them. Here are common parameters for the major cloud providers:
| Parameter | AWS EBS (CSI) | GCP PD (CSI) | Description |
|---|---|---|---|
type | gp3, gp2, io2, st1 | pd-ssd, pd-standard, pd-balanced | Volume type / performance tier |
fsType | ext4, xfs | ext4, xfs | Filesystem to format the volume with |
iopsPerGB | "50" (io1/io2 only) | — | Provisioned IOPS per GiB |
encrypted | "true" | — | Enable encryption at rest |
replication-type | — | regional-pd | Enable regional replication on GCP |
All parameter values must be strings — even numeric ones. Writing iopsPerGB: 50 without quotes will cause YAML to pass an integer, which some provisioners reject. Always quote: iopsPerGB: "50".
reclaimPolicy
The reclaimPolicy determines what happens to the underlying volume when the PVC is deleted. There are two options:
Delete(default): The PV and its backing storage are destroyed. This is the right choice for ephemeral or easily-recreatable data.Retain: The PV is kept and its status changes toReleased. The actual cloud volume is not deleted. You must manually reclaim or clean it up. Use this for databases and anything you cannot afford to lose.
Note that Recycle (which ran rm -rf on the volume) is deprecated and unsupported by CSI drivers. If you need similar behavior, use a Delete policy and rely on backups.
volumeBindingMode
This field controls when the volume is actually provisioned and bound. It has a significant impact on scheduling and is the most commonly misconfigured StorageClass field.
| Mode | Behavior | When to Use |
|---|---|---|
Immediate |
The volume is provisioned the moment the PVC is created, before any Pod references it. | Only when your storage is available in all zones (e.g., NFS, network-attached storage). |
WaitForFirstConsumer |
Provisioning is delayed until a Pod that uses the PVC is scheduled. The volume is created in the same zone as the node the Pod lands on. | Cloud block storage (EBS, PD, Azure Disk) — virtually always. |
Cloud block volumes like EBS and GCP PD are zonal — an EBS volume in us-east-1a cannot be attached to a node in us-east-1b. With Immediate binding, the volume might get created in zone A while the scheduler places the Pod in zone B, causing a permanent scheduling failure. WaitForFirstConsumer avoids this entirely by letting the scheduler pick the node first and then creating the volume in the correct zone.
allowVolumeExpansion
When set to true, you can increase the size of an existing PVC by editing its spec.resources.requests.storage field. The CSI driver will expand the underlying volume and, if needed, resize the filesystem. Most cloud CSI drivers support this. You cannot shrink a volume — only grow it.
The Default StorageClass
When a PVC does not specify a storageClassName, Kubernetes assigns it the default StorageClass. A StorageClass is marked as default using the storageclass.kubernetes.io/is-default-class annotation. Most managed Kubernetes clusters (EKS, GKE, AKS) come with a default StorageClass pre-configured.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
You can check which StorageClass is the default with kubectl get sc. The default will have (default) next to its name:
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE AGE
fast-ssd ebs.csi.aws.com Delete WaitForFirstConsumer 5d
standard (default) ebs.csi.aws.com Delete WaitForFirstConsumer 30d
If you have more than one StorageClass annotated as default, the behavior is undefined — some admission controllers will reject PVCs, while others will pick one arbitrarily. Keep exactly one default per cluster.
How Dynamic Provisioning Works End-to-End
Understanding the full lifecycle clarifies what Kubernetes does behind the scenes when you create a PVC. Here is the sequence:
- You create a PVC that references a StorageClass (either explicitly via
storageClassNameor implicitly via the default). - The PVC enters
Pendingstate. If the binding mode isWaitForFirstConsumer, it stays Pending until a Pod claims it. - A Pod is scheduled that mounts the PVC. The scheduler picks a node, taking storage topology into account.
- The provisioner plugin fires. It calls the cloud provider API (e.g.,
ec2:CreateVolume) to create a volume in the appropriate zone with the specified parameters. - Kubernetes creates a PV object that represents the newly provisioned volume. The PV's
spec.claimRefis set to the PVC. - The PVC is bound to the PV. Both transition to
Boundstatus. - The kubelet mounts the volume into the Pod's container at the specified
mountPath.
When the PVC is deleted, the process runs in reverse: the volume is unmounted, the PV is removed, and (if reclaimPolicy: Delete) the cloud volume is destroyed.
Example: AWS EBS with the CSI Driver
This is the most common setup on EKS. You define a StorageClass for gp3 SSD volumes and create a PVC that triggers dynamic provisioning.
# storageclass-aws.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# pvc-aws.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-data
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: ebs-gp3
resources:
requests:
storage: 20Gi
Apply both and watch the PVC wait for a consumer:
$ kubectl apply -f storageclass-aws.yaml -f pvc-aws.yaml
$ kubectl get pvc app-data
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
app-data Pending ebs-gp3 5s
# Pending — waiting for a Pod to be scheduled (WaitForFirstConsumer)
$ kubectl run test --image=nginx --overrides='{
"spec": {"containers": [{"name": "nginx", "image": "nginx",
"volumeMounts": [{"name": "data", "mountPath": "/data"}]}],
"volumes": [{"name": "data",
"persistentVolumeClaim": {"claimName": "app-data"}}]}}'
$ kubectl get pvc app-data
NAME STATUS VOLUME CAPACITY STORAGECLASS AGE
app-data Bound pvc-a1b2c3d4-5678-90ef-ghij-klmnopqrstuv 20Gi ebs-gp3 45s
The PVC transitions from Pending to Bound once the Pod is scheduled. A PV named pvc-a1b2c3d4-... was automatically created by the EBS CSI driver, backed by a real EBS volume in the same availability zone as the node.
Example: GCP Persistent Disk with the CSI Driver
On GKE, the pattern is identical — only the provisioner name and parameters differ.
# storageclass-gcp.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pd-ssd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
fsType: ext4
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# pvc-gcp.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: db-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: pd-ssd
resources:
requests:
storage: 50Gi
For workloads that need replication across zones (e.g., a regional GKE cluster), add the replication-type parameter:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pd-ssd-regional
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Example: Local Path Provisioner for Development
Cloud provisioners don't work on local clusters like kind, minikube, or k3s. The Rancher Local Path Provisioner fills this gap by creating volumes as directories on the node's filesystem. It ships by default with k3s and can be installed on any cluster.
# This StorageClass comes pre-installed on k3s clusters
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
PVCs referencing local-path work exactly like cloud-backed PVCs. The provisioner creates a directory under /opt/local-path-provisioner on the node, and the PV points to that host path. This gives you a fully functional dynamic provisioning loop for development and CI without any cloud infrastructure.
# Install local-path-provisioner on kind or minikube
$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml
# Verify it's running
$ kubectl -n local-path-storage get pod
NAME READY STATUS AGE
local-path-provisioner-7745554f7f-k8x2q 1/1 Running 30s
Expanding a Volume After Creation
If your StorageClass has allowVolumeExpansion: true, you can grow a PVC in place. Edit the PVC and increase the storage request — the CSI driver handles the rest.
# Expand from 20Gi to 50Gi
$ kubectl patch pvc app-data -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'
# Watch the resize progress
$ kubectl get pvc app-data -o jsonpath='{.status.conditions[*].type}'
FileSystemResizePending
# After a moment (once the Pod remounts or the filesystem is resized online):
$ kubectl get pvc app-data
NAME STATUS VOLUME CAPACITY STORAGECLASS AGE
app-data Bound pvc-a1b2c3d4 50Gi ebs-gp3 2h
Some CSI drivers resize the filesystem online (while the Pod is running). Others require the Pod to be restarted for the resize to take effect. Check your driver's documentation. AWS EBS CSI and GCP PD CSI both support online expansion.
A well-organized cluster typically has 2–4 StorageClasses: a general-purpose default (e.g., gp3), a high-performance class for databases (e.g., io2 with provisioned IOPS), and optionally a cheap throughput class for logs (st1). Name them descriptively — fast-ssd, high-iops, bulk-hdd — so developers can pick the right tier without knowing cloud-specific details.
CSI Drivers and Storage Patterns for Production
Before Kubernetes v1.13, every storage backend — AWS EBS, GCE PD, Cinder, Ceph — was compiled directly into the Kubernetes codebase as an "in-tree" volume plugin. This meant that adding a new storage driver or fixing a bug required a full Kubernetes release. CSI (Container Storage Interface) decouples storage from core Kubernetes, letting vendors ship and update their own drivers on their own schedule.
CSI is not Kubernetes-specific. It is a cross-platform specification (also adopted by Mesos and Cloud Foundry) that defines a standard gRPC interface between a container orchestrator and a storage provider. In Kubernetes, CSI drivers run as a set of sidecar containers alongside a driver-specific plugin, all deployed as regular Pods.
Why In-Tree Plugins Had to Go
The in-tree model created three compounding problems. First, release coupling: a storage vendor could not ship a fix without waiting for the next Kubernetes minor release cycle (roughly every four months). Second, binary bloat: every kubelet and controller-manager binary contained code for every supported storage backend, even the ones that cluster never used. Third, privileged access: storage code ran inside core Kubernetes components, which meant a bug in a volume plugin could crash the kubelet or controller-manager.
CSI solves all three. Drivers are independently versioned container images. They run in their own Pods with scoped privileges. And the core Kubernetes codebase no longer needs to know anything about the underlying storage technology — it just speaks the CSI gRPC protocol.
As of Kubernetes 1.26, in-tree volume plugins for AWS EBS, GCE PD, Azure Disk, and others have their CSI migration permanently enabled. The in-tree code delegates all operations to the corresponding CSI driver. You should ensure the CSI driver is installed even if you previously relied on in-tree plugins.
CSI Architecture
A CSI driver deployment consists of two parts: a controller component (typically a Deployment or StatefulSet with one replica) that handles cluster-level operations like provisioning and attaching volumes, and a node component (a DaemonSet) that runs on every node to handle mounting and unmounting. Both parts are composed of Kubernetes-maintained sidecar containers plus the vendor-specific CSI driver container.
flowchart LR
PVC["PVC Created"] --> EP["external-provisioner"]
EP -->|"CreateVolume gRPC"| Driver["CSI Driver\n(Controller)"]
Driver --> Backend["Storage Backend\n(EBS, Ceph, NFS...)"]
Backend -->|"Volume Ready"| EA["external-attacher"]
EA -->|"ControllerPublishVolume\ngRPC"| Driver
Driver --> Attach["Volume Attached\nto Node"]
Attach --> Kubelet["kubelet"]
Kubelet -->|"NodeStageVolume\nNodePublishVolume"| NodeDriver["CSI Driver\n(Node DaemonSet)"]
NodeDriver --> Mount["Volume Mounted\nin Container"]
style PVC fill:#e0e7ff,stroke:#4f46e5,color:#1e1b4b
style Backend fill:#fef3c7,stroke:#d97706,color:#78350f
style Mount fill:#d1fae5,stroke:#059669,color:#064e3b
The diagram above shows the full lifecycle of a dynamically provisioned volume. The external sidecars watch for Kubernetes API events (new PVC, new VolumeAttachment) and translate them into gRPC calls to the CSI driver. The driver then talks to the actual storage backend.
CSI Sidecar Containers
Kubernetes maintains a set of standard sidecar containers that handle the orchestrator-side logic. These run alongside the vendor's CSI driver container within the same Pod. Here is what each one does:
| Sidecar | Runs In | Watches | Calls (gRPC) |
|---|---|---|---|
external-provisioner | Controller | PersistentVolumeClaim objects | CreateVolume / DeleteVolume |
external-attacher | Controller | VolumeAttachment objects | ControllerPublishVolume / ControllerUnpublishVolume |
external-snapshotter | Controller | VolumeSnapshot objects | CreateSnapshot / DeleteSnapshot |
external-resizer | Controller | PVC size changes | ControllerExpandVolume |
node-driver-registrar | Node (DaemonSet) | N/A (registers driver with kubelet) | Registers CSI driver socket path |
livenessprobe | Both | N/A | Probe (health check endpoint) |
How CSI Drivers Are Deployed
Most CSI drivers ship as Helm charts or kustomize manifests. Under the hood, the deployment always follows the same two-part pattern: a controller Deployment and a node DaemonSet. Here is a simplified view of the node DaemonSet for the AWS EBS CSI driver:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebs-csi-node
namespace: kube-system
spec:
selector:
matchLabels:
app: ebs-csi-node
template:
metadata:
labels:
app: ebs-csi-node
spec:
containers:
- name: ebs-plugin
image: public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.28.0
volumeMounts:
- name: kubelet-dir
mountPath: /var/lib/kubelet
mountPropagation: Bidirectional
- name: plugin-dir
mountPath: /csi
- name: node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0
args:
- "--csi-address=/csi/csi.sock"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock"
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
volumes:
- name: kubelet-dir
hostPath:
path: /var/lib/kubelet
type: Directory
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/ebs.csi.aws.com/
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
Notice the mountPropagation: Bidirectional on the kubelet directory — this is critical. It allows the CSI driver to mount volumes inside its own mount namespace and have those mounts propagate to the kubelet, which then bind-mounts them into application containers. The node-driver-registrar sidecar registers the driver's Unix socket with the kubelet so the kubelet knows how to reach it.
Volume Snapshots
Volume snapshots bring point-in-time copy capabilities to Kubernetes. They are modeled as three API resources that mirror the PV/PVC pattern: VolumeSnapshotClass defines how to snapshot (which CSI driver, what parameters), VolumeSnapshot is the user-facing request (like a PVC), and VolumeSnapshotContent is the actual snapshot object bound to the backend (like a PV).
Volume snapshots require the CSI driver to implement the snapshot capability, and you must install the snapshot CRDs and the snapshot-controller separately — they are not part of core Kubernetes. Most managed Kubernetes offerings (EKS, GKE, AKS) pre-install these.
# 1. Define a VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
---
# 2. Take a snapshot of an existing PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: db-snapshot-2024-01-15
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: postgres-data
The deletionPolicy controls what happens to the underlying snapshot in the storage backend when you delete the VolumeSnapshot object. Retain keeps the backend snapshot (safer for backups), while Delete removes it. Always use Retain for any snapshot that serves as a backup.
Restoring from a Snapshot
To restore, you create a new PVC with a dataSource pointing to the snapshot. Kubernetes provisions a new volume pre-populated with the snapshot data:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
dataSource:
name: db-snapshot-2024-01-15
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Volume Cloning
Volume cloning creates a duplicate of an existing PVC without going through a snapshot intermediate. The clone is a new, independent volume with the same data as the source at the time of the clone request. It is useful for spinning up test environments from production data or creating read replicas.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-clone
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3-encrypted
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-data # source PVC
kind: PersistentVolumeClaim
The source and clone must be in the same namespace and use the same StorageClass. The clone size must be equal to or larger than the source. Some CSI drivers (like AWS EBS CSI) implement cloning via an internal snapshot-and-restore, so it can take time proportional to volume size.
Volume Expansion
CSI drivers that support the ControllerExpandVolume and NodeExpandVolume RPCs allow you to grow PVCs without downtime. The StorageClass must set allowVolumeExpansion: true, and then you simply patch the PVC's spec.resources.requests.storage to a larger value:
# Expand a PVC from 100Gi to 200Gi
kubectl patch pvc postgres-data -p \
'{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Monitor the resize progress
kubectl get pvc postgres-data -o jsonpath='{.status.conditions[*]}'
Expansion happens in two phases: the controller expands the underlying volume in the storage backend, then the node resizes the filesystem when the volume is next mounted (or immediately, for online expansion). You can track progress via the FileSystemResizePending condition on the PVC.
Production Storage Patterns
ReadWriteMany with NFS, CephFS, or EFS
Most block storage (EBS, Azure Disk, GCE PD) only supports ReadWriteOnce — a single node can mount the volume for read-write at a time. When multiple Pods across different nodes need to read and write the same data (shared uploads, CMS content, ML training datasets), you need a storage backend that supports ReadWriteMany (RWX).
| Option | Protocol | Performance | Best For |
|---|---|---|---|
| AWS EFS (efs.csi.aws.com) | NFSv4.1 | Moderate latency, elastic throughput | Shared config, media uploads |
| CephFS (Rook) | Ceph native / FUSE | High throughput, tunable | Data-intensive shared workloads |
| NFS Server + nfs-subdir-external-provisioner | NFSv3/v4 | Depends on server hardware | Dev/test, legacy integration |
| Azure Files | SMB / NFS | Moderate | Cross-node file sharing on AKS |
# StorageClass for AWS EFS (ReadWriteMany)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap # uses EFS Access Points
fileSystemId: fs-0a1b2c3d4e5f
directoryPerms: "700"
basePath: "/dynamic_provisioning"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-uploads
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 50Gi
Block Storage vs. Filesystem Storage
CSI supports two volume modes: Filesystem (the default) and Block. With Filesystem mode, the CSI driver creates a filesystem (ext4, xfs) on the volume and mounts it as a directory. With Block mode, the raw block device is exposed directly to the container as a device file at the specified devicePath.
Block mode is used by databases that manage their own storage layout (like certain configurations of Oracle, or high-performance key-value stores), and by applications that need direct I/O without filesystem overhead. Most workloads should use Filesystem mode.
# Raw block volume PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: raw-block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block # not Filesystem
storageClassName: gp3-encrypted
resources:
requests:
storage: 50Gi
---
# Pod consuming a raw block device
apiVersion: v1
kind: Pod
metadata:
name: block-consumer
spec:
containers:
- name: app
image: myapp:latest
volumeDevices: # not volumeMounts
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: raw-block-pvc
Ephemeral CSI Volumes
Some CSI drivers support ephemeral inline volumes — volumes defined directly in the Pod spec rather than through a PVC. These volumes have the same lifecycle as the Pod: they are created when the Pod starts and deleted when the Pod is removed. This pattern is ideal for injecting short-lived secrets, certificates, or identity tokens from external systems (e.g., HashiCorp Vault via the Secrets Store CSI Driver).
apiVersion: v1
kind: Pod
metadata:
name: app-with-secrets
spec:
containers:
- name: app
image: myapp:latest
volumeMounts:
- name: vault-secrets
mountPath: /mnt/secrets
readOnly: true
volumes:
- name: vault-secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: vault-db-creds
Generic Ephemeral Volumes
Generic ephemeral volumes (GA since Kubernetes 1.23) combine the lifecycle of ephemeral volumes with the full power of PVCs. You embed a PVC template directly inside the Pod spec. Kubernetes creates a real PVC for each Pod, provisions a volume through the normal StorageClass flow, and automatically deletes the PVC when the Pod terminates.
This is powerful for workloads that need fast scratch space backed by real block storage (not just emptyDir on the node's disk). Think ML training jobs that need 500 GiB of NVMe-backed scratch, or CI runners that need isolated high-IOPS build volumes.
apiVersion: v1
kind: Pod
metadata:
name: ml-training-job
spec:
containers:
- name: trainer
image: ml-framework:latest
volumeMounts:
- name: scratch
mountPath: /scratch
volumes:
- name: scratch
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: io2-high-iops
resources:
requests:
storage: 500Gi
Storage Best Practices for Production
Capacity Planning
Do not wait for PersistentVolume fullness alerts to plan capacity. Set up monitoring that tracks volume utilization percentage (available via the kubelet's /metrics endpoint as kubelet_volume_stats_used_bytes and kubelet_volume_stats_capacity_bytes). Alert at 70% utilization to give yourself time to expand. Combine this with volume expansion so that responding to an alert is a single kubectl patch rather than a data migration.
IOPS and Throughput Requirements
Storage performance problems are the number one cause of unexplained application slowness in Kubernetes. Cloud block storage IOPS scales with volume size (e.g., AWS gp3 provides a baseline of 3,000 IOPS for any size, but io2 scales up to 64,000). Match your StorageClass to your workload profile:
| Workload | IOPS Profile | Recommended Storage |
|---|---|---|
| PostgreSQL / MySQL (OLTP) | High random read/write | io2 Block Express, local NVMe |
| Elasticsearch / Kafka | High sequential throughput | gp3 (tuned throughput), st1 |
| WordPress / CMS | Low, bursty | gp3 default |
| ML training scratch | Very high sequential write | Local NVMe, instance store |
Backup Strategies with Snapshots
Snapshots are not backups by themselves — they often live in the same storage system as the original volume (e.g., EBS snapshots are in the same region). A production backup strategy should include: (1) scheduled VolumeSnapshots via a CronJob or a tool like Velero, (2) cross-region copy of snapshots for disaster recovery, and (3) periodic restore tests to verify snapshot integrity. Here is a minimal CronJob-based snapshot approach:
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-snapshot-daily
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
serviceAccountName: snapshot-creator
containers:
- name: snapshot
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
SNAP_NAME="db-snap-$(date +%Y%m%d-%H%M%S)"
cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ${SNAP_NAME}
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: postgres-data
EOF
echo "Created snapshot: ${SNAP_NAME}"
restartPolicy: OnFailure
For production databases, use Velero instead of hand-rolled CronJobs. Velero coordinates application-consistent snapshots (with pre/post hooks to flush writes), manages snapshot retention policies, and handles cross-region copies — all features you would otherwise need to build yourself.
Cross-AZ Considerations
Block storage volumes (EBS, Azure Disk, GCE PD) are tied to a single availability zone. If a Pod is scheduled on a node in us-east-1a but its PV lives in us-east-1b, the volume attach will fail. Kubernetes handles this through topology-aware scheduling: when a StorageClass sets volumeBindingMode: WaitForFirstConsumer, the PV is not provisioned until the Pod is scheduled, ensuring the volume is created in the same AZ as the node.
For StatefulSets, this means each replica can land in a different AZ with its volume in the matching zone — exactly what you want for high availability. However, if a node in one AZ goes down, the Pod cannot be rescheduled to another AZ because its volume is stuck in the original zone. Plan for this by either using replicated storage (Ceph, Longhorn) that spans AZs, or by accepting that recovery requires restoring from a snapshot in the new AZ.
Testing Storage Failover
Do not wait for a real outage to find out your storage layer fails ungracefully. Test these scenarios regularly:
- Node drain: Run
kubectl drainon a node with active PVs. Verify the volume detaches cleanly and the replacement Pod mounts it on the new node within your SLA (typical: 1–3 minutes for EBS). - Force-delete a Pod: Use
kubectl delete pod --force --grace-period=0. Confirm the volume attachment is released and the new Pod is not stuck inContainerCreatingwaiting for the stale VolumeAttachment to expire (the default force-detach timeout is 6 minutes). - AZ failure simulation: Cordon all nodes in one AZ. Verify that StatefulSet Pods with
WaitForFirstConsumervolumes do not automatically recover (expected behavior), and that you can restore from a snapshot in another AZ. - Snapshot restore: Restore a VolumeSnapshot into a new PVC and verify data integrity. If you cannot read back what you wrote, your backup strategy is broken.
When a node fails abruptly (power loss, kernel panic), Kubernetes waits for the node.kubernetes.io/unreachable taint timeout (default 5 minutes) before marking Pods for eviction. Combined with the volume force-detach timeout, a StatefulSet Pod with a block volume can take 6–11 minutes to recover on a new node. Factor this into your availability SLAs.
Popular CSI Drivers
The CSI ecosystem is mature, with production-grade drivers available for all major cloud providers and several open-source storage systems. Here is a quick reference for the most widely deployed drivers:
| Driver | Provider | Access Modes | Snapshots | Expansion | Use Case |
|---|---|---|---|---|---|
ebs.csi.aws.com | AWS | RWO | Yes | Yes | General-purpose block storage on EKS |
efs.csi.aws.com | AWS | RWX | No | N/A | Shared file storage (NFS-backed) on EKS |
pd.csi.storage.gke.io | GCP | RWO / ROX | Yes | Yes | Block storage on GKE |
disk.csi.azure.com | Azure | RWO | Yes | Yes | Managed disks on AKS |
file.csi.azure.com | Azure | RWX | No | Yes | Azure Files (SMB/NFS) on AKS |
| Rook-Ceph (rbd / cephfs) | Open Source | RWO / RWX | Yes | Yes | Software-defined storage, multi-AZ replication |
| Longhorn | SUSE/Rancher | RWO / RWX | Yes | Yes | Lightweight replicated storage for edge/bare-metal |
secrets-store.csi.k8s.io | SIG Auth | Ephemeral | N/A | N/A | Mount secrets from Vault, AWS SM, Azure KV |
When choosing a CSI driver, verify that it supports the specific features you need (snapshots, expansion, cloning, RWX) and check its Kubernetes version compatibility matrix. Run the driver's conformance tests in your CI pipeline before upgrading versions in production.
ConfigMaps and Secrets — Externalizing Configuration
The twelve-factor app methodology defines a strict boundary: configuration that varies between deployments must live outside your code. Database URLs, feature flags, API keys, TLS certificates — none of these belong in a container image. An image built once should run identically in dev, staging, and production. The only thing that changes is the configuration injected at runtime.
Kubernetes operationalizes this principle through two first-class objects: ConfigMaps for non-sensitive data and Secrets for sensitive data. Both decouple configuration from your container image, but they differ in how the cluster handles, stores, and exposes the data they carry.
ConfigMaps: Non-Sensitive Configuration
A ConfigMap is a namespaced Kubernetes object that stores key-value pairs. The values can be short strings (a feature flag, a log level) or entire files (an nginx.conf, a .properties file). The maximum size of a ConfigMap is 1 MiB.
Creating ConfigMaps
You can create ConfigMaps from multiple sources. Understanding the distinction matters because it determines how keys are named inside the resulting object.
From literal key-value pairs — useful for simple settings:
kubectl create configmap app-settings \
--from-literal=LOG_LEVEL=info \
--from-literal=MAX_RETRIES=3 \
--from-literal=FEATURE_DARK_MODE=true
From a file — the filename becomes the key, the file content becomes the value:
# Key will be "nginx.conf", value will be the file contents
kubectl create configmap nginx-config --from-file=nginx.conf
# Override the key name explicitly
kubectl create configmap nginx-config --from-file=my-custom-key=nginx.conf
From a directory — each file in the directory becomes a key-value pair:
# Every file in ./config/ becomes a key
kubectl create configmap app-config --from-file=./config/
From an env file — parses KEY=VALUE lines (ignores comments and blank lines):
kubectl create configmap app-env --from-env-file=app.env
The declarative equivalent in YAML gives you version-controlled, reproducible configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-settings
namespace: production
data:
LOG_LEVEL: "info"
MAX_RETRIES: "3"
nginx.conf: |
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://backend:8080;
}
}
Even though MAX_RETRIES looks like an integer and FEATURE_DARK_MODE looks like a boolean, ConfigMap values are always stored as strings. Your application code is responsible for parsing them into the correct type. For binary data, use the binaryData field with base64-encoded content.
Consuming ConfigMaps as Environment Variables
There are two approaches to injecting ConfigMap data as environment variables, and choosing between them affects maintainability and naming control.
envFrom injects every key in the ConfigMap as an environment variable. It is convenient but gives you less control — if someone adds a key to the ConfigMap, it automatically appears in your container's environment.
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
containers:
- name: app
image: myapp:1.4.0
envFrom:
- configMapRef:
name: app-settings
prefix: CFG_ # optional: prepends CFG_ to every key
Individual env entries with valueFrom give you explicit control. You pick exactly which keys to expose and can rename them in the process.
spec:
containers:
- name: app
image: myapp:1.4.0
env:
- name: APP_LOG_LEVEL
valueFrom:
configMapKeyRef:
name: app-settings
key: LOG_LEVEL
optional: true # pod starts even if key is missing
| Aspect | envFrom | Individual env |
|---|---|---|
| Ease of use | One line imports all keys | Each key must be listed explicitly |
| Naming control | Only global prefix | Full control over env var names |
| Invalid keys | Silently skipped (keys not valid as env vars) | You choose only valid keys |
| Auditability | Harder to trace where a variable comes from | Clear mapping in pod spec |
| Best for | Apps expecting many config values | Precise injection of a few values |
Consuming ConfigMaps as Volume Mounts
When your application reads configuration from files (not environment variables), mount the ConfigMap as a volume. Each key in the ConfigMap becomes a file in the mount directory, and the value becomes the file's content.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.25
volumeMounts:
- name: config-volume
mountPath: /etc/nginx/conf.d
readOnly: true
volumes:
- name: config-volume
configMap:
name: nginx-config
items: # optional: select specific keys
- key: nginx.conf
path: default.conf # rename inside the mount
The items field lets you project specific keys and rename them in the mount. Without items, every key in the ConfigMap appears as a file. You can also use subPath to mount a single file without overwriting the entire target directory — but be aware that subPath mounts do not receive automatic updates.
When you update a ConfigMap, volume-mounted files are eventually updated by the kubelet — typically within 30–60 seconds (controlled by --sync-frequency and the kubelet's ConfigMap cache TTL). However, environment variables are never updated after pod startup. You must restart the pod (or use a rolling restart) to pick up env var changes. This makes volume mounts the better choice for applications that watch config files for changes.
ConfigMaps as Command-Line Arguments
You can reference ConfigMap values in a container's command or args by first mapping them to environment variables and then using the $(VAR_NAME) substitution syntax.
spec:
containers:
- name: worker
image: myworker:2.1.0
env:
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: app-settings
key: LOG_LEVEL
command: ["./worker"]
args: ["--log-level", "$(LOG_LEVEL)", "--max-retries", "5"]
Secrets: Sensitive Configuration
Secrets look almost identical to ConfigMaps in terms of consumption (env vars, volumes, command args). The critical difference is intent: Secrets signal to the cluster that the data is sensitive. Kubernetes responds by storing them in tmpfs on nodes (never written to disk on the node), restricting access through RBAC, and — if configured — encrypting them at rest in etcd.
Secret Types
Kubernetes defines several built-in Secret types. The type constrains what keys the Secret must contain and enables specialized behavior in controllers and the kubelet.
| Type | Purpose | Required Keys |
|---|---|---|
Opaque | Arbitrary user-defined data (default type) | None — any keys allowed |
kubernetes.io/dockerconfigjson | Private registry credentials for image pulls | .dockerconfigjson |
kubernetes.io/tls | TLS certificate and private key pairs | tls.crt, tls.key |
kubernetes.io/basic-auth | Basic authentication credentials | username, password |
kubernetes.io/ssh-auth | SSH private key credentials | ssh-privatekey |
kubernetes.io/service-account-token | Service account tokens (legacy, auto-created) | token, ca.crt, namespace |
Creating Secrets
The imperative approach mirrors ConfigMap creation. Kubernetes automatically base64-encodes the values when you use kubectl create secret.
# Generic (Opaque) secret
kubectl create secret generic db-credentials \
--from-literal=username=admin \
--from-literal=password='S3cur3P@ss!'
# TLS secret from certificate files
kubectl create secret tls app-tls \
--cert=tls.crt \
--key=tls.key
# Docker registry secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=deploy \
--docker-password=token123
In declarative YAML, you must base64-encode values yourself in the data field — or use stringData for plain text that Kubernetes encodes on submission:
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque
# data: values must be base64-encoded
data:
username: YWRtaW4= # echo -n 'admin' | base64
password: UzNjdXIzUEBzcyE= # echo -n 'S3cur3P@ss!' | base64
---
apiVersion: v1
kind: Secret
metadata:
name: db-credentials-easy
type: Opaque
# stringData: plain text — Kubernetes encodes it for you
stringData:
username: admin
password: "S3cur3P@ss!"
A common misconception: base64-encoding a Secret does not protect it. Anyone with kubectl get secret db-credentials -o yaml access can decode the values instantly with echo 'YWRtaW4=' | base64 -d. Base64 exists purely for safe transport of binary data — it provides zero confidentiality. Real protection comes from RBAC (restricting who can read Secrets), encryption at rest, and external secret management.
Consuming Secrets
Secrets are consumed identically to ConfigMaps — through envFrom, individual env entries, and volume mounts. The only difference in the YAML is that you reference secretKeyRef instead of configMapKeyRef, or use a secret volume source.
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: myapi:3.0.0
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
volumeMounts:
- name: tls-certs
mountPath: /etc/tls
readOnly: true
volumes:
- name: tls-certs
secret:
secretName: app-tls
defaultMode: 0400 # restrict file permissions
Setting defaultMode: 0400 ensures that only the container's user can read the mounted Secret files. This is a simple but effective defense-in-depth measure. For registry credentials, reference the Secret in the pod's imagePullSecrets field rather than mounting it.
Encryption at Rest
By default, Secrets are stored unencrypted in etcd. Anyone with direct etcd access can read every Secret in the cluster. To fix this, you configure an EncryptionConfiguration on the API server that encrypts Secret data before it is written to etcd.
# /etc/kubernetes/enc/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {} # fallback: read unencrypted data
The API server loads this file via the --encryption-provider-config flag. The providers list is ordered: the first provider is used for writing, and all providers are tried for reading (which is how you rotate keys gracefully). Common providers include aescbc, aesgcm, secretbox, and kms (for integrating with cloud KMS services like AWS KMS, GCP Cloud KMS, or Azure Key Vault).
After enabling encryption, existing Secrets remain unencrypted until you re-write them. Force re-encryption of all Secrets with:
kubectl get secrets --all-namespaces -o json | \
kubectl replace -f -
External Secret Management
For production clusters, encryption at rest is the minimum. Most teams go further by keeping secret values entirely outside of Kubernetes and syncing them in at runtime. This avoids having sensitive values in etcd at all, and centralizes secret lifecycle management (rotation, auditing, access control) in a dedicated vault system.
| Solution | How It Works | Best For |
|---|---|---|
| External Secrets Operator | CRD-based operator that syncs secrets from external stores (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault) into native Kubernetes Secrets | Multi-cloud teams; standardized CRD interface across providers |
| Sealed Secrets | Encrypts Secrets client-side with a cluster-specific public key; only the in-cluster controller can decrypt. Safe to commit to Git. | GitOps workflows where you want secrets in version control |
| HashiCorp Vault + Agent Injector | Vault sidecar agent fetches secrets and writes them to a shared in-memory volume. Secrets never become Kubernetes Secret objects. | Strict compliance; dynamic short-lived credentials; database credential rotation |
Here is a minimal External Secrets Operator example syncing a database password from AWS Secrets Manager:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: db-credentials # resulting K8s Secret name
creationPolicy: Owner
data:
- secretKey: username # key in the K8s Secret
remoteRef:
key: prod/db-credentials # path in AWS Secrets Manager
property: username
- secretKey: password
remoteRef:
key: prod/db-credentials
property: password
Immutable ConfigMaps and Secrets
Kubernetes 1.21 graduated the immutable field to stable. Setting immutable: true on a ConfigMap or Secret provides two concrete benefits:
- Performance — The kubelet stops polling the API server for updates to that object. In clusters with thousands of ConfigMaps, this significantly reduces API server load and watch traffic.
- Safety — Accidental or malicious edits are blocked. The only way to change the configuration is to create a new object with a new name and update the pods referencing it.
apiVersion: v1
kind: ConfigMap
metadata:
name: app-settings-v3
immutable: true
data:
LOG_LEVEL: "warn"
MAX_RETRIES: "5"
---
apiVersion: v1
kind: Secret
metadata:
name: db-credentials-v2
immutable: true
type: Opaque
stringData:
username: admin
password: "NewS3cur3P@ss!"
The common pattern is to append a version suffix or content hash to the name (app-settings-v3, app-settings-a8f3d). Tools like Kustomize automate this with configMapGenerator and the --append-hash behavior, producing names like app-settings-k5m8h and updating all Deployment references automatically. This triggers a rolling update whenever config changes — giving you the equivalent of a config-driven redeployment.
Putting It All Together
Here is a realistic Deployment that uses a ConfigMap for application settings, a Secret for database credentials, and a volume-mounted ConfigMap for a custom configuration file — all in one spec.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: app
image: order-service:4.2.1
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: order-settings # LOG_LEVEL, MAX_RETRIES, etc.
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: order-settings
key: DB_HOST
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: order-db-creds
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: order-db-creds
key: password
volumeMounts:
- name: app-config
mountPath: /etc/order-service
readOnly: true
- name: tls-certs
mountPath: /etc/tls
readOnly: true
volumes:
- name: app-config
configMap:
name: order-service-config
- name: tls-certs
secret:
secretName: order-tls
defaultMode: 0400
If your application only reads config at startup (environment variables or non-watched files), use kubectl rollout restart deployment/order-service after updating a ConfigMap or Secret to trigger a zero-downtime rolling restart. For fully automated config-driven rollouts, use a tool like Reloader that watches ConfigMaps and Secrets and automatically triggers rolling updates on the associated Deployments.
Resource Requests, Limits, and Quality of Service
Every container in a Kubernetes cluster competes for finite CPU and memory on the nodes it runs on. Without explicit resource declarations, the scheduler is flying blind — it cannot make intelligent placement decisions, and the kubelet cannot protect one workload from another's greed. Resource requests and limits are how you communicate your workload's needs to Kubernetes, and they directly determine scheduling, runtime enforcement, and eviction priority.
Requests vs. Limits — Two Different Guarantees
A request is the amount of CPU or memory that Kubernetes guarantees to a container. The scheduler uses requests to decide which node has enough room to place a pod. If a node has 4 CPU cores and existing pods have requested 3.5 cores total, the scheduler will only place a new pod there if its CPU request is 0.5 cores or less.
A limit is the maximum amount a container is allowed to consume at runtime. The kubelet enforces limits through kernel mechanisms — Linux cgroups. A container can use resources up to its limit, but never beyond. The request is a floor for scheduling; the limit is a ceiling for enforcement.
| Aspect | Request | Limit |
|---|---|---|
| Purpose | Scheduling guarantee — "I need at least this much" | Runtime ceiling — "I must never exceed this" |
| When it matters | Pod scheduling (kube-scheduler) | Runtime enforcement (kubelet / kernel) |
| Enforcement mechanism | Node allocatable capacity accounting | Linux cgroups (CFS quota for CPU, OOM killer for memory) |
| Can exceed? | Yes — containers can use more than requested if available | No — hard enforcement at runtime |
| Default if omitted | 0 (no guarantee), unless LimitRange sets a default | Unbounded (no cap), unless LimitRange sets a default |
Here is a pod spec that sets both requests and limits for CPU and memory:
apiVersion: v1
kind: Pod
metadata:
name: web-server
spec:
containers:
- name: nginx
image: nginx:1.27
resources:
requests:
cpu: "250m" # 0.25 CPU cores guaranteed
memory: "128Mi" # 128 MiB guaranteed
limits:
cpu: "500m" # Can burst up to 0.5 cores
memory: "256Mi" # Hard cap — OOMKilled if exceeded
CPU Resources: Millicores and CFS Throttling
CPU is measured in millicores (or millicpu). One core equals 1000m. You can express CPU as a decimal (0.5) or in millicores (500m) — they are equivalent. Unlike memory, CPU is a compressible resource: when a container hits its CPU limit, it is not killed — it is throttled.
Throttling is enforced through the Linux kernel's Completely Fair Scheduler (CFS) bandwidth control. The CFS works in periods (typically 100ms). If a container has a CPU limit of 500m, it receives a quota of 50ms per 100ms period. Once the container exhausts its quota within a period, the kernel suspends its threads until the next period begins. The container stays alive but gets slower — its processes stall mid-execution waiting for their next CPU slice.
A container with a 100m CPU limit gets only 10ms of CPU per 100ms period. If a request handler needs 15ms of CPU time, it will be paused partway through and forced to wait for the next period — turning a 15ms operation into one that spans two periods (~115ms wall-clock time). This is why latency-sensitive services often experience tail-latency spikes with tight CPU limits. You can monitor throttling via container_cpu_cfs_throttled_periods_total in Prometheus.
Memory Resources: Bytes and OOMKill
Memory is measured in bytes, using standard suffixes: Ki, Mi, Gi (power-of-two) or K, M, G (power-of-ten). Always use the binary suffixes (Mi, Gi) to avoid confusion — 128Mi is 134,217,728 bytes, while 128M is 128,000,000 bytes. Memory is incompressible: unlike CPU, you cannot simply slow a process down when it uses too much memory. The kernel must reclaim the memory, and it does so violently.
When a container's memory usage exceeds its limit, the Linux kernel's OOM (Out of Memory) killer terminates the container's process. Kubernetes sees this as a container failure with exit code 137 (128 + SIGKILL signal 9) and reason OOMKilled. If the pod's restartPolicy allows it, the kubelet restarts the container — but if it keeps getting OOMKilled, Kubernetes applies exponential back-off (CrashLoopBackOff).
A common misconception: setting a memory request does not reserve that memory exclusively for your container. Requests only affect scheduling decisions. If your container allocates more memory than its limit allows, it will be OOMKilled — regardless of its request value. If no limit is set, the container can grow until the node itself runs out of memory, at which point the kubelet's eviction manager starts killing pods based on QoS priority.
Quality of Service Classes
Kubernetes automatically assigns every pod a QoS class based on the resource requests and limits of its containers. The QoS class determines eviction priority when a node is under memory pressure — Kubernetes kills lower-priority pods first to free resources for higher-priority ones. You never set the QoS class directly; it is computed from your resource configuration.
| QoS Class | Condition | Eviction Priority | Use Case |
|---|---|---|---|
| Guaranteed | Every container sets requests equal to limits for both CPU and memory | Last to be evicted (highest priority) | Databases, payment services, control plane components |
| Burstable | At least one container has a request or limit set, but they are not all equal | Evicted after BestEffort pods | Web servers, API backends, most production workloads |
| BestEffort | No container sets any request or limit | First to be evicted (lowest priority) | Batch jobs, dev/test workloads, non-critical tasks |
The following diagram shows how Kubernetes determines the QoS class:
flowchart TD
A["Pod Created"] --> B{"All containers have
CPU & memory requests
AND limits set?"}
B -- No --> C{"Any container has at
least one request
or limit set?"}
B -- Yes --> D{"requests == limits
for every container?"}
C -- No --> E["BestEffort
Lowest priority — evicted first"]
C -- Yes --> F["Burstable
Medium priority"]
D -- Yes --> G["Guaranteed
Highest priority — evicted last"]
D -- No --> F
Guaranteed QoS Example
For Guaranteed QoS, every container must declare both CPU and memory with requests exactly equal to limits. This is the most predictable configuration — the container gets a fixed, reserved allocation.
apiVersion: v1
kind: Pod
metadata:
name: payment-processor
spec:
containers:
- name: processor
image: payments:3.2.1
resources:
requests:
cpu: "1"
memory: "512Mi"
limits:
cpu: "1" # Same as request → Guaranteed
memory: "512Mi" # Same as request → Guaranteed
Burstable QoS Example
Most production workloads fall into Burstable. You set requests to size for typical load and limits to allow headroom for spikes.
apiVersion: v1
kind: Pod
metadata:
name: api-server
spec:
containers:
- name: api
image: myapp-api:2.1.0
resources:
requests:
cpu: "250m" # Typical steady-state usage
memory: "256Mi"
limits:
cpu: "1" # Allow 4x burst for traffic spikes
memory: "512Mi" # Double the request for safety
BestEffort QoS Example
A pod with no resource declarations at all gets BestEffort. It can use whatever is available but will be the first to be evicted under memory pressure.
apiVersion: v1
kind: Pod
metadata:
name: batch-worker
spec:
containers:
- name: worker
image: data-cruncher:latest
# No resources block at all → BestEffort
You can verify a pod's QoS class at any time:
kubectl get pod payment-processor -o jsonpath='{.status.qosClass}'
# Output: Guaranteed
LimitRanges — Default and Boundary Guardrails
Relying on every developer to remember setting requests and limits is fragile. A LimitRange is a namespace-scoped policy that defines default values, minimum values, and maximum values for resource requests and limits on containers and pods. When a pod is created without explicit resource declarations, the LimitRange admission controller injects the defaults automatically.
apiVersion: v1
kind: LimitRange
metadata:
name: container-limits
namespace: production
spec:
limits:
- type: Container
default: # Applied as limits if not specified
cpu: "500m"
memory: "256Mi"
defaultRequest: # Applied as requests if not specified
cpu: "100m"
memory: "128Mi"
min: # Reject pods requesting less than this
cpu: "50m"
memory: "64Mi"
max: # Reject pods requesting more than this
cpu: "4"
memory: "4Gi"
- type: Pod
max: # Total across all containers in a pod
cpu: "8"
memory: "8Gi"
If a developer deploys a container with no resource configuration into the production namespace, the LimitRange controller automatically sets requests.cpu: 100m, requests.memory: 128Mi, limits.cpu: 500m, and limits.memory: 256Mi. If they try to request cpu: 10 (10 cores), the API server rejects the pod with a validation error because it exceeds the max constraint.
ResourceQuotas — Namespace-Level Budgets
While LimitRanges constrain individual containers and pods, ResourceQuotas constrain an entire namespace's aggregate consumption. They prevent a single team or application from monopolizing cluster resources. A ResourceQuota specifies the total amount of compute resources, storage, and even object counts that all pods in a namespace can collectively consume.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
# Compute totals
requests.cpu: "20" # Max 20 CPU cores of requests
requests.memory: "40Gi" # Max 40 GiB of memory requests
limits.cpu: "40" # Max 40 CPU cores of limits
limits.memory: "80Gi" # Max 80 GiB of memory limits
# Object counts
pods: "50" # Max 50 pods in namespace
services: "20" # Max 20 services
configmaps: "30" # Max 30 ConfigMaps
persistentvolumeclaims: "10"
secrets: "30"
# Storage
requests.storage: "200Gi" # Total PVC storage
Check current usage against the quota:
kubectl describe resourcequota team-alpha-quota -n team-alpha
# Name: team-alpha-quota
# Resource Used Hard
# -------- ---- ----
# limits.cpu 12 40
# limits.memory 28Gi 80Gi
# pods 18 50
# requests.cpu 6 20
# requests.memory 14Gi 40Gi
When a ResourceQuota is active in a namespace and it specifies compute limits (e.g., requests.cpu), every new pod must declare matching resource requests — otherwise the API server rejects it. This is why you should always pair a ResourceQuota with a LimitRange that provides sensible defaults. The LimitRange injects defaults for pods that omit resources, and the ResourceQuota enforces the namespace-wide budget.
Ephemeral Storage Requests and Limits
Beyond CPU and memory, containers consume local disk space for logs, temporary files, and writable layers. Kubernetes tracks this as ephemeral-storage — space consumed on the node's root filesystem. You can set requests and limits for ephemeral storage the same way you do for CPU and memory. If a container exceeds its ephemeral-storage limit, the kubelet evicts the pod.
apiVersion: v1
kind: Pod
metadata:
name: log-processor
spec:
containers:
- name: processor
image: log-processor:1.4.0
resources:
requests:
cpu: "500m"
memory: "256Mi"
ephemeral-storage: "1Gi" # Need at least 1 GiB disk
limits:
cpu: "1"
memory: "512Mi"
ephemeral-storage: "2Gi" # Evicted if exceeds 2 GiB
Ephemeral-storage accounting includes the container's writable layer, log files (/var/log), and emptyDir volumes (unless they are backed by tmpfs or a dedicated medium). This is particularly important for workloads that write large temporary files — image processing pipelines, build agents, or log-heavy applications — where unchecked disk usage could fill the node's filesystem and impact all pods on that node.
The CPU Limits Debate
There is an active and legitimate debate in the Kubernetes community about whether you should set CPU limits at all. Both sides have valid arguments, and the right answer depends on your workload characteristics and priorities.
The Case Against CPU Limits
The argument: if a node has idle CPU capacity, why prevent a container from using it? CPU limits enforce a hard ceiling via CFS quota, meaning a container gets throttled even when no other workload wants those CPU cycles. This leads to wasted capacity and unnecessary latency. Companies like Google (in Borg, the predecessor to Kubernetes) and several prominent community voices recommend setting only CPU requests and omitting CPU limits entirely.
The Case For CPU Limits
The counter-argument: without limits, a single misbehaving container (an infinite loop, a regex backtrack, a memory leak in a JIT compiler) can consume all available CPU on a node, degrading every other pod's performance. CPU requests only guarantee a minimum share — they do not prevent a container from consuming far more during contention. Limits provide predictability and isolation, which matters when you have mixed workloads from different teams.
| Factor | No CPU Limits | With CPU Limits |
|---|---|---|
| Resource efficiency | Higher — idle CPU is available to any container | Lower — CPU goes unused if limit not reached |
| Latency predictability | Less predictable — noisy neighbors possible | More predictable — CFS throttling is deterministic |
| Tail latency (p99) | Risk of spikes from bursts by other pods | Risk of spikes from CFS throttling within the pod |
| Noisy neighbor protection | Weak — relies on kernel CFS fair shares only | Strong — hard caps prevent resource hogging |
| Capacity planning | Harder — actual usage can vary wildly from requests | Easier — limits provide a bounded upper estimate |
| Multi-tenant clusters | Risky without strong trust boundaries | Recommended for isolation |
Always set memory limits — memory is incompressible, and an unbounded container can trigger node-wide OOM events. For CPU, start with limits set (especially in multi-tenant clusters), then remove them selectively for latency-sensitive services where throttling is a bigger problem than noisy neighbors. Monitor container_cpu_cfs_throttled_periods_total to detect when throttling is harming your workloads. If throttling is high and the node has spare capacity, removing the CPU limit for that workload is a reasonable decision.
Putting It All Together
In practice, you rarely configure resources in isolation. A well-managed namespace uses all three primitives together: LimitRange for sane defaults, ResourceQuota for budget enforcement, and explicit resource declarations in your pod specs for precision. Here is a complete namespace setup:
apiVersion: v1
kind: Namespace
metadata:
name: team-backend
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "4Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: team-backend
spec:
hard:
requests.cpu: "16"
requests.memory: "32Gi"
limits.cpu: "32"
limits.memory: "64Gi"
pods: "40"
With this configuration, any pod deployed to team-backend without resource declarations automatically gets requests of 100m CPU / 128Mi memory and limits of 500m CPU / 256Mi memory. No single container can exceed 4 cores or 4Gi memory. The namespace as a whole cannot exceed 16 cores of CPU requests or 40 pods total. This layered approach gives you safety by default while letting individual workloads override values within the defined guardrails.
Namespaces, Labels, and Annotations — Organizing Resources
A running Kubernetes cluster quickly accumulates hundreds — sometimes thousands — of resources. Without organizational structure, finding the right Deployment or debugging a failing Service becomes a needle-in-a-haystack problem. Kubernetes provides three complementary mechanisms to keep things manageable: Namespaces partition the cluster logically, Labels tag resources for selection and grouping, and Annotations attach arbitrary metadata for tools and humans.
These three features operate at different levels. Namespaces are a hard boundary enforced by the API server — they affect resource visibility, access control, and resource quotas. Labels are soft, queryable tags that controllers and kubectl use to match resources together. Annotations are freeform metadata that have no effect on selection or scheduling but carry essential information for external tools, operators, and auditing.
Namespaces: Logical Cluster Partitions
A namespace is a virtual cluster inside your physical cluster. Resources in one namespace are invisible to kubectl commands scoped to another namespace (unless you explicitly ask). This isolation makes namespaces the primary tool for multi-tenancy, environment separation, and organizational boundaries.
The Four Built-in Namespaces
Every Kubernetes cluster ships with four namespaces out of the box, each serving a specific purpose:
| Namespace | Purpose | What Lives Here |
|---|---|---|
default | The catch-all for resources created without specifying a namespace | Your workloads, if you don't create custom namespaces |
kube-system | Control plane and core cluster components | CoreDNS, kube-proxy, metrics-server, CNI pods |
kube-public | Publicly readable resources (even by unauthenticated users) | The cluster-info ConfigMap used during bootstrapping |
kube-node-lease | Lightweight heartbeats for node health | One Lease object per node, updated every few seconds |
It is tempting to drop cluster-wide tools (monitoring agents, log collectors) into kube-system. Resist this. That namespace is managed by the cluster itself, and upgrades or managed-Kubernetes providers may overwrite or garbage-collect unexpected resources. Create a dedicated namespace like monitoring or infra instead.
When to Create New Namespaces
The right namespace strategy depends on your organization. There is no single correct answer, but three patterns cover most real-world cases:
- Per-team —
team-backend,team-frontend,team-data. Good when teams own distinct services and you want to apply separate RBAC policies and resource quotas per team. - Per-environment —
dev,staging,production. Simple, but only works well for small clusters. In larger organizations, environments typically live in separate clusters entirely. - Per-application —
checkout-service,payment-service. Provides fine-grained isolation and makes it easy to tear down everything related to one application at once.
Creating a namespace is straightforward:
# Imperative
kubectl create namespace team-backend
# Declarative (preferred for GitOps)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: team-backend
labels:
team: backend
environment: production
EOF
Namespace-Scoped vs. Cluster-Scoped Resources
Not every Kubernetes resource lives inside a namespace. Resources that apply to the entire cluster — like Nodes, PersistentVolumes, ClusterRoles, and Namespaces themselves — are cluster-scoped. Most workload resources (Pods, Deployments, Services, ConfigMaps, Secrets) are namespace-scoped.
# List all namespace-scoped resource types
kubectl api-resources --namespaced=true
# List all cluster-scoped resource types
kubectl api-resources --namespaced=false
Cross-Namespace Communication
Namespaces are a logical boundary, not a network boundary. By default, a Pod in namespace team-backend can freely communicate with a Service in namespace team-frontend. The key is DNS: Kubernetes DNS gives every Service a fully qualified name following the pattern <service>.<namespace>.svc.cluster.local.
# Within the same namespace, short names work
curl http://payment-api:8080/health
# From a different namespace, use the FQDN
curl http://payment-api.team-backend.svc.cluster.local:8080/health
If you need actual network isolation between namespaces, you must use NetworkPolicies. Without them, namespaces provide organizational separation only — every Pod can still reach every other Pod across the cluster.
Labels: Identifying and Selecting Resources
Labels are key-value pairs attached to any Kubernetes object's metadata. Unlike namespaces, which provide hard partitions, labels are a flexible tagging system. Their real power comes from selectors — the mechanism that controllers, Services, and kubectl use to find matching resources.
Every major Kubernetes abstraction depends on labels. A Service routes traffic to Pods that match its selector. A Deployment manages ReplicaSets through label matching. A DaemonSet picks which nodes to run on using node labels. Understanding labels means understanding how Kubernetes wires everything together.
Label Syntax Rules
Labels follow strict formatting requirements. Keys can have an optional prefix (a DNS subdomain up to 253 characters) separated by a / from the name. The name portion must be 63 characters or fewer and match the pattern [a-z0-9A-Z] with -, _, and . allowed in the middle. Values follow the same rules but can also be empty.
metadata:
labels:
app: payment-api # simple key
version: v2.3.1 # version tracking
tier: backend # architectural layer
app.kubernetes.io/name: payment-api # recommended label (prefixed)
app.kubernetes.io/component: server
app.kubernetes.io/managed-by: helm
Recommended Labels
Kubernetes defines a set of recommended labels under the app.kubernetes.io prefix. These labels are not enforced by the system, but they create a shared vocabulary that tools like Helm, ArgoCD, and various dashboards understand.
| Label | Example Value | Purpose |
|---|---|---|
app.kubernetes.io/name | payment-api | The name of the application |
app.kubernetes.io/instance | payment-api-prod | A unique instance identifier |
app.kubernetes.io/version | 3.1.0 | The current application version |
app.kubernetes.io/component | database | The component within the architecture |
app.kubernetes.io/part-of | e-commerce | The higher-level application this belongs to |
app.kubernetes.io/managed-by | helm | The tool managing this resource |
Label Selectors
Selectors are how Kubernetes answers the question "which resources match?" There are two flavors: equality-based and set-based. Equality-based selectors use =, ==, and !=. Set-based selectors use in, notin, and exists. Both can be combined in a single query — all conditions must be satisfied (logical AND).
# Equality-based: find all pods for the payment-api app
kubectl get pods -l app=payment-api
# Inequality: everything except the frontend tier
kubectl get pods -l tier!=frontend
# Set-based: pods in either staging or production
kubectl get pods -l 'environment in (staging, production)'
# Set-based: pods that have a "release" label (any value)
kubectl get pods -l 'release'
# Set-based: pods without a "canary" label
kubectl get pods -l '!canary'
# Combining selectors (AND logic)
kubectl get pods -l 'app=payment-api,environment in (production)'
How Controllers Use Selectors
The selector mechanism is the glue that holds Kubernetes' declarative model together. A Deployment does not directly manage Pods — it manages ReplicaSets, which in turn manage Pods. The connection at each level is made through label selectors. Here is a concrete example showing the full chain:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 3
selector:
matchLabels: # Deployment finds ReplicaSets with these labels
app: payment-api
template:
metadata:
labels: # Pods created with these labels — must match selector above
app: payment-api
version: v2.3.1
spec:
containers:
- name: api
image: myregistry/payment-api:2.3.1
---
apiVersion: v1
kind: Service
metadata:
name: payment-api
spec:
selector: # Service routes to Pods matching these labels
app: payment-api
ports:
- port: 80
targetPort: 8080
The Deployment's spec.selector.matchLabels must be a subset of the Pod template's metadata.labels. If they don't match, the API server rejects the manifest. The Service's spec.selector independently matches Pods by label — it has no knowledge of the Deployment at all. This loose coupling is intentional: a Service can route to Pods managed by different Deployments, StatefulSets, or even bare Pods, as long as the labels match.
Annotations: Metadata for Tools and Humans
Annotations look like labels — they are key-value pairs in metadata — but they serve a fundamentally different purpose. You cannot select or filter resources by annotations. Instead, annotations carry non-identifying information that tools, controllers, and operators read at runtime. Think of them as a structured comment attached to a resource.
Annotations have relaxed constraints compared to labels. Values can be much larger (up to 256 KB total for all annotations on a resource) and can contain any UTF-8 characters, including JSON, URLs, and multi-line strings.
Common Annotations in the Wild
| Annotation | Set By | Purpose |
|---|---|---|
kubectl.kubernetes.io/last-applied-configuration | kubectl apply | Stores the full JSON of the last applied manifest for three-way merge diffs |
kubernetes.io/change-cause | User | Records why a rollout was triggered (shown in kubectl rollout history) |
prometheus.io/scrape | User | Tells Prometheus to scrape metrics from this Pod |
prometheus.io/port | User | Specifies the port Prometheus should scrape |
nginx.ingress.kubernetes.io/rewrite-target | User | Configures URL rewriting in the NGINX Ingress Controller |
checksum/config | Helm | A hash of the ConfigMap contents, used to trigger Pod restarts on config changes |
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
annotations:
kubernetes.io/change-cause: "Upgraded to v2.3.1 — fixes CVE-2024-1234"
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
containers:
- name: api
image: myregistry/payment-api:2.3.1
Labels vs. Annotations: When to Use Which
The decision rule is simple: if Kubernetes or your tooling needs to select or group resources by this piece of metadata, use a label. If the data is informational, consumed by external tools, or too large for a label, use an annotation.
| Criterion | Labels | Annotations |
|---|---|---|
| Primary purpose | Identification and selection | Non-identifying metadata |
| Used by selectors | Yes — Services, Deployments, kubectl -l | No — never queryable via selectors |
| Value constraints | ≤ 63 characters, alphanumeric + -_. | Up to 256 KB, any UTF-8 string |
| Examples | app: nginx, tier: frontend | Build SHA, config hash, Ingress hints |
| Indexable by API server | Yes — efficient lookups | No — stored but not indexed |
| Typical consumers | Kubernetes controllers, schedulers | CI/CD tools, monitoring agents, operators |
Start with a label. If you find yourself exceeding the 63-character value limit, needing to store structured data (JSON, URLs), or the value has no relevance to selection, move it to an annotation. A common mistake is putting build hashes or Git SHAs in labels — they are not useful for selection and are better suited as annotations.
Practical: Managing Labels with kubectl
Beyond filtering with -l, kubectl gives you commands to add, update, and remove labels on live resources — useful for quick operational tasks like marking a Pod for debugging or excluding it from a Service's traffic.
# Add a label to a running Pod
kubectl label pod payment-api-7d6f8b4c5-x9kzq debug=true
# Overwrite an existing label (--overwrite is required)
kubectl label pod payment-api-7d6f8b4c5-x9kzq version=v2.4.0 --overwrite
# Remove a label (use the key followed by a minus sign)
kubectl label pod payment-api-7d6f8b4c5-x9kzq debug-
# Show labels as columns in output
kubectl get pods --show-labels
# Use a custom column to display a specific label
kubectl get pods -L app,version
# Label all pods in a namespace at once
kubectl label pods --all environment=staging -n team-backend
Similarly, you can manage annotations imperatively:
# Add an annotation
kubectl annotate deployment payment-api kubernetes.io/change-cause="Rollback to v2.2.0"
# Remove an annotation
kubectl annotate deployment payment-api kubernetes.io/change-cause-
# View annotations on a resource
kubectl get deployment payment-api -o jsonpath='{.metadata.annotations}'
Putting It All Together
In practice, you use all three mechanisms in concert. Namespaces divide ownership and access. Labels connect resources and enable querying. Annotations carry the metadata that your CI/CD pipeline, monitoring stack, and Ingress controllers rely on.
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
namespace: team-backend # Namespace: ownership boundary
labels: # Labels: identification & selection
app.kubernetes.io/name: payment-api
app.kubernetes.io/version: "2.3.1"
app.kubernetes.io/component: server
app.kubernetes.io/part-of: e-commerce
app.kubernetes.io/managed-by: argocd
annotations: # Annotations: tool metadata
kubernetes.io/change-cause: "Release 2.3.1"
argocd.argoproj.io/sync-wave: "2"
git.commit/sha: "a1b2c3d4e5f6"
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: payment-api
template:
metadata:
labels:
app.kubernetes.io/name: payment-api
app.kubernetes.io/version: "2.3.1"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
containers:
- name: api
image: myregistry/payment-api:2.3.1
ports:
- containerPort: 8080
- containerPort: 9090
name: metrics
With this foundation — namespaces for boundaries, labels for selection, and annotations for metadata — you have the organizational toolkit to keep a complex cluster understandable as it scales. The next section builds directly on namespaces by introducing RBAC and ServiceAccounts, which control who can access the resources inside each namespace.
RBAC and ServiceAccounts — Who Can Do What
Every request that hits the Kubernetes API server passes through three gates: authentication (who are you?), authorization (are you allowed to do this?), and admission control (should we modify or reject this request?). This section focuses on the first two gates — specifically, how Kubernetes identifies callers and how RBAC grants or denies their actions.
Getting RBAC wrong is one of the fastest paths to a cluster security incident. Overly permissive ClusterRoleBindings to cluster-admin are disturbingly common in production. Understanding the model — and knowing how to scope permissions tightly — is essential.
Authentication — Proving Who You Are
Kubernetes does not have a built-in user database. It does not store usernames and passwords. Instead, the API server delegates identity verification to external systems through pluggable authentication strategies. When a request arrives, the API server tries each configured authenticator in order until one succeeds.
| Strategy | How It Works | Typical Use Case |
|---|---|---|
| X.509 Client Certificates | Client presents a TLS certificate signed by the cluster CA. The Common Name (CN) becomes the username; Organization (O) fields become groups. | Admin users, kubeadm-bootstrapped clusters |
| Bearer Tokens | A static token or ServiceAccount JWT is sent in the Authorization: Bearer <token> header. | ServiceAccounts, legacy static token files |
| OpenID Connect (OIDC) | API server validates a JWT issued by an external identity provider (Google, Azure AD, Keycloak, Dex). | Enterprise SSO for human users |
| Webhook Token Authentication | API server sends the token to an external webhook service that responds with the user's identity. | Custom auth integrations, cloud-specific identity |
Kubernetes has two categories of identity: normal users (humans, managed externally — no User API object exists) and ServiceAccounts (managed as Kubernetes objects in namespaces). You cannot create a "User" resource with kubectl. User identity comes entirely from the authentication layer.
Authorization — Deciding What You Can Do
Once the API server knows who you are, it determines what you can do. Kubernetes supports multiple authorization modes, configured via the --authorization-mode flag on the API server. The modes are evaluated in order, and the first one that makes a decision (allow or deny) wins.
| Mode | Description | Status |
|---|---|---|
| RBAC | Role-based access control using Role/ClusterRole and Binding objects. The standard. | Default on virtually all clusters |
| ABAC | Attribute-based access control using a static policy file. Requires API server restart to change. | Legacy — not recommended |
| Webhook | Delegates authorization decisions to an external HTTP service. | Used alongside RBAC for custom policies |
| Node | Special-purpose authorizer that grants kubelets the minimum permissions they need. | Always enabled alongside RBAC |
In practice, you will work with RBAC on every cluster. The rest of this section is a deep dive into the RBAC model.
The RBAC Model
RBAC in Kubernetes is built from four resource types that connect in a clear pattern: Roles define what actions are allowed, and Bindings connect those roles to subjects (users, groups, or ServiceAccounts). The "namespace vs. cluster" axis gives you two levels of scope.
erDiagram
Role {
string namespace
string name
list rules
}
ClusterRole {
string name
list rules
}
RoleBinding {
string namespace
string roleRef
list subjects
}
ClusterRoleBinding {
string roleRef
list subjects
}
User {
string name
}
Group {
string name
}
ServiceAccount {
string namespace
string name
}
Role ||--o{ RoleBinding : "referenced by"
ClusterRole ||--o{ RoleBinding : "referenced by"
ClusterRole ||--o{ ClusterRoleBinding : "referenced by"
RoleBinding }o--|| User : "grants to"
RoleBinding }o--|| Group : "grants to"
RoleBinding }o--|| ServiceAccount : "grants to"
ClusterRoleBinding }o--|| User : "grants to"
ClusterRoleBinding }o--|| Group : "grants to"
ClusterRoleBinding }o--|| ServiceAccount : "grants to"
There are four key resources to understand:
| Resource | Scope | Purpose |
|---|---|---|
| Role | Namespace | Defines permissions (verbs on resources) within a single namespace |
| ClusterRole | Cluster | Defines permissions cluster-wide, or for cluster-scoped resources (nodes, PVs, namespaces) |
| RoleBinding | Namespace | Binds a Role or ClusterRole to subjects within a namespace |
| ClusterRoleBinding | Cluster | Binds a ClusterRole to subjects across all namespaces |
A subtle but powerful pattern: a RoleBinding can reference a ClusterRole. This lets you define a reusable set of permissions once as a ClusterRole and then grant it in specific namespaces through RoleBindings. The subject only gets those permissions within the RoleBinding's namespace — not cluster-wide.
RBAC Verbs
Each rule in a Role or ClusterRole specifies which verbs (actions) are allowed on which resources in which API groups. Kubernetes defines eight verbs that map directly to HTTP methods on the API server.
| Verb | HTTP Method | Description |
|---|---|---|
get | GET (single) | Read a single resource by name |
list | GET (collection) | List all resources of a type |
watch | GET (streaming) | Stream real-time changes to resources |
create | POST | Create a new resource |
update | PUT | Replace an entire resource |
patch | PATCH | Modify specific fields of a resource |
delete | DELETE (single) | Delete a single resource by name |
deletecollection | DELETE (collection) | Delete all resources matching a selector |
Common groupings: read-only access typically means get, list, watch. Full management adds create, update, patch, delete. The wildcard * matches all verbs — use it sparingly.
Role and ClusterRole
A Role grants permissions within a specific namespace. Each rule in the rules array specifies API groups, resources, and the verbs allowed on those resources.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: staging
name: deployment-manager
rules:
# Full control over Deployments
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# Read-only access to Pods and their logs
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
The apiGroups field determines which API group the resources belong to. Core resources (Pods, Services, ConfigMaps) use "" (the empty string). Resources in named groups use the group name — "apps" for Deployments, "batch" for Jobs, "networking.k8s.io" for NetworkPolicies.
A ClusterRole looks identical but has no namespace field and can also grant access to cluster-scoped resources like Nodes, PersistentVolumes, and Namespaces themselves.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-reader
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list"]
You can also restrict access to specific resource names using the resourceNames field. This is useful for granting access to a particular ConfigMap or Secret without opening up all resources of that type.
rules:
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["app-config", "feature-flags"]
verbs: ["get", "update"]
RoleBinding and ClusterRoleBinding
Roles and ClusterRoles are inert until you bind them to subjects. A RoleBinding grants the permissions defined in a Role (or ClusterRole) to one or more subjects — within a single namespace. A ClusterRoleBinding grants permissions cluster-wide.
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deploy-manager-binding
namespace: staging
subjects:
# A specific user (from certificate CN or OIDC)
- kind: User
name: jane.doe
apiGroup: rbac.authorization.k8s.io
# A group (from certificate O field or OIDC groups claim)
- kind: Group
name: platform-team
apiGroup: rbac.authorization.k8s.io
# A ServiceAccount in the same namespace
- kind: ServiceAccount
name: ci-deployer
namespace: staging
roleRef:
kind: Role
name: deployment-manager
apiGroup: rbac.authorization.k8s.io
The roleRef is immutable — once a Binding is created, you cannot change which Role it references. To point it at a different Role, delete and recreate the Binding. This prevents privilege escalation through in-place modification.
Here is a ClusterRoleBinding that grants node-reader to the monitoring group across the entire cluster:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: monitoring-node-reader
subjects:
- kind: Group
name: monitoring
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: node-reader
apiGroup: rbac.authorization.k8s.io
ServiceAccounts
ServiceAccounts are the identity mechanism for workloads running inside the cluster. Every namespace has a default ServiceAccount, and every Pod runs as some ServiceAccount. When a Pod makes API calls (e.g., a controller querying Pods, or a CI tool creating Deployments), it authenticates using its ServiceAccount token.
Automatic Token Mounting
By default, Kubernetes mounts a ServiceAccount token into every Pod at /var/run/secrets/kubernetes.io/serviceaccount/. Since Kubernetes 1.22, these are bound service account tokens — time-limited JWTs projected through the TokenRequest API, scoped to a specific audience and expiration (default: 1 hour, auto-refreshed by the kubelet).
# Inspect the projected token volume inside a running Pod
kubectl exec my-pod -- ls /var/run/secrets/kubernetes.io/serviceaccount/
# Output: ca.crt namespace token
# Decode the JWT to see its claims (bound, time-limited)
kubectl exec my-pod -- cat /var/run/secrets/kubernetes.io/serviceaccount/token \
| cut -d'.' -f2 | base64 -d 2>/dev/null | python3 -m json.tool
Disabling Automatic Token Mounting
Most application Pods never call the Kubernetes API. Mounting a token into these Pods is an unnecessary attack surface — if the Pod is compromised, the attacker gets a valid API token. Disable it at the ServiceAccount level, the Pod level, or both.
# Option 1: Disable at the ServiceAccount level (affects all Pods using it)
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: production
automountServiceAccountToken: false
---
# Option 2: Disable at the Pod spec level (overrides ServiceAccount setting)
apiVersion: v1
kind: Pod
metadata:
name: my-app-pod
spec:
serviceAccountName: my-app
automountServiceAccountToken: false
containers:
- name: app
image: my-app:1.4.0
Set automountServiceAccountToken: false on the default ServiceAccount in every namespace. Create dedicated ServiceAccounts with explicit permissions only for Pods that actually need API access. The default ServiceAccount should never have RoleBindings attached to it.
Bound Service Account Token Volume Projection
Modern Kubernetes clusters (1.22+) use projected volumes to mount ServiceAccount tokens. Unlike the legacy approach of long-lived Secret-based tokens, projected tokens are issued on-demand through the TokenRequest API with three important properties:
- Time-limited — tokens expire (default 1 hour) and are automatically rotated by the kubelet before expiration.
- Audience-bound — tokens are scoped to a specific audience (typically the API server), preventing misuse with other services.
- Object-bound — tokens are tied to the specific Pod; if the Pod is deleted, the token becomes invalid immediately.
You can also request tokens with custom audiences and expiration times using a projected volume explicitly:
apiVersion: v1
kind: Pod
metadata:
name: vault-consumer
spec:
serviceAccountName: vault-auth
containers:
- name: app
image: my-app:2.0.0
volumeMounts:
- name: vault-token
mountPath: /var/run/secrets/vault
readOnly: true
volumes:
- name: vault-token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 3600
audience: vault
Aggregated ClusterRoles
Aggregated ClusterRoles let you compose a ClusterRole from multiple smaller ClusterRoles using label selectors. Instead of maintaining one monolithic role, you define granular roles and aggregate them automatically. Kubernetes' built-in admin, edit, and view ClusterRoles use this mechanism — which is why installing a CRD can automatically make its resources visible in those roles.
# Parent: aggregates all ClusterRoles with the matching label
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-aggregate
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.example.com/aggregate-to-monitoring: "true"
rules: [] # Rules are auto-populated by the controller
---
# Child: contributes rules to the aggregate
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-metrics-reader
labels:
rbac.example.com/aggregate-to-monitoring: "true"
rules:
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list", "watch"]
---
# Another child: automatically merged into the aggregate
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-events-reader
labels:
rbac.example.com/aggregate-to-monitoring: "true"
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch"]
The parent ClusterRole's rules field is automatically filled by the RBAC controller. Leave it as an empty array — any manually added rules will be overwritten. Adding or removing a child ClusterRole with the matching label dynamically updates the aggregate.
Auditing Permissions with kubectl auth
You do not need to read through dozens of Roles and Bindings to understand what a subject can do. The kubectl auth can-i command answers permission questions directly.
# Can I create deployments in the staging namespace?
kubectl auth can-i create deployments --namespace staging
# yes
# Can the "ci-deployer" ServiceAccount list pods in staging?
kubectl auth can-i list pods \
--namespace staging \
--as system:serviceaccount:staging:ci-deployer
# yes
# List ALL permissions for a specific ServiceAccount
kubectl auth can-i --list \
--namespace staging \
--as system:serviceaccount:staging:ci-deployer
# Check cluster-scoped permissions: can user jane delete nodes?
kubectl auth can-i delete nodes --as jane.doe
# no
# Check who you are currently authenticated as (Kubernetes 1.27+)
kubectl auth whoami
The --as flag triggers impersonation, which lets cluster admins test permissions from another identity's perspective. ServiceAccounts use the format system:serviceaccount:<namespace>:<name>. The --list flag outputs a table of all allowed actions — invaluable for auditing.
Practical Example: CI/CD Pipeline ServiceAccount
Here is a complete, production-ready example: a ServiceAccount for a CI/CD pipeline (like ArgoCD or a GitHub Actions runner) that can manage Deployments and Services in the staging namespace — and nothing else.
# 1. Create a dedicated ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: ci-deployer
namespace: staging
---
# 2. Define the minimum required permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ci-deploy-role
namespace: staging
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["services"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments/status"]
verbs: ["get"]
---
# 3. Bind the Role to the ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ci-deploy-binding
namespace: staging
subjects:
- kind: ServiceAccount
name: ci-deployer
namespace: staging
roleRef:
kind: Role
name: ci-deploy-role
apiGroup: rbac.authorization.k8s.io
After applying this, verify the permissions are correct:
# Apply the manifests
kubectl apply -f ci-deployer-rbac.yaml
# Verify: can it create deployments in staging?
kubectl auth can-i create deployments \
--namespace staging \
--as system:serviceaccount:staging:ci-deployer
# yes
# Verify: can it delete secrets? (should be denied)
kubectl auth can-i delete secrets \
--namespace staging \
--as system:serviceaccount:staging:ci-deployer
# no
# Verify: can it create deployments in production? (should be denied)
kubectl auth can-i create deployments \
--namespace production \
--as system:serviceaccount:staging:ci-deployer
# no
# List all granted permissions
kubectl auth can-i --list \
--namespace staging \
--as system:serviceaccount:staging:ci-deployer
Built-in ClusterRoles Reference
Kubernetes ships with several default ClusterRoles. Understanding these helps you decide when to use a built-in role versus creating your own.
| ClusterRole | Permissions | When to Use |
|---|---|---|
cluster-admin | Full access to all resources in all namespaces. Equivalent to root. | Break-glass emergency access only. Never for day-to-day use. |
admin | Full access within a namespace (Roles, RoleBindings, most resources). No access to ResourceQuotas or the namespace itself. | Namespace owners, team leads. |
edit | Read/write access to most resources in a namespace. No access to Roles or RoleBindings. | Developers deploying applications. |
view | Read-only access to most resources. No access to Secrets. | Observers, dashboard users, auditors. |
Instead of creating per-namespace Roles, use a RoleBinding that references the built-in edit or view ClusterRole. For example, binding ClusterRole/edit via a RoleBinding in the dev namespace gives the dev team edit access in dev only — without needing a custom Role.
Pod Security Standards and Pod Security Admission
For years, PodSecurityPolicy (PSP) was the built-in mechanism for controlling what pods could and couldn't do on a Kubernetes cluster. It was powerful but deeply flawed — difficult to reason about, impossible to dry-run, and full of surprising interactions with RBAC. In Kubernetes 1.21, PSP was officially deprecated. In Kubernetes 1.25, it was removed entirely.
Its replacement is a two-part system: Pod Security Standards (PSS), which define three tiered security profiles, and Pod Security Admission (PSA), a built-in admission controller that enforces those standards at the namespace level. Together, they give you a simpler, more predictable way to prevent dangerous pod configurations from running on your cluster.
Why PodSecurityPolicy Had to Go
PSP suffered from fundamental design problems that made it unreliable in practice. Understanding why it failed helps you appreciate what PSA does differently.
- Confusing policy selection. When multiple PSPs existed, Kubernetes would silently pick one based on alphabetical ordering and mutation rules. Administrators couldn't easily predict which policy applied to a given pod.
- Tight RBAC coupling. PSPs were authorized through RBAC bindings on the
useverb. This meant the effective policy depended on the identity creating the pod — not on the namespace or workload. A Deployment created via a controller used the controller's ServiceAccount, not the user's, leading to subtle bypasses. - No dry-run or audit mode. You couldn't test a policy before enforcing it. Rolling out a new PSP to an existing cluster was a high-risk operation with no way to preview what would break.
- Mutation side effects. PSPs could mutate pod specs (setting default seccomp profiles, dropping capabilities), which made debugging unpredictable and conflicted with GitOps workflows expecting immutable manifests.
PSA intentionally does not mutate pods. It only validates. If a pod violates the policy, it's rejected, warned about, or logged — but never silently modified. This is a deliberate design choice that makes behavior predictable and audit trails trustworthy.
The Three Pod Security Standards
Pod Security Standards define three progressively restrictive security profiles. Each level is a superset of the one before it — Restricted includes everything Baseline checks, and Baseline includes everything Privileged allows (which is nothing, since Privileged is unrestricted). Think of them as presets, not custom policies.
| Level | Intent | Typical Use Case |
|---|---|---|
| Privileged | Unrestricted. No checks applied. | System-level workloads like CNI plugins, storage drivers, logging agents that require host access |
| Baseline | Prevents known privilege escalations while remaining broadly compatible | General application workloads — the sensible default for most namespaces |
| Restricted | Heavily locked-down, follows current pod hardening best practices | Security-sensitive workloads, multi-tenant environments, compliance-mandated clusters |
Privileged — The Escape Hatch
The Privileged level applies zero restrictions. Any pod configuration is allowed: privileged containers, host namespaces, host paths — everything. This exists because certain infrastructure components (CNI plugins like Calico, storage provisioners like Longhorn, monitoring agents like the node exporter) genuinely need elevated access to the host.
Apply this level only to dedicated infrastructure namespaces like kube-system, and keep those namespaces tightly controlled via RBAC. Never use Privileged as the default for application namespaces.
Baseline — Sensible Defaults
Baseline blocks the most dangerous pod configurations while remaining compatible with the vast majority of application workloads. It prevents things like privileged containers, host networking, and dangerous capability additions, but still allows running as root and most volume types. Most off-the-shelf Helm charts and container images will pass Baseline without modification.
Restricted — Hardened Workloads
Restricted builds on Baseline and adds requirements that many existing images don't meet out of the box: pods must run as non-root, must drop ALL capabilities, must set a seccomp profile, and can only use a limited set of volume types. This is the level you want for multi-tenant clusters or environments with compliance requirements, but expect to update your Dockerfiles and pod specs to conform.
What Each Level Actually Checks
The specific checks at each level are defined in the Kubernetes documentation and are version-pinned. Here is a breakdown of the key controls and which level enforces them.
| Check | Privileged | Baseline | Restricted |
|---|---|---|---|
| HostProcess (Windows) | Allowed | Disallowed | Disallowed |
Host namespaces (hostNetwork, hostPID, hostIPC) | Allowed | Disallowed | Disallowed |
| Privileged containers | Allowed | Disallowed | Disallowed |
| Capabilities (beyond a safe set) | Allowed | Only allows NET_BIND_SERVICE addition | Must drop ALL; only NET_BIND_SERVICE may be added |
| HostPath volumes | Allowed | Disallowed | Disallowed |
| Host ports | Allowed | Disallowed (or limited range) | Disallowed |
/proc mount type | Allowed | Must be Default | Must be Default |
| Seccomp profile | Any | Any (not Unconfined since v1.25) | Must be RuntimeDefault or Localhost |
| Sysctls | Any | Only safe set allowed | Only safe set allowed |
| Volume types | Any | Broad set (no hostPath) | Limited: configMap, csi, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret |
runAsNonRoot | Not required | Not required | Must be true |
| Run as non-root user (UID) | Not required | Not required | runAsUser must not be 0 |
Privilege escalation (allowPrivilegeEscalation) | Allowed | Allowed | Must be false |
Enforcement Modes: Enforce, Audit, and Warn
PSA does not just have an on/off switch. It provides three modes that control what happens when a pod violates the configured security level. You can combine multiple modes on the same namespace, and this is in fact the recommended approach for gradual rollouts.
| Mode | Effect | Visibility |
|---|---|---|
enforce |
Rejects pods that violate the policy | API request fails with an error — the pod is never created |
audit |
Allows the pod but records the violation in the API server audit log | Visible only in audit logs — invisible to the user |
warn |
Allows the pod but sends a warning back to the API client | Displayed as a warning in kubectl output |
The key insight is that audit and warn let you preview what would break before you turn on enforcement. You can set enforce: baseline (to block the worst offenders now) while simultaneously setting warn: restricted and audit: restricted (to surface everything that is not yet hardened). Users see warnings in their terminal, and your security team can query the audit log for violations.
Applying PSA with Namespace Labels
PSA is configured entirely through labels on namespace objects. There are no separate policy resources to create — you just label the namespace. The label format is:
pod-security.kubernetes.io/<MODE>: <LEVEL>
pod-security.kubernetes.io/<MODE>-version: <VERSION>
Where MODE is enforce, audit, or warn; LEVEL is privileged, baseline, or restricted; and VERSION is a Kubernetes minor version like v1.30 or latest. Pinning a version ensures that policy checks do not change under your feet when you upgrade the cluster.
Here is a namespace configured to enforce Baseline and warn on Restricted:
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: v1.30
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: v1.30
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: v1.30
You can also apply labels imperatively with kubectl:
# Enforce baseline, warn and audit on restricted
kubectl label namespace my-app \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
What Happens When a Pod Violates the Policy
Let's see PSA in action. Suppose the my-app namespace enforces baseline. If someone tries to create a privileged container, the request is rejected:
# This pod will be REJECTED in a baseline-enforced namespace
apiVersion: v1
kind: Pod
metadata:
name: dangerous-pod
namespace: my-app
spec:
hostNetwork: true
containers:
- name: app
image: nginx:1.27
securityContext:
privileged: true
Error from server (Forbidden): pods "dangerous-pod" is forbidden:
violates PodSecurity "baseline:v1.30":
host namespaces (hostNetwork=true),
privileged (container "app" must not set securityContext.privileged=true)
Notice the error message is specific — it tells you exactly which checks failed and which container caused the violation. With warn mode on the same namespace, the pod would be created but you would see a warning in the kubectl output:
Warning: would violate PodSecurity "restricted:v1.30":
allowPrivilegeEscalation != false (container "app" ...),
unrestricted capabilities (container "app" ...),
runAsNonRoot != true (pod or container "app" ...),
seccompProfile (pod or container "app" ...)
Writing Pods That Pass the Restricted Level
The Restricted level is where most teams hit friction. It requires explicit security context settings that many container images and Helm charts do not include by default. Here is a pod spec that satisfies every Restricted check:
apiVersion: v1
kind: Pod
metadata:
name: secure-app
namespace: my-app
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: my-registry/app:v2.1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
ports:
- containerPort: 8080
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
Let's break down each requirement and why it matters:
runAsNonRoot: true— Ensures the container process does not run as UID 0. Your container image must define aUSERdirective in the Dockerfile, or you setrunAsUserto a non-zero UID.allowPrivilegeEscalation: false— Prevents child processes from gaining more privileges than the parent viasetuidbinaries or filesystem capabilities.capabilities.drop: ["ALL"]— Drops all Linux capabilities. You can selectively add backNET_BIND_SERVICEif the container needs to bind to ports below 1024.seccompProfile.type: RuntimeDefault— Applies the container runtime's default seccomp filter, which blocks around 40-60 dangerous syscalls while allowing normal application behavior.readOnlyRootFilesystem: true— Not strictly required by Restricted, but a strong best practice. UseemptyDirmounts for directories that need writes (like/tmp).
Set pod-level securityContext for settings that apply to all containers (runAsNonRoot, seccompProfile), and container-level securityContext for per-container settings (allowPrivilegeEscalation, capabilities). If both are set, the container-level value takes precedence.
Dry-Running PSA Checks with kubectl
Before changing enforcement labels on a live namespace, you can test what would fail. The --dry-run=server flag processes the request through admission but does not persist the object. Combined with warn mode, this lets you check any pod spec against a PSA level without risk.
# Dry-run: will this pod be accepted in a restricted namespace?
kubectl apply -f secure-app.yaml --dry-run=server
# Check all existing pods in a namespace against restricted level
kubectl label --dry-run=server --overwrite namespace my-app \
pod-security.kubernetes.io/enforce=restricted
The second command is especially powerful. When you label a namespace with --dry-run=server, Kubernetes evaluates all existing pods against the new level and returns warnings for any violations — without actually changing the namespace label. This is how you safely audit before flipping the switch.
Migrating from PodSecurityPolicy to PSA
If you are running a cluster that used PSPs, migrating to PSA requires a methodical approach. The two systems are fundamentally different — PSPs are cluster-scoped resources bound via RBAC, while PSA is namespace-scoped via labels. There is no automatic conversion.
Step 1: Audit Your Existing PSPs
Start by understanding what your current PSPs actually allow. Map each PSP to the closest PSS level:
# List all PSPs and their key settings
kubectl get psp -o custom-columns=\
NAME:.metadata.name,\
PRIV:.spec.privileged,\
HOST_NET:.spec.hostNetwork,\
HOST_PID:.spec.hostPID,\
RUN_AS:.spec.runAsUser.rule,\
VOLUMES:.spec.volumes
Step 2: Label Namespaces with Warn and Audit First
Do not jump to enforcement. Start by adding warn and audit labels to every namespace at the level you intend to enforce. Let the warnings accumulate for a week or two.
# Phase 1: warn and audit only — nothing is blocked
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
kubectl label namespace "$ns" \
pod-security.kubernetes.io/warn=baseline \
pod-security.kubernetes.io/audit=baseline \
--overwrite
done
Step 3: Fix Violations in Workloads
Review the warnings in kubectl output and the violations in audit logs. Update your Deployments, StatefulSets, and DaemonSets to pass the target level. Common fixes include removing hostNetwork: true, dropping capabilities, adding seccompProfile, and ensuring containers do not run as root.
Step 4: Enable Enforcement
Once violations are resolved, add the enforce label. If you are aiming for Restricted long-term, a phased approach works well:
# Phase 2: enforce baseline, warn+audit on restricted
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: v1.30
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
---
# Phase 3 (later): enforce restricted
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: v1.30
Step 5: Remove PSP Resources
After enforcement is active and stable, clean up the old PSP resources:
# Delete all PodSecurityPolicies (already non-functional on 1.25+)
kubectl delete psp --all
# Remove RBAC bindings that referenced PSPs
kubectl delete clusterrole psp:privileged psp:restricted --ignore-not-found
kubectl delete clusterrolebinding psp:privileged psp:restricted --ignore-not-found
Cluster-Wide Defaults with the Admission Configuration
Namespace labels give you per-namespace control, but you can also configure cluster-wide defaults and exemptions by passing an AdmissionConfiguration file to the API server. This lets you define a default level for all namespaces, exempt specific users or namespaces, and avoid the need to label every namespace individually.
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1
kind: PodSecurityConfiguration
defaults:
enforce: baseline
enforce-version: latest
warn: restricted
warn-version: latest
audit: restricted
audit-version: latest
exemptions:
usernames: []
runtimeClasses: []
namespaces:
- kube-system
- kube-node-lease
- cert-manager
This configuration is passed to the kube-apiserver via the --admission-control-config-file flag. Namespaces listed under exemptions.namespaces bypass PSA checks entirely, which is how you carve out space for infrastructure workloads that need elevated privileges.
PSA only evaluates pods at creation time. It does not retroactively evict existing pods when you change a namespace label. If you tighten enforcement on a namespace, existing non-compliant pods continue running until they are restarted or rescheduled. Always check running workloads with kubectl label --dry-run=server before assuming compliance.
PSA Limitations and When You Need More
Pod Security Admission is intentionally simple. It covers the most impactful pod-level security controls, but it does not handle everything. Understanding its boundaries helps you decide when to supplement it with additional tools.
- No fine-grained policies. You cannot say "allow hostNetwork only for pods with label X" or "restrict image registries." It is all-or-nothing within a level.
- No mutation. PSA will not auto-inject security contexts or seccomp profiles. If you want defaulting behavior, use a mutating webhook.
- Namespace-scoped only. There is no way to target individual workloads within a namespace — the level applies to every pod in that namespace.
- No image policy. PSA does not validate image signatures, enforce registry allowlists, or check for CVEs.
For more granular control, consider policy engines like Kyverno or OPA Gatekeeper, which can express arbitrary admission rules via custom policies. These complement PSA well — use PSA as the baseline floor, and layer custom policies on top for organization-specific rules. The next section on Admission Controllers and Dynamic Webhooks covers exactly how these tools integrate with the admission pipeline.
Admission Controllers and Dynamic Webhooks
Every request to the Kubernetes API server passes through a pipeline before any object is persisted to etcd. After authentication confirms who you are and authorization confirms what you can do, the request enters a third stage: admission control. This is where the cluster enforces policies, injects defaults, and applies guardrails that RBAC alone cannot express.
Admission controllers are plugins compiled into the API server binary. They intercept requests after authN/authZ but before the object is written to storage. Some are mutating (they modify the request), some are validating (they approve or reject it), and some do both. Understanding this pipeline is essential for anyone operating production clusters or building platform engineering tooling.
The Admission Pipeline
The diagram below shows the full lifecycle of an API request. Notice that mutating webhooks run before validating webhooks — this ordering is intentional. Mutations happen first so that validators see the final version of the object. If any stage rejects the request, the entire operation fails and nothing is written to etcd.
sequenceDiagram
participant Client as kubectl / Client
participant Auth as Authentication
participant Authz as Authorization
participant MW as Mutating Admission
participant SV as Schema Validation
participant VW as Validating Admission
participant etcd as etcd
Client->>Auth: API Request
Auth->>Auth: Verify identity (certs, tokens)
alt Auth fails
Auth-->>Client: 401 Unauthorized
end
Auth->>Authz: Authenticated request
Authz->>Authz: Check RBAC policies
alt Authz fails
Authz-->>Client: 403 Forbidden
end
Authz->>MW: Authorized request
MW->>MW: Mutate (inject defaults, sidecars, labels)
alt Mutation webhook rejects
MW-->>Client: 400/500 Rejected
end
MW->>SV: Mutated object
SV->>SV: Validate against OpenAPI schema
alt Schema validation fails
SV-->>Client: 422 Unprocessable
end
SV->>VW: Schema-valid object
VW->>VW: Validate policies (image registry, labels)
alt Validation webhook rejects
VW-->>Client: 403 Denied by policy
end
VW->>etcd: Persist object
etcd-->>Client: 200 OK / 201 Created
Built-in Admission Controllers
Kubernetes ships with dozens of admission controllers, and the API server enables a recommended set by default. You can see the active controllers with kube-apiserver --help | grep enable-admission-plugins. Here are the most important ones you should understand:
| Controller | Type | Purpose |
|---|---|---|
NamespaceLifecycle | Validating | Prevents creating objects in namespaces that are being terminated, and rejects requests to delete the default, kube-system, and kube-public namespaces. |
LimitRanger | Mutating + Validating | Enforces LimitRange objects — injects default CPU/memory requests and limits into Pods that don't specify them, and rejects Pods that exceed range constraints. |
ResourceQuota | Validating | Tracks and enforces resource consumption against ResourceQuota objects. Rejects requests that would cause a namespace to exceed its quota. |
ServiceAccount | Mutating | Automatically assigns the default ServiceAccount to Pods that don't specify one, and mounts the corresponding API token. |
DefaultStorageClass | Mutating | Assigns the default StorageClass to PersistentVolumeClaim objects that don't request a specific class. |
PodSecurity | Validating | Enforces Pod Security Standards (Privileged, Baseline, Restricted) at the namespace level. Replaced the deprecated PodSecurityPolicy. |
MutatingAdmissionWebhook | Mutating | Calls external webhook services to modify incoming objects. This is the gateway for dynamic mutation — sidecar injection, label addition, default overrides. |
ValidatingAdmissionWebhook | Validating | Calls external webhook services to approve or deny requests. This is the gateway for dynamic policy enforcement without recompiling the API server. |
The last two controllers in the table — MutatingAdmissionWebhook and ValidatingAdmissionWebhook — are the built-in controllers that dispatch to your custom webhook services. They are the bridge between the static admission pipeline and your dynamic, external logic.
Dynamic Admission Webhooks
Built-in controllers cover common defaults, but real-world clusters need custom policies: "all images must come from our private registry," "every Deployment must have an owner label," or "inject an Envoy sidecar into every Pod in the mesh namespace." Dynamic webhooks let you implement these rules as external HTTPS services and register them with the API server — no recompilation required.
MutatingWebhookConfiguration
A mutating webhook intercepts API requests and modifies the object before it continues through the pipeline. The webhook receives the object as JSON, changes it, and returns a JSON Patch (or the modified object). The API server applies the patch to the original request. Classic use cases include sidecar injection (Istio, Linkerd), adding default labels or annotations, and setting security context defaults.
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sidecar-injector
webhooks:
- name: sidecar.example.com
admissionReviewVersions: ["v1"]
sideEffects: None
clientConfig:
service:
name: sidecar-injector
namespace: webhook-system
path: "/inject"
caBundle: LS0tLS1C... # Base64-encoded CA cert
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
namespaceSelector:
matchLabels:
sidecar-injection: enabled
failurePolicy: Ignore
timeoutSeconds: 5
The key fields to understand: rules determines which API requests trigger the webhook (here, only Pod CREATE operations). The namespaceSelector restricts the webhook to namespaces with a specific label, which prevents it from intercepting system Pods. The clientConfig tells the API server where to send the admission review — either a service reference (for in-cluster webhooks) or a url (for external endpoints).
ValidatingWebhookConfiguration
A validating webhook cannot modify objects — it can only approve or deny them. It receives the final (post-mutation) version of the object and returns an allowed: true or allowed: false response with an optional message. This is the right tool for enforcing organizational policies: image source restrictions, required labels, prohibited configurations, and resource naming conventions.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: image-policy
webhooks:
- name: image-policy.example.com
admissionReviewVersions: ["v1"]
sideEffects: None
clientConfig:
service:
name: image-policy-webhook
namespace: webhook-system
path: "/validate"
caBundle: LS0tLS1C...
rules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
failurePolicy: Fail
timeoutSeconds: 10
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "webhook-system"]
Webhook Configuration Details
Failure Policy: Fail vs. Ignore
The failurePolicy field controls what happens when the webhook is unreachable or returns an error (not a clean "deny," but an actual failure like a timeout or 500 response). You have two choices, and the right one depends on the webhook's purpose:
| Policy | Behavior on Failure | Best For |
|---|---|---|
Fail | The API request is rejected. Nothing gets through if the webhook is down. | Security-critical validations (image policies, compliance checks). You would rather block all changes than risk a policy bypass. |
Ignore | The API request is allowed through as if the webhook didn't exist. | Non-critical mutations (label injection, observability sidecars). Availability matters more than enforcement. |
A webhook with failurePolicy: Fail that becomes unavailable will block all matching API requests cluster-wide. If your webhook matches Pod creations in kube-system, a webhook outage can prevent critical system Pods from starting — cascading into a full cluster failure. Always exclude system namespaces with namespaceSelector, and keep webhook timeout values low (5–10 seconds).
Timeout and Matching Configuration
The timeoutSeconds field (default: 10, max: 30) controls how long the API server waits for a webhook response. Mutating webhooks run sequentially (each sees the output of the previous one), so their timeouts are additive. Validating webhooks run in parallel, so only the slowest one matters. Keep timeouts low — webhook latency directly adds to every API call that matches.
Use namespaceSelector and objectSelector to precisely target the resources your webhook cares about. The namespaceSelector matches labels on the namespace of the target object, while objectSelector matches labels on the object itself. Combine both to minimize unnecessary webhook invocations.
# Only match Pods that have the "validate: true" label
# in namespaces labeled "environment: production"
objectSelector:
matchLabels:
validate: "true"
namespaceSelector:
matchLabels:
environment: production
# Also supports matchExpressions for exclusions:
# namespaceSelector:
# matchExpressions:
# - key: kubernetes.io/metadata.name
# operator: NotIn
# values: ["kube-system", "kube-node-lease"]
Building a Validating Webhook — Practical Example
Let's build a validating webhook that enforces a simple policy: every Deployment must have a team label. This is a common organizational requirement for tracking ownership. The webhook is a small HTTPS server that receives AdmissionReview requests from the API server and responds with allow/deny decisions.
Step 1: The Webhook Server
The webhook server is a standard HTTP handler that parses the AdmissionReview request, inspects the object, and returns a response. Here is a minimal Go implementation:
package main
import (
"encoding/json"
"fmt"
"io"
"net/http"
admissionv1 "k8s.io/api/admission/v1"
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func validateDeployment(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
var review admissionv1.AdmissionReview
json.Unmarshal(body, &review)
var deployment appsv1.Deployment
json.Unmarshal(review.Request.Object.Raw, &deployment)
allowed := true
message := "Deployment is valid"
if _, ok := deployment.Labels["team"]; !ok {
allowed = false
message = fmt.Sprintf(
"Deployment %q denied: missing required label 'team'",
deployment.Name,
)
}
review.Response = &admissionv1.AdmissionResponse{
UID: review.Request.UID,
Allowed: allowed,
Result: &metav1.Status{Message: message},
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(review)
}
func main() {
http.HandleFunc("/validate", validateDeployment)
fmt.Println("Webhook server listening on :8443")
http.ListenAndServeTLS(":8443", "/certs/tls.crt", "/certs/tls.key", nil)
}
Step 2: TLS Certificates
Webhooks must be served over HTTPS — the API server will not call HTTP endpoints. For production, use cert-manager to automatically provision and rotate certificates. For development, you can generate self-signed certs:
# Generate a CA and signed certificate for the webhook service
openssl genrsa -out ca.key 2048
openssl req -x509 -new -key ca.key -days 365 -out ca.crt \
-subj "/CN=webhook-ca"
openssl genrsa -out tls.key 2048
openssl req -new -key tls.key \
-subj "/CN=label-validator.webhook-system.svc" | \
openssl x509 -req -CA ca.crt -CAkey ca.key \
-CAcreateserial -days 365 -out tls.crt \
-extfile <(echo "subjectAltName=DNS:label-validator.webhook-system.svc")
# Create the TLS secret in the webhook namespace
kubectl create namespace webhook-system
kubectl -n webhook-system create secret tls webhook-certs \
--cert=tls.crt --key=tls.key
# Base64-encode the CA for the webhook configuration
export CA_BUNDLE=$(cat ca.crt | base64 | tr -d '\n')
echo "caBundle: $CA_BUNDLE"
Step 3: Deploy the Webhook and Register It
Package the Go server into a container image, deploy it as a Kubernetes Service, and register the ValidatingWebhookConfiguration with the API server.
apiVersion: apps/v1
kind: Deployment
metadata:
name: label-validator
namespace: webhook-system
spec:
replicas: 2
selector:
matchLabels:
app: label-validator
template:
metadata:
labels:
app: label-validator
spec:
containers:
- name: webhook
image: registry.example.com/label-validator:v1
ports:
- containerPort: 8443
volumeMounts:
- name: certs
mountPath: /certs
readOnly: true
volumes:
- name: certs
secret:
secretName: webhook-certs
---
apiVersion: v1
kind: Service
metadata:
name: label-validator
namespace: webhook-system
spec:
selector:
app: label-validator
ports:
- port: 443
targetPort: 8443
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: label-validator
webhooks:
- name: label-validator.example.com
admissionReviewVersions: ["v1"]
sideEffects: None
clientConfig:
service:
name: label-validator
namespace: webhook-system
path: "/validate"
caBundle: ${CA_BUNDLE} # Replace with base64-encoded CA cert
rules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
failurePolicy: Fail
timeoutSeconds: 5
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-node-lease", "webhook-system"]
Step 4: Test the Webhook
Deploy a Deployment without the team label and verify it gets rejected:
# This should be REJECTED — no "team" label
kubectl create deployment nginx-bad --image=nginx
# Error: admission webhook "label-validator.example.com" denied the request:
# Deployment "nginx-bad" denied: missing required label 'team'
# This should SUCCEED — "team" label present
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-good
labels:
team: platform
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.27
EOF
# deployment.apps/nginx-good created
ValidatingAdmissionPolicy — CEL-Based Validation Without Webhooks
Dynamic webhooks are powerful but operationally expensive. You need to build, deploy, and maintain an HTTPS service with TLS certificates, handle high availability, and worry about latency. Starting with Kubernetes 1.26 (beta) and GA in 1.30, ValidatingAdmissionPolicy lets you write validation rules directly as CEL (Common Expression Language) expressions — no external webhook needed.
CEL expressions run inside the API server process itself. This eliminates network hops, TLS management, and the risk of webhook unavailability. For straightforward validation rules (required labels, image prefix checks, resource limit enforcement), ValidatingAdmissionPolicy is the better choice.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-team-label
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
validations:
- expression: "has(object.metadata.labels) && 'team' in object.metadata.labels"
message: "All Deployments must have a 'team' label"
- expression: "object.metadata.labels['team'].size() > 0"
message: "The 'team' label must not be empty"
A policy on its own does nothing — you need a ValidatingAdmissionPolicyBinding to activate it and specify which resources or namespaces it applies to:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: require-team-label-binding
spec:
policyName: require-team-label
validationActions:
- Deny # Reject non-compliant requests
matchResources:
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-node-lease"]
You can also set validationActions to [Warn] during rollout to log violations without blocking requests — useful for gradually introducing a new policy. The Audit action records violations in the API server audit log without any user-facing warning.
CEL Expression Examples
CEL is concise but expressive. Here are common patterns you can use in validations[].expression:
| Policy | CEL Expression |
|---|---|
| Image must come from private registry | object.spec.template.spec.containers.all(c, c.image.startsWith('registry.example.com/')) |
| Replicas must be at least 2 | object.spec.replicas >= 2 |
Must not use latest tag | object.spec.template.spec.containers.all(c, !c.image.endsWith(':latest')) |
| Memory limit is required | object.spec.template.spec.containers.all(c, has(c.resources.limits) && has(c.resources.limits.memory)) |
| No hostNetwork | !has(object.spec.template.spec.hostNetwork) || object.spec.template.spec.hostNetwork == false |
Webhooks vs. ValidatingAdmissionPolicy — When to Use Which
With two mechanisms for custom validation, how do you choose? The decision hinges on complexity and operational maturity:
| Criteria | ValidatingAdmissionPolicy (CEL) | Dynamic Webhooks |
|---|---|---|
| Complexity | Simple field checks, label/annotation rules, value constraints | Complex logic: external lookups, cross-resource validation, stateful decisions |
| Operational cost | Zero — runs in the API server | High — deploy, maintain, and monitor an HTTPS service |
| Latency | Microseconds (in-process) | Milliseconds to seconds (network hop + TLS) |
| Availability risk | None — always available with the API server | Webhook outage can block the cluster if failurePolicy: Fail |
| Mutation support | No — validation only | Yes — mutating webhooks can modify objects |
| Kubernetes version | 1.26+ (beta), 1.30+ (GA) | 1.16+ (stable) |
Start with ValidatingAdmissionPolicy for pure validation rules. Reserve webhooks for mutations (sidecar injection, defaulting) and complex validations that need external data or multi-resource checks. If you are running Kubernetes 1.30+, CEL-based policies should be your default choice for validation.
Policy Engines: OPA/Gatekeeper and Kyverno
Building individual webhooks per policy doesn't scale. Policy engines provide a framework for managing many policies declaratively — they handle the webhook infrastructure and let you focus on writing rules.
OPA Gatekeeper uses Open Policy Agent with the Rego language. You define ConstraintTemplate resources (parameterized policy templates) and Constraint resources (instances of those templates applied to specific resources). Gatekeeper runs as a validating webhook and caches replicated cluster data for cross-resource checks.
Kyverno takes a Kubernetes-native approach — policies are YAML resources with no separate language to learn. It supports both validation and mutation, can generate resources, and integrates with the Kubernetes API directly. Kyverno policies use familiar patterns like match/exclude blocks and JSON patches for mutations.
Both engines are CNCF projects (Gatekeeper is part of OPA, a graduated project; Kyverno is an incubating project). For teams that need dozens of admission policies and want auditing, dry-run modes, and centralized reporting, a policy engine is far more practical than hand-rolling webhooks.
Cluster Hardening and Security Best Practices
Kubernetes clusters expose a large attack surface by default. The API server accepts requests, etcd stores every secret in the cluster, kubelets run arbitrary containers on nodes, and the flat pod network lets any workload talk to any other. Hardening is the process of systematically reducing that surface — disabling what you don't need, encrypting what you can't disable, and restricting everything else to least privilege.
The CIS Kubernetes Benchmark is the industry-standard checklist for cluster security. It covers the control plane, worker nodes, policies, and managed services with specific, auditable recommendations. Every hardening measure in this section maps to one or more CIS controls. Tools like kube-bench can automatically audit your cluster against the benchmark and flag gaps.
No single control protects a cluster. Security is layered: API authentication stops unauthorized users, RBAC limits what authorized users can do, network policies restrict pod communication, runtime security constrains what containers can execute, and audit logging records what actually happened. Each layer compensates for failures in the others.
API Server Hardening
The API server is the front door to your cluster. Every kubectl command, every controller reconciliation loop, and every kubelet heartbeat passes through it. If an attacker gains unrestricted API access, they effectively control the entire cluster. Hardening the API server means limiting who can reach it, how they authenticate, and what gets logged.
Start with three critical flags. Disable anonymous authentication so every request must present valid credentials. Enable audit logging so you have a forensic trail of every API call. And restrict which encryption ciphers the API server accepts — older TLS cipher suites have known vulnerabilities.
# /etc/kubernetes/manifests/kube-apiserver.yaml (static pod manifest)
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
# Disable anonymous authentication (CIS 1.2.1)
- --anonymous-auth=false
# Enable audit logging (CIS 1.2.22–1.2.25)
- --audit-policy-file=/etc/kubernetes/audit-policy.yaml
- --audit-log-path=/var/log/kubernetes/audit.log
- --audit-log-maxage=30
- --audit-log-maxbackup=10
- --audit-log-maxsize=100
# Restrict TLS ciphers (CIS 1.2.31)
- --tls-min-version=VersionTLS12
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
# Disable insecure port (deprecated but verify it's off)
- --insecure-port=0
# Enable RBAC and Node authorization
- --authorization-mode=Node,RBAC
# Set request timeout
- --request-timeout=300s
Audit Policy
An audit policy tells the API server what to log and at what detail level. The four levels are None, Metadata, Request, and RequestResponse. Logging every request body is expensive, so a good policy logs metadata for most resources and full request/response only for sensitive operations like Secrets access and RBAC changes.
# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Skip read-only requests to endpoints or events (high volume, low value)
- level: None
resources:
- group: ""
resources: ["endpoints", "events"]
verbs: ["get", "list", "watch"]
# Log full request+response for Secrets (sensitive data access)
- level: RequestResponse
resources:
- group: ""
resources: ["secrets"]
# Log full request+response for RBAC changes
- level: RequestResponse
resources:
- group: "rbac.authorization.k8s.io"
resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
# Log request body for all write operations
- level: Request
verbs: ["create", "update", "patch", "delete"]
# Log metadata for everything else
- level: Metadata
Restricting API Access with Firewall Rules
Beyond authentication, limit network-level access to the API server. On cloud providers, use security groups or firewall rules to restrict port 6443 to known CIDR ranges — your office network, VPN endpoints, and CI/CD runner IPs. On bare metal, use iptables or firewall rules on the control plane nodes.
# Restrict API server access to specific CIDR ranges (iptables example)
iptables -A INPUT -p tcp --dport 6443 -s 10.0.0.0/16 -j ACCEPT # Pod/node network
iptables -A INPUT -p tcp --dport 6443 -s 192.168.1.0/24 -j ACCEPT # Admin VPN
iptables -A INPUT -p tcp --dport 6443 -j DROP # Drop all other
# On GKE, use master-authorized-networks
gcloud container clusters update my-cluster \
--enable-master-authorized-networks \
--master-authorized-networks 203.0.113.0/24,198.51.100.0/24
Securing etcd
etcd is the cluster's brain — it stores every object including Secrets, ConfigMaps, RBAC rules, and service account tokens. If an attacker reads etcd directly, they bypass all Kubernetes authorization. Two protections are essential: encrypt data at rest so raw etcd snapshots are useless, and enforce mTLS so only authenticated clients (the API server) can connect.
Encrypting Data at Rest
By default, Kubernetes stores Secrets in etcd as base64-encoded plaintext. Anyone with access to the etcd data directory or a backup can read every secret in the cluster. The EncryptionConfiguration resource tells the API server to encrypt specified resources before writing them to etcd.
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
providers:
# aescbc encrypts with a local key — simple but you manage rotation
- aescbc:
keys:
- name: key-2024
secret: c2VjcmV0LWtleS1mb3ItZW5jcnlwdGlvbi0zMmJ5dGVz # 32-byte base64
# identity is the fallback — allows reading unencrypted data
- identity: {}
Pass this configuration to the API server with --encryption-provider-config=/etc/kubernetes/encryption-config.yaml. After enabling encryption, re-encrypt existing Secrets by reading and writing them back:
# Re-encrypt all existing secrets with the new encryption key
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
# Verify a secret is encrypted in etcd (run on control plane node)
ETCDCTL_API=3 etcdctl get /registry/secrets/default/my-secret \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key | hexdump -C | head
etcd mTLS and Access Restriction
etcd should never be exposed to anything except the API server. Configure mTLS so etcd requires client certificates for every connection, and bind etcd to a private interface — not 0.0.0.0. The CIS Benchmark (Section 2) explicitly requires that etcd's client and peer communication use TLS with valid certificates.
# etcd static pod manifest — TLS flags (CIS 2.1–2.6)
spec:
containers:
- name: etcd
command:
- etcd
# Client-server TLS (CIS 2.1, 2.2)
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --client-cert-auth=true
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
# Peer TLS (CIS 2.4, 2.5)
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-client-cert-auth=true
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
# Bind to private interface only
- --listen-client-urls=https://10.0.1.10:2379
- --listen-peer-urls=https://10.0.1.10:2380
Kubelet Security
The kubelet runs on every node and has the power to create, destroy, and inspect containers. It exposes an HTTPS API (port 10250) that can return container logs, execute commands inside pods, and list running workloads. If anonymous access is enabled — which it is by default on many installations — anyone who can reach a node's IP can exploit this API.
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Disable anonymous access (CIS 4.2.1)
authentication:
anonymous:
enabled: false
webhook:
enabled: true # Delegate auth to the API server
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
# Use Webhook authorization, not AlwaysAllow (CIS 4.2.2)
authorization:
mode: Webhook
# Disable read-only port (CIS 4.2.4)
readOnlyPort: 0
# Rotate kubelet certificates automatically
rotateCertificates: true
# Protect kernel defaults
protectKernelDefaults: true
# Restrict which event types are recorded
eventRecordQPS: 5
The Webhook authorization mode is critical: it makes the kubelet ask the API server "is this caller allowed to do this?" for every request. Combined with the Node authorizer on the API server side, this ensures kubelets can only access the Secrets, ConfigMaps, and PersistentVolumes bound to pods scheduled on their node — not resources belonging to other nodes.
The kubelet's read-only port (10255) serves metrics and pod listings without any authentication. It is often left enabled for legacy monitoring setups. Set readOnlyPort: 0 and migrate monitoring to the authenticated port 10250, or use the /metrics endpoint with proper bearer token authentication.
Network-Level Security
By default, every pod in a Kubernetes cluster can communicate with every other pod — across namespaces, across nodes, no restrictions. This flat networking model is great for getting started, but in production it means a compromised pod in the frontend namespace can directly reach your database pods in backend. Network policies let you enforce segmentation at the cluster level.
Default Deny Policy
The most impactful single security measure for networking is a default-deny ingress policy in every namespace. This inverts the model: instead of "everything allowed unless denied," you get "nothing allowed unless explicitly permitted." Apply this first, then add targeted allow rules.
# Default deny all ingress traffic in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # Selects ALL pods in this namespace
policyTypes:
- Ingress # No ingress rules = deny all inbound
---
# Allow frontend pods to receive traffic on port 8080 only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: frontend
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 8080
Blocking Cloud Metadata Service Access
On AWS, GCP, and Azure, the instance metadata service at 169.254.169.254 is a high-value target. A compromised pod can query it to steal node IAM credentials, read instance identity tokens, or discover internal network topology. Block this with a NetworkPolicy that denies egress to the metadata IP.
# Block access to cloud metadata service from all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-metadata-service
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
# Allow DNS resolution (required for most workloads)
- to:
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow all egress EXCEPT the metadata service
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32
Service Mesh mTLS
NetworkPolicies operate at L3/L4 — they filter by IP and port but cannot inspect or encrypt traffic. A service mesh like Istio, Linkerd, or Cilium adds L7 policy enforcement and automatic mTLS between pods. Every pod-to-pod connection is encrypted and authenticated by the mesh sidecar, so even if an attacker is on the pod network, they cannot eavesdrop on or impersonate legitimate services.
| Mechanism | Layer | Encryption | Identity Verification | Complexity |
|---|---|---|---|---|
| NetworkPolicy | L3/L4 | None | IP/label-based | Low |
| Service mesh mTLS | L4/L7 | TLS 1.2/1.3 | Certificate-based (SPIFFE) | Medium-High |
| Cilium with WireGuard | L3 | WireGuard | Node-level identity | Medium |
Image Security and Supply Chain
The container image is the package that runs in your cluster. If an attacker pushes a malicious image to your registry or you pull an image with a known CVE, no amount of runtime security will fully protect you. Image security is about controlling what enters the cluster: which registries you trust, which images are allowed, and whether you can verify they haven't been tampered with.
Use Image Digests, Not Tags
Tags are mutable pointers. nginx:1.25 can point to a different image tomorrow if someone pushes a new build with the same tag. Digests are immutable content hashes — they guarantee you get exactly the image you tested and approved. In production, always pin images by digest.
# BAD — tag is mutable, could be overwritten
containers:
- name: app
image: myregistry.io/myapp:v2.1.0
# GOOD — digest is immutable, cryptographically verified
containers:
- name: app
image: myregistry.io/myapp@sha256:a3ed95caeb02ffe68cdd9fd844066...
Enforcing Image Policies with Kyverno
You can't rely on developers to always use digests or approved registries. Policy engines like Kyverno and OPA Gatekeeper run as admission webhooks — they intercept every pod creation and reject those that violate your rules. Here is a Kyverno policy that requires images from trusted registries and blocks the latest tag:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: Enforce
background: true
rules:
- name: validate-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: >-
Images must come from approved registries.
pattern:
spec:
containers:
- image: "gcr.io/my-project/* | myregistry.io/*"
- name: block-latest-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: "The 'latest' tag is not allowed. Use a specific version or digest."
pattern:
spec:
containers:
- image: "!*:latest"
Image Scanning
Integrate vulnerability scanning into your CI pipeline and your cluster admission flow. Tools like Trivy, Grype, and Snyk scan images for known CVEs in OS packages and application dependencies. Use them at two points: in CI (to catch vulnerabilities before merge) and as an admission webhook (to block deployment of images with critical findings).
# Scan an image for vulnerabilities — fail CI if HIGH/CRITICAL found
trivy image --severity HIGH,CRITICAL --exit-code 1 myregistry.io/myapp:v2.1.0
# Scan and output results as a table for human review
trivy image --format table --severity HIGH,CRITICAL myregistry.io/myapp:v2.1.0
# Generate an SBOM (Software Bill of Materials) for audit trails
trivy image --format spdx-json --output sbom.json myregistry.io/myapp:v2.1.0
Runtime Security
Runtime security is the last line of defense. Even after you have locked down the API server, encrypted etcd, restricted the network, and verified your images, a container might still be compromised through an application-level vulnerability. Runtime controls restrict what a container process can do once it is running: which system calls it can make, which files it can access, and what kernel capabilities it holds.
Security Context: The Foundation
Every pod spec should include a securityContext that enforces the principle of least privilege. Run as a non-root user, drop all Linux capabilities, set the root filesystem to read-only, and prevent privilege escalation. These four settings alone block the majority of container breakout techniques.
apiVersion: v1
kind: Pod
metadata:
name: hardened-app
spec:
# Pod-level security context
securityContext:
runAsNonRoot: true
runAsUser: 10000
runAsGroup: 10000
fsGroup: 10000
seccompProfile:
type: RuntimeDefault # Apply default Seccomp profile
containers:
- name: app
image: myregistry.io/myapp@sha256:abc123...
# Container-level security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL # Drop every Linux capability
# Use emptyDir for writable temp directories
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache
volumes:
- name: tmp
emptyDir:
sizeLimit: 100Mi
- name: cache
emptyDir:
sizeLimit: 200Mi
Seccomp Profiles
Seccomp (Secure Computing Mode) filters system calls at the kernel level. The RuntimeDefault profile blocks roughly 44 of the ~300+ Linux syscalls, including dangerous ones like ptrace, mount, and reboot. For higher security, create a custom profile that allows only the exact syscalls your application needs.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": [
"accept4", "bind", "clone", "close", "connect",
"epoll_ctl", "epoll_wait", "execve", "exit_group",
"fcntl", "fstat", "futex", "getpid", "getsockopt",
"ioctl", "listen", "mmap", "mprotect", "nanosleep",
"openat", "read", "recvfrom", "rt_sigaction",
"sendto", "setsockopt", "socket", "write"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Place custom Seccomp profiles in /var/lib/kubelet/seccomp/ on each node, then reference them in your pod spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/my-strict-profile.json
AppArmor and SELinux
AppArmor (Ubuntu/Debian) and SELinux (RHEL/CentOS) are Linux Security Modules (LSMs) that confine processes to a defined set of file paths, network operations, and capabilities. They operate below Kubernetes and enforce policies even if a container escapes its cgroup or namespace. As of Kubernetes 1.30, AppArmor support is GA with the appArmorProfile field in the security context.
# AppArmor (Kubernetes 1.30+ GA field)
securityContext:
appArmorProfile:
type: RuntimeDefault # Use container runtime's default profile
---
# SELinux — assign an MCS label to isolate containers
securityContext:
seLinuxOptions:
level: "s0:c123,c456" # Multi-Category Security label
type: "container_t"
Hardening Checklist
Use this table as a quick-reference audit sheet. Each item maps to a CIS Kubernetes Benchmark section. Run kube-bench to automate the audit.
| Area | Control | CIS Section | Priority |
|---|---|---|---|
| API Server | Disable anonymous auth (--anonymous-auth=false) | 1.2.1 | Critical |
| API Server | Enable audit logging | 1.2.22–1.2.25 | Critical |
| API Server | Use Node,RBAC authorization mode | 1.2.8 | Critical |
| etcd | Encrypt secrets at rest | 1.2.33 | Critical |
| etcd | Enable client-cert auth (--client-cert-auth=true) | 2.2 | Critical |
| Kubelet | Disable anonymous auth, use Webhook mode | 4.2.1–4.2.2 | Critical |
| Kubelet | Disable read-only port | 4.2.4 | High |
| Network | Apply default-deny NetworkPolicies | 5.3.2 | High |
| Network | Block cloud metadata service (169.254.169.254) | — | High |
| Images | Use image digests, not tags | — | High |
| Images | Restrict to trusted registries (Kyverno/OPA) | 5.5.1 | High |
| Runtime | Run as non-root, drop all capabilities | 5.2.6–5.2.9 | Critical |
| Runtime | Read-only root filesystem | 5.2.4 | High |
| Runtime | Apply Seccomp RuntimeDefault profile | 5.7.2 | High |
Run kube-bench run --targets master,node,etcd,policies regularly — ideally as a CronJob in your cluster or as part of your CI pipeline. It checks your cluster against the CIS Benchmark and produces a pass/fail report for every control. Address CRITICAL and HIGH findings first; they represent the most exploitable gaps.
Health Checks — Liveness, Readiness, and Startup Probes
Kubernetes can restart failed containers and reschedule pods onto healthy nodes — but only if it knows something is wrong. Without health checks, the kubelet has a single signal: whether the container process is running. A process can be alive yet completely stuck — deadlocked, out of file descriptors, wedged in an infinite loop. From the outside, the container appears healthy. Requests pile up, users see timeouts, and nothing self-heals.
Health check probes give the kubelet fine-grained insight into your application's actual state. They are the mechanism behind self-healing and zero-downtime deployments. Configure them wrong (or skip them entirely), and you undermine two of the most valuable things Kubernetes offers.
The Three Probe Types
Kubernetes provides three distinct probes. Each answers a different question about your container, and each triggers a different response from the system. Understanding the distinction is essential — mixing up their purposes is one of the most common sources of production outages.
Liveness Probe — "Is the process stuck?"
The liveness probe detects situations where your application is running but can no longer make progress. Deadlocks, infinite loops, and corrupted internal state are classic examples. When a liveness probe fails beyond its configured threshold, the kubelet kills the container and restarts it. This is the nuclear option — a full container restart — so the probe must only check the application's own internal health.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
In this example, the kubelet waits 15 seconds after the container starts, then hits /healthz every 10 seconds. If three consecutive checks fail (or time out after 3 seconds each), the container is killed and restarted according to the pod's restartPolicy.
Never have your liveness probe check a database connection, an external API, or a downstream service. If the database goes down, your liveness probe fails, Kubernetes restarts every pod, the pods come back up, can't reach the database, fail again — and you've created a cascading restart storm that makes a partial outage into a total one. Liveness probes must only check the process itself.
Readiness Probe — "Can this pod serve traffic?"
The readiness probe controls whether a pod's IP address appears in a Service's Endpoints object. When the probe fails, the pod is removed from the load balancer — it stops receiving new requests but keeps running. When the probe passes again, traffic resumes. Unlike liveness, this is a gentle, reversible action.
This is the probe where you should check dependencies. If your app needs a database connection to serve requests, let the readiness probe verify it. A failing readiness probe during a rolling update prevents the new version from receiving traffic until it's truly ready, which is how Kubernetes achieves zero-downtime deployments.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
successThreshold: 1
failureThreshold: 2
Notice the shorter initialDelaySeconds and periodSeconds compared to liveness. You want to detect unreadiness quickly so the Service routes traffic only to healthy pods. The successThreshold of 1 means a single passing check brings the pod back into rotation.
Startup Probe — "Has the application finished starting?"
Some applications — legacy Java apps with heavy classloading, applications that run database migrations on boot, or ML services loading large models — can take minutes to start. Without a startup probe, you'd have to set a massive initialDelaySeconds on the liveness probe, which also delays detection of stuck containers after they've started.
The startup probe solves this cleanly. While it runs, both liveness and readiness probes are disabled. Once the startup probe succeeds, it never runs again and the other probes take over. If the startup probe exhausts its failure threshold, the container is killed — the application is considered broken, not just slow.
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 30 # 10s × 30 = up to 5 minutes to start
The math here is straightforward: periodSeconds × failureThreshold gives the maximum startup window. In this example, the application gets up to 5 minutes to start. Once /healthz returns a 200, the startup probe is done and the liveness/readiness probes begin their normal cycles.
Probe Mechanisms
Every probe type supports four mechanisms for checking container health. The mechanism you choose depends on what your application exposes and how much control you need.
| Mechanism | How It Works | Success Criteria | Best For |
|---|---|---|---|
httpGet | Sends an HTTP GET to a path and port | Status code 200–399 | Web servers and APIs with a health endpoint |
tcpSocket | Attempts a TCP connection to a port | Port accepts the connection | Non-HTTP services (databases, caches, TCP servers) |
exec | Runs a command inside the container | Command exits with code 0 | Custom checks, script-based validation, sidecar health |
grpc | Calls the gRPC Health Checking Protocol | Response status is SERVING | gRPC services implementing the standard health protocol |
Here's an example of each mechanism in a probe definition:
# HTTP — most common for web applications
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: probe-check
# TCP — useful when there's no HTTP endpoint
livenessProbe:
tcpSocket:
port: 3306
# exec — run arbitrary commands
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
# gRPC — native support since Kubernetes 1.27 (stable)
livenessProbe:
grpc:
port: 50051
service: my.package.MyService # optional, defaults to ""
Each exec probe forks a new process inside the container. For high-frequency probes across many pods, this can add non-trivial CPU and PID pressure. Prefer httpGet or tcpSocket when possible. If you must use exec, keep the command lightweight and avoid shell invocations like sh -c "...".
Configuration Parameters Explained
All three probe types share the same set of timing and threshold parameters. Getting these values right is the difference between a resilient deployment and one that either ignores failures or restarts too aggressively.
| Parameter | Default | Description |
|---|---|---|
initialDelaySeconds | 0 | Seconds to wait after the container starts before the first probe. Use this if you aren't using a startup probe. |
periodSeconds | 10 | How often (in seconds) the probe is executed. Lower values detect failures faster but increase load. |
timeoutSeconds | 1 | Seconds to wait for a probe response before counting it as a failure. Must be less than periodSeconds. |
successThreshold | 1 | Consecutive successes required to mark the probe as passing. Must be 1 for liveness and startup probes. |
failureThreshold | 3 | Consecutive failures before taking action (restart for liveness, remove from endpoints for readiness). |
The total time before action is taken on failure is approximately initialDelaySeconds + (periodSeconds × failureThreshold). For example, with initialDelaySeconds: 10, periodSeconds: 5, and failureThreshold: 3, a stuck container is restarted roughly 25 seconds after it starts failing.
Full Working Example
Here's a production-grade pod spec for a web application that uses all three probes together. The startup probe gives the app up to 3 minutes to initialize. Once startup succeeds, the liveness probe watches for hangs and the readiness probe controls traffic flow.
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: myregistry/order-service:2.4.1
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 18 # 10s × 18 = 3 min max startup
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3 # restart after ~30s of failures
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2 # remove from Service after ~10s
successThreshold: 1
Note that the liveness probe hits /healthz (a lightweight internal check) while the readiness probe hits /ready (which may verify database connectivity, cache availability, or other dependencies). These are deliberately different endpoints with different responsibilities.
Common Anti-Patterns to Avoid
Misconfigured probes cause more outages than missing probes. These are the mistakes that show up repeatedly in incident postmortems.
1. Using the same endpoint for liveness and readiness
If your /health endpoint checks the database and you use it for both probes, a database outage will trigger liveness failures. Kubernetes restarts all your pods — which still can't reach the database — creating a restart loop. Split your endpoints: /healthz for liveness (internal-only checks) and /ready for readiness (dependency checks).
2. Setting timeoutSeconds too low
The default timeoutSeconds is 1 second. If your health endpoint does any I/O — even a trivial database ping — it can occasionally exceed 1 second under load. This triggers spurious failures and restarts. Set timeoutSeconds to at least 2–3 seconds for endpoints that touch any external resource, and always keep your liveness endpoint free of I/O.
3. Missing readiness probes during rolling updates
Without a readiness probe, Kubernetes considers a pod ready the moment its container starts. During a rolling update, the old pod is terminated as soon as the new one's container is running — even if the new application hasn't opened its listening socket yet. Users hit connection refused errors. Always define a readiness probe on pods behind a Service.
4. Using initialDelaySeconds instead of a startup probe
A long initialDelaySeconds (e.g., 120 seconds) on the liveness probe means that if the application deadlocks within those first 120 seconds after startup, it won't be detected. The startup probe is strictly better: it protects slow starts while keeping liveness detection responsive once the application is running.
When a probe fails, Kubernetes emits an event on the pod. Run kubectl describe pod <name> and check the Events section for messages like Liveness probe failed: HTTP probe failed with statuscode: 503. For deeper debugging, temporarily exec into the container and call the health endpoint manually: kubectl exec -it <pod> -- curl -v localhost:8080/healthz.
Logging Architecture — From Containers to Centralized Storage
Kubernetes does not ship with a built-in log aggregation system. It gives you the primitives — container stdout/stderr capture, node-level log files, and API access — but the responsibility of collecting, shipping, and storing those logs at scale falls squarely on you. Understanding how logs flow through the system is the first step to building a reliable observability stack.
Logging in Kubernetes happens at three distinct levels, each building on the one below it: container-level (what your application writes), node-level (how the kubelet and container runtime manage log files on disk), and cluster-level (how you centralize logs from every node into a queryable backend). We will cover all three, then walk through practical stack deployments.
Level 1: Container Logs — stdout and stderr
The simplest form of logging in Kubernetes: your application writes to stdout and stderr, and the container runtime (containerd, CRI-O) captures those streams and writes them to log files on the node's filesystem. This is the 12-Factor App approach to logging — treat logs as event streams, not files.
You access these logs with kubectl logs. Under the hood, the kubelet reads the log files written by the container runtime and streams them back through the Kubernetes API.
# View logs from a running pod
kubectl logs my-app-pod-7f8b9c6d4-x2k9z
# Follow logs in real-time (like tail -f)
kubectl logs -f my-app-pod-7f8b9c6d4-x2k9z
# View logs from a specific container in a multi-container pod
kubectl logs my-app-pod-7f8b9c6d4-x2k9z -c sidecar-logger
# View logs from a previous container instance (after a crash restart)
kubectl logs my-app-pod-7f8b9c6d4-x2k9z --previous
# View last 100 lines from the past hour
kubectl logs my-app-pod-7f8b9c6d4-x2k9z --tail=100 --since=1h
kubectl logs is effective for interactive debugging, but it has hard limits. It queries one pod at a time (unless you use label selectors with --selector), it only shows logs still on the node (subject to rotation), and it cannot search across the cluster. For anything beyond "what is this one pod doing right now?", you need centralized logging.
Level 2: Node-Level Logging — Where Log Files Live
When a container writes to stdout/stderr, the container runtime doesn't just hold it in memory — it writes each log line to a JSON-formatted file on the node. The standard path is /var/log/containers/, which contains symlinks to the actual log files under /var/log/pods/. The filename encodes the pod name, namespace, container name, and container ID.
# Symlink structure on a node
ls /var/log/containers/
# my-app-7f8b9c6d4-x2k9z_default_app-abc123def456.log -> /var/log/pods/default_my-app-.../app/0.log
# The actual log file is CRI-format JSON — one JSON object per line
cat /var/log/pods/default_my-app-7f8b9c6d4-x2k9z_uid/app/0.log
# {"log":"Starting server on port 8080\n","stream":"stdout","time":"2024-11-15T10:23:01.234Z"}
# {"log":"Error: connection refused\n","stream":"stderr","time":"2024-11-15T10:23:05.891Z"}
The kubelet manages log rotation for container logs. Two kubelet flags control this behavior: --container-log-max-size (default 10Mi) sets the maximum size of each log file before rotation, and --container-log-max-files (default 5) sets how many rotated files to keep. When a log file hits the size limit, the runtime rotates it (e.g., 0.log becomes 0.log.20241115-102301.gz) and starts a fresh file.
With default settings, each container can use up to 50 MiB of disk for logs (5 files × 10 MiB). On a node running 50 pods, that is 2.5 GiB of log storage. High-throughput applications can burn through these limits fast, causing old logs to disappear before anyone reads them. This is the core reason you need cluster-level log shipping — node-level storage is ephemeral and bounded.
Level 3: Cluster-Level Logging — Centralized Aggregation
Cluster-level logging is not a Kubernetes feature — it is an architecture pattern you implement yourself. The goal: ship logs from every node to a central backend where they can be searched, filtered, and retained independently of node lifecycle. Kubernetes documentation describes three architectures for achieving this.
flowchart LR
subgraph Node["Worker Node"]
App["App Container<br/>stdout/stderr"]
CR["Container Runtime"]
LF["/var/log/containers/"]
FB["Fluent Bit<br/>DaemonSet"]
end
App -->|writes| CR
CR -->|"JSON log files"| LF
FB -->|tails log files| LF
FB -->|"enriches with<br/>K8s metadata"| FB
subgraph Backend["Centralized Backend"]
ES["Elasticsearch<br/>or Loki"]
UI["Kibana<br/>or Grafana"]
end
FB -->|ships logs| ES
ES -->|queries| UI
Architecture 1: Node-Level DaemonSet Agent (Most Common)
A logging agent runs as a DaemonSet — one pod per node — tailing the log files from /var/log/containers/. The agent parses the CRI JSON format, enriches each line with Kubernetes metadata (pod name, namespace, labels, annotations), and forwards the processed logs to a centralized backend. This is by far the most widely used pattern because it requires zero changes to your application code.
Popular DaemonSet agents include Fluent Bit (lightweight, C-based, low memory footprint), Fluentd (Ruby-based, plugin-rich, more flexible routing), Vector (Rust-based, high performance), and Promtail (purpose-built for Grafana Loki). The agent reads from the node filesystem, so the DaemonSet needs a volume mount to /var/log.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.1
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 128Mi
volumes:
- name: varlog
hostPath:
path: /var/log
- name: config
configMap:
name: fluent-bit-config
Architecture 2: Sidecar Container Streaming to stdout
Some applications write logs to files inside the container (e.g., /var/log/app/access.log and /var/log/app/error.log) rather than stdout. A sidecar container can tail those files and re-emit them to its own stdout, making them available to the node-level DaemonSet agent via the standard /var/log/containers/ path.
This pattern is useful when you cannot modify an application to log to stdout, or when you need to split multiple log streams from a single container into separate streams. The downside is resource overhead — each sidecar consumes CPU and memory, and you double the disk I/O since logs are written twice.
apiVersion: v1
kind: Pod
metadata:
name: legacy-app
spec:
containers:
- name: app
image: legacy-app:2.1
volumeMounts:
- name: log-volume
mountPath: /var/log/app
- name: log-streamer
image: busybox:1.36
args:
- /bin/sh
- -c
- tail -F /var/log/app/access.log
volumeMounts:
- name: log-volume
mountPath: /var/log/app
readOnly: true
volumes:
- name: log-volume
emptyDir: {}
Architecture 3: Direct Push from Application
The application pushes logs directly to a logging backend — bypassing the node filesystem entirely. This is common with application-level logging libraries (e.g., a Java app using Logback with an Elasticsearch appender, or a Go service pushing to Loki via HTTP). You get maximum control over format and routing, but you tightly couple your application to a specific logging infrastructure and lose the ability to collect logs from application crashes or containers that fail before the logging library initializes.
| Architecture | Application Changes | Resource Cost | Crash Log Coverage | Best For |
|---|---|---|---|---|
| DaemonSet Agent | None (log to stdout) | 1 agent per node | Full — logs persisted on disk | Most workloads |
| Sidecar Streaming | None | 1 sidecar per pod | Full | Legacy apps writing to files |
| Direct Push | Logging library config | In-app overhead | Partial — crash logs may be lost | High-cardinality custom routing |
Log Aggregation Stacks
Two stacks dominate the Kubernetes logging landscape. Your choice between them depends on scale, query patterns, and infrastructure budget.
EFK Stack: Elasticsearch + Fluent Bit/Fluentd + Kibana
The EFK stack is the traditional enterprise choice. Elasticsearch indexes every log line as a full-text searchable document, making it excellent for ad-hoc queries across high-cardinality fields. Fluent Bit (or Fluentd) ships the logs. Kibana provides dashboards, saved searches, and alerting. The cost: Elasticsearch is resource-hungry. A production cluster typically needs dedicated nodes with fast SSDs and significant memory for the JVM heap.
# Fluent Bit ConfigMap — collect, parse, enrich, and ship to Elasticsearch
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
[OUTPUT]
Name es
Match kube.*
Host elasticsearch.logging.svc
Port 9200
Logstash_Format On
Logstash_Prefix k8s-logs
Retry_Limit False
parsers.conf: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
Loki + Promtail + Grafana Stack
Grafana Loki takes a fundamentally different approach. Instead of indexing the full text of every log line (like Elasticsearch), Loki indexes only the metadata labels (namespace, pod, container, custom labels) and stores the raw log content in compressed chunks on cheap object storage (S3, GCS, MinIO). This makes Loki dramatically cheaper to run — often 10x less infrastructure than Elasticsearch for the same log volume.
The trade-off: you cannot do full-text search across all logs. Queries in Loki always start with a label selector ({namespace="production", app="api-gateway"}) to narrow the stream, then optionally apply line filters or regex. If your debugging workflow is "show me all logs from this service in the last hour", Loki is ideal. If you need "find every log line containing this UUID across the entire cluster", Elasticsearch is faster.
# Promtail DaemonSet config snippet — ships logs to Loki
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: logging
data:
promtail.yaml: |
server:
http_listen_port: 3101
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki.logging.svc:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
- labels:
level:
| Characteristic | EFK (Elasticsearch) | Loki + Grafana |
|---|---|---|
| Indexing | Full-text on every field | Labels only, raw log chunks |
| Query speed (full text) | Fast — inverted index | Slower — brute-force scan within stream |
| Query speed (by label) | Fast | Fast |
| Storage cost | High (SSD-backed indices) | Low (object storage like S3) |
| Resource footprint | Heavy (JVM heap, CPU) | Light (single binary or microservices) |
| Setup complexity | Moderate to high | Low (especially via Helm) |
| Best for | Security/compliance, high-cardinality search | Developer debugging, cost-sensitive teams |
Structured Logging and Log Levels
Unstructured log lines like User login failed for john@example.com are human-readable but machine-hostile. When you have thousands of pods producing millions of log lines, you need logs that can be parsed, filtered, and aggregated automatically. Structured logging means emitting each log entry as a JSON object with consistent, typed fields.
{
"timestamp": "2024-11-15T10:23:05.891Z",
"level": "error",
"message": "User login failed",
"service": "auth-api",
"user_email": "john@example.com",
"reason": "invalid_password",
"request_id": "req-a4f8c2e1-9b3d",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"duration_ms": 42
}
With structured logs, Fluent Bit's Merge_Log option (or Promtail's json pipeline stage) parses the JSON and promotes fields like level, service, and request_id into indexed or queryable labels. You can then filter dashboards by level=error, trace a request across services with request_id, or correlate logs with distributed traces via trace_id.
Log Levels
Use standard severity levels consistently across all services. This lets you filter noise and alert on errors at the infrastructure level rather than parsing strings.
| Level | When to Use | Example |
|---|---|---|
debug |
Verbose internals — disabled in production by default | Cache key lookup, SQL query parameters |
info |
Normal operational events | Server started, request handled, job completed |
warn |
Unexpected but recoverable situations | Retry attempt, deprecated API call, slow query |
error |
Failures that need attention | Database connection failed, upstream 5xx, unhandled exception |
fatal |
Unrecoverable — process will exit | Missing required config, binding port already in use |
Correlation IDs in Microservices
In a microservices architecture, a single user request might traverse five or more services. Without a shared identifier, debugging a failure means manually stitching logs together by timestamp — which is error-prone and slow. The solution: generate a unique request_id (or correlation_id) at the edge gateway and propagate it through every service via HTTP headers (X-Request-ID) or gRPC metadata.
Every service includes this ID in every log line it emits. To find the full story of a failed request, you query your logging backend with that single ID and see the complete chain — from ingress to database and back. If you are also using distributed tracing (OpenTelemetry), include the trace_id in your logs to link them directly to trace spans.
Practical: Deploying the Loki Stack with Helm
The fastest way to stand up centralized logging in a Kubernetes cluster is the Loki stack via Helm. This deploys Loki (log storage), Promtail (DaemonSet agent), and connects to an existing or new Grafana instance.
# Add the Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki (single-binary mode for dev/small clusters)
helm install loki grafana/loki \
--namespace logging --create-namespace \
--set loki.auth_enabled=false \
--set singleBinary.replicas=1 \
--set loki.storage.type=filesystem
# Install Promtail (DaemonSet log shipper)
helm install promtail grafana/promtail \
--namespace logging \
--set config.clients[0].url=http://loki:3100/loki/api/v1/push
# Install Grafana (if not already running)
helm install grafana grafana/grafana \
--namespace logging \
--set persistence.enabled=true \
--set adminPassword='your-secure-password'
After installation, add Loki as a data source in Grafana (URL: http://loki:3100), open the Explore panel, and query with LogQL:
# All logs from the "production" namespace
{namespace="production"}
# Error-level logs from a specific app
{namespace="production", app="api-gateway"} |= "error"
# Parse JSON logs and filter by status code
{namespace="production", app="api-gateway"} | json | status_code >= 500
# Count error rate per service over 5-minute windows
sum(rate({namespace="production"} |= "level=error" [5m])) by (app)
Log Retention Strategies
Logs are only useful if they exist when you need them — but storing everything forever is prohibitively expensive. A good retention strategy balances compliance requirements, debugging needs, and storage costs.
| Tier | Retention | Storage Type | Use Case |
|---|---|---|---|
| Hot | 3–7 days | SSD / local disk | Active debugging, real-time dashboards |
| Warm | 30–90 days | HDD / standard cloud storage | Incident investigation, trend analysis |
| Cold/Archive | 1–7 years | Object storage (S3 Glacier, GCS Coldline) | Compliance, audit, legal hold |
In Elasticsearch, use Index Lifecycle Management (ILM) policies to automatically roll over indices by age or size, transition them to cheaper storage tiers, and delete them on schedule. In Loki, configure the compactor component with retention_enabled: true and set per-tenant or global retention periods via limits_config.retention_period.
# Loki retention configuration snippet
limits_config:
retention_period: 720h # 30 days global default
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
Drop debug-level logs at the agent level before they reach your backend. In Fluent Bit, use a grep filter to exclude lines matching debug. In Promtail, use a drop pipeline stage. This alone can reduce log volume — and storage costs — by 40–60% in verbose applications.
Key Takeaways
- Always log to stdout/stderr. This is the Kubernetes-native convention. The container runtime, kubelet, and DaemonSet agents all expect it.
- DaemonSet agents are the default choice. Deploy Fluent Bit or Promtail as a DaemonSet — it covers every pod on every node with zero application changes.
- Choose Loki for cost efficiency, Elasticsearch for search power. Most teams that are not in regulated industries start with Loki and move to Elasticsearch only if query patterns demand full-text indexing.
- Emit structured JSON logs. Consistent fields like
level,message,service, andrequest_idtransform logs from noise into a queryable observability signal. - Plan retention from day one. Tiered storage with automatic lifecycle policies prevents runaway costs and keeps you compliant.
Monitoring with Prometheus, Grafana, and Metrics Server
Running containers without monitoring is flying blind. Kubernetes orchestrates hundreds or thousands of Pods across a fleet of nodes, and without visibility into CPU usage, memory pressure, error rates, and API server health, you won’t know something is wrong until users start complaining. This section covers the entire Kubernetes monitoring stack — from the lightweight Metrics Server that powers kubectl top, to the Prometheus ecosystem that gives you deep, long-term observability.
The monitoring landscape in Kubernetes splits into two categories: core metrics (the minimal set required by Kubernetes itself for scheduling and autoscaling) and full monitoring pipelines (Prometheus, Grafana, Alertmanager) that give you complete observability. Understanding both — and how they complement each other — is essential for production operations.
The Monitoring Stack at a Glance
Before diving into individual components, here is how all the pieces fit together. Every metric in the Kubernetes ecosystem starts at a source (kubelet, kube-state-metrics, node-exporter, or your application), gets scraped by Prometheus, stored in its time-series database, queried by Grafana for dashboards, and evaluated by Alertmanager for alerts.
flowchart LR
subgraph Sources["Metric Sources"]
KSM["kube-state-metrics<br/>(Deployment, Pod, Node state)"]
NE["node-exporter<br/>(CPU, memory, disk, network)"]
CA["kubelet / cAdvisor<br/>(container-level metrics)"]
APP["Application Pods<br/>(/metrics endpoints)"]
end
subgraph Prom["Prometheus"]
SM["ServiceMonitor /<br/>PodMonitor CRDs"]
SCRAPE["Scrape Engine<br/>(pull-based)"]
TSDB["Time-Series DB<br/>(local storage)"]
end
subgraph Viz["Visualization & Alerting"]
GF["Grafana<br/>(dashboards)"]
AM["Alertmanager<br/>(routing & notifications)"]
SLACK["Slack / PagerDuty /<br/>Email / Webhook"]
end
MS["Metrics Server<br/>(in-memory, real-time)"]
HPA["HPA / VPA /<br/>kubectl top"]
KSM --> SCRAPE
NE --> SCRAPE
CA --> SCRAPE
APP --> SCRAPE
SM -.->|"defines targets"| SCRAPE
SCRAPE --> TSDB
TSDB --> GF
TSDB -->|"alert rules"| AM
AM --> SLACK
CA --> MS
MS --> HPA
Notice the two independent paths. Metrics Server feeds the Kubernetes control plane (HPA, VPA, kubectl top) with real-time, in-memory metrics. Prometheus scrapes the same sources (plus many more) and stores them for querying, dashboarding, and alerting. They serve different purposes and you need both.
Metrics Server — Lightweight, Real-Time, Ephemeral
Metrics Server is a cluster-wide aggregator of resource usage data. It collects CPU and memory metrics from every kubelet’s built-in cAdvisor, holds them in memory only, and exposes them through the Kubernetes Metrics API (metrics.k8s.io). It is the component that makes kubectl top nodes and kubectl top pods work.
| Characteristic | Metrics Server |
|---|---|
| Storage | In-memory only — no historical data, no persistence |
| Metrics scope | CPU and memory usage at node and pod level |
| Scrape interval | ~15 seconds (configurable) |
| Primary consumers | HPA, VPA, kubectl top, Kubernetes scheduler |
| Not designed for | Long-term storage, dashboards, alerting, custom metrics |
Most managed Kubernetes services (EKS, GKE, AKS) pre-install Metrics Server. For self-managed clusters, deploying it is straightforward:
# Install Metrics Server (official manifest)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify it is running
kubectl get deployment metrics-server -n kube-system
# Now you can check resource usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
Metrics Server keeps only the latest data point — it has no history. It exists solely to feed Kubernetes internal components like the Horizontal Pod Autoscaler. For dashboards, alerting, and historical analysis, you need Prometheus. Think of Metrics Server as the speedometer in your car and Prometheus as the full telemetry system.
Prometheus — The Pillar of Kubernetes Monitoring
Prometheus is an open-source monitoring system purpose-built for dynamic, cloud-native environments. It is the de facto standard for Kubernetes monitoring. Unlike traditional push-based monitoring tools (where agents send data to a central server), Prometheus uses a pull-based model — it actively scrapes HTTP endpoints (/metrics) on a configurable interval.
This pull model is a deliberate design choice that works exceptionally well in Kubernetes. Pods come and go, IPs change, replicas scale up and down. Prometheus uses service discovery to dynamically find all scrape targets, so it automatically adapts to the changing cluster topology without reconfiguration.
Core Concepts
| Concept | Description |
|---|---|
| Metric | A named time series with labels. Example: container_cpu_usage_seconds_total{namespace="prod", pod="api-7b5f8"} |
| Scrape target | An HTTP endpoint that exposes metrics in Prometheus format. Every Kubernetes component exposes one. |
| Scrape interval | How often Prometheus fetches metrics (typically 15–30 seconds). |
| TSDB | Prometheus stores all data in a local time-series database, optimized for append-heavy writes and label-based queries. |
| PromQL | The query language for selecting, aggregating, and computing over time-series data. |
| Recording rules | Pre-computed PromQL expressions stored as new time series — reduces query-time computation for dashboards. |
| Alert rules | PromQL expressions that fire when a condition is true for a specified duration. |
Installing with kube-prometheus-stack
The recommended way to deploy Prometheus on Kubernetes is through the kube-prometheus-stack Helm chart (formerly prometheus-operator). This single chart installs Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, and a set of pre-configured alert rules and dashboards.
# Add the Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install the full stack into a dedicated namespace
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=your-secure-password
# Verify all components are running
kubectl get pods -n monitoring
This gives you a fully operational monitoring stack out of the box: Prometheus scraping all Kubernetes components, Grafana loaded with dashboards, Alertmanager configured with sensible default alerts, and exporters collecting node and cluster state metrics.
ServiceMonitor and PodMonitor CRDs
The Prometheus Operator introduces Custom Resource Definitions that let you declaratively define scrape targets. Instead of editing Prometheus configuration files, you create a ServiceMonitor (targets a Kubernetes Service) or a PodMonitor (targets Pods directly). The Operator watches these CRDs and automatically updates the Prometheus scrape configuration.
# ServiceMonitor — scrape metrics from a Service's endpoints
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-metrics
namespace: monitoring
labels:
release: kube-prometheus-stack # must match Prometheus selector
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: my-app
endpoints:
- port: http-metrics # named port on the Service
interval: 30s
path: /metrics
# PodMonitor — scrape metrics directly from Pod ports
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: envoy-sidecar-metrics
namespace: monitoring
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
sidecar: envoy
podMetricsEndpoints:
- port: admin
interval: 15s
path: /stats/prometheus
The key distinction: use a ServiceMonitor when your Pods are fronted by a Service (the common case). Use a PodMonitor when you need to scrape individual Pods directly — for example, sidecar proxies that don’t have their own Service, or when you need per-Pod label resolution.
Metric Sources — kube-state-metrics and node-exporter
Prometheus scrapes metrics, but it needs something to expose them. In a Kubernetes cluster, three primary metric sources cover the full picture: the kubelet’s built-in cAdvisor, kube-state-metrics, and node-exporter. Each serves a distinct purpose.
| Source | What It Exposes | Example Metrics |
|---|---|---|
| kubelet / cAdvisor | Container-level resource usage — CPU, memory, filesystem, and network I/O for every running container | container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_receive_bytes_total |
| kube-state-metrics | Cluster object state from the Kubernetes API — Deployments, Pods, Nodes, Jobs. Answers "what is the desired vs. actual state?" | kube_pod_status_phase, kube_deployment_spec_replicas, kube_node_status_condition, kube_job_status_failed |
| node-exporter | Host-level hardware and OS metrics — CPU load, memory, disk space, network interfaces, filesystem usage | node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes, node_disk_io_time_seconds_total |
kube-state-metrics is particularly important because it bridges the gap between the Kubernetes API and Prometheus. cAdvisor tells you "this container is using 200m CPU." kube-state-metrics tells you "this Deployment wants 3 replicas but only 2 are available" or "this Pod has been in CrashLoopBackOff for 10 minutes." Without it, you cannot alert on most operational problems.
node-exporter runs as a DaemonSet (one Pod per node) and exposes hardware-level metrics that cAdvisor does not cover. While cAdvisor reports per-container metrics, node-exporter reports the overall health of the underlying machine — disk IOPS, network errors, CPU steal time, and available memory across the entire node.
Key Metrics to Monitor
With hundreds of metrics available, it helps to organize your monitoring strategy around two proven frameworks: the USE method (Utilization, Saturation, Errors) for infrastructure resources, and the RED method (Rate, Errors, Duration) for services.
USE Method — For Nodes and Infrastructure
| Signal | What to Measure | Key Metrics |
|---|---|---|
| Utilization | How busy is the resource? | node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes |
| Saturation | How much extra work is queued? | node_load1 / node_load15, node_disk_io_time_weighted_seconds_total |
| Errors | Are there failures? | node_network_receive_errs_total, node_disk_io_time_seconds_total (anomalies) |
RED Method — For Services and APIs
| Signal | What to Measure | Key Metrics |
|---|---|---|
| Rate | Requests per second | apiserver_request_total, application-specific request counters |
| Errors | Failed requests per second | apiserver_request_total{code=~"5.."}, grpc_server_handled_total{grpc_code!="OK"} |
| Duration | Latency distribution | apiserver_request_duration_seconds_bucket, application histogram metrics |
PromQL — Querying Kubernetes Metrics
PromQL is Prometheus’s query language. It operates on time series identified by a metric name and a set of key-value labels. Learning a handful of patterns covers 90% of what you need for Kubernetes monitoring.
Essential Queries
# CPU usage per Pod (cores) — averaged over 5 minutes
rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])
# CPU usage as a percentage of the Pod's request
sum by (namespace, pod) (
rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])
)
/
sum by (namespace, pod) (
kube_pod_container_resource_requests{resource="cpu"}
) * 100
# Memory working set per Pod (the metric that triggers OOMKill)
container_memory_working_set_bytes{image!="", container!="POD"}
# Pods not in Running phase, grouped by namespace and phase
sum by (namespace, phase) (kube_pod_status_phase{phase!="Running"})
# API server request rate by verb and response code
sum by (verb, code) (rate(apiserver_request_total[5m]))
# API server error rate (5xx responses)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total[5m])) * 100
# Node CPU utilization percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Available disk space per node (percentage)
(node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) * 100
# Pods in CrashLoopBackOff (restart count increasing)
sum by (namespace, pod) (
increase(kube_pod_container_status_restarts_total[1h])
) > 5
A few key patterns to remember. rate() calculates the per-second rate of a counter over a time window — you will use this on every _total metric. sum by (label) aggregates across dimensions. increase() shows the total increase over a window, which is useful for low-frequency events like Pod restarts.
When querying cAdvisor metrics, include {image!="", container!="POD"} in your label selector. The container="POD" entries represent the pause container (the network namespace holder) — not your actual workload. Omitting this filter inflates your results with meaningless data.
Grafana — Dashboards for Kubernetes
Grafana is the visualization layer. It connects to Prometheus as a data source and lets you build dashboards with graphs, tables, heatmaps, and stat panels. The kube-prometheus-stack Helm chart installs Grafana pre-configured with a Prometheus data source and a comprehensive set of dashboards.
Accessing Grafana
# Port-forward Grafana to your local machine
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Open http://localhost:3000
# Default credentials: admin / your-secure-password (set during helm install)
Essential Dashboards
The kube-prometheus-stack ships with dozens of dashboards. Focus on these four categories first — they cover the most critical operational views:
| Dashboard | What It Shows | When to Use |
|---|---|---|
| Kubernetes / Cluster Overview | Total cluster CPU/memory usage, node count, Pod count, failed Pods, and overall resource allocation vs. capacity | Daily health check, capacity planning, initial incident triage |
| Node Exporter / Nodes | Per-node CPU, memory, disk I/O, network throughput, system load, and filesystem usage | Investigating node-level performance issues, disk pressure, or network saturation |
| Kubernetes / Pods | Per-pod CPU/memory usage vs. requests/limits, container restarts, network traffic, and OOMKill events | Debugging application performance issues, right-sizing resource requests |
| Kubernetes / API Server | Request rate, error rate, latency percentiles, inflight requests, and etcd request durations | Diagnosing control plane slowness or API server overload |
Creating a Custom Dashboard Panel
You can define Grafana dashboards as code using ConfigMaps. The Grafana sidecar in kube-prometheus-stack watches for ConfigMaps with a specific label and automatically imports them. Here is an example that creates a namespace resource usage dashboard:
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-namespace-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # sidecar picks up ConfigMaps with this label
data:
namespace-resources.json: |
{
"title": "Namespace Resource Usage",
"uid": "ns-resource-usage",
"panels": [
{
"title": "CPU Usage by Namespace",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [{
"expr": "sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))",
"legendFormat": "{{ namespace }}"
}],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Memory Usage by Namespace",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [{
"expr": "sum by (namespace) (container_memory_working_set_bytes)",
"legendFormat": "{{ namespace }}"
}],
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
}
],
"schemaVersion": 39,
"version": 1
}
Alerting with Alertmanager
Dashboards are useful when someone is looking at them. Alerts are what wake you up at 3 AM when something is actually broken. In the Prometheus ecosystem, alerting works in two stages: Prometheus evaluates alert rules (PromQL expressions with thresholds and durations) and fires alerts when conditions are met. Alertmanager receives those fired alerts, deduplicates them, groups related alerts, and routes them to the right notification channel.
Alert Rules — Defining What to Alert On
Alert rules are defined as PrometheusRule CRDs (another Prometheus Operator resource). Each rule specifies a PromQL expression, a for duration (how long the condition must be true before firing), and labels/annotations that control routing and provide context.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-critical-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: kubernetes.pod.alerts
rules:
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Container {{ $labels.container }} restarted {{ $value }} times in the last hour."
- alert: PodNotReady
expr: |
kube_pod_status_ready{condition="true"} == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 15m"
- name: kubernetes.node.alerts
rules:
- alert: NodeNotReady
expr: |
kube_node_status_condition{condition="Ready", status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"
description: "Node has been NotReady for more than 5 minutes."
- alert: HighNodeMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} memory above 90%"
- name: kubernetes.apiserver.alerts
rules:
- alert: APIServerHighErrorRate
expr: |
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/ sum(rate(apiserver_request_total[5m])) * 100 > 3
for: 10m
labels:
severity: critical
annotations:
summary: "API server error rate above 3%"
Alertmanager Configuration — Routing, Receivers, and Silences
Alertmanager handles the "what happens after an alert fires" logic. Its configuration defines receivers (where to send notifications), routes (which alerts go to which receivers), inhibition rules (suppress lower-priority alerts when a higher-priority one is firing), and silences (temporarily mute specific alerts during maintenance).
# Alertmanager config (set via Helm values under alertmanager.config)
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: default-slack
group_by: [alertname, namespace]
group_wait: 30s # wait before sending first notification
group_interval: 5m # wait between grouped notifications
repeat_interval: 4h # re-notify interval for unresolved alerts
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: false
- match:
severity: warning
receiver: slack-warnings
receivers:
- name: default-slack
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: "#k8s-alerts"
title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
text: >-
{{ range .Alerts }}*{{ .Annotations.summary }}*
{{ .Annotations.description }}{{ end }}
- name: pagerduty-critical
pagerduty_configs:
- service_key: your-pagerduty-service-key
severity: critical
- name: slack-warnings
slack_configs:
- api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: "#k8s-warnings"
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, namespace]
The inhibition rule at the bottom is important: if a critical alert fires for a given alertname and namespace, it suppresses the corresponding warning-level alert. This prevents alert floods — you don’t want both a "node memory high" warning and a "node memory critical" alert hitting your Slack channel simultaneously.
Key Alertmanager Concepts
| Concept | Purpose | Example |
|---|---|---|
| Grouping | Combines related alerts into a single notification | group_by: [alertname, namespace] — all PodCrashLooping alerts in the same namespace arrive as one message |
| Routing | Directs alerts to different receivers based on labels | Critical → PagerDuty, Warning → Slack |
| Inhibition | Suppresses lower-severity alerts when a related higher-severity alert is active | NodeNotReady (critical) suppresses HighNodeCPU (warning) for the same node |
| Silences | Temporarily mutes alerts matching specific labels (created via UI or API) | Silence all alerts for namespace=staging during a maintenance window |
The most common monitoring anti-pattern is alerting on everything. If your team receives 50 notifications per day, they will start ignoring all of them. Only alert on conditions that require human intervention. Metrics that are "nice to know" belong on dashboards, not in alert rules. A good rule of thumb: every alert should have a clear runbook explaining what action to take.
Putting It All Together — A Monitoring Checklist
Here is a practical checklist for production Kubernetes monitoring. Deploy the kube-prometheus-stack with sensible defaults, customize these key areas, and you will have solid observability coverage:
- Deploy the full stack. Use
helm install kube-prometheus-stackto get Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one shot. - Verify Metrics Server is running for HPA and
kubectl topfunctionality. - Create ServiceMonitors for every application that exposes a
/metricsendpoint. This is how Prometheus discovers your custom metrics. - Set up alert rules for the essentials: PodCrashLooping, NodeNotReady, HighMemoryUsage, APIServerErrors, PersistentVolumeFillingUp.
- Configure Alertmanager routing so critical alerts go to PagerDuty (or your on-call tool) and warnings go to Slack.
- Add inhibition rules to prevent alert storms when a single root cause triggers multiple symptoms.
- Review Grafana dashboards daily. Use the cluster overview for capacity planning and the Pod dashboard for right-sizing resource requests.
- Configure persistent storage for Prometheus (
storageSpecin Helm values) so you retain metrics across Pod restarts. 15–30 days of retention is typical.
Troubleshooting — A Systematic Guide to Common Issues
Every Kubernetes operator eventually stares at a Pod stuck in Pending, a Service that refuses to route traffic, or a node that drops to NotReady at 2 AM. The difference between a 10-minute fix and a 3-hour ordeal is almost always methodology — knowing which commands to run, in which order, and what the output actually means.
This section gives you a systematic approach to diagnosing the most common Kubernetes failures. Each issue category follows the same pattern: understand why the failure happens, then follow concrete debugging steps with real command output to identify and resolve the root cause.
The Debugging Toolkit
Before diving into specific issues, you need to internalize the five core debugging commands. These cover 90% of Kubernetes troubleshooting. Everything else is built on top of them.
| Command | What It Tells You | When to Use |
|---|---|---|
kubectl describe <resource> <name> | Full resource spec + conditions + events (the most useful part). Shows scheduling decisions, image pulls, mount errors, and probe failures. | First command for any resource-level issue. Always start here. |
kubectl logs <pod> [-c container] | Container stdout/stderr output. Add --previous to see logs from the last crashed container. | When a container is crashing or misbehaving. Use -f to stream. |
kubectl get events --sort-by='.lastTimestamp' | Cluster-wide event stream — scheduling, pulling, mounting, killing, scaling events across all resources. | When you need the big picture. Great for correlating issues across resources. |
kubectl exec -it <pod> -- /bin/sh | Interactive shell inside a running container. Test DNS, connectivity, filesystem, environment variables. | When you need to verify the container's runtime environment. |
kubectl debug node/<name> -it --image=busybox | Launches a privileged debugging Pod on a specific node with access to the host filesystem and network. | When the issue is at the node level — disk, networking, kubelet, or container runtime. |
When a Pod is not running, kubectl logs often returns nothing — there's no container to produce output yet. Always run kubectl describe pod <name> first. The Events section at the bottom tells you what Kubernetes tried to do and why it failed — image pull errors, scheduling failures, volume mount issues, and probe failures all surface here before any container log exists.
The Troubleshooting Decision Tree
When something goes wrong, the fastest path to a fix is to identify the category of failure first. Pod status is your primary signal — it immediately narrows your search space to one of a few well-understood failure modes.
flowchart TD
START["Pod not working as expected"] --> STATUS{"What is the
Pod status?"}
STATUS -->|"Pending"| PEND{"Check events
with describe"}
PEND -->|"No nodes match"| FIX_SCHED["Fix nodeSelector,
tolerations, or affinity"]
PEND -->|"Insufficient CPU/memory"| FIX_RES["Reduce requests or
add cluster capacity"]
PEND -->|"PVC not bound"| FIX_PVC["Fix StorageClass
or provision PV"]
STATUS -->|"ImagePullBackOff"| IMG{"Check image
name & registry"}
IMG -->|"Wrong tag/name"| FIX_IMG["Fix image reference
in Pod spec"]
IMG -->|"Auth required"| FIX_SEC["Create or fix
imagePullSecrets"]
STATUS -->|"CrashLoopBackOff"| CRASH{"Check logs
--previous"}
CRASH -->|"OOMKilled"| FIX_OOM["Increase memory
limits"]
CRASH -->|"App error"| FIX_APP["Fix application
code or config"]
CRASH -->|"Bad command/args"| FIX_CMD["Fix command
or entrypoint"]
STATUS -->|"Running but
not working"| RUNNING{"Check Service
& networking"}
RUNNING -->|"No endpoints"| FIX_LBL["Fix label selector
or targetPort"]
RUNNING -->|"DNS failure"| FIX_DNS["Check CoreDNS
and DNS policy"]
RUNNING -->|"Probe failing"| FIX_PROBE["Fix readiness/
liveness probe"]
STATUS -->|"Evicted"| EVICT["Check node
pressure conditions"]
style START fill:#f8fafc,stroke:#334155,color:#0f172a
style FIX_SCHED fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_RES fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_PVC fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_IMG fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_SEC fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_OOM fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_APP fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_CMD fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_LBL fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_DNS fill:#dcfce7,stroke:#16a34a,color:#14532d
style FIX_PROBE fill:#dcfce7,stroke:#16a34a,color:#14532d
style EVICT fill:#fef9c3,stroke:#ca8a04,color:#713f12
With this mental model, let's walk through each failure category in detail.
Pod Issues
Pods are the atomic unit of scheduling in Kubernetes, and they're where most failures surface. The Pod's status.phase and status.containerStatuses[].state fields are your first diagnostic signals. Run kubectl get pods to see the high-level status, then drill in with describe.
ImagePullBackOff
ImagePullBackOff means the kubelet tried to pull the container image and failed. After the initial failure (ErrImagePull), Kubernetes backs off exponentially — retrying at increasing intervals up to 5 minutes. The three most common causes are: a misspelled image name or tag, a private registry that requires authentication, and a missing imagePullSecret.
# Step 1: Identify the failing Pod
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-server-7f8b9d6c4-x2k 0/1 ImagePullBackOff 0 3m
# Step 2: Get the exact error from events
kubectl describe pod api-server-7f8b9d6c4-x2k | tail -10
# Events:
# Warning Failed pull image "myregistry.io/api-server:v2.1.0":
# rpc error: code = Unknown desc = failed to pull and unpack image:
# 401 Unauthorized
# Step 3: Verify the image exists (from your local machine)
docker pull myregistry.io/api-server:v2.1.0
# Step 4: If auth is the issue, create the pull secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=deploy-bot \
--docker-password="${REGISTRY_TOKEN}" \
--docker-email=deploy@example.com
# Step 5: Patch the Deployment to use the secret
kubectl patch deployment api-server -p \
'{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'
Quick checklist for ImagePullBackOff:
- Is the image name and tag spelled correctly? (Watch for typos like
ngixnvsnginx.) - Does the tag actually exist in the registry? (
latestmay not exist on every image.) - Is the registry private? If so, does the namespace have the correct
imagePullSecret? - Is the node able to reach the registry? (Network policies, firewall rules, proxy settings.)
- Is the
imagePullPolicyset toAlwayswhen it should beIfNotPresent(or vice versa)?
CrashLoopBackOff
CrashLoopBackOff means the container starts, then exits, and Kubernetes keeps restarting it with increasing backoff delays (10s, 20s, 40s, up to 5 minutes). The image was pulled successfully — the problem is inside the container. This is where kubectl logs --previous becomes essential, because it captures stdout/stderr from the last terminated container instance.
# Check the restart count and termination reason
kubectl get pod payment-svc-5d4f8b7a9-m3j \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated}' | jq .
# {
# "exitCode": 137,
# "reason": "OOMKilled",
# "startedAt": "2024-11-15T08:23:01Z",
# "finishedAt": "2024-11-15T08:23:44Z"
# }
# Grab logs from the previous (crashed) container
kubectl logs payment-svc-5d4f8b7a9-m3j --previous
# If the container exits too fast for logs, override the entrypoint
# to keep it alive for inspection:
kubectl debug payment-svc-5d4f8b7a9-m3j \
-it --copy-to=debug-pod --container=payment \
-- /bin/sh -c "sleep 3600"
The three dominant causes of CrashLoopBackOff:
| Cause | Indicator | Fix |
|---|---|---|
| OOMKilled | Exit code 137, reason OOMKilled in lastState.terminated | Increase resources.limits.memory in the Pod spec, or fix the memory leak in the application. |
| Application error | Non-zero exit code (1, 2, etc.), error messages in kubectl logs --previous | Fix the application bug. Check environment variables, ConfigMap mounts, database connection strings, and missing dependencies. |
| Wrong command/args | Exit code 126 (permission denied) or 127 (command not found) | Verify the command and args fields in the container spec. Remember: command overrides the image's ENTRYPOINT, and args overrides CMD. |
Exit code 137 with reason OOMKilled means the container exceeded its memory limit and the kernel's OOM killer terminated it. This is different from node-level memory pressure, which causes Pod eviction (covered below). For OOMKilled, the fix is always at the container level — raise the limit or reduce memory consumption. Don't confuse the two.
Pending Pods
A Pod in Pending status has been accepted by the API server and stored in etcd, but the scheduler cannot place it on any node. The Pod will remain Pending indefinitely until the underlying constraint is resolved. The Events section in kubectl describe pod always tells you exactly why.
# Diagnose why a Pod is Pending
kubectl describe pod ml-training-pod-8x4r2 | grep -A 5 "Events"
# Events:
# Warning FailedScheduling 0/3 nodes are available:
# 1 node(s) had untolerated taint {gpu=true},
# 2 node(s) didn't match Pod's node affinity/selector.
# Check resource availability across all nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# Compare Pod requests against node capacity
kubectl get pod ml-training-pod-8x4r2 \
-o jsonpath='{.spec.containers[*].resources}' | jq .
# Check for unbound PVCs (common with StatefulSets)
kubectl get pvc
# NAME STATUS VOLUME CAPACITY STORAGECLASS AGE
# data-redis-0 Pending fast-ssd 5m
Common causes and their fixes:
- Insufficient resources: No node has enough allocatable CPU or memory for the Pod's
requests. Either reduce the requests, add nodes, or evict lower-priority workloads. - Node selector mismatch: The Pod specifies
nodeSelector: {disktype: ssd}but no node has that label. Add the label withkubectl label node <name> disktype=ssd. - Taints without tolerations: All available nodes have taints the Pod doesn't tolerate. Add the matching
tolerationsto the Pod spec. - Unbound PVC: The Pod references a PersistentVolumeClaim that hasn't been provisioned yet. See the Storage Issues subsection below.
Evicted Pods
Eviction happens when a node runs critically low on resources — disk space, memory, or process IDs. The kubelet monitors these thresholds and evicts Pods to protect node stability. Evicted Pods are not restarted on the same node; the owning controller (Deployment, ReplicaSet) creates replacements that get scheduled elsewhere.
# Find evicted Pods (they linger in Failed status)
kubectl get pods --field-selector=status.phase=Failed \
-o custom-columns=NAME:.metadata.name,REASON:.status.reason,NODE:.spec.nodeName
# NAME REASON NODE
# logger-5c7f8d-xq9k2 Evicted worker-03
# cache-warmup-b8d4f-r3j7n Evicted worker-03
# Check node conditions for pressure signals
kubectl describe node worker-03 | grep -A 3 "Conditions"
# Conditions:
# MemoryPressure True KubeletHasInsufficientMemory
# DiskPressure False KubeletHasNoDiskPressure
# PIDPressure False KubeletHasSufficientPID
# Ready True KubeletReady
# Clean up lingering evicted Pod objects
kubectl delete pods --field-selector=status.phase=Failed
Evictions are a symptom, not the root cause. If Pods keep getting evicted from the same node, investigate: is a Pod consuming unbounded memory (no limits set)? Are log files filling the disk? Is emptyDir storage growing unchecked? Set proper resource limits and configure log rotation to prevent recurrence.
Service Issues
A Kubernetes Service provides a stable network identity (ClusterIP, DNS name) for a set of Pods. When a Service doesn't route traffic correctly, the problem almost always comes down to one thing: the Service can't find its backend Pods. Kubernetes uses label selectors to associate a Service with Pods, and the Endpoints (or EndpointSlice) object is the link between them.
Endpoints Not Populating
When kubectl get endpoints <service-name> shows <none>, the Service has no backend Pods. Traffic sent to the Service's ClusterIP goes nowhere. This is the single most common Service debugging scenario.
# Step 1: Check if the Service has endpoints
kubectl get endpoints order-service
# NAME ENDPOINTS AGE
# order-service <none> 12m <-- problem!
# Step 2: Compare the Service selector with Pod labels
kubectl get svc order-service -o jsonpath='{.spec.selector}' | jq .
# { "app": "order-svc" }
kubectl get pods --show-labels | grep order
# order-svc-7f8b9c-x2k 1/1 Running app=orders <-- mismatch!
# The Service selects "app=order-svc" but Pods have "app=orders"
# Fix: update either the Service selector or Pod labels
# Step 3: Also verify the port mapping
kubectl get svc order-service -o jsonpath='{.spec.ports[*]}' | jq .
# { "port": 80, "targetPort": 3000, "protocol": "TCP" }
# Confirm the container actually listens on port 3000
kubectl exec order-svc-7f8b9c-x2k -- ss -tlnp
The three-point Service health check: (1) Do the label selectors match? (2) Does targetPort match the port the container is actually listening on? (3) Are the backend Pods in Ready state? A Pod that fails its readiness probe is automatically removed from the Endpoints list — it's running, but the Service won't send traffic to it.
External Access Not Working
When a LoadBalancer or NodePort Service isn't reachable from outside the cluster, the issue is often at the infrastructure layer rather than Kubernetes itself.
# Check if the LoadBalancer has an external IP assigned
kubectl get svc frontend -o wide
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# frontend LoadBalancer 10.96.45.12 <pending> 80:31247/TCP
# <pending> means the cloud controller hasn't provisioned an LB yet.
# Possible causes:
# - No cloud-controller-manager running (bare-metal cluster)
# - Quota exhausted for load balancers in your cloud account
# - Missing annotation required by your cloud provider
# For NodePort, verify the port is reachable on the node
kubectl get svc frontend -o jsonpath='{.spec.ports[0].nodePort}'
# 31247
# Test from outside: curl http://<node-external-ip>:31247
# If blocked, check security group / firewall rules
# for the node port range (default 30000-32767)
Networking Issues
Kubernetes networking relies on a flat network model — every Pod gets its own IP, and every Pod can reach every other Pod without NAT. In practice, this works through a CNI (Container Network Interface) plugin. When networking breaks, it's usually DNS resolution, CNI misconfiguration, or NetworkPolicy rules blocking traffic.
DNS Resolution Failures
Every Pod in a Kubernetes cluster relies on CoreDNS for service discovery. When DNS breaks, Pods can't resolve Service names, and cascading failures follow quickly. The telltale signs: applications log "name resolution failed" or "host not found" errors while IP-based connectivity works fine.
# Step 1: Test DNS from inside a Pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
nslookup kubernetes.default.svc.cluster.local
# If this fails, DNS is broken cluster-wide.
# Step 2: Check CoreDNS Pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# NAME READY STATUS RESTARTS
# coredns-5d78c9869d-4xk2m 1/1 Running 0
# coredns-5d78c9869d-r8j7n 1/1 Running 0
# Step 3: Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Step 4: Verify the kube-dns Service has endpoints
kubectl get endpoints -n kube-system kube-dns
# Should show CoreDNS Pod IPs on ports 53 (UDP+TCP)
# Step 5: Check the Pod's DNS config
kubectl exec <your-pod> -- cat /etc/resolv.conf
# nameserver 10.96.0.10 <-- should be the kube-dns ClusterIP
# search default.svc.cluster.local svc.cluster.local cluster.local
If CoreDNS Pods are healthy but DNS is still failing, check whether a NetworkPolicy is blocking UDP/TCP port 53 traffic to the kube-system namespace. Also verify that the Pod's dnsPolicy is set correctly — the default ClusterFirst routes queries to CoreDNS, while Default uses the node's /etc/resolv.conf instead.
Cross-Namespace Communication
Pods can reach Services in other namespaces by using the fully qualified domain name: <service-name>.<namespace>.svc.cluster.local. If this fails while same-namespace resolution works, the cause is almost always a NetworkPolicy restricting ingress or egress between namespaces.
# Test cross-namespace connectivity
kubectl exec -n frontend deploy/web-app -- \
wget -qO- --timeout=3 http://api-gateway.backend.svc.cluster.local/health
# If it times out, check NetworkPolicies in the target namespace
kubectl get networkpolicy -n backend
# NAME POD-SELECTOR AGE
# restrict-all app=api-gw 2d
# Inspect the policy — does it allow ingress from the frontend namespace?
kubectl describe networkpolicy restrict-all -n backend
# NetworkPolicy that allows traffic from the "frontend" namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-ingress
namespace: backend
spec:
podSelector:
matchLabels:
app: api-gw
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: frontend
ports:
- protocol: TCP
port: 8080
Storage Issues
Persistent storage in Kubernetes is mediated by two resources: PersistentVolumeClaims (PVCs) that Pods request, and PersistentVolumes (PVs) that represent the actual storage. A provisioner (usually via a StorageClass) dynamically creates PVs to satisfy PVC requests. When this chain breaks, Pods that depend on the volume get stuck in Pending.
PVC Stuck in Pending
# Check PVC status
kubectl get pvc
# NAME STATUS STORAGECLASS CAPACITY AGE
# postgres-data Pending fast-ssd 8m
# Get the reason from events
kubectl describe pvc postgres-data
# Events:
# Warning ProvisioningFailed storageclass.storage.k8s.io "fast-ssd" not found
# List available StorageClasses
kubectl get storageclass
# NAME PROVISIONER RECLAIMPOLICY
# standard kubernetes.io/gce-pd Delete
# premium-rwo pd.csi.storage.gke.io Delete
# Fix: delete the PVC and recreate with a valid StorageClass
# (storageClassName is immutable after creation)
kubectl delete pvc postgres-data
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: premium-rwo
resources:
requests:
storage: 20Gi
EOF
Other causes of Pending PVCs: the provisioner Pod itself is not running (check kube-system), the cloud provider hit a quota limit on disk volumes, or the PVC requests a specific accessMode (like ReadWriteMany) that the provisioner doesn't support. For WaitForFirstConsumer binding mode, the PVC intentionally stays Pending until a Pod that references it is scheduled — this is normal behavior, not an error.
Volume Mount Failures
Even after a PVC is bound, the volume can fail to mount on the node. This typically causes the Pod to be stuck in ContainerCreating status with a FailedAttachVolume or FailedMount event.
# Pod stuck in ContainerCreating
kubectl describe pod postgres-0 | grep -A 10 "Events"
# Events:
# Warning FailedAttachVolume AttachVolume.Attach failed for volume "pvc-9a8b7c":
# rpc error: code = Internal desc = Could not attach volume:
# volume is already attached to node "worker-01"
# This happens when a ReadWriteOnce volume is still attached to another node.
# Common during node drains or Pod rescheduling.
# Check which node currently has the volume attached
kubectl get volumeattachment | grep pvc-9a8b7c
# If the old Pod is gone but the attachment lingers, force-detach:
kubectl delete volumeattachment <attachment-name>
Node Issues
Nodes are the physical (or virtual) machines that run your workloads. When a node has problems, every Pod on it is affected. The kubelet reports node health via conditions, and the node controller in the control plane takes action when conditions degrade — marking nodes as NotReady, evicting Pods, or preventing new scheduling.
NotReady Nodes
A NotReady node means the kubelet has stopped reporting healthy heartbeats to the API server. After the node-monitor-grace-period (default: 40 seconds), the node controller marks it NotReady. After pod-eviction-timeout (default: 5 minutes), Pods are evicted and rescheduled to healthy nodes.
# Identify NotReady nodes
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-01 Ready <none> 45d v1.29.2
# worker-02 NotReady <none> 45d v1.29.2
# worker-03 Ready <none> 45d v1.29.2
# Get conditions and recent events for the bad node
kubectl describe node worker-02 | grep -A 20 "Conditions"
# If you can SSH into the node, check kubelet status
# ssh worker-02
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" --no-pager | tail -30
# Common causes:
# - kubelet process crashed or was OOM-killed
# - Container runtime (containerd/CRI-O) unresponsive
# - Certificate expired (kubelet can't authenticate to API server)
# - Network partition between node and control plane
# Restart kubelet as a quick recovery step
sudo systemctl restart kubelet
Node Resource Pressure
Even when a node is Ready, it can be under resource pressure — triggering Pod evictions. The kubelet monitors three thresholds and sets corresponding conditions when they're breached.
| Condition | Default Threshold | What Triggers It | Recovery Action |
|---|---|---|---|
MemoryPressure | memory.available < 100Mi | Too many Pods without memory limits, memory leaks, or undersized nodes | Evict BestEffort Pods first, then Burstable. Set proper requests and limits. |
DiskPressure | nodefs.available < 10% | Container images filling disk, large log files, or emptyDir volumes growing unbounded | Prune unused images (crictl rmi --prune), configure log rotation, set emptyDir.sizeLimit. |
PIDPressure | pid.available < 1000 | Applications forking too many processes | Set pids-limit in container runtime config. Investigate the process-leaking container. |
# Check actual vs. allocatable resources on a specific node
kubectl describe node worker-03 | grep -A 8 "Allocated resources"
# Allocated resources:
# (Total limits may be over 100 percent.)
# Resource Requests Limits
# -------- -------- ------
# cpu 3800m (95%) 7200m (180%)
# memory 6Gi (78%) 12Gi (150%)
# ephemeral-storage 0 (0%) 0 (0%)
# This node is at 95% CPU requests — no new Pods can schedule here.
# Limits over 100% mean overcommit — safe until Pods actually burst.
If you're troubleshooting resource pressure regularly, you need better observability — not faster debugging. Set up Prometheus alerts on kube_node_status_condition for pressure conditions, node_memory_MemAvailable_bytes for memory headroom, and node_filesystem_avail_bytes for disk. Catch these issues at 80% utilization, not at 100%. The previous section on Monitoring with Prometheus and Grafana covers this setup in detail.
Putting It All Together: A Real Debugging Session
Real incidents rarely involve a single, obvious failure. Here's a realistic end-to-end example: a new deployment rollout appears stuck, and you need to find out why.
# 1. What's the Deployment status?
kubectl rollout status deployment/checkout-api --timeout=10s
# Waiting for deployment "checkout-api" rollout to finish:
# 1 out of 3 new replicas have been updated...
# 2. Which Pods are problematic?
kubectl get pods -l app=checkout-api
# NAME READY STATUS RESTARTS
# checkout-api-6b8f9d7c4-old1 1/1 Running 0
# checkout-api-6b8f9d7c4-old2 1/1 Running 0
# checkout-api-6b8f9d7c4-old3 1/1 Running 0
# checkout-api-85d4a3f1b-new1 0/1 CrashLoopBackOff 4
# 3. Why is the new Pod crashing?
kubectl logs checkout-api-85d4a3f1b-new1 --previous
# Error: FATAL: password authentication failed for user "checkout"
# Connection to database refused
# 4. Check what secret the new Pod references
kubectl get pod checkout-api-85d4a3f1b-new1 -o jsonpath=\
'{.spec.containers[0].envFrom}' | jq .
# [{ "secretRef": { "name": "checkout-db-creds-v2" } }]
# 5. Does that secret exist?
kubectl get secret checkout-db-creds-v2
# Error from server (NotFound): secrets "checkout-db-creds-v2" not found
# Root cause: the new Deployment version references a secret that
# hasn't been created yet. Create it and the rollout will proceed.
This five-step flow — rollout status → identify failing Pods → check logs → inspect config → trace the dependency — works for the vast majority of deployment issues. The key habit is to let each command's output guide your next command, rather than guessing randomly. Systematic beats fast every time.
Horizontal and Vertical Pod Autoscaling (HPA & VPA)
Kubernetes offers two complementary autoscaling dimensions. Horizontal Pod Autoscaling (HPA) adjusts the number of Pod replicas — more traffic means more Pods. Vertical Pod Autoscaling (VPA) adjusts the resource requests and limits on each Pod — the same number of Pods, but each one gets more (or less) CPU and memory. Together they let your workloads respond to demand without manual intervention.
Understanding when to use each — and when not to combine them — is the key to a stable, cost-efficient cluster. This section walks through the algorithms, APIs, configuration knobs, and YAML manifests for both.
graph LR
M["Metrics Server /
Custom Metrics Adapter"] -->|current metrics| HPA["HPA Controller"]
M -->|resource usage| VPA["VPA Recommender"]
HPA -->|scale replicas| D["Deployment / ReplicaSet"]
VPA -->|update requests & limits| D
D --> P1["Pod"]
D --> P2["Pod"]
D --> P3["Pod +/-"]
style HPA fill:#3b82f6,color:#fff,stroke:#2563eb
style VPA fill:#8b5cf6,color:#fff,stroke:#7c3aed
style M fill:#f59e0b,color:#fff,stroke:#d97706
Horizontal Pod Autoscaler (HPA)
The HPA is a built-in Kubernetes controller that periodically (every 15 seconds by default) fetches metrics, computes the desired replica count, and patches the target workload's scale subresource. It ships with the control plane — no extra installation is required. You only need a Metrics Server (for CPU/memory) or a custom metrics adapter (for application-level metrics) to supply the data it reads.
The HPA Algorithm
The core formula is deceptively simple:
desiredReplicas = ceil( currentReplicas x ( currentMetricValue / desiredMetricValue ) )
For example, if you have 3 replicas, current average CPU utilization is 80%, and the target is 50%, the calculation is ceil(3 x (80 / 50)) = ceil(4.8) = 5. The HPA scales you to 5 replicas. When multiple metrics are configured, the HPA computes a desired replica count for each metric and takes the maximum — the most aggressive scale-up wins.
Pods that are not yet ready, or that have no metrics (just started), are handled conservatively. Pods without metrics are assumed to consume 0% during scale-down calculations and 100% during scale-up. This prevents both premature scale-down and sluggish scale-up.
The HPA does not act on every tiny fluctuation. It has a default tolerance of 0.1 (10%). If the ratio currentMetric / desiredMetric is within [0.9, 1.1], no scaling action is taken. This prevents thrashing on noisy metrics.
HPA v2 API — Metric Types
The modern HPA API (autoscaling/v2, stable since Kubernetes 1.23) supports four distinct metric sources. This is a major upgrade over v1, which only supported CPU percentage.
| Metric Type | Source | Use Case |
|---|---|---|
| Resource | Metrics Server (metrics.k8s.io) | Scale on CPU or memory utilization as a percentage of requests. |
| Pods | Custom Metrics Adapter (custom.metrics.k8s.io) | Scale on a per-pod metric like requests_per_second or queue_depth from Prometheus. |
| Object | Custom Metrics Adapter (custom.metrics.k8s.io) | Scale on a metric attached to another Kubernetes object, like an Ingress's requests-per-second. |
| External | External Metrics Adapter (external.metrics.k8s.io) | Scale on a metric from outside the cluster — an SQS queue length, a Pub/Sub subscription backlog, etc. |
Here is a complete HPA manifest that uses three of these metric types simultaneously. The HPA evaluates all of them and picks the one that demands the most replicas:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 30
metrics:
# 1. Resource metric — keep average CPU at 60%
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# 2. Pods metric — custom metric from Prometheus adapter
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
# 3. External metric — SQS queue depth
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: order-processing
target:
type: Value
value: "50"
The scaleTargetRef points at the workload to scale. minReplicas and maxReplicas set the guardrails — the HPA will never go below 3 or above 30 regardless of what the metrics say. Each item in the metrics array independently computes a desired replica count, and the largest value wins.
Scaling Behavior — Stabilization and Policies
Raw autoscaling with no throttling can be dangerous. A brief CPU spike could add 20 Pods, then immediately remove them when the spike subsides, causing cascading restarts. The behavior field gives you fine-grained control over how fast the HPA scales in each direction.
There are two key concepts: stabilization windows decide how long the HPA looks back to avoid reacting to transient spikes. Scaling policies limit how many replicas can be added or removed in a given time period, expressed as either an absolute number (Pods) or a percentage (Percent) of current replicas.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # React immediately to scale-up need
policies:
- type: Percent
value: 100 # Allow doubling every 60s
periodSeconds: 60
- type: Pods
value: 5 # Or add at least 5 pods per 60s
periodSeconds: 60
selectPolicy: Max # Use whichever policy allows MORE pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% per minute
periodSeconds: 60
selectPolicy: Min # Use the most conservative policy
This configuration follows a common production pattern: scale up aggressively, scale down conservatively. The scale-up has no stabilization window and allows doubling, so the workload reacts quickly to genuine traffic surges. The scale-down uses a 5-minute stabilization window and limits removal to 10% of replicas per minute, preventing capacity from dropping too fast after a brief traffic dip.
| Behavior Field | Default (Scale Up) | Default (Scale Down) | Purpose |
|---|---|---|---|
stabilizationWindowSeconds | 0 | 300 (5 min) | Looks back over this window and picks the highest (up) or lowest (down) recommended replica count. |
policies[].type | — | — | Pods (absolute count) or Percent (of current replicas). |
policies[].value | — | — | The max pods/percent that can change per periodSeconds. |
policies[].periodSeconds | — | — | Time window for the policy (1–1800 seconds). |
selectPolicy | Max | Max | Max picks the policy that allows the most change. Min picks the most restrictive. Disabled blocks scaling in that direction entirely. |
Inspecting HPA Status
After creating an HPA, you can see what the controller is doing at any time:
# Quick overview — shows targets, current values, and replica count
kubectl get hpa order-service -n production
# Detailed status — shows each metric, conditions, and events
kubectl describe hpa order-service -n production
# Watch scaling decisions in real time
kubectl get hpa order-service -n production -w
If the TARGETS column shows <unknown>/60%, it means the metrics pipeline is broken. Check that Metrics Server is running (kubectl get pods -n kube-system | grep metrics-server) and that your Pods have resources.requests set — the HPA cannot compute utilization percentage without a denominator.
Vertical Pod Autoscaler (VPA)
While HPA answers "how many Pods?", VPA answers "how big should each Pod be?" The Vertical Pod Autoscaler watches actual resource consumption over time and adjusts the requests and limits on containers. This is critical for workloads where developers have no idea what to request — and in practice, initial resource estimates are almost always wrong.
VPA is not built into the control plane. You install it separately (the project lives at kubernetes/autoscaler on GitHub). It consists of three components:
| Component | Role |
|---|---|
| Recommender | Reads historical and real-time resource usage from the Metrics Server. Computes recommended requests for each container. |
| Updater | Checks running Pods against the recommendation. If a Pod's requests are significantly off, it evicts the Pod so it gets recreated with new values. |
| Admission Controller | Intercepts Pod creation and mutates the requests/limits fields to match the current recommendation — so new Pods start right-sized. |
VPA Update Modes
The updatePolicy.updateMode field controls how aggressively the VPA acts. Choosing the right mode depends on your tolerance for Pod restarts.
| Mode | Behavior | Pod Restarts? | Best For |
|---|---|---|---|
| Off | Produces recommendations only. Does not change Pod resources. | No | Observation and right-sizing analysis. Safe starting point. |
| Initial | Applies recommendations at Pod creation time only. Does not touch running Pods. | No (existing Pods) | Workloads where you control rollout timing (e.g., via CI/CD deploys). |
| Auto | Applies recommendations at creation and evicts running Pods to update them. This is the fully automatic mode. | Yes | Non-critical workloads that tolerate occasional restarts. |
Here is a VPA resource in Off mode — the safest way to start. It will produce recommendations you can review without affecting any running Pods:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: order-service
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 2Gi
controlledResources: ["cpu", "memory"]
The resourcePolicy sets guardrails. Without minAllowed and maxAllowed, the VPA could recommend absurdly small or large values. The controlledResources field lets you limit VPA to only CPU or only memory if needed.
Once the recommender has gathered enough data (give it at least a few hours, ideally 24 hours), inspect the recommendations:
kubectl describe vpa order-service-vpa -n production
The output includes four recommendation tiers: lowerBound, target, uncappedTarget (ignores your min/max constraints), and upperBound. In most cases, you want to use the target value when manually adjusting manifests.
VPA in Auto Mode
When you are confident in the VPA's recommendations and your workload can handle Pod evictions, switch to Auto:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: order-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: order-service
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 4Gi
controlledResources: ["cpu", "memory"]
In Auto mode, the Updater component periodically compares each Pod's actual requests against the recommendation. If the difference exceeds a threshold, it evicts the Pod. The Deployment controller creates a replacement, and the VPA Admission Controller mutates the new Pod's resource requests to the recommended values. This means there will be brief disruptions — make sure you have a PodDisruptionBudget in place.
Why You Should Not Use HPA and VPA on CPU Together
This is one of the most common autoscaling mistakes. At first glance it seems logical: let HPA scale replica count based on CPU, and let VPA right-size each Pod's CPU requests. In practice, it creates a feedback loop that makes both controllers fight each other.
graph TD
A["CPU usage rises"] --> B["HPA adds replicas"]
B --> C["CPU usage per pod drops"]
C --> D["VPA lowers CPU requests"]
D --> E["Utilization % appears higher
(same usage, lower request)"]
E --> A
style A fill:#ef4444,color:#fff,stroke:#dc2626
style D fill:#8b5cf6,color:#fff,stroke:#7c3aed
style B fill:#3b82f6,color:#fff,stroke:#2563eb
Here is the conflict step by step: HPA computes utilization as currentUsage / request. When VPA lowers the request value, the utilization percentage jumps — even though the actual CPU consumption has not changed. The HPA sees high utilization and adds more replicas. More replicas reduce actual per-pod usage, so VPA lowers requests further. This cycle continues until you hit maxReplicas with tiny per-pod resource requests.
If you need both, follow these rules: (1) Use HPA on a custom metric (like requests-per-second or queue depth) — not CPU or memory. (2) Let VPA manage CPU and memory requests. This way the two controllers operate on completely independent signals with no overlap. Alternatively, use VPA in Off mode purely for recommendations and manage requests manually.
HPA vs. VPA — When to Use Which
| Criteria | HPA | VPA |
|---|---|---|
| Workload can be horizontally scaled (stateless, no sticky sessions) | ✅ Primary choice | Use alongside for right-sizing |
| Workload cannot add replicas (single-instance DB, singleton worker) | ❌ Not applicable | ✅ Primary choice |
| Traffic is spiky and unpredictable | ✅ Reacts in seconds | Slower — requires eviction |
| Resource requests are unknown or drifting | Not its job | ✅ Built for this |
| Cost optimization / right-sizing | Prevents over-provisioned replica counts | ✅ Prevents over-provisioned per-pod resources |
| Zero disruption requirement | ✅ No Pod eviction | ⚠️ Auto mode evicts Pods (use Off or Initial) |
Putting It Together — A Production-Ready Example
A real-world setup often combines HPA on a custom metric with VPA in Off or Initial mode. Below is a Deployment with resource requests, an HPA that scales on HTTP request rate, and a VPA in recommendation-only mode to continuously right-size the requests over time.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: api-gateway
image: myregistry/api-gateway:v2.4.1
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
---
# hpa.yaml — scale on custom metric, NOT cpu
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
---
# vpa.yaml — recommendation only, no conflicts with HPA
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-gateway-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: api-gateway
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
With this setup, the HPA handles scaling replica count based on actual traffic (requests per second). The VPA in Off mode watches resource usage and continuously generates right-sizing recommendations. Periodically — during a maintenance window or as part of a deployment cycle — you read the VPA recommendations and update the Deployment's resources.requests accordingly. No feedback loop, no conflicts, the best of both worlds.
You can build a CI/CD step that queries VPA recommendations via kubectl get vpa api-gateway-vpa -o jsonpath='{.status.recommendation}' and opens a pull request to update the Deployment manifest. This gives you the benefits of VPA's analysis without any runtime Pod evictions — a pattern sometimes called "VPA as a recommender."
Cluster Autoscaler and KEDA — Infrastructure and Event-Driven Scaling
HPA and VPA scale your Pods — they add replicas or resize containers. But what happens when the cluster itself runs out of room? If there are no nodes with enough CPU or memory to schedule a new Pod, HPA's additional replicas sit in Pending forever. This is where infrastructure-level scaling steps in.
Two tools dominate this space. The Cluster Autoscaler watches for unschedulable Pods and provisions new nodes from your cloud provider. KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with external event sources — message queues, databases, cron schedules — and enables the powerful ability to scale to and from zero. Together they close the loop: KEDA and HPA scale workloads, and the Cluster Autoscaler scales the infrastructure beneath them.
How the Cluster Autoscaler Works
The Cluster Autoscaler runs as a Deployment inside your cluster (typically in kube-system). It performs two operations on a continuous loop: scale-up when Pods can't be scheduled, and scale-down when nodes are underutilized. It does not watch CPU or memory metrics directly — it watches for Pod scheduling failures.
Scale-up is triggered when the scheduler cannot place a Pod on any existing node because of insufficient resources, taints, affinity rules, or other constraints. The Cluster Autoscaler simulates adding a node from each configured node group and picks the one that would allow the pending Pod(s) to schedule. It then calls the cloud provider API to add that node.
Scale-down is triggered when a node's resource utilization (based on requests, not actual usage) falls below a configurable threshold for a sustained period. Before removing a node, the autoscaler checks that all Pods on it can be rescheduled elsewhere, that none are controlled by a controller that would block eviction (e.g., Pods with local storage, PodDisruptionBudgets that can't be satisfied), and that the node isn't annotated to prevent scale-down.
flowchart TB
Start["Autoscaler Loop\n(every scan-interval)"] --> CheckPending{"Unschedulable\nPods exist?"}
CheckPending -->|Yes| Simulate["Simulate scheduling\nagainst each node group"]
Simulate --> Expand["Select best node group\n(expander strategy)"]
Expand --> ScaleUp["Call cloud API:\nincrease node group size"]
ScaleUp --> Start
CheckPending -->|No| CheckUtil{"Any node below\nutilization threshold?"}
CheckUtil -->|Yes| CheckSafe{"All Pods safely\nreschedulable?"}
CheckSafe -->|Yes| Drain["Cordon & drain node"]
Drain --> ScaleDown["Call cloud API:\ndecrease node group size"]
ScaleDown --> Start
CheckSafe -->|No| Start
CheckUtil -->|No| Start
Cloud Provider Integration
The Cluster Autoscaler doesn't manage VMs directly. It talks to your cloud provider's node group abstraction — the mechanism that manages a pool of identically configured machines. Each provider uses a different primitive, but the concept is the same: a group of nodes that can be scaled by changing a "desired count" value.
| Cloud Provider | Node Group Primitive | How Autoscaler Interacts |
|---|---|---|
| AWS (EKS) | Auto Scaling Groups (ASGs) | Modifies the ASG's DesiredCapacity. Each ASG maps to a node group with a specific instance type, AMI, and launch template. Supports mixed instance types via ASG mixed instance policies. |
| GCP (GKE) | Managed Instance Groups (MIGs) | Adjusts the MIG's target size. GKE's built-in autoscaler uses the same logic but is integrated natively — you enable it via gcloud or the console rather than deploying a separate controller. |
| Azure (AKS) | Virtual Machine Scale Sets (VMSSs) | Changes the VMSS instance count. AKS integrates the Cluster Autoscaler natively — you configure it per node pool with az aks nodepool update --enable-cluster-autoscaler. |
On GKE and AKS, the Cluster Autoscaler is a native feature you enable per node pool — there's no need to deploy the autoscaler yourself. On EKS (and self-managed clusters), you install it as a Helm chart or Deployment. In all cases, the underlying logic is the same open-source Cluster Autoscaler project.
Configuring the Cluster Autoscaler
The Cluster Autoscaler exposes several key parameters that control how aggressively it scales up and how cautiously it scales down. Getting these right is the difference between responsive scaling and either wasted spend or prolonged scheduling delays.
| Parameter | Default | What It Controls |
|---|---|---|
--scan-interval | 10s | How often the autoscaler checks for unschedulable Pods and underutilized nodes. Lower values react faster but increase API server load. |
--scale-down-delay-after-add | 10m | Cooldown after a scale-up before scale-down is considered. Prevents thrashing when a newly added node is still stabilizing. |
--scale-down-delay-after-delete | 0s (scan-interval) | Cooldown after a node is removed before another can be removed. Controls how quickly the cluster shrinks. |
--scale-down-unneeded-time | 10m | How long a node must remain underutilized before it becomes eligible for removal. Guards against premature removal from temporary dips. |
--scale-down-utilization-threshold | 0.5 | A node is considered underutilized if the sum of requested resources (CPU or memory) is below this fraction of its capacity. 0.5 means <50% utilized. |
--max-graceful-termination-sec | 600 | Maximum time to wait for Pods to terminate gracefully during node drain before forceful eviction. |
--skip-nodes-with-local-storage | true | When true, nodes with Pods using emptyDir volumes won't be removed. Set to false if your workloads can tolerate local data loss. |
Here's a Helm values file that configures the Cluster Autoscaler for a production EKS cluster with tuned parameters:
# cluster-autoscaler-values.yaml
autoDiscovery:
clusterName: my-production-cluster
tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/my-production-cluster
extraArgs:
scan-interval: 10s
scale-down-delay-after-add: 10m
scale-down-delay-after-delete: 0s
scale-down-unneeded-time: 10m
scale-down-utilization-threshold: "0.5"
skip-nodes-with-local-storage: "false"
expander: least-waste
balance-similar-node-groups: "true"
max-node-provision-time: 15m
rbac:
create: true
serviceAccount:
create: true
name: cluster-autoscaler
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ClusterAutoscalerRole
# Install with Helm
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
-f cluster-autoscaler-values.yaml
Node Group Auto-Discovery
Rather than hard-coding every ASG or MIG name, you can tell the Cluster Autoscaler to discover node groups automatically based on tags (AWS), labels (GCP), or tags (Azure). This is the recommended approach — when your infrastructure team creates a new node group with the right tags, the autoscaler picks it up automatically.
# AWS: Tag your ASGs with these two tags
# Key: k8s.io/cluster-autoscaler/enabled Value: true
# Key: k8s.io/cluster-autoscaler/<cluster-name> Value: owned
# The autoscaler discovers them with:
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled=true,\
k8s.io/cluster-autoscaler/my-cluster=owned
# You can also set min/max per ASG using ASG tags:
# Key: k8s.io/cluster-autoscaler/node-template/resources/cpu
# Key: k8s.io/cluster-autoscaler/node-template/resources/memory
Expander Strategies
When multiple node groups could satisfy the pending Pod(s), the expander decides which one to grow. This is a critical choice that directly affects cost and bin-packing efficiency. You set it with --expander=<strategy>.
| Strategy | How It Chooses | Best For |
|---|---|---|
random | Picks a node group at random from the candidates. Simple and fast. | Homogeneous clusters where all node groups have the same instance type. Spreads load evenly by chance. |
most-pods | Picks the node group whose new node would schedule the most pending Pods. | Batch workloads with many identical small Pods. Maximizes the impact of each new node. |
least-waste | Picks the node group whose new node would have the least idle resources after scheduling pending Pods. Calculates waste as unused CPU + unused memory fractions. | Mixed workloads with varying resource requests. Optimizes for cost by minimizing leftover capacity. Recommended for most production clusters. |
priority | Uses a ConfigMap to define an ordered priority list of node groups. Falls back to lower-priority groups when higher-priority ones are at max capacity. | Clusters with spot/preemptible nodes alongside on-demand nodes. You prioritize cheap capacity and fall back to expensive capacity. |
Here's the ConfigMap used by the priority expander. Higher numbers mean higher priority:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
50:
- .*spot.* # Prefer spot node groups (regex match on ASG name)
30:
- .*arm64-ondemand.* # Fall back to cheaper ARM on-demand
10:
- .*x86-ondemand.* # Last resort: x86 on-demand
KEDA — Scaling Based on External Events
The standard HPA scales based on CPU, memory, or custom metrics exposed via the Kubernetes metrics API. But many real-world scaling decisions depend on signals that live outside the cluster: the depth of a Kafka topic, the length of a RabbitMQ queue, a Prometheus query result, or a cron schedule. KEDA (Kubernetes Event-Driven Autoscaling) bridges this gap.
KEDA is a lightweight component that acts as a Kubernetes metrics adapter. It reads external event sources (called scalers) and feeds those metrics into the standard HPA machinery. This means you get all of HPA's stabilization, scaling policies, and behavior — but driven by any event source KEDA supports. Crucially, KEDA adds one capability that HPA alone cannot provide: scaling to and from zero replicas.
flowchart LR
subgraph External["External Event Sources"]
Kafka["Kafka Topic"]
RMQ["RabbitMQ Queue"]
Prom["Prometheus"]
SQS["AWS SQS"]
Cron["Cron Schedule"]
end
subgraph KEDA_NS["KEDA Components"]
Operator["KEDA Operator"]
Adapter["Metrics Adapter"]
end
subgraph K8s["Kubernetes"]
HPA["HPA"]
Deploy["Deployment / Job"]
Pods["Pods (0 to N)"]
end
Kafka --> Operator
RMQ --> Operator
Prom --> Operator
SQS --> Operator
Cron --> Operator
Operator -->|"creates & manages"| HPA
Operator -->|"scale 0 to 1"| Deploy
Adapter -->|"serves metrics"| HPA
HPA -->|"scale 1 to N"| Deploy
Deploy --> Pods
KEDA splits the scaling responsibility. The KEDA operator handles the zero-to-one and one-to-zero transitions (since HPA requires at least 1 replica to calculate metrics). Once there's at least one replica, the HPA takes over for the one-to-N scaling, using metrics fed by KEDA's metrics adapter. This architecture means KEDA doesn't replace HPA — it enhances it.
Installing KEDA
KEDA installs cleanly via Helm and runs in its own namespace. It deploys three components: the operator, the metrics API server, and an admission webhook for validating ScaledObject configurations.
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.16.0
# Verify the installation
kubectl get pods -n keda
ScaledObject — Scaling Deployments and StatefulSets
A ScaledObject is KEDA's primary CRD. It binds an external event source to a Kubernetes workload (Deployment, StatefulSet, or any resource with a /scale subresource). When you create a ScaledObject, KEDA automatically creates and manages an HPA behind the scenes.
Here's a ScaledObject that scales a consumer Deployment based on a Kafka topic's consumer lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: order-processor # Deployment name
pollingInterval: 15 # Check trigger every 15s
cooldownPeriod: 120 # Wait 120s before scaling to zero
idleReplicaCount: 0 # Scale to zero when idle
minReplicaCount: 0 # Minimum replicas (0 = scale-to-zero)
maxReplicaCount: 50 # Maximum replicas
fallback:
failureThreshold: 3 # After 3 failed polls...
replicas: 5 # ...fall back to 5 replicas
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
consumerGroup: order-processor-group
topic: orders
lagThreshold: "10" # Scale up when lag > 10 per partition
activationLagThreshold: "1" # Activate (0 to 1) when lag > 1
The idleReplicaCount and minReplicaCount are the keys to scale-to-zero. When the Kafka lag drops to zero (or below activationLagThreshold), KEDA waits for cooldownPeriod seconds, then scales the Deployment to idleReplicaCount. When new messages arrive, KEDA immediately scales to minReplicaCount (or 1, whichever is higher) and hands off to HPA for further scaling.
ScaledJob — Scaling Kubernetes Jobs
Not every workload is a long-running Deployment. For batch processing — where each item in a queue should trigger a discrete Job that runs to completion — KEDA offers the ScaledJob CRD. Instead of scaling replicas, it creates new Job instances proportional to the event source's backlog.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: video-encoder
namespace: media
spec:
jobTargetRef:
template:
spec:
containers:
- name: encoder
image: myregistry/video-encoder:3.2
envFrom:
- secretRef:
name: sqs-credentials
restartPolicy: Never
backoffLimit: 3
pollingInterval: 10
maxReplicaCount: 20 # Max 20 concurrent Jobs
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5
scalingStrategy:
strategy: accurate # Create exactly as many Jobs as queue items
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/video-jobs
queueLength: "1" # 1 Job per message
awsRegion: us-east-1
authenticationRef:
name: aws-credentials
Common KEDA Triggers
KEDA ships with 60+ built-in scalers. Here are the most commonly used triggers and when you'd reach for each one:
| Trigger | Event Source | Typical Use Case |
|---|---|---|
kafka | Apache Kafka consumer group lag | Scale consumers to keep up with message throughput. Scales per-partition. |
rabbitmq | RabbitMQ queue depth | Scale workers processing task queues. Supports both AMQP and HTTP API protocols. |
redis | Redis list length or stream pending count | Scale based on Redis-backed job queues (Sidekiq, Celery with Redis broker). |
prometheus | Any Prometheus query result | Scale on custom business metrics — request latency, error rate, active users. Very flexible. |
cron | Time-based schedule | Pre-scale before known traffic peaks (e.g., scale up at 8 AM, scale down at 8 PM). |
aws-sqs-queue | AWS SQS queue depth | Scale processors for SQS-based job queues. Integrates with IRSA for auth. |
azure-servicebus | Azure Service Bus queue/topic message count | Scale handlers for Azure messaging workloads. |
postgresql | PostgreSQL query result | Scale based on pending row count in a work table. Useful for database-driven job patterns. |
metrics-api | Any HTTP JSON endpoint | Scale on custom API responses — anything that returns a number. |
Prometheus Trigger Example
The prometheus trigger is the most versatile — if you can express it as a PromQL query that returns a scalar, you can scale on it. This example scales an API gateway based on the request rate and includes a cron trigger for predictive pre-scaling:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-gateway-scaler
namespace: production
spec:
scaleTargetRef:
name: api-gateway
minReplicaCount: 2 # Always keep at least 2 replicas
maxReplicaCount: 30
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="api-gateway"}[2m]))
threshold: "100" # Scale up when RPS exceeds 100 per replica
activationThreshold: "5" # Only activate from zero at 5 RPS
- type: cron # Combine triggers for pre-scaling
metadata:
timezone: America/New_York
start: 0 7 * * 1-5 # 7 AM weekdays
end: 0 9 * * 1-5 # 9 AM weekdays
desiredReplicas: "10" # Pre-scale for morning traffic
A ScaledObject can have multiple triggers — KEDA uses the highest replica count recommended by any trigger. This is powerful for layering strategies: use a Prometheus trigger for reactive scaling and a cron trigger for predictive pre-scaling. The replica count will be the greater of the two at any given moment.
KEDA Authentication
Many external event sources require credentials. KEDA handles this through TriggerAuthentication and ClusterTriggerAuthentication CRDs, which decouple credentials from the ScaledObject. This lets you reuse auth configurations and keep secrets out of your scaling manifests.
# TriggerAuthentication pulling from a Kubernetes Secret
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: kafka-auth
namespace: production
spec:
secretTargetRef:
- parameter: sasl # KEDA trigger parameter name
name: kafka-credentials # Kubernetes Secret name
key: sasl-config # Key inside the Secret
---
# Reference it in the ScaledObject trigger:
# triggers:
# - type: kafka
# authenticationRef:
# name: kafka-auth
# metadata:
# bootstrapServers: kafka:9092
How Cluster Autoscaler and KEDA Work Together
The real power of these tools emerges when they operate as a pipeline. KEDA detects that your Kafka lag is climbing and tells HPA to increase replicas from 3 to 20. The scheduler tries to place 17 new Pods but only 5 fit on existing nodes — the remaining 12 stay Pending. The Cluster Autoscaler detects the unschedulable Pods, provisions 3 new nodes from the appropriate node group, and the scheduler places the Pods as nodes become ready.
sequenceDiagram
participant ES as Event Source (Kafka)
participant KEDA as KEDA Operator
participant HPA as HPA
participant Sched as Scheduler
participant CA as Cluster Autoscaler
participant Cloud as Cloud Provider
ES->>KEDA: Consumer lag = 200
KEDA->>HPA: Target metric = 200, threshold = 10
HPA->>HPA: Desired replicas = 20 (currently 3)
HPA->>Sched: Create 17 new Pods
Sched->>Sched: 5 scheduled, 12 unschedulable
Note over Sched: Insufficient CPU/memory
CA->>CA: Detects 12 Pending Pods
CA->>Cloud: Add 3 nodes to node group
Cloud-->>CA: Nodes provisioned
Sched->>Sched: Place 12 Pods on new nodes
The total time from event spike to all Pods running is dominated by node provisioning — typically 2-5 minutes depending on the cloud provider and instance type. This latency is why it's important to keep a small buffer of headroom capacity, either via a higher minReplicaCount or by using pause Pods (low-priority Pods that reserve node capacity and get preempted when real workloads need the space).
The Cluster Autoscaler makes decisions based on resource requests, not actual usage. If your Pods don't have CPU and memory requests, the scheduler considers them zero-cost, packs unlimited Pods onto each node, and the Cluster Autoscaler never triggers because Pods are technically "schedulable." This leads to OOM kills and CPU starvation with no automatic remediation. Set realistic requests on every container.
Debugging Autoscaler Behavior
When scaling isn't happening as expected, use these commands to diagnose. The Cluster Autoscaler writes its decision logic into a ConfigMap, and KEDA creates standard HPA objects you can inspect directly.
# ---- Cluster Autoscaler ----
# Check the autoscaler's status ConfigMap for its latest decisions
kubectl get cm cluster-autoscaler-status -n kube-system -o yaml
# View autoscaler logs for scale-up/down events
kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler \
--tail=100 | grep -E "Scale|Expanding|Removing"
# See why a Pod is unschedulable
kubectl describe pod <pending-pod-name> | grep -A 5 "Events"
# ---- KEDA ----
# List all ScaledObjects and their status
kubectl get scaledobjects -A
# Inspect the HPA that KEDA created (named keda-hpa-<scaledobject-name>)
kubectl get hpa -A | grep keda
# Check KEDA operator logs for trigger errors
kubectl logs -n keda -l app=keda-operator --tail=50
# Describe a ScaledObject for detailed status
kubectl describe scaledobject order-processor-scaler -n production
Resource Optimization and Cost Management
Kubernetes makes it easy to run workloads — and just as easy to waste money doing it. Studies consistently show that the average Kubernetes cluster runs at 20–35% CPU utilization, meaning most organizations are paying for 2–3x more compute than they actually need. The root causes are predictable: over-provisioned resource requests, idle dev/staging environments running 24/7, and nodes sized without considering pod density tradeoffs.
This section gives you a concrete framework for cutting Kubernetes costs without sacrificing reliability. You will learn to right-size workloads, pack nodes efficiently, leverage cheaper compute for the right workloads, and get visibility into where every dollar goes.
Right-Sizing Workloads with VPA Recommendations
Over-provisioning is the single biggest source of Kubernetes waste. Developers set resource requests and limits during initial deployment — often by guessing — and never revisit them. A pod requesting 1 CPU and 1Gi of memory but actually using 50m CPU and 128Mi wastes over 90% of its reserved resources. Those ghost resources block scheduling and inflate your node count.
The Vertical Pod Autoscaler (VPA) solves this by observing actual resource consumption and recommending — or automatically applying — right-sized requests. Even if you don't enable auto-updating mode, running VPA in recommendation-only mode gives you a data-driven starting point.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation-only — no live changes
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 4Gi
After running VPA for a few days, query the recommendations to see what your workloads actually need:
# View VPA recommendations for a deployment
kubectl get vpa api-server-vpa -o jsonpath='{.status.recommendation}' | jq .
# Example output:
# {
# "containerRecommendations": [{
# "containerName": "api-server",
# "lowerBound": { "cpu": "80m", "memory": "180Mi" },
# "target": { "cpu": "120m", "memory": "256Mi" },
# "upperBound": { "cpu": "350m", "memory": "512Mi" }
# }]
# }
Set resource requests to the VPA target value and limits to near the upperBound. Using lowerBound as your request leaves zero headroom for traffic spikes. The target value already accounts for the p90 usage with a safety margin built in.
ResourceQuotas and LimitRanges: Guardrails Against Waste
Right-sizing individual workloads is not enough if any team can deploy unlimited resources into a shared cluster. ResourceQuotas cap the total resources a namespace can consume, while LimitRanges set per-pod and per-container defaults and ceilings. Together, they prevent resource hoarding and ensure fair sharing across teams.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "8" # Total CPU requests across all pods
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "40" # Max 40 pods in this namespace
persistentvolumeclaims: "10"
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default: # Applied if no limits are set
cpu: 500m
memory: 512Mi
defaultRequest: # Applied if no requests are set
cpu: 100m
memory: 128Mi
max:
cpu: 2
memory: 4Gi
min:
cpu: 50m
memory: 64Mi
When a ResourceQuota is active in a namespace, every pod must specify resource requests and limits — otherwise the API server rejects it. The LimitRange fills in defaults for pods that don't specify them, so developers aren't blocked. This combination gives you a safety net: quotas prevent namespace-level runaway, and LimitRanges prevent individual container-level extremes.
Spot and Preemptible Instances
Cloud providers offer spare compute capacity at 60–90% discounts under names like Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure). The tradeoff: the provider can reclaim these nodes with as little as 30 seconds notice. This makes them a poor fit for stateful databases but excellent for fault-tolerant workloads that Kubernetes can reschedule automatically.
| Workload Type | Spot-Friendly? | Why |
|---|---|---|
| Stateless APIs behind an HPA | ✅ Yes | HPA replaces lost pods instantly; multiple replicas absorb the loss |
| Batch jobs / CronJobs | ✅ Yes | Jobs have built-in retry; partial progress can be checkpointed |
| CI/CD build runners | ✅ Yes | Builds are idempotent; a failed build simply re-queues |
| Dev/staging environments | ✅ Yes | Brief interruptions are acceptable; nobody carries a pager for staging |
| Stateful databases (Postgres, Redis) | ❌ No | Data loss risk; failover adds latency and complexity |
| Single-replica critical services | ❌ No | No redundancy to absorb the eviction |
The standard pattern is to run a mixed cluster with an on-demand node pool for critical workloads and a spot node pool for everything else. Use taints and tolerations to control which workloads land on spot nodes:
# Deployment tolerating spot node taints + preferring spot nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 6
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
tolerations:
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
preference:
matchExpressions:
- key: cloud.google.com/gke-spot
operator: In
values: ["true"]
terminationGracePeriodSeconds: 25 # Less than the 30s eviction notice
containers:
- name: processor
image: myapp/batch-processor:v2.1
resources:
requests:
cpu: 250m
memory: 512Mi
Bin Packing: Maximizing Node Utilization
The default Kubernetes scheduler spreads pods across nodes to maximize availability. This is great for resilience but terrible for cost — you end up with many lightly loaded nodes instead of fewer well-packed ones. Bin packing is the opposite strategy: pack as many pods as possible onto existing nodes before adding new ones.
You can tune the scheduler's scoring plugin to prefer nodes that already have pods on them. The NodeResourcesFit plugin supports a MostAllocated strategy that scores partially-filled nodes higher:
# KubeSchedulerConfiguration — bin packing profile
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Pack pods tightly onto existing nodes
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
Bin packing works especially well when paired with the Cluster Autoscaler. The autoscaler removes underutilized nodes when their pods can be rescheduled elsewhere. With bin packing pushing pods onto fewer nodes, the autoscaler has more opportunities to drain and terminate empty or near-empty nodes — directly reducing your bill.
Packing many pods onto one node means losing that node takes out more workloads simultaneously. Only use MostAllocated for workloads that can tolerate brief disruptions (batch jobs, dev environments, stateless services with multiple replicas). For mission-critical production services, keep the default spread strategy and pair it with PodDisruptionBudgets.
Pod Density, Node Size, and Cluster Overhead
Every Kubernetes node reserves a chunk of resources for system daemons — the kubelet, container runtime, OS kernel, and kube-proxy. This is called system reservation, and it is not available to your workloads. The overhead is roughly fixed per node regardless of size, which creates a key tradeoff:
| Strategy | Many Small Nodes | Fewer Large Nodes |
|---|---|---|
| System overhead ratio | High — each node loses ~0.5–1 CPU and 0.5–1Gi to reservations | Low — same fixed cost amortized over more allocatable resources |
| Blast radius | Small — losing one node affects few pods | Large — losing one node affects many pods |
| Pod density | Limited by max-pods-per-node (default 110) and allocatable resources | Can fit many more pods per node |
| IP address usage | More nodes = more IPs consumed by node networking | Fewer IPs wasted on node overhead |
| Scaling granularity | Fine-grained — can add small increments of capacity | Coarse — each new node adds a large block of capacity |
| Best for | Varied workloads, strict isolation requirements | Homogeneous workloads, cost-optimized steady state |
Here is a concrete example. On a t3.medium (2 vCPU, 4Gi RAM) in AWS EKS, system reservation takes approximately 70m CPU and 574Mi memory. That leaves only 1.93 vCPU and 3.4Gi for pods — a 14% memory overhead. On a m5.4xlarge (16 vCPU, 64Gi RAM), the reservation is about 110m CPU and 1.7Gi memory — a 2.6% memory overhead. Running 8 small nodes instead of 1 large node wastes roughly 3Gi of memory to system reservation alone.
graph LR
subgraph Small["8 x t3.medium (2 CPU, 4Gi each)"]
S_Total["Total: 16 CPU, 32Gi"]
S_Reserved["Reserved: 0.56 CPU, 4.6Gi"]
S_Allocatable["Allocatable: 15.4 CPU, 27.4Gi"]
end
subgraph Large["1 x m5.4xlarge (16 CPU, 64Gi)"]
L_Total["Total: 16 CPU, 64Gi"]
L_Reserved["Reserved: 0.11 CPU, 1.7Gi"]
L_Allocatable["Allocatable: 15.9 CPU, 62.3Gi"]
end
S_Total --> S_Reserved --> S_Allocatable
L_Total --> L_Reserved --> L_Allocatable
Cost Visibility: Know Where the Money Goes
You cannot optimize what you cannot measure. Cloud provider bills show you total Kubernetes spend, but they cannot break costs down to the namespace, team, or workload level. Dedicated cost tools fill this gap by combining resource usage metrics with real pricing data.
| Tool | Type | Key Strength | Best For |
|---|---|---|---|
| Kubecost | Commercial (free tier available) | Real-time cost allocation per namespace/label/deployment with savings recommendations | Teams that want a turnkey dashboard with actionable alerts |
| OpenCost | Open source (CNCF sandbox) | Kubecost's cost-allocation engine, fully open. Exposes a cost API you can query programmatically | Teams that want to build custom dashboards or integrate cost into CI/CD |
| AWS Cost Explorer + CUR | Cloud-native | Tag-based cost allocation with EKS split-cost allocation for per-pod costs | AWS-only shops already using Cost Explorer |
| GKE Cost Estimation | Cloud-native | Built into GKE console; breaks down cost by namespace and workload | GCP-only teams wanting zero extra tooling |
| Prometheus + custom metrics | DIY | Full control; combine container_cpu_usage_seconds_total with pricing data | Teams with existing Prometheus and Grafana stacks |
The quickest way to start is deploying OpenCost alongside your existing Prometheus installation. It reads resource metrics, maps them to on-demand pricing, and exposes a cost allocation API:
# Install OpenCost via Helm
helm install opencost opencost/opencost \
--namespace opencost --create-namespace \
--set opencost.prometheus.internal.serviceName=prometheus-server \
--set opencost.prometheus.internal.namespaceName=monitoring
# Query cost allocation per namespace for the last 24h
kubectl port-forward -n opencost svc/opencost 9090:9090 &
curl -s "http://localhost:9090/allocation/compute?window=24h&aggregate=namespace" | jq '
.data[0] | to_entries[] | {
namespace: .key,
cpu_cost: .value.cpuCost,
memory_cost: .value.ramCost,
total: .value.totalCost
}'
Dev/Staging Environment Strategies
Non-production environments are often the largest source of hidden waste. A staging cluster that mirrors production at full scale runs 24/7 but is only actively used 8–10 hours on weekdays. That is 70% idle time at production-grade cost. Here are three strategies, ordered from easiest to most aggressive:
1. Reduce Replica Counts
The simplest approach: run 1 replica instead of 3 in non-production. Create a Kustomize overlay or Helm values file per environment:
# kustomize/overlays/staging/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 1 # Production uses 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker
spec:
replicas: 1 # Production uses 5
2. Namespace-Level Shutdown
Scale all deployments in a namespace to zero replicas during off-hours. A simple CronJob or CI pipeline can handle this:
# Save current replica counts as annotations, then scale to zero
for deploy in $(kubectl get deploy -n staging -o name); do
replicas=$(kubectl get "$deploy" -n staging -o jsonpath='{.spec.replicas}')
kubectl annotate "$deploy" -n staging original-replicas="$replicas" --overwrite
kubectl scale "$deploy" -n staging --replicas=0
done
# Restore original replica counts in the morning
for deploy in $(kubectl get deploy -n staging -o name); do
replicas=$(kubectl get "$deploy" -n staging \
-o jsonpath='{.metadata.annotations.original-replicas}')
kubectl scale "$deploy" -n staging --replicas="${replicas:-1}"
done
3. Cluster Hibernation
For environments not needed overnight or on weekends, scale the entire cluster's node pool to zero. Most managed Kubernetes services support this — your control plane stays up (and costs almost nothing on GKE, about $0.10/hr on EKS) while worker nodes are fully removed. On resume, the Cluster Autoscaler brings nodes back as pods become schedulable.
# GKE: Scale node pool to zero
gcloud container clusters resize staging-cluster \
--node-pool default-pool --num-nodes 0 --zone us-central1-a --quiet
# EKS: Set desired capacity to zero via ASG
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name eks-staging-node-group \
--min-size 0 --desired-capacity 0
Node Type and Size Decision Framework
Choosing the right node type is one of the highest-leverage cost decisions you can make. The wrong choice locks in waste for months. Use this framework to narrow down the decision based on your workload characteristics.
flowchart TD
A["What is your dominant workload pattern?"] --> B{"CPU-bound or\nMemory-bound?"}
B -->|CPU-bound| C{"Burstable or\nsteady utilization?"}
B -->|Memory-bound| D["Memory-optimized instances\n(r-series / e2-highmem)"]
B -->|Balanced| E["General-purpose instances\n(m-series / e2-standard)"]
C -->|Burstable| F["Burstable instances\n(t3/t3a on AWS, e2 on GCP)"]
C -->|Steady| G["Compute-optimized instances\n(c-series / c2/c3)"]
F --> H{"Many small pods\nor few large pods?"}
G --> H
D --> H
E --> H
H -->|"Many small pods\n(under 0.5 CPU each)"| I["Larger nodes for better density\n(8+ vCPU, 16+ Gi RAM)"]
H -->|"Few large pods\n(2+ CPU each)"| J["Size nodes to fit 3-5 pods\nwith 10-15% headroom"]
I --> K["Enable bin packing\n+ Cluster Autoscaler"]
J --> K
Practical Sizing Rules of Thumb
| Rule | Details |
|---|---|
| Target 70–80% allocatable utilization | Below 60% means you are paying for idle capacity. Above 85% leaves no room for traffic spikes or rolling updates (which temporarily double pod count). |
| Size nodes for your largest pod x 3–5 | If your biggest pod requests 2 CPU and 4Gi, nodes should have at least 8 CPU and 16Gi allocatable so you can fit multiple pods and maintain flexibility. |
| Use 2+ node pools | One for general workloads (general-purpose instances) and one for specialized workloads (GPU, memory-heavy). This avoids paying GPU prices for a pod that only needs CPU. |
| Do not go below 2 vCPU / 4Gi per node | After system reservations, very small nodes have too little allocatable capacity. DaemonSets (logging, monitoring, CNI) eat a large percentage of tiny nodes. |
| Account for DaemonSet overhead | Logging agents, monitoring exporters, and CNI plugins run on every node. On a 4Gi node, DaemonSets consuming 300Mi is a 7.5% tax. On a 32Gi node, it is under 1%. |
During a rolling deployment with maxSurge: 25%, Kubernetes creates new pods before terminating old ones. A 10-replica deployment briefly runs 12–13 pods. If your nodes are at 95% allocation, those surge pods cannot schedule and the rollout stalls. Keep at least 20–30% headroom cluster-wide, or size maxSurge and maxUnavailable to work within your capacity.
Putting It All Together: A Cost Optimization Checklist
Cost optimization is not a one-time project — it is an ongoing practice. Use this as a recurring checklist to keep your cluster lean:
- Deploy VPA in recommendation mode on all namespaces. Review and apply right-sized requests quarterly.
- Set ResourceQuotas and LimitRanges on every namespace. No namespace should have unlimited access to cluster resources.
- Move fault-tolerant workloads to spot instances. Target at least 40–60% of non-critical workloads on spot for maximum savings.
- Enable bin packing if your workloads are predominantly stateless, and pair it with Cluster Autoscaler for automatic node removal.
- Deploy a cost visibility tool (OpenCost or Kubecost). Set alerts for namespaces exceeding their budget by more than 10%.
- Implement off-hours shutdown for dev/staging environments. A scheduled scale-down from 7pm to 8am on weekdays plus full weekends cuts compute costs by 65%.
- Review node sizing annually. Workload profiles change over time. What was optimal six months ago may be wasteful now.
Helm — Kubernetes Package Management
Deploying a real application to Kubernetes usually means managing a collection of interconnected manifests — Deployments, Services, ConfigMaps, Secrets, Ingresses, ServiceAccounts, RBAC rules. These files share values like the application name, image tag, and replica count. Copying and pasting those values across a dozen YAML files is fragile. When you need the same app deployed to staging and production with different configurations, plain YAML falls apart fast.
Helm solves this by introducing templating, packaging, and release management for Kubernetes manifests. Think of it as the apt or brew of Kubernetes — it bundles manifests into reusable, versioned, configurable packages called charts, and tracks every deployment as a release with full rollback history.
Helm v3 Architecture
Helm v2 required a server-side component called Tiller running inside the cluster. Tiller held broad permissions and was a well-known security concern — any user who could reach Tiller could deploy anything to any namespace. Helm v3, released in November 2019, removed Tiller entirely.
In Helm v3, the CLI talks directly to the Kubernetes API server using your existing kubeconfig credentials. Release state — the record of what was installed, which revision is active, and the rendered manifests — is stored as Kubernetes Secrets (by default) or ConfigMaps in the release's target namespace. This means Helm respects your existing RBAC rules with zero extra infrastructure.
graph LR
USER["👤 Developer / CI"]
HELM["Helm CLI"]
API["kube-apiserver"]
SEC["Release Secrets<br/>(namespace-scoped)"]
RES["Deployed Resources<br/>(Pods, Services, etc.)"]
REPO["Chart Repository<br/>(OCI / HTTP)"]
USER --> HELM
HELM -->|"helm install / upgrade"| API
HELM -->|"helm pull / search"| REPO
API --> SEC
API --> RES
SEC -.->|"stores release history<br/>revisions, values, manifests"| API
Each Helm release stores its history as Secrets named sh.helm.release.v1.<release-name>.v<revision> in the namespace where the release is deployed. This makes namespace-scoped RBAC sufficient to control who can install or modify releases — no cluster-admin required for namespace-bound operations.
Core Concepts
Helm has four foundational concepts you need to internalize before running any commands: charts, releases, revisions, and repositories. Every Helm operation maps back to these.
| Concept | What It Is | Analogy |
|---|---|---|
| Chart | A versioned package containing templated Kubernetes manifests, default values, metadata, and optional dependencies. A directory or .tgz archive. | A Debian .deb package or a Homebrew formula |
| Release | A specific instance of a chart deployed to a cluster. One chart can produce multiple releases (e.g., myapp-staging and myapp-prod) with different values. | An installed instance of a package on your system |
| Revision | A snapshot of a release at a point in time. Each helm install, helm upgrade, or helm rollback increments the revision number. Enables full rollback. | A Git commit for your deployment |
| Repository | An HTTP server or OCI registry that hosts packaged charts. Public repos include Artifact Hub, Bitnami, and official project charts. | A package registry like npm or PyPI |
Chart Directory Structure
A Helm chart is a directory with a specific layout. The structure is convention over configuration — Helm looks for files in exact locations. Here is what a well-formed chart looks like:
myapp/
├── Chart.yaml # Chart metadata: name, version, appVersion, dependencies
├── Chart.lock # Locked dependency versions (generated by helm dependency update)
├── values.yaml # Default configuration values
├── values.schema.json # Optional: JSON Schema to validate values
├── templates/ # Kubernetes manifest templates (Go templates)
│ ├── _helpers.tpl # Named template definitions (partials)
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── hpa.yaml
│ ├── serviceaccount.yaml
│ ├── tests/ # Helm test Pod definitions
│ │ └── test-connection.yaml
│ └── NOTES.txt # Post-install usage instructions (rendered and shown to user)
├── charts/ # Dependency subcharts (populated by helm dependency update)
└── .helmignore # Files to exclude when packaging
Chart.yaml — The Manifest
Chart.yaml is the identity card of your chart. It declares metadata, the chart version (for the package itself), the app version (for the software being deployed), and any dependencies on other charts.
apiVersion: v2
name: myapp
description: A web application with Redis caching
type: application # "application" or "library"
version: 1.2.0 # Chart version — follows SemVer
appVersion: "3.5.1" # Version of the app being deployed
keywords:
- web
- api
maintainers:
- name: Platform Team
email: platform@example.com
dependencies:
- name: redis
version: "18.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
values.yaml — Default Configuration
values.yaml is where you define every configurable parameter with sensible defaults. Users override these at install time with --set flags or custom value files. Keep it well-commented — this file is the primary interface for anyone consuming your chart.
# -- Number of application replicas
replicaCount: 2
image:
# -- Container image repository
repository: ghcr.io/myorg/myapp
# -- Image pull policy
pullPolicy: IfNotPresent
# -- Overrides the image tag (default is the chart appVersion)
tag: ""
service:
type: ClusterIP
port: 80
ingress:
enabled: false
className: nginx
hosts:
- host: myapp.example.com
paths:
- path: /
pathType: Prefix
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
redis:
enabled: true
Go Template Syntax in Helm
Helm templates are standard Go templates augmented with the Sprig function library and Helm-specific objects. Templates receive a top-level context object (referred to as .) that contains several built-in objects you access with dot notation.
| Object | Contains | Example Usage |
|---|---|---|
.Values | Merged values from values.yaml and user overrides | {{ .Values.image.repository }} |
.Release | Release metadata: .Name, .Namespace, .IsInstall, .IsUpgrade, .Revision | {{ .Release.Name }} |
.Chart | Contents of Chart.yaml: .Name, .Version, .AppVersion | {{ .Chart.AppVersion }} |
.Capabilities | Cluster info: .APIVersions, .KubeVersion | {{ .Capabilities.KubeVersion.Minor }} |
.Template | Current template info: .Name, .BasePath | {{ .Template.Name }} |
Templates in Action
Here is a real templates/deployment.yaml that uses conditionals, ranges, includes, and value references. This single template can produce different Deployment manifests depending on the values passed in.
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
{{- range .Values.service.ports }}
- containerPort: {{ .targetPort }}
protocol: {{ .protocol | default "TCP" }}
{{- end }}
{{- if .Values.resources }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- end }}
{{- if .Values.env }}
env:
{{- range $key, $value := .Values.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
{{- end }}
A few details to note about the syntax. The {{- with a dash trims whitespace before the directive, and -}} trims whitespace after — this keeps your rendered YAML clean. The nindent function adds a newline and indentation, solving the most common frustration with Helm: getting YAML indentation right inside templates. The pipe operator | chains functions left to right, just like Unix pipes.
Named Templates with _helpers.tpl
The _helpers.tpl file (the leading underscore tells Helm not to render it as a standalone manifest) defines reusable template snippets using the define action. You call these with include from other templates. This eliminates duplication across your manifests.
{{/*
Expand the name of the chart.
*/}}
{{- define "myapp.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a fully qualified app name, truncated to 63 chars (K8s label limit).
*/}}
{{- define "myapp.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{/*
Common labels applied to every resource.
*/}}
{{- define "myapp.labels" -}}
helm.sh/chart: {{ include "myapp.chart" . }}
{{ include "myapp.selectorLabels" . }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels — must be identical on Deployment and Service.
*/}}
{{- define "myapp.selectorLabels" -}}
app.kubernetes.io/name: {{ include "myapp.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
Essential Helm Commands
The Helm CLI covers the full lifecycle: searching for charts, installing releases, upgrading, rolling back, inspecting, and cleaning up. Here are the commands you will use daily, grouped by operation.
Install, Upgrade, and Rollback
# Add a chart repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Install a chart as a new release
helm install my-redis bitnami/redis \
--namespace caching --create-namespace \
--set auth.password=secretpass \
--set replica.replicaCount=3
# Install from a local chart directory
helm install myapp ./myapp -f values-prod.yaml
# Upgrade an existing release (creates a new revision)
helm upgrade myapp ./myapp \
--set image.tag=3.6.0 \
--reuse-values # keep previously supplied values
# Install OR upgrade in one command (idempotent — great for CI/CD)
helm upgrade --install myapp ./myapp -f values-prod.yaml
# Rollback to a previous revision
helm rollback myapp 2 # roll back to revision 2
# Uninstall a release and all its resources
helm uninstall myapp --namespace default
Inspect and Debug
# Render templates locally WITHOUT deploying (essential for debugging)
helm template myapp ./myapp -f values-staging.yaml
# Render templates and validate against the cluster's API (catches schema errors)
helm template myapp ./myapp --validate
# Dry-run an install/upgrade against the live cluster
helm install myapp ./myapp --dry-run --debug
# Lint a chart for best practices and syntax errors
helm lint ./myapp --strict
# Show computed values for a deployed release
helm get values myapp --all
# Show the rendered manifests of a deployed release
helm get manifest myapp
# List all releases across namespaces
helm list --all-namespaces
# Show release history (all revisions)
helm history myapp
Dependencies
# Download dependencies declared in Chart.yaml into charts/
helm dependency update ./myapp
# List dependency status
helm dependency list ./myapp
# Rebuild the Chart.lock file
helm dependency build ./myapp
helm upgrade --install in CI/CD PipelinesThe --install flag makes helm upgrade idempotent: it installs the release if it doesn't exist, or upgrades it if it does. This eliminates the need for conditional logic in your pipeline scripts. Combine it with --atomic to auto-rollback on failure and --wait to block until all resources are ready.
Helm Hooks
Hooks let you run Kubernetes resources at specific points in the release lifecycle — before install, after upgrade, before deletion, and more. A hook is any template with a helm.sh/hook annotation. Common use cases include running database migrations before an upgrade, populating seed data after install, or performing cleanup before uninstall.
| Hook | When It Fires | Typical Use Case |
|---|---|---|
pre-install | After templates render, before any resources are created | Create a database schema, run preflight checks |
post-install | After all resources are created | Register with a service mesh, send a Slack notification |
pre-upgrade | After templates render, before any resources are updated | Run database migrations, take a backup |
post-upgrade | After all resources are updated | Clear a CDN cache, warm application caches |
pre-delete | Before any release resources are deleted | Drain connections, export data |
post-delete | After all resources are deleted | Remove DNS records, clean up external resources |
pre-rollback | Before rollback | Reverse a database migration |
post-rollback | After rollback | Notify monitoring systems |
test | When helm test is invoked | Smoke tests, connectivity checks |
Here is a concrete example — a Job that runs database migrations before each upgrade:
# templates/migrate-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "myapp.fullname" . }}-migrate
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5" # lower weight runs first
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command: ["./migrate", "--target", "latest"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: {{ include "myapp.fullname" . }}-db
key: url
The hook-weight controls ordering when multiple hooks fire at the same point — lower numbers run first. The hook-delete-policy controls when the hook resource is cleaned up: before-hook-creation deletes any prior instance before creating the new one, and hook-succeeded deletes it after successful completion. Without a delete policy, hook resources accumulate across revisions.
Helm Tests
Helm tests are Pods defined in templates/tests/ with the "helm.sh/hook": test annotation. They run when you invoke helm test <release> and validate that a deployed release is actually working — not just that resources were created, but that the application responds correctly.
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: {{ include "myapp.fullname" . }}-test
annotations:
"helm.sh/hook": test
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
restartPolicy: Never
containers:
- name: curl-test
image: curlimages/curl:8.5.0
command:
- sh
- -c
- |
echo "Testing {{ include "myapp.fullname" . }} health endpoint..."
curl --fail --silent --max-time 10 \
http://{{ include "myapp.fullname" . }}:{{ .Values.service.port }}/healthz
echo "Test passed!"
# Run tests against a deployed release
helm test myapp --timeout 60s
# Run tests and view logs on failure
helm test myapp --logs
Subcharts and Dependencies
A chart can depend on other charts. This is how you compose complex stacks — your application chart might depend on a Redis chart and a PostgreSQL chart. Dependencies are declared in Chart.yaml and downloaded into the charts/ directory.
You pass configuration to subcharts by nesting values under the dependency name. The parent chart's values.yaml controls everything — the subchart doesn't need modification.
# Chart.yaml — declaring dependencies
apiVersion: v2
name: myapp
version: 1.2.0
dependencies:
- name: postgresql
version: "13.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled # toggle with values
- name: redis
version: "18.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
# values.yaml — configuring subcharts via nested keys
postgresql:
enabled: true
auth:
postgresPassword: "changeme"
database: "myapp_db"
primary:
resources:
requests:
cpu: 250m
memory: 256Mi
redis:
enabled: true
auth:
password: "redis-secret"
replica:
replicaCount: 2
# Download/update dependencies
helm dependency update ./myapp
# Install the full stack — app + PostgreSQL + Redis
helm install myapp ./myapp -f values-prod.yaml
# Deploy without the Redis subchart
helm install myapp ./myapp --set redis.enabled=false
Building a Custom Chart from Scratch
The best way to understand Helm is to build a chart end to end. Below is a walkthrough that creates a chart for a Go API server, scaffolds it, customizes the templates, lints, renders, and deploys it.
# 1. Scaffold a new chart
helm create order-api
# 2. Examine what was generated
tree order-api/
# 3. Edit Chart.yaml — set your app metadata
cat > order-api/Chart.yaml <<'EOF'
apiVersion: v2
name: order-api
description: Order processing API service
type: application
version: 0.1.0
appVersion: "1.0.0"
EOF
# 4. Define your values
cat > order-api/values.yaml <<'EOF'
replicaCount: 2
image:
repository: ghcr.io/myorg/order-api
pullPolicy: IfNotPresent
tag: ""
service:
type: ClusterIP
port: 8080
ingress:
enabled: true
className: nginx
hosts:
- host: orders.example.com
paths:
- path: /api/orders
pathType: Prefix
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
probes:
liveness:
path: /healthz
port: 8080
readiness:
path: /readyz
port: 8080
EOF
# 5. Lint the chart
helm lint ./order-api --strict
# 6. Render templates locally to verify output
helm template order-api ./order-api -f order-api/values.yaml
# 7. Dry-run against the cluster (validates API compatibility)
helm install order-api ./order-api --dry-run --debug
# 8. Deploy for real
helm install order-api ./order-api --namespace orders --create-namespace
# 9. Verify the release
helm list -n orders
helm get values order-api -n orders
kubectl get all -n orders
# 10. Run tests
helm test order-api -n orders
The Release Lifecycle Flow
Understanding how install, upgrade, rollback, and uninstall interact with revisions is critical for production operations. Every mutation creates a new revision, and Helm keeps a configurable history depth (default: 10) for rollback.
stateDiagram-v2
[*] --> Deployed : helm install (rev 1)
Deployed --> Deployed : helm upgrade (rev N+1)
Deployed --> Superseded : new revision deployed
Deployed --> Uninstalled : helm uninstall
Deployed --> PendingUpgrade : helm upgrade (in progress)
PendingUpgrade --> Deployed : success
PendingUpgrade --> Failed : error / timeout
Failed --> Deployed : helm rollback (rev N+1)
Failed --> Deployed : helm upgrade --force (rev N+1)
Superseded --> Deployed : helm rollback to this rev
Uninstalled --> [*]
Helm vs. Alternatives
Helm is the most widely adopted Kubernetes packaging tool, but it is not the only option. Each alternative makes different trade-offs between power, complexity, and approach. Your choice depends on team size, use case, and how much you value DRY templating versus straightforward patching.
| Tool | Approach | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| Helm | Go templating + packaging + release management | Huge ecosystem of prebuilt charts. Built-in versioning, rollback, and dependency management. OCI registry support. | Go templates can be hard to read. Debugging whitespace issues is frustrating. Complex charts become a maintenance burden. | Teams consuming third-party software (databases, monitoring stacks) and shipping reusable internal charts. |
| Kustomize | Overlay-based patching of plain YAML — no templating | Built into kubectl (kubectl apply -k). No new syntax to learn — just YAML patches. Easy to understand diffs. No logic layer. |
No packaging or versioning. No release management. Repetitive for more than a handful of environments. No dependency concept. | Teams managing a small number of environments for their own services. Works well alongside Helm (helm template | kustomize). |
| Jsonnet / Tanka | A data-templating language that generates JSON/YAML programmatically | Full programming language with functions, imports, conditionals, comprehensions. Excellent for complex, highly parameterized configurations. | Steep learning curve. Small ecosystem. Not widely adopted outside of observability teams (Grafana, Prometheus). | Power users generating complex, deeply nested configurations — especially monitoring and alerting stacks. |
| cdk8s | Define Kubernetes resources using TypeScript, Python, Java, or Go — generates YAML | Full programming language. IDE autocompletion. Type safety. Reuse existing test frameworks. | Requires a build step. Overkill for simple deployments. Smaller community than Helm. | Development teams who prefer writing infrastructure in the same language as their application code. |
A common production pattern is to render Helm charts with helm template, then apply Kustomize overlays on top for environment-specific patches. This gives you Helm's ecosystem for third-party charts and Kustomize's simplicity for last-mile customization. ArgoCD and Flux both support this workflow natively.
Production Best Practices
Helm is straightforward to start with but has sharp edges at scale. These practices come from production experience managing hundreds of releases across clusters.
- Pin chart versions in CI/CD. Never use
helm install bitnami/rediswithout a--versionflag. A new upstream chart version can break your deployment with zero warning. - Use
values.schema.json. Define a JSON Schema for your chart's values. Helm validates user-supplied values against the schema at install/upgrade time, catching typos and invalid configurations before they reach the cluster. - Always use
--atomicand--timeoutin CI/CD. The--atomicflag automatically rolls back if an upgrade fails, preventing half-deployed states. Set--timeoutto a reasonable value so failures don't hang your pipeline. - Limit release history. Set
--history-max 5on upgrades. Each revision stores the full rendered manifest as a Secret, and clusters with hundreds of revisions accumulate significant etcd storage. - Template locally before deploying. Run
helm templateandhelm lint --strictin CI before anyhelm upgrade. Catch errors in minutes, not in production. - Use library charts for shared templates. If multiple charts share the same label conventions, RBAC patterns, or monitoring annotations, extract them into a library chart (
type: libraryin Chart.yaml) and import it as a dependency. - Store custom charts in an OCI registry. Helm v3.8+ supports OCI registries (e.g., ECR, GHCR, ACR, Harbor) as first-class chart repositories. Use
helm pushandhelm pull oci://for versioned, authenticated chart distribution.
Custom Resource Definitions — Extending the Kubernetes API
Kubernetes ships with a rich set of built-in resources — Pods, Deployments, Services, ConfigMaps — but real-world platforms inevitably need domain-specific abstractions. A Custom Resource Definition (CRD) lets you register an entirely new resource type with the API server, so it can be created, listed, watched, and deleted with kubectl just like any native object.
CRDs are the foundation of the Kubernetes extension model. When you install Prometheus via the Operator, it registers CRDs like Prometheus, ServiceMonitor, and AlertmanagerConfig. When you deploy Istio, it brings VirtualService and DestinationRule. Every major ecosystem tool extends Kubernetes this way. Understanding CRDs is the prerequisite for building operators and designing platform APIs.
How CRDs Extend the API Server
When you apply a CRD manifest, the API server dynamically creates a new RESTful endpoint. No recompilation, no restart — the new resource type is available within seconds. The API server handles storage (in etcd), RBAC authorization, admission control, and watch notifications for your custom resource exactly as it does for built-in types.
sequenceDiagram
participant U as User / kubectl
participant A as API Server
participant E as etcd
U->>A: Apply CRD manifest (kind: CustomResourceDefinition)
A->>E: Store CRD definition
A-->>A: Register new REST endpoint
A-->>U: CRD created
Note over A: New endpoint is now live
U->>A: kubectl apply -f myapp.yaml (kind: MyApp)
A-->>A: Validate against OpenAPI schema in CRD
A->>E: Store MyApp instance
A-->>U: myapp.apps.example.com/my-sample created
U->>A: kubectl get myapps
A->>E: List MyApp objects
A-->>U: Return list with printer columns
The key insight is that a CRD is just data stored in etcd — it tells the API server the shape and rules for your custom resource. The actual behavior (reconciliation, automation) comes from a controller or operator watching those custom resources. CRDs without controllers are still useful for configuration storage, but the real power emerges when you pair them with custom controllers.
Anatomy of a CRD
Every CRD defines four fundamental properties that determine how the resource appears in the API. These map directly to the resource's API path and how users interact with it.
| Field | Purpose | Example |
|---|---|---|
| Group | API group the resource belongs to. Use a domain you own to avoid collisions. | apps.example.com |
| Version | API version string. Follows Kubernetes conventions: v1alpha1 → v1beta1 → v1. | v1alpha1 |
| Kind | The PascalCase name of your resource type as it appears in manifests. | WebApplication |
| Scope | Namespaced or Cluster. Determines whether instances live inside a namespace or are cluster-wide. | Namespaced |
The combination of group, version, and kind (GVK) uniquely identifies a resource type across the entire cluster. The API path follows the pattern /apis/{group}/{version}/namespaces/{ns}/{plural} for namespaced resources, or /apis/{group}/{version}/{plural} for cluster-scoped ones.
Choose Namespaced for resources that belong to a team or application (most CRDs). Choose Cluster only for resources that are inherently global — like cluster-wide policies, storage classes, or infrastructure definitions that span namespaces. You cannot change the scope after creation without deleting and recreating the CRD.
Practical Example: A WebApplication CRD
Let's build a complete CRD for a WebApplication resource. This custom type will represent a web application deployment with its image, replicas, and ingress configuration — the kind of abstraction a platform team might offer to developers.
Step 1 — Define the CRD
This manifest registers the WebApplication type with the API server. Pay attention to the openAPIV3Schema section — it defines exactly what fields users can set and their types.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: webapplications.apps.example.com
spec:
group: apps.example.com
scope: Namespaced
names:
plural: webapplications
singular: webapplication
kind: WebApplication
shortNames:
- webapp
- wa
categories:
- all # Show in `kubectl get all`
versions:
- name: v1alpha1
served: true
storage: true
# --- Schema validation ---
schema:
openAPIV3Schema:
type: object
required: ["spec"]
properties:
spec:
type: object
required: ["image", "replicas"]
properties:
image:
type: string
description: "Container image in repository:tag format."
replicas:
type: integer
minimum: 1
maximum: 100
default: 2
port:
type: integer
minimum: 1
maximum: 65535
default: 8080
ingress:
type: object
properties:
host:
type: string
tlsSecret:
type: string
env:
type: array
items:
type: object
required: ["name", "value"]
properties:
name:
type: string
value:
type: string
status:
type: object
properties:
readyReplicas:
type: integer
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
enum: ["True", "False", "Unknown"]
lastTransitionTime:
type: string
format: date-time
message:
type: string
# --- Printer columns for kubectl get ---
additionalPrinterColumns:
- name: Image
type: string
jsonPath: .spec.image
- name: Replicas
type: integer
jsonPath: .spec.replicas
- name: Ready
type: integer
jsonPath: .status.readyReplicas
- name: Host
type: string
jsonPath: .spec.ingress.host
priority: 1 # Only shown with -o wide
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
# --- Subresources ---
subresources:
status: {} # Enable /status subresource
scale: # Enable /scale subresource
specReplicasPath: .spec.replicas
statusReplicasPath: .status.readyReplicas
Step 2 — Apply the CRD and Verify
Once applied, the API server immediately recognizes the new type. You can verify this by checking the API resources list.
# Register the CRD
kubectl apply -f webapplication-crd.yaml
# Verify the CRD is established
kubectl get crd webapplications.apps.example.com
# Confirm it appears in API resources
kubectl api-resources | grep webapp
# webapplications webapp,wa apps.example.com/v1alpha1 true WebApplication
# Inspect the CRD details
kubectl describe crd webapplications.apps.example.com
Step 3 — Create an Instance
With the CRD registered, you create instances of WebApplication using standard kubectl apply. The API server validates the manifest against the OpenAPI schema you defined — try submitting an invalid field and it will be rejected.
apiVersion: apps.example.com/v1alpha1
kind: WebApplication
metadata:
name: frontend
namespace: production
spec:
image: my-company/frontend:2.4.1
replicas: 3
port: 3000
ingress:
host: app.example.com
tlsSecret: app-tls-cert
env:
- name: NODE_ENV
value: production
- name: API_URL
value: https://api.example.com
# Create the custom resource
kubectl apply -f frontend-webapp.yaml
# List all WebApplications (using the short name)
kubectl get webapp -n production
# NAME IMAGE REPLICAS READY AGE
# frontend my-company/frontend:2.4.1 3 <none> 5s
# Detailed view with priority columns
kubectl get webapp -n production -o wide
# Full YAML output
kubectl get webapp frontend -n production -o yaml
Schema Validation with OpenAPI v3
Since Kubernetes 1.16, CRDs require structural schemas. A structural schema means every field must have a declared type, no untyped objects are allowed at any nesting level, and the schema must be self-contained (no external $ref pointers). This isn't just a formality — structural schemas are what enable server-side validation, pruning of unknown fields, and defaulting.
Key Schema Features
| Feature | Schema Keyword | Example |
|---|---|---|
| Required fields | required | required: ["image", "replicas"] |
| Default values | default | default: 2 |
| Range constraints | minimum, maximum | minimum: 1, maximum: 100 |
| String patterns | pattern | pattern: "^[a-z0-9-]+$" |
| Enum values | enum | enum: ["True", "False", "Unknown"] |
| String formats | format | format: date-time |
| Prune unknown fields | x-kubernetes-preserve-unknown-fields | Set to true to allow arbitrary JSON |
| Immutable fields | x-kubernetes-validations | CEL rule preventing changes after creation |
When a user submits a custom resource, the API server validates it against the schema and prunes any fields not declared in the schema. This means typos like replcia: 3 are silently removed rather than stored — which can be confusing. You can catch these issues by using kubectl apply --dry-run=server -o yaml to see exactly what the API server will store.
CEL Validation Rules (Kubernetes 1.25+)
OpenAPI schemas handle type-level validation well, but they cannot express cross-field constraints like "if ingress is set, host must not be empty" or "replicas must be odd for quorum-based workloads." Starting in Kubernetes 1.25 (beta) and GA in 1.29, you can embed Common Expression Language (CEL) rules directly in the schema using x-kubernetes-validations.
# Add these to the spec-level schema in your CRD
spec:
type: object
x-kubernetes-validations:
# Cross-field validation: ingress requires a host
- rule: "!has(self.ingress) || has(self.ingress.host)"
message: "ingress.host is required when ingress is specified"
# Enforce image tag is not 'latest'
- rule: "!self.image.endsWith(':latest')"
message: "Using :latest tag is not allowed; pin to a specific version"
properties:
image:
type: string
x-kubernetes-validations:
# Ensure image contains a tag
- rule: "self.contains(':')"
message: "Image must include a tag (e.g., myapp:v1.0)"
replicas:
type: integer
x-kubernetes-validations:
# Transition rule: prevent scaling down more than 50% at once
- rule: "self >= oldSelf / 2"
message: "Cannot scale down by more than 50% in a single update"
CEL rules using oldSelf are transition rules — they compare the new value against the previous one and only apply during updates, not creation. This is how you enforce constraints like "this field is immutable" (self == oldSelf) or "replicas can only increase" (self >= oldSelf) without a validating webhook.
Additional Printer Columns
By default, kubectl get for custom resources shows only NAME and AGE. The additionalPrinterColumns field in the CRD lets you surface important fields in the table output — making your custom resources feel like first-class citizens.
additionalPrinterColumns:
- name: Image
type: string
jsonPath: .spec.image
- name: Replicas
type: integer
jsonPath: .spec.replicas
- name: Ready
type: integer
jsonPath: .status.readyReplicas
- name: Host
type: string
jsonPath: .spec.ingress.host
priority: 1 # Only visible with -o wide
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
Each column maps a jsonPath expression to a named column. The priority field controls visibility: columns with priority: 0 (default) appear in normal output, while higher priorities only show with -o wide. Use this to keep the default view clean while still exposing detailed information when requested.
Subresources: Status and Scale
Subresources give your custom resource separate API endpoints for specific concerns. Kubernetes supports two CRD subresources: status and scale. Without them, the entire resource is a single blob — any controller or user can overwrite any field.
The Status Subresource
Enabling the status subresource splits the resource into two independently updatable halves. Updates to /status can only modify the .status field, and updates to the main resource ignore changes to .status. This separation is critical for the controller pattern: users own .spec (desired state), controllers own .status (observed state).
subresources:
status: {} # Enables PUT /apis/apps.example.com/v1alpha1/.../frontend/status
scale:
specReplicasPath: .spec.replicas
statusReplicasPath: .status.readyReplicas
# labelSelectorPath: .status.selector # Optional, needed for HPA
The Scale Subresource
The scale subresource makes your custom resource compatible with kubectl scale and the Horizontal Pod Autoscaler (HPA). It exposes a standard Scale object at /scale, mapping your spec and status fields to the canonical replica count fields.
# Scale using kubectl (works because of the scale subresource)
kubectl scale webapp frontend --replicas=5 -n production
# The HPA can also target your custom resource
kubectl autoscale webapp frontend --min=2 --max=10 --cpu-percent=70 -n production
Versioning Strategies
APIs evolve. You will rename fields, add required properties, or restructure the schema entirely. CRDs support multiple versions in the versions array, each with its own schema, printer columns, and subresource configuration. But only one version can be the storage version — the one used to persist objects in etcd.
Single Version (Simple Case)
If your CRD is internal to your team or still in early development, a single version is fine. Use v1alpha1 to signal instability. Promote to v1beta1 and then v1 as the API stabilizes. When you have only one version, there is no conversion needed.
Multiple Versions with Conversion Webhooks
When you need to serve two versions simultaneously — for example, v1alpha1 for existing users and v1beta1 with a breaking schema change — you deploy a conversion webhook. The API server calls this webhook to translate objects between versions on the fly.
flowchart LR
A["Client requests v1beta1"] --> B["API Server"]
B --> C{"Object stored as\nv1alpha1 in etcd"}
C --> D["Conversion Webhook"]
D --> E["Returns v1beta1\nrepresentation"]
E --> B
B --> A
style D fill:#f9f0ff,stroke:#7c3aed,stroke-width:2px
spec:
conversion:
strategy: Webhook
webhook:
conversionReviewVersions: ["v1"]
clientConfig:
service:
name: webapp-conversion
namespace: webapp-system
path: /convert
port: 443
caBundle: <base64-encoded-CA-cert>
versions:
- name: v1alpha1
served: true
storage: true # Currently the storage version
schema:
openAPIV3Schema:
# ... v1alpha1 schema ...
- name: v1beta1
served: true
storage: false # Served but not stored
schema:
openAPIV3Schema:
# ... v1beta1 schema (may have different field names) ...
The conversion webhook receives a ConversionReview object containing the objects to convert and the target version. It must be able to convert between any two served versions — not just adjacent ones. A common pattern is to use a "hub" version internally and convert to/from every other version through that hub.
Conversion webhooks add operational complexity — they must be highly available (if the webhook is down, all reads and writes to the custom resource fail). For additive, non-breaking changes (adding optional fields, adding a new version with the same schema), you don't need a webhook. Use strategy: None and let the API server round-trip the same data to both versions.
CRD Best Practices
| Practice | Why It Matters |
|---|---|
| Always define a structural schema | Without it, you get no validation, no pruning, no defaulting. Raw x-kubernetes-preserve-unknown-fields at the root defeats the purpose. |
| Enable the status subresource | Without it, users can accidentally overwrite controller-managed status, and controllers can clobber user-specified spec fields. |
Use shortNames and categories | Short names improve daily usability. Adding to the all category means kubectl get all includes your resources. |
| Pin your CRD naming to a domain you own | Prevents naming collisions when multiple tools install CRDs in the same cluster. |
| Set meaningful printer columns | Users should be able to assess resource health from kubectl get output without needing -o yaml. |
| Add CEL validations for business rules | Catch invalid configurations at admission time, not at reconciliation time when the error is harder to surface. |
Start with v1alpha1 | Signals to users that the API may change. Promote versions deliberately following Kubernetes API conventions. |
Running kubectl delete crd webapplications.apps.example.com immediately removes the CRD and every WebApplication instance in every namespace. There is no confirmation prompt and no undo. In production, protect CRDs with RBAC and consider adding a finalizer that blocks deletion until instances are migrated.
Building Operators — Encoding Operational Knowledge
A Kubernetes Operator is a CRD paired with a custom controller that watches it — together, they encode human operational knowledge into software. Instead of an engineer running a runbook to install, upgrade, back up, or failover a complex stateful system like PostgreSQL or Kafka, the Operator does it automatically. The operator is the runbook, compiled into a reconciliation loop.
The concept was introduced by CoreOS in 2016 and has since become the standard pattern for managing stateful and complex workloads on Kubernetes. If you have already read the section on Custom Resource Definitions, you know how to extend the Kubernetes API with new resource types. Operators take the next step: they give those custom resources a brain.
The Core Pattern: CRD + Controller = Operator
Every Operator follows the same fundamental structure. A Custom Resource Definition declares a new API type (e.g., PostgresCluster), and a custom controller watches instances of that type and acts on them. The controller continuously compares the desired state (what the user declared in the CR) with the actual state (what is running in the cluster), and takes action to close the gap.
graph LR
User["👤 User"] -->|"kubectl apply"| API["API Server"]
API -->|"stores"| ETCD["etcd"]
subgraph Operator["Operator Pod"]
CTRL["Controller / Reconciler"]
end
API -->|"watch events"| CTRL
CTRL -->|"read CR spec"| API
CTRL -->|"create/update owned resources"| API
CTRL -->|"write CR status"| API
CTRL -->|"manages"| DEP["Deployment"]
CTRL -->|"manages"| SVC["Service"]
CTRL -->|"manages"| CM["ConfigMap"]
CTRL -->|"manages"| PVC["PVC"]
The user interacts only with the custom resource. They declare replicas: 3 and version: "15.4" on a PostgresCluster CR — the Operator translates that into the dozens of Kubernetes primitives (StatefulSets, Services, ConfigMaps, PVCs, Jobs) required to make it real. This is the key value proposition: the Operator abstracts away operational complexity behind a clean, domain-specific API.
The Operator Maturity Model
Not all Operators are created equal. The Operator Capability Model, originally defined by the Operator Framework project, classifies operators into five levels based on the depth of operational knowledge they encode. Each level subsumes the previous one.
| Level | Name | Capabilities | Example |
|---|---|---|---|
| 1 | Basic Install | Automated provisioning via CR. Installs the application with sensible defaults. No lifecycle management beyond initial deployment. | Helm-based operator that templates and applies manifests |
| 2 | Seamless Upgrades | Supports version upgrades and configuration changes with minimal disruption. Handles rollbacks on failure. | Operator that performs rolling upgrades of a database cluster |
| 3 | Full Lifecycle | Backup, restore, and disaster recovery. The Operator can recreate the application's state from a snapshot. | CloudNativePG: continuous WAL archiving + point-in-time recovery |
| 4 | Deep Insights | Exposes operational metrics, logs, and alerts. Integrates with Prometheus, dashboards, and alerting pipelines. | Prometheus Operator: auto-generates scrape configs and alert rules |
| 5 | Auto Pilot | Automatic scaling, self-healing, tuning, and anomaly detection. Makes operational decisions without human input. | Operator that auto-scales read replicas based on query latency |
Reaching Level 5 (Auto Pilot) requires encoding deep domain expertise — auto-tuning PostgreSQL's shared_buffers or rebalancing Kafka partitions based on broker load. Very few operators reach this level. If you are building an operator, aim for Level 3 as a solid production baseline: install, upgrade, backup, and restore.
The Reconciliation Loop — How Controllers Think
At the heart of every Operator is the reconciliation loop, powered by the controller-runtime library. This is the same pattern used by Kubernetes' built-in controllers (Deployment controller, ReplicaSet controller), but applied to your custom resources. The loop follows a precise sequence.
flowchart TD
A["Informer watches API Server"] -->|"Event: Create/Update/Delete"| B["Work Queue"]
B -->|"Dequeue item"| C["Reconcile(request)"]
C --> D{"Desired == Actual?"}
D -->|"Yes"| E["Return success — done"]
D -->|"No"| F["Take corrective action"]
F --> G["Create / Update / Delete owned resources"]
G --> H["Update CR status"]
H --> I{"Error?"}
I -->|"Yes"| J["Requeue with backoff"]
I -->|"No"| E
J --> B
The controller does not receive a stream of events and react to each one. Instead, controller-runtime uses informers (cached watches) and a work queue to deduplicate and batch events. Your Reconcile function receives only a name and namespace — it must fetch the current state itself and decide what to do. This design makes reconciliation idempotent: calling Reconcile ten times in a row produces the same result as calling it once.
Key Principles of Reconciliation
- Level-triggered, not edge-triggered. Your reconciler reacts to the current state of the world, not to "what changed." It re-reads the resource on every invocation and computes the full diff.
- Idempotent. If the reconciler creates a Service that already exists, it should update it or skip it — never crash or duplicate it.
- Optimistic concurrency. Kubernetes uses
resourceVersionon every object. If two reconcilers try to update the same resource, one gets a conflict error and requeues. - Requeue on failure. If any step fails, return an error. The controller-runtime will requeue the item with exponential backoff (default: 5ms to 16 minutes).
Comparing Operator Frameworks
You do not build an Operator from scratch. Several frameworks scaffold the boilerplate — project layout, RBAC manifests, CRD generation, leader election setup — so you focus on writing the Reconcile function. Here is how the major frameworks compare.
| Framework | Language / Approach | Strengths | Best For |
|---|---|---|---|
| Kubebuilder | Go (controller-runtime) | The upstream standard. Generates CRD YAMLs from Go types via markers. Tight integration with controller-runtime and controller-tools. Used by most production operators. | Teams comfortable with Go who need full control over reconciliation logic. |
| Operator SDK | Go, Ansible, or Helm | Builds on Kubebuilder for Go operators, but adds first-class support for Ansible playbooks and Helm charts as operator backends. Includes OLM (Operator Lifecycle Manager) integration and scorecard testing. | Teams that want Ansible/Helm-based operators (Level 1–2), or Go operators that integrate with OLM for marketplace distribution. |
| KUDO | Declarative YAML (plans & steps) | Define operator behavior entirely in YAML — no code. Uses "plans" (install, upgrade, backup) composed of "steps" and "tasks." Good for encoding multi-step procedures. | Operations teams without Go expertise who need to encode multi-step Day-2 workflows. |
| Metacontroller | Any language (webhook-based) | You write a sync webhook in any language (Python, Node.js, etc.). Metacontroller handles the watch/queue/reconcile infrastructure and calls your webhook with the parent resource and its children. | Polyglot teams, rapid prototyping, or when Go is not an option. |
Kubebuilder is the upstream project that Operator SDK's Go support is built on. If you are writing a Go-based operator, starting with Kubebuilder gives you the thinnest abstraction layer and the broadest community support. Use Operator SDK if you specifically need Ansible/Helm operator types or OLM integration. Use Metacontroller if your team does not write Go.
Common Operator Patterns
Regardless of which framework you choose, production operators share a set of recurring implementation patterns. These patterns solve real problems around ownership, cleanup, status reporting, and high availability.
Owned Resources and OwnerReferences
When your operator creates a Deployment, Service, or ConfigMap on behalf of a CR, it sets an ownerReference on the child resource pointing back to the CR. This gives you two things for free: garbage collection (when the CR is deleted, Kubernetes automatically deletes all owned resources) and watch filtering (controller-runtime can map events on owned resources back to the parent CR for re-reconciliation).
// Set the CR as the owner of the child Deployment
if err := ctrl.SetControllerReference(myCR, deployment, r.Scheme); err != nil {
return ctrl.Result{}, err
}
// Now if myCR is deleted, this Deployment is garbage-collected automatically
Status Conditions
Operators report health and progress through status conditions — a standardized pattern borrowed from core Kubernetes resources (Pods have Ready, Initialized, etc.). Each condition has a type, status (True/False/Unknown), reason, and message. This lets users and monitoring tools query the CR's status programmatically.
status:
conditions:
- type: Ready
status: "True"
reason: AllReplicasRunning
message: "3/3 replicas are running and healthy"
lastTransitionTime: "2024-11-15T10:30:00Z"
- type: BackupComplete
status: "True"
reason: ScheduledBackupSucceeded
message: "Last backup completed at 2024-11-15T06:00:00Z"
lastTransitionTime: "2024-11-15T06:00:12Z"
Finalizers
Sometimes deleting a CR requires cleanup that goes beyond Kubernetes — removing external DNS records, deprovisioning cloud resources, or flushing data to object storage. Finalizers solve this. Your operator adds a finalizer string to the CR's metadata when it creates external resources. When a user deletes the CR, Kubernetes sets the deletionTimestamp but does not remove the object until all finalizers are cleared. Your reconciler detects the deletion, performs cleanup, then removes the finalizer to let the delete proceed.
const finalizerName = "myapp.example.com/cleanup"
// In Reconcile:
if myCR.ObjectMeta.DeletionTimestamp.IsZero() {
// CR is NOT being deleted — ensure finalizer is present
if !controllerutil.ContainsFinalizer(myCR, finalizerName) {
controllerutil.AddFinalizer(myCR, finalizerName)
return ctrl.Result{}, r.Update(ctx, myCR)
}
} else {
// CR IS being deleted — run cleanup logic
if controllerutil.ContainsFinalizer(myCR, finalizerName) {
if err := r.deleteExternalResources(ctx, myCR); err != nil {
return ctrl.Result{}, err // requeue until cleanup succeeds
}
controllerutil.RemoveFinalizer(myCR, finalizerName)
return ctrl.Result{}, r.Update(ctx, myCR)
}
}
Leader Election
Operators typically run as a Deployment with multiple replicas for availability. But you do not want two replicas simultaneously reconciling the same resource — that causes conflicts and race conditions. Leader election ensures only one replica actively reconciles at a time. The others remain on hot standby and take over if the leader fails. Controller-runtime provides built-in leader election using a Lease resource in the operator's namespace.
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
LeaderElection: true,
LeaderElectionID: "myapp-operator-lock",
// Only the leader processes reconcile events.
// Standbys maintain informer caches for fast failover.
})
A Simplified Go Operator: Reconciling a WebApp CR
The following example shows a minimal but realistic operator reconciler built with Kubebuilder. It watches a custom WebApp resource and ensures a matching Deployment and Service exist with the correct replica count and container image. This is a Level 1 operator — basic install and configuration.
package controllers
import (
"context"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
myappv1 "github.com/example/webapp-operator/api/v1"
)
type WebAppReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// +kubebuilder:rbac:groups=myapp.example.com,resources=webapps,verbs=get;list;watch;create;update;patch
// +kubebuilder:rbac:groups=myapp.example.com,resources=webapps/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the WebApp CR
var webapp myappv1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil // CR deleted, owned resources auto-cleaned
}
return ctrl.Result{}, err
}
// 2. Define the desired Deployment
replicas := int32(webapp.Spec.Replicas)
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{"app": webapp.Name},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{"app": webapp.Name},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "web",
Image: webapp.Spec.Image,
Ports: []corev1.ContainerPort{{ContainerPort: 8080}},
}},
},
},
},
}
// 3. Set owner reference for garbage collection
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
// 4. Create or update the Deployment
var existing appsv1.Deployment
err := r.Get(ctx, client.ObjectKeyFromObject(deploy), &existing)
if errors.IsNotFound(err) {
log.Info("Creating Deployment", "name", deploy.Name)
return ctrl.Result{}, r.Create(ctx, deploy)
} else if err != nil {
return ctrl.Result{}, err
}
// Update if spec drifted
existing.Spec.Replicas = &replicas
existing.Spec.Template.Spec.Containers[0].Image = webapp.Spec.Image
log.Info("Updating Deployment", "name", deploy.Name)
return ctrl.Result{}, r.Update(ctx, &existing)
}
// SetupWithManager registers watches for WebApp and owned Deployments
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&myappv1.WebApp{}). // watch WebApp CRs
Owns(&appsv1.Deployment{}). // watch owned Deployments
Complete(r)
}
Notice the structure: fetch the CR, build the desired state, set owner references, then create-or-update. The SetupWithManager method at the bottom is where the magic happens — For() tells controller-runtime to watch WebApp resources, and Owns() means "also watch Deployments that have an ownerReference pointing back to a WebApp, and if they change, re-reconcile the parent WebApp." This is how drift detection works automatically.
Notable Production Operators
Before building your own operator, check if one already exists. The ecosystem has mature, battle-tested operators for the most common stateful workloads. Studying their source code is also one of the best ways to learn operator patterns.
| Operator | Manages | Maturity Level | Key Capabilities |
|---|---|---|---|
| Prometheus Operator | Prometheus, Alertmanager, Thanos | Level 4 (Deep Insights) | CRDs for ServiceMonitor, PrometheusRule, AlertmanagerConfig. Auto-generates scrape configs and alerting rules from CRs. Powers the kube-prometheus-stack. |
| cert-manager | TLS Certificates | Level 3 (Full Lifecycle) | Automates certificate issuance and renewal via Let's Encrypt, Vault, or custom CAs. CRDs: Certificate, Issuer, ClusterIssuer. Handles ACME challenges automatically. |
| Strimzi | Apache Kafka | Level 4 (Deep Insights) | Manages Kafka brokers, ZooKeeper (or KRaft), MirrorMaker, Kafka Connect, and Schema Registry. Handles rolling upgrades, topic management, user authentication, and rack-aware replication. |
| CloudNativePG | PostgreSQL | Level 4–5 | Manages primary + replicas with streaming replication. Continuous WAL archiving to S3/GCS, point-in-time recovery, automated failover, connection pooling via PgBouncer, and declarative backup schedules. |
An Operator is a program running in your cluster with elevated RBAC permissions — it can create, modify, and delete resources on your behalf. A buggy reconciler can cause cascading deletions or infinite update loops. Before deploying any operator, review its RBAC scope, test it in a staging cluster, and monitor its reconciliation error rate. Building your own operator is a serious commitment: you are writing infrastructure software that must handle edge cases, API version skew, and partial failures gracefully.
Putting It Together: When to Build vs. When to Use
Build a custom operator when you have domain-specific operational logic that cannot be expressed with standard Kubernetes primitives or Helm charts — multi-step upgrade procedures, custom health checks, cross-resource coordination, or integration with external systems. A Helm chart can install an application; an operator can operate it through its full lifecycle.
Do not build an operator when a simpler tool will do. If your application is a stateless web service that just needs a Deployment and a Service, a Helm chart or a Kustomize overlay is the right answer. Operators shine for stateful, complex systems where Day-2 operations (upgrades, backup, failover, scaling) are the hard part — and those operations follow well-defined, automatable procedures. The next section explores how to deliver both applications and operators to clusters using GitOps with ArgoCD and Flux.
GitOps with ArgoCD and Flux — Declarative Continuous Delivery
GitOps is an operational model that takes the declarative philosophy Kubernetes was built on and extends it all the way to your delivery pipeline. Instead of running kubectl apply from a CI job or an engineer's laptop, you store your desired cluster state in Git and let an in-cluster agent continuously reconcile reality to match. The result is an auditable, reversible, and fully automated delivery system.
This approach solves a class of problems that traditional CI/CD pipelines struggle with: configuration drift, lack of auditability, credential sprawl, and the gap between "what we deployed" and "what's actually running." Two tools dominate the Kubernetes GitOps landscape — ArgoCD and Flux — and this section covers both in depth.
The Four Principles of GitOps
The OpenGitOps project (a CNCF Sandbox project) formalized GitOps into four principles. These aren't aspirational guidelines — they are concrete architectural constraints that your tooling must enforce.
| Principle | What It Means | In Practice |
|---|---|---|
| Declarative | The entire system's desired state is expressed declaratively | All Kubernetes manifests, Helm values, and Kustomize overlays live as files — no imperative scripts that "create if not exists" |
| Versioned & Immutable | The desired state is stored in a version-controlled source of truth | Git provides history, blame, branching, and the ability to revert any change to any prior commit |
| Pulled Automatically | Agents automatically pull the desired state and apply it | An in-cluster controller (ArgoCD or Flux) watches the Git repo and syncs changes without external triggers |
| Continuously Reconciled | Agents observe actual state and correct drift | If someone manually edits a Deployment via kubectl edit, the GitOps agent reverts the change to match Git |
Storing YAML in Git and having Jenkins kubectl apply it is not GitOps. True GitOps requires a pull-based reconciliation loop running inside the cluster. The distinction matters because the pull model eliminates the need for external systems to hold cluster credentials and enables continuous drift detection — not just deploy-time synchronization.
Push-Based CI/CD vs. Pull-Based GitOps
The fundamental architectural difference between traditional CI/CD and GitOps is who initiates the deployment and where the credentials live. In push-based delivery, your CI server (Jenkins, GitHub Actions, GitLab CI) holds a kubeconfig or service account token and pushes changes into the cluster. In pull-based GitOps, an agent running inside the cluster pulls changes from Git.
flowchart LR
subgraph PUSH["Push-Based CI/CD"]
direction LR
DEV1["Developer"] -->|git push| REPO1["Git Repo"]
REPO1 -->|webhook| CI["CI Server<br/>Jenkins / GH Actions"]
CI -->|"kubectl apply<br/>holds cluster creds"| K8S1["Kubernetes Cluster"]
end
subgraph PULL["Pull-Based GitOps"]
direction LR
DEV2["Developer"] -->|git push| REPO2["Git Repo"]
AGENT["GitOps Agent<br/>ArgoCD / Flux"] -->|poll / webhook| REPO2
AGENT -->|"reconcile<br/>in-cluster access"| K8S2["Kubernetes Cluster"]
end
PUSH ~~~ PULL
| Aspect | Push-Based (CI/CD) | Pull-Based (GitOps) |
|---|---|---|
| Credential location | CI server needs cluster credentials (kubeconfig, tokens) | Agent runs in-cluster — uses Kubernetes RBAC, no external credentials needed |
| Drift detection | None — cluster can diverge silently between pipeline runs | Continuous — agent detects and optionally corrects drift in real time |
| Deployment trigger | Pipeline run (event-driven, one-shot) | Reconciliation loop (continuous, polling or webhook-triggered) |
| Rollback | Re-run an older pipeline or write rollback logic | git revert — the agent syncs the previous state automatically |
| Audit trail | CI logs (may expire or be incomplete) | Git history — immutable, signed commits, PR approvals |
| Multi-cluster | CI needs credentials for every cluster | Each cluster runs its own agent pointing at the same (or different) repo paths |
In practice, most teams use a hybrid: CI handles build, test, and image push, then updates a Git repo (via automated PR or commit), which triggers the GitOps agent to deploy. This keeps the boundary clean — CI owns the artifact pipeline, GitOps owns the delivery pipeline.
ArgoCD Deep Dive
ArgoCD is the most widely adopted GitOps tool in the Kubernetes ecosystem. It provides a declarative, Kubernetes-native continuous delivery engine with a rich Web UI, RBAC, SSO, and multi-cluster support out of the box. ArgoCD is a CNCF Graduated project.
Architecture Overview
ArgoCD runs as a set of controllers in your cluster. The API Server exposes gRPC and REST APIs (and serves the Web UI). The Repository Server clones Git repos, renders Helm charts, runs Kustomize, and returns plain manifests. The Application Controller continuously compares the rendered manifests against the live cluster state and performs sync operations.
flowchart TB
subgraph ARGO["ArgoCD (argocd namespace)"]
API["API Server<br/>gRPC / REST / Web UI"]
REPO["Repository Server<br/>clones Git, renders manifests"]
CTRL["Application Controller<br/>reconciliation loop"]
REDIS["Redis<br/>caching layer"]
DEX["Dex (optional)<br/>SSO / OIDC"]
end
GIT["Git Repository"]
K8S["Target Cluster(s)"]
USER["User / CLI / Web UI"]
USER --> API
API --> REPO
API --> CTRL
CTRL --> REPO
REPO -->|"clone & render"| GIT
CTRL -->|"compare & sync"| K8S
API --> REDIS
DEX --> API
The Application CRD
Everything in ArgoCD revolves around the Application custom resource. An Application defines what to deploy (a path in a Git repo) and where to deploy it (a target cluster and namespace). ArgoCD watches these CRDs and reconciles accordingly.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: apps/my-app/overlays/production
destination:
server: https://kubernetes.default.svc # in-cluster
namespace: my-app
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes in the cluster
syncOptions:
- CreateNamespace=true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
The source can point to plain YAML, a Kustomize directory, a Helm chart in a Git repo, or a chart from a Helm repository. ArgoCD auto-detects the format based on the directory contents (kustomization.yaml, Chart.yaml, etc.).
Sync Policies: Automated, Self-Heal, and Prune
Sync policies control how aggressively ArgoCD reconciles. The right combination depends on your risk tolerance and environment maturity. Here's what each option does and when to use it.
| Policy | Behavior | Recommendation |
|---|---|---|
automated | ArgoCD syncs automatically when it detects Git has diverged from the cluster | Enable for staging and production once you trust your review process |
selfHeal | If someone kubectl edits a resource, ArgoCD reverts it to match Git | Always enable in production — prevents drift from manual interventions |
prune | Resources deleted from Git are removed from the cluster | Enable with care — a bad merge can delete production resources |
| Manual sync | ArgoCD detects drift and marks the app "OutOfSync" but waits for a human to click Sync | Good starting point for teams new to GitOps |
You can also add sync waves and hooks via annotations. Sync waves control ordering (e.g., create the namespace before the deployment), and hooks run Jobs at specific phases (PreSync for schema migrations, PostSync for smoke tests, SyncFail for alerting).
# Run a database migration before syncing the app
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: migrate
image: myorg/db-migrate:v2.1.0
command: ["./migrate", "--target", "latest"]
restartPolicy: Never
App-of-Apps Pattern
Managing dozens of Application CRDs individually becomes unwieldy. The app-of-apps pattern solves this: you create a single "root" Application that points to a directory containing other Application manifests. When ArgoCD syncs the root app, it creates all the child Applications, which then sync their own targets.
# Root Application — manages all other Applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: platform-root
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: argocd-apps/ # Directory of Application YAMLs
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
selfHeal: true
prune: true
The argocd-apps/ directory contains individual Application YAMLs — one for cert-manager, one for ingress-nginx, one for each microservice, etc. Adding a new service to the platform is a single Git commit that drops a new Application YAML into this directory.
ApplicationSets for Multi-Cluster and Templating
ApplicationSet is a more powerful evolution of app-of-apps. It uses generators to produce Application resources dynamically from data sources like Git directory structure, cluster lists, pull requests, or external APIs. This eliminates boilerplate when you deploy the same application across many clusters or environments.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: my-app-multi-cluster
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
env: production
template:
metadata:
name: 'my-app-{{name}}'
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: 'apps/my-app/overlays/{{metadata.labels.region}}'
destination:
server: '{{server}}'
namespace: my-app
This single ApplicationSet generates one Application per production cluster registered with ArgoCD. The {{name}}, {{server}}, and {{metadata.labels.region}} template variables are populated from the cluster registration data. Other generators include git (one app per directory), list (explicit values), pullRequest (preview environments per PR), and matrix (combine generators).
RBAC and SSO
ArgoCD ships with its own RBAC system that controls who can view, sync, or override applications. Policies are defined in a ConfigMap using a Casbin-style syntax. Combined with SSO via Dex or a direct OIDC provider, this lets you map your identity provider groups to ArgoCD roles.
# argocd-rbac-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-rbac-cm
namespace: argocd
data:
policy.csv: |
# Role: team-frontend can sync only their apps
p, role:team-frontend, applications, get, default/frontend-*, allow
p, role:team-frontend, applications, sync, default/frontend-*, allow
# Role: platform-admin has full access
p, role:platform-admin, applications, *, */*, allow
p, role:platform-admin, clusters, *, *, allow
# Map OIDC groups to roles
g, oidc-group:frontend-devs, role:team-frontend
g, oidc-group:platform-team, role:platform-admin
policy.default: role:readonly
The pattern default/frontend-* scopes access to applications in the default ArgoCD project whose names start with frontend-. The policy.default key ensures that authenticated users who don't match any group get read-only access — a safe default for visibility without risk.
The Web UI
ArgoCD's Web UI is one of its biggest differentiators. It provides a real-time visualization of your application's resource tree — every Deployment, ReplicaSet, Pod, Service, and Ingress is shown with its sync status and health. You can see diffs between Git and the live state, trigger syncs, view logs, and even exec into Pods. For teams that need operational visibility without deep kubectl fluency, the UI dramatically lowers the barrier to understanding what's running in the cluster.
Flux Deep Dive
Flux takes a fundamentally different architectural approach from ArgoCD. Rather than a monolithic application, Flux is a set of composable, single-purpose controllers that each manage one aspect of the GitOps pipeline. You install only the controllers you need, and they coordinate through Kubernetes custom resources. Flux is also a CNCF Graduated project.
Core Controllers and CRDs
Flux's architecture follows the Unix philosophy — small tools that do one thing well. Each controller watches specific CRDs and produces outputs that other controllers consume.
flowchart LR
subgraph SOURCES["Source Controllers"]
GR["GitRepository"]
HR["HelmRepository"]
OCR["OCIRepository"]
BUCKET["Bucket (S3)"]
end
subgraph DEPLOY["Deployment Controllers"]
KS["Kustomization<br/>Controller"]
HC["Helm<br/>Controller"]
end
subgraph AUTO["Automation"]
IAC["Image Reflector<br/>Controller"]
IAU["Image Automation<br/>Controller"]
end
subgraph NOTIFY["Notifications"]
NP["Notification<br/>Provider"]
NA["Alert"]
NR["Receiver"]
end
GR -->|artifact| KS
GR -->|artifact| HC
HR -->|chart| HC
OCR -->|artifact| KS
KS -->|apply manifests| CLUSTER["Kubernetes<br/>Cluster"]
HC -->|helm install/upgrade| CLUSTER
IAC -->|latest image tag| IAU
IAU -->|commit update| GIT["Git Repo"]
NR -->|webhook trigger| GR
KS --> NA
HC --> NA
NA --> NP
| Controller | CRDs | Responsibility |
|---|---|---|
| Source Controller | GitRepository, HelmRepository, OCIRepository, Bucket | Fetches artifacts from external sources and produces versioned tarballs for other controllers to consume |
| Kustomize Controller | Kustomization | Applies Kustomize overlays or plain YAML from a source artifact to the cluster |
| Helm Controller | HelmRelease | Manages Helm chart lifecycle — install, upgrade, rollback, test, uninstall |
| Image Reflector | ImageRepository, ImagePolicy | Scans container registries for new image tags matching a policy |
| Image Automation | ImageUpdateAutomation | Commits image tag updates back to Git when new images match the policy |
| Notification Controller | Provider, Alert, Receiver | Sends alerts to Slack/Teams/PagerDuty and receives webhooks to trigger reconciliation |
GitRepository and Kustomization
The two most fundamental Flux CRDs are GitRepository (fetch the source) and Kustomization (apply it). Together, they form the minimum viable GitOps pipeline. The GitRepository polls your repo at a configurable interval, and the Kustomization controller applies the resulting manifests.
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: my-app
namespace: flux-system
spec:
interval: 5m
url: https://github.com/myorg/k8s-manifests.git
ref:
branch: main
secretRef:
name: git-credentials # SSH key or token for private repos
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: my-app
namespace: flux-system
spec:
interval: 10m
retryInterval: 2m
targetNamespace: my-app
sourceRef:
kind: GitRepository
name: my-app
path: ./apps/my-app/overlays/production
prune: true # Remove resources deleted from Git
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: my-app
namespace: my-app
timeout: 3m
Note that Flux's Kustomization CRD is not the same as a kustomization.yaml file. The Flux Kustomization is a controller configuration that tells Flux what to apply. If the target path contains a kustomization.yaml, Flux runs Kustomize on it. If it doesn't, Flux generates one automatically from all the YAML files in the directory. The healthChecks field is powerful — Flux will wait for the specified resources to become healthy before marking the reconciliation as successful.
HelmRelease
For teams using Helm, Flux provides a HelmRelease CRD that manages the full chart lifecycle. It supports values from ConfigMaps, Secrets, inline YAML, or values files in Git. Flux handles install, upgrade, rollback on failure, and uninstall — all declaratively.
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: bitnami
namespace: flux-system
spec:
interval: 1h
url: https://charts.bitnami.com/bitnami
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: redis
namespace: flux-system
spec:
interval: 30m
chart:
spec:
chart: redis
version: "18.x" # Semver range
sourceRef:
kind: HelmRepository
name: bitnami
targetNamespace: cache
install:
createNamespace: true
remediation:
retries: 3
upgrade:
remediation:
retries: 3
remediateLastFailure: true # Rollback on failed upgrade
values:
architecture: replication
replica:
replicaCount: 3
auth:
existingSecret: redis-credentials
Image Automation
Flux's image automation controllers close the loop between CI and GitOps. When your CI pipeline pushes a new container image, Flux detects it, updates the image tag in Git, and then syncs the new manifests to the cluster. This eliminates the manual step of updating image tags in your manifests repo.
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: my-app
namespace: flux-system
spec:
image: ghcr.io/myorg/my-app
interval: 5m
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: my-app
namespace: flux-system
spec:
imageRepositoryRef:
name: my-app
policy:
semver:
range: "1.x" # Only pick tags matching semver 1.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
name: my-app
namespace: flux-system
spec:
interval: 30m
sourceRef:
kind: GitRepository
name: my-app
git:
checkout:
ref:
branch: main
commit:
author:
name: flux-bot
email: flux@myorg.com
messageTemplate: "chore: update {{.AutomationObject}} images"
push:
branch: main
update:
path: ./apps/my-app
strategy: Setters
In your deployment manifest, you mark which image fields to update using a special comment marker:
containers:
- name: my-app
image: ghcr.io/myorg/my-app:1.4.2 # {"$imagepolicy": "flux-system:my-app"}
When a new tag like 1.5.0 appears in the registry and matches the 1.x semver policy, Flux automatically commits an update changing 1.4.2 to 1.5.0 in your Git repo. The Kustomization controller then picks up the commit and deploys it.
Multi-Tenancy with Flux
Flux has first-class multi-tenancy support. Each tenant (team or project) gets their own GitRepository and Kustomization scoped to specific namespaces. A platform team manages the "root" Kustomization that bootstraps tenant Kustomizations, and Kubernetes RBAC ensures tenants can only deploy to their own namespaces.
# Tenant Kustomization — scoped to team-alpha's namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-alpha-apps
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: team-alpha-repo
path: ./deploy
prune: true
targetNamespace: team-alpha
serviceAccountName: team-alpha-sa # RBAC-scoped SA
validation: client
The serviceAccountName field is the key to multi-tenancy. The Kustomization controller impersonates this service account when applying resources, so it can only create or modify resources that the service account has RBAC access to. A tenant cannot accidentally (or maliciously) modify resources in another team's namespace.
ArgoCD vs. Flux — Choosing Your Tool
Both ArgoCD and Flux are CNCF Graduated projects with large communities and production deployments at scale. The choice between them often comes down to team preferences, existing tooling, and operational philosophy rather than a clear technical winner.
| Dimension | ArgoCD | Flux |
|---|---|---|
| Architecture | Monolithic application with API server, controller, repo server | Composable toolkit of independent controllers |
| Web UI | Rich built-in UI with resource visualization, diffs, logs | No built-in UI — use Weave GitOps UI or Capacitor as add-ons |
| CLI | argocd CLI for app management and admin tasks | flux CLI for bootstrapping and troubleshooting |
| Multi-cluster | Centralized — one ArgoCD manages many clusters | Decentralized — each cluster runs its own Flux, or use Flux + cluster API |
| Helm support | Renders Helm charts to plain manifests, tracks via Application CRD | Native HelmRelease CRD with full lifecycle (install, upgrade, rollback, test) |
| Image automation | Via ArgoCD Image Updater (separate project, less mature) | Built-in image reflector and automation controllers |
| Multi-tenancy | AppProjects with RBAC policies and source/destination restrictions | Service account impersonation per Kustomization with native K8s RBAC |
| Notifications | Built-in notification engine with triggers and templates | Notification controller with providers and alerts |
| Learning curve | Lower — the UI helps visualize state and debug issues | Higher — requires comfort with CRDs and CLI-based debugging |
| Resource footprint | Heavier — runs Redis, Dex, repo server, API server | Lighter — only install the controllers you need |
Choose ArgoCD if your team values a visual dashboard, you manage multiple clusters from a central hub, or your developers are not deeply comfortable with kubectl. Choose Flux if you prefer a composable toolkit, want tighter Kubernetes-native RBAC integration, need built-in image automation, or run a platform where each team manages their own GitOps pipeline with strong tenant isolation.
Repository Structure Best Practices
Your Git repository layout determines how cleanly you can manage environments, teams, and promotion workflows. There's no universally correct structure, but two patterns dominate — and each serves different organizational needs.
Monorepo vs. Polyrepo
| Pattern | Structure | Best For | Watch Out For |
|---|---|---|---|
| Monorepo | One repo with all manifests, separated by directory | Small-to-medium teams, strong shared standards, easy cross-cutting changes | Merge conflicts at scale, RBAC requires path-level Git permissions (e.g., CODEOWNERS) |
| Polyrepo | Separate repos per team or per app | Large orgs with autonomous teams, strict access control | Harder to make platform-wide changes, more repos to manage |
| Hybrid | One repo for platform/infra, separate repos per team for apps | Platform engineering model — central team controls shared infra | Requires clear ownership boundaries |
Here's a recommended monorepo structure that works well with both ArgoCD and Flux. It separates concerns by layer (infrastructure vs. applications), uses Kustomize overlays for environment differentiation, and keeps the GitOps tool configuration in its own directory.
k8s-manifests/
├── infrastructure/ # Shared cluster infrastructure
│ ├── base/
│ │ ├── cert-manager/
│ │ ├── ingress-nginx/
│ │ ├── monitoring/
│ │ └── sealed-secrets/
│ ├── staging/
│ │ └── kustomization.yaml # Patches for staging
│ └── production/
│ └── kustomization.yaml # Patches for production
├── apps/ # Application workloads
│ ├── base/
│ │ ├── frontend/
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ └── kustomization.yaml
│ │ └── backend/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── kustomization.yaml
│ ├── staging/
│ │ ├── frontend/
│ │ │ └── kustomization.yaml # image tag, replicas, env vars
│ │ └── backend/
│ │ └── kustomization.yaml
│ └── production/
│ ├── frontend/
│ │ └── kustomization.yaml
│ └── backend/
│ └── kustomization.yaml
└── clusters/ # GitOps tool configuration
├── staging/
│ ├── infrastructure.yaml # ArgoCD App or Flux Kustomization
│ └── apps.yaml
└── production/
├── infrastructure.yaml
└── apps.yaml
Environment Promotion Patterns
Promoting a change from staging to production should be a deliberate, reviewable action. Two promotion patterns are common, and they work differently with your Git workflow.
Pattern 1: Branch-per-environment. The main branch represents staging, and a production branch represents production. You promote by merging main into production. This is simple but fragile — merge conflicts accumulate, and the branches inevitably diverge in ways that are hard to reason about.
Pattern 2: Directory-per-environment (recommended). A single branch (main) contains directories for each environment, using Kustomize overlays to vary configuration. Promotion is a PR that updates the production overlay — typically changing an image tag or a Kustomize patch. This is easier to audit and less error-prone.
# apps/production/backend/kustomization.yaml
# Promoting v1.5.0 to production = changing this image tag
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: backend
resources:
- ../../base/backend
images:
- name: ghcr.io/myorg/backend
newTag: "1.5.0" # <-- promotion happens here
patches:
- patch: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
spec:
replicas: 5 # Production runs more replicas
Automating promotion is possible too. After staging passes health checks, a CI job can open a PR that bumps the production image tag. The PR goes through code review, merges, and the GitOps agent deploys it — keeping the human-in-the-loop for production changes while automating the toil.
Practical Setup: Installing ArgoCD
Here's a working setup that gets ArgoCD running in your cluster and deploys an application from Git. This uses the non-HA manifest for simplicity — production clusters should use the HA manifest or the Helm chart.
# 1. Install ArgoCD into its own namespace
kubectl create namespace argocd
kubectl apply -n argocd \
-f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# 2. Wait for all components to be ready
kubectl wait --for=condition=available deployment --all -n argocd --timeout=300s
# 3. Get the initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d && echo
# 4. Port-forward the API server to access the UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# 5. Login with the CLI (optional — you can also use the Web UI)
argocd login localhost:8080 --username admin --insecure
# 6. Change the default password immediately
argocd account update-password
Once ArgoCD is running, create your first Application — either through the Web UI, the CLI, or by applying an Application YAML like the one shown in the Application CRD section above. Within seconds, ArgoCD will clone your repo, render the manifests, and show you the sync status.
# Create an Application via CLI
argocd app create guestbook \
--repo https://github.com/argoproj/argocd-example-apps.git \
--path guestbook \
--dest-server https://kubernetes.default.svc \
--dest-namespace default
# Check sync status
argocd app get guestbook
# Trigger a sync
argocd app sync guestbook
Practical Setup: Bootstrapping Flux
Flux bootstraps itself — the flux bootstrap command installs Flux into your cluster and commits its own configuration to your Git repo. From that point on, Flux manages itself through GitOps. This is one of Flux's most elegant design choices.
# 1. Install the Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash
# 2. Check prerequisites (kubectl context, Git access, cluster version)
flux check --pre
# 3. Bootstrap Flux with GitHub (creates repo structure and installs controllers)
export GITHUB_TOKEN=<your-pat-token>
flux bootstrap github \
--owner=myorg \
--repository=k8s-manifests \
--branch=main \
--path=clusters/staging \
--personal
# 4. Verify all controllers are running
flux check
# 5. Check the state of all Flux resources
flux get all
After bootstrap, Flux has committed its own controller manifests to clusters/staging/flux-system/ in your Git repo. To deploy an application, you add GitRepository and Kustomization YAMLs to the clusters/staging/ path and push. Flux picks them up automatically.
# Create a source and kustomization via CLI (generates YAML and commits to Git)
flux create source git my-app \
--url=https://github.com/myorg/my-app-manifests \
--branch=main \
--interval=5m \
--export > ./clusters/staging/my-app-source.yaml
flux create kustomization my-app \
--source=GitRepository/my-app \
--path="./overlays/staging" \
--prune=true \
--interval=10m \
--export > ./clusters/staging/my-app-kustomization.yaml
# Commit and push — Flux reconciles automatically
git add -A && git commit -m "feat: add my-app to staging" && git push
# Watch the reconciliation
flux get kustomizations --watch
Never commit plain Kubernetes Secrets to your GitOps repo. Use Sealed Secrets (Bitnami), SOPS (Mozilla — natively supported by Flux), or an External Secrets Operator that syncs secrets from Vault, AWS Secrets Manager, or GCP Secret Manager. ArgoCD works with all three approaches; Flux has built-in SOPS decryption in its Kustomize controller.
High Availability and Disaster Recovery
A Kubernetes cluster that runs only when everything is perfect is not production-ready. Real infrastructure experiences node failures, network partitions, data center outages, and human errors. High availability (HA) is about surviving component failures without downtime. Disaster recovery (DR) is about getting back to a working state when HA is not enough — when you lose an entire cluster, a region, or corrupt critical data.
This section walks through HA at every layer of the stack — control plane, application, and data — then covers the DR strategies that protect you when the worst happens. Each concept is paired with the practical configuration that implements it.
Control Plane High Availability
The control plane is the brain of your cluster. If it goes down, no new Pods can be scheduled, no Deployments can roll out, and no self-healing can occur. Existing workloads keep running (kubelets operate autonomously), but the cluster is effectively frozen. A production control plane must tolerate the loss of at least one node without interruption.
API Server: Stateless and Load-Balanced
The kube-apiserver is stateless — it reads and writes all data to etcd, holding nothing in memory between requests. This makes it the easiest control plane component to scale. You run multiple replicas (typically 3) behind a load balancer, and any instance can serve any request. If one crashes, the load balancer routes traffic to the surviving instances.
| Load Balancer Option | Best For | Notes |
|---|---|---|
| Cloud LB (AWS NLB, GCP ILB) | Managed Kubernetes (EKS, GKE) | Handled automatically by the cloud provider. Zero config on your part. |
| HAProxy / Nginx | Self-managed clusters | Run on dedicated hosts or as a keepalived VIP pair for the LB itself. |
| kube-vip | Bare-metal clusters | Runs as a static Pod on control plane nodes. Provides a virtual IP via ARP or BGP. |
etcd: The Quorum Problem
etcd is the single source of truth for all cluster state. Unlike the API server, etcd is stateful — it uses the Raft consensus protocol, which requires a majority quorum to accept writes. This means the number of nodes you run directly determines how many failures you can tolerate.
| etcd Nodes | Quorum Required | Tolerated Failures |
|---|---|---|
| 1 | 1 | 0 — any failure loses the cluster |
| 3 | 2 | 1 node |
| 5 | 3 | 2 nodes |
| 7 | 4 | 3 nodes (rarely needed — latency increases) |
Three nodes is the minimum for production. Five nodes are appropriate when you need to survive two simultaneous failures — common in multi-AZ deployments where an entire availability zone might go down. Going beyond five is almost never justified because each additional member increases write latency (every write must be replicated to a majority).
An even number (e.g., 4) gives you the same fault tolerance as one fewer node (3), but with higher write latency. Four nodes still require a quorum of 3, so you can only lose 1 — the same as a 3-node cluster. The extra node adds cost and latency with no resilience benefit.
Scheduler and Controller Manager: Leader Election
Unlike the API server, the kube-scheduler and kube-controller-manager cannot run in active-active mode. If two schedulers tried to bind the same Pod to different nodes simultaneously, the cluster would enter a conflicted state. Instead, these components use leader election.
All replicas start, but only the elected leader actively does work. The others are hot standbys that watch the leader lease. If the leader crashes or its lease expires (default: 15 seconds), a standby wins the election and takes over. The mechanism uses a Kubernetes Lease object stored in the API server, so it piggybacks on the same HA infrastructure you already have for etcd and the API server.
# Check which node currently holds the scheduler and controller-manager leases
kubectl get lease -n kube-system kube-scheduler -o jsonpath='{.spec.holderIdentity}'
kubectl get lease -n kube-system kube-controller-manager -o jsonpath='{.spec.holderIdentity}'
HA Control Plane Architecture
The following diagram shows a production-grade 3-node control plane. Each node runs all control plane components. The API servers sit behind a shared load balancer, while etcd forms a Raft cluster across all three nodes. The scheduler and controller-manager elect a single leader.
graph TB
LB["Load Balancer<br/>(VIP / Cloud LB)"]
subgraph CP1["Control Plane Node 1"]
API1["kube-apiserver"]
ETCD1["etcd member-1"]
S1["scheduler (leader)"]
CM1["controller-manager<br/>(standby)"]
end
subgraph CP2["Control Plane Node 2"]
API2["kube-apiserver"]
ETCD2["etcd member-2"]
S2["scheduler (standby)"]
CM2["controller-manager<br/>(leader)"]
end
subgraph CP3["Control Plane Node 3"]
API3["kube-apiserver"]
ETCD3["etcd member-3"]
S3["scheduler (standby)"]
CM3["controller-manager<br/>(standby)"]
end
LB --> API1
LB --> API2
LB --> API3
ETCD1 <-->|"Raft"| ETCD2
ETCD2 <-->|"Raft"| ETCD3
ETCD1 <-->|"Raft"| ETCD3
API1 --> ETCD1
API2 --> ETCD2
API3 --> ETCD3
Workers["Worker Nodes"] --> LB
Application-Level High Availability
A highly available control plane means nothing if your application runs as a single Pod on one node. Application HA requires three things: multiple replicas, intelligent placement across failure domains, and controlled disruption during maintenance. Kubernetes gives you specific primitives for each.
Pod Anti-Affinity: Don't Put All Eggs in One Basket
Pod anti-affinity rules tell the scheduler to avoid placing Pods that match a label selector on the same node (or in the same zone). This ensures that a single node failure does not take out all replicas of a service. The requiredDuringSchedulingIgnoredDuringExecution variant is a hard rule — the Pod will not schedule if it cannot be satisfied. The preferredDuringSchedulingIgnoredDuringExecution variant is a soft hint.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: myapp/frontend:2.4.1
resources:
requests:
cpu: 250m
memory: 256Mi
This configuration guarantees that no two web-frontend Pods land on the same node. Change the topologyKey to topology.kubernetes.io/zone to spread across availability zones instead — though a hard zone anti-affinity requirement with 3 replicas requires at least 3 zones.
Topology Spread Constraints: Even Distribution
Anti-affinity is binary: same node or different node. Topology spread constraints give you finer control — you can specify the maximum allowed skew between zones or nodes. This is critical for large deployments where you want even distribution, not just non-colocation.
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-frontend
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-frontend
This example combines two constraints. The first is a hard rule: Pods must be evenly distributed across zones (no zone can have more than 1 extra Pod compared to the least-populated zone). The second is a soft rule: try to spread across nodes within a zone, but do not block scheduling if it cannot be perfectly balanced.
Pod Disruption Budgets: Controlled Maintenance
When a node is drained for maintenance (kubectl drain), upgrades, or autoscaler scale-down, Kubernetes evicts Pods. Without guardrails, a drain operation could evict all replicas of a critical service simultaneously. A Pod Disruption Budget (PDB) tells Kubernetes how many Pods of a given set must remain available (or how many can be unavailable) during voluntary disruptions.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-frontend-pdb
spec:
minAvailable: 2 # at least 2 Pods must stay running
selector:
matchLabels:
app: web-frontend
With 3 replicas and minAvailable: 2, Kubernetes will only allow 1 Pod to be evicted at a time. If a second drain would violate the budget, it blocks until the first evicted Pod is rescheduled and healthy. You can alternatively use maxUnavailable — for example, maxUnavailable: 1 achieves the same result and is often clearer. You can also use percentage values like maxUnavailable: "25%" for larger deployments.
# Check PDB status — ALLOWED DISRUPTIONS shows how many more Pods can be evicted
kubectl get pdb web-frontend-pdb
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# web-frontend-pdb 2 N/A 1 5m
A PDB guards against drains, evictions, and voluntary maintenance. It does not prevent a node from crashing, the OOM killer from terminating a container, or a Pod from failing its health checks. PDBs complement — but do not replace — replicas, anti-affinity, and proper resource limits.
etcd Backup and Restore
HA protects you from individual component failures, but it cannot protect against data corruption, accidental mass deletion (kubectl delete ns production), or a bug that writes bad data to etcd. For those scenarios, you need backups. etcd is the single store of all cluster state, so backing it up means backing up your entire cluster configuration.
Manual Snapshot with etcdctl
The etcdctl snapshot save command creates a point-in-time snapshot of the etcd data directory. This is the foundation of every etcd backup strategy. You must provide the etcd TLS certificates because etcd requires client authentication.
# Take an etcd snapshot (run on a control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20240115-030000.db --write-table
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | 6d15a4e2 | 1284592 | 1438 | 5.2 MB |
# +----------+----------+------------+------------+
Automated Backup with a CronJob
Manual backups do not scale. The following CronJob runs every 6 hours, creates an etcd snapshot, and uploads it to an S3-compatible object store. It runs on a control plane node (via nodeSelector and toleration) and mounts the host etcd certificates.
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
containers:
- name: etcd-backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
SNAPSHOT="/tmp/etcd-$(date +%Y%m%d-%H%M%S).db"
etcdctl snapshot save "$SNAPSHOT" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd-certs/ca.crt \
--cert=/etc/etcd-certs/server.crt \
--key=/etc/etcd-certs/server.key
# Upload to S3 (aws-cli or mc for MinIO)
aws s3 cp "$SNAPSHOT" s3://my-etcd-backups/
envFrom:
- secretRef:
name: aws-backup-credentials
volumeMounts:
- name: etcd-certs
mountPath: /etc/etcd-certs
readOnly: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
restartPolicy: OnFailure
Restoring from a Snapshot
Restoring etcd is a disruptive operation — you stop the current etcd cluster and replace its data. This is a last-resort procedure, typically performed when the cluster has lost quorum or data has been corrupted. The restored cluster starts with a new cluster ID, so all etcd members must be restored from the same snapshot.
# 1. Stop the kube-apiserver and etcd (move their static Pod manifests)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# 2. Remove the old etcd data directory
sudo rm -rf /var/lib/etcd/member
# 3. Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115-030000.db \
--data-dir=/var/lib/etcd \
--name=cp-1 \
--initial-cluster=cp-1=https://10.0.1.10:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# 4. Restart etcd and the API server
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 5. Verify the cluster is healthy
kubectl get nodes
kubectl get pods -A
Disaster Recovery Strategies
HA handles individual component failures. Disaster recovery handles catastrophic scenarios: a full cluster loss, a region outage, a ransomware attack, or a cascading failure that corrupts data. Your DR strategy determines your Recovery Time Objective (RTO) — how long until you are back online — and your Recovery Point Objective (RPO) — how much data you can afford to lose.
graph LR
subgraph Strategies["DR Strategy Spectrum"]
direction LR
A["etcd Snapshots<br/>+ GitOps Rebuild"]
B["Velero<br/>App-Level Backup"]
C["Active-Passive<br/>Multi-Cluster"]
D["Active-Active<br/>Multi-Cluster"]
end
A -.-|"RTO: hours RPO: hours"| B
B -.-|"RTO: 30-60 min RPO: minutes"| C
C -.-|"RTO: minutes RPO: seconds"| D
style A fill:#fef3c7,stroke:#d97706
style B fill:#fef3c7,stroke:#d97706
style C fill:#dbeafe,stroke:#2563eb
style D fill:#d1fae5,stroke:#059669
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| etcd snapshot + GitOps rebuild | Hours | Last snapshot (hours) | Low | Low |
| Velero scheduled backups | 30-60 minutes | Last backup (minutes-hours) | Low-Medium | Medium |
| Active-Passive multi-cluster | Minutes | Replication lag (seconds-minutes) | High | High |
| Active-Active multi-cluster | Near-zero | Near-zero | Very High | Very High |
GitOps-Based Recovery: Rebuild from Git
If you follow GitOps (see the previous section on ArgoCD and Flux), your entire cluster desired state already lives in Git. Your DR process becomes: provision a new cluster, point your GitOps tool at the same repository, and let it reconcile. This rebuilds all namespaces, Deployments, Services, ConfigMaps, RBAC policies, and network policies automatically.
The gap in GitOps-only recovery is runtime state: PersistentVolume data, Secrets not stored in Git, Custom Resource instances created by operators, and any state held in databases running inside the cluster. GitOps gives you the skeleton; you need Velero or database-native replication to restore the flesh.
Velero: Application-Level Backup and Restore
Velero (formerly Heptio Ark) is the standard tool for backing up and restoring Kubernetes resources and persistent volumes. It works at the Kubernetes API level — it queries the API server for resources, serializes them to JSON, and stores them in object storage (S3, GCS, Azure Blob). For volumes, it can take CSI snapshots or use Restic/Kopia for file-level backups.
graph LR
V["Velero Server<br/>(in-cluster)"]
API["kube-apiserver"]
S3["Object Storage<br/>(S3 / GCS / MinIO)"]
SNAP["Volume Snapshots<br/>(CSI / Cloud)"]
V -->|"1. List resources"| API
V -->|"2. Store manifests"| S3
V -->|"3. Snapshot PVs"| SNAP
S3 -->|"4. Restore manifests"| V
SNAP -->|"5. Restore volumes"| V
V -->|"6. Create resources"| API
Installing and Configuring Velero
# Install Velero with AWS S3 as the backup storage location
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero \
--use-node-agent \
--default-volumes-to-fs-backup
Velero: Backup, Restore, and Schedule
Velero operates through three core workflows: on-demand backups for immediate snapshots, scheduled backups for automated protection, and restores to recover from a backup. Each can target specific namespaces, label selectors, or resource types.
# --- On-demand backup of the "production" namespace ---
velero backup create prod-backup-manual \
--include-namespaces production \
--wait
# Check backup status
velero backup describe prod-backup-manual --details
# --- Schedule automatic daily backups with 7-day retention ---
velero schedule create prod-daily \
--schedule="0 2 * * *" \
--include-namespaces production \
--ttl 168h
# List scheduled backups
velero schedule get
# --- Restore from a backup ---
# Restore everything from the backup
velero restore create --from-backup prod-backup-manual --wait
# Restore only specific resources (e.g., just Deployments and Services)
velero restore create --from-backup prod-backup-manual \
--include-resources deployments,services \
--wait
# Restore to a different namespace (rename on restore)
velero restore create --from-backup prod-backup-manual \
--namespace-mappings production:production-restored \
--wait
Velero Backup as a Kubernetes Resource
Behind the CLI, Velero creates Custom Resources. Here is what a Schedule and BackupStorageLocation look like as YAML — useful when you manage Velero through GitOps rather than the CLI.
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-velero-backups
prefix: cluster-prod
config:
region: us-east-1
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: full-cluster-daily
namespace: velero
spec:
schedule: "0 3 * * *"
template:
ttl: 720h # 30-day retention
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- velero
snapshotVolumes: true
defaultVolumesToFsBackup: false # prefer CSI snapshots
Multi-Cluster Disaster Recovery
For organizations that cannot tolerate the recovery time of a single-cluster strategy, multi-cluster DR provides faster failover by running a second cluster that is ready (or already serving traffic) when the primary fails.
| Pattern | How It Works | Trade-offs |
|---|---|---|
| Active-Passive | A standby cluster in a second region receives replicated data (via Velero cross-cluster restore, database replication, or storage-level replication). DNS or a global load balancer switches traffic during failover. | The passive cluster consumes resources but serves no traffic until failover. Data replication lag determines RPO. Failover can be automated with health checks on the global LB. |
| Active-Active | Both clusters serve production traffic simultaneously. A global load balancer (e.g., AWS Global Accelerator, Cloudflare LB) routes users to the nearest healthy cluster. Data stores use multi-region replication (e.g., CockroachDB, Cassandra, or cloud-managed databases with multi-region writes). | Near-zero RTO/RPO, but significantly more complex. You must handle data consistency, conflict resolution, and split-brain scenarios. Stateless workloads are easy; stateful workloads are hard. |
Most teams should start with GitOps + Velero scheduled backups. This combination covers 90% of disaster scenarios at low cost. Move to active-passive multi-cluster only when your SLAs demand RTO under 15 minutes. Move to active-active only when you need five-nines availability across regions — and be prepared for a significant jump in operational complexity.
Putting It All Together: An HA Checklist
High availability and disaster recovery are not features you bolt on later — they are architectural decisions made early. Here is a practical checklist to audit your cluster resilience:
| Layer | Requirement | How to Verify |
|---|---|---|
| Control Plane | 3+ API server replicas behind a load balancer | kubectl get pods -n kube-system -l component=kube-apiserver |
| etcd | 3 or 5 members with quorum health checks | etcdctl endpoint health --cluster |
| etcd Backups | Automated snapshots every 1-6 hours, stored offsite | Check CronJob history or Velero schedule status |
| Leader Election | Scheduler and controller-manager run on multiple nodes | kubectl get lease -n kube-system |
| App Replicas | Critical services have 3+ replicas | kubectl get deploy -o wide |
| Pod Spreading | Anti-affinity or topology spread constraints in place | Review Deployment specs for affinity or topologySpreadConstraints |
| PDBs | PDBs defined for every critical workload | kubectl get pdb --all-namespaces |
| DR Tested | Restore procedure tested in the last 30 days | Check your runbook and last restore drill date |
The last row is the most important. A backup that has never been tested is not a backup — it is a hope. Schedule quarterly DR drills where you restore to a fresh cluster and verify that applications are functional. The time you invest in testing is repaid the first time a real disaster strikes.
Multi-Tenancy Patterns — Sharing Clusters Safely
Running a separate Kubernetes cluster for every team, environment, or customer is simple to reason about — but expensive and operationally painful to maintain. Multi-tenancy lets multiple tenants share a single cluster while keeping their workloads isolated from each other. The challenge is finding the right balance between isolation strength, operational overhead, and cost efficiency.
There is no single "correct" multi-tenancy model. The right choice depends on your trust boundaries, compliance requirements, and how many clusters you are willing to manage. This section walks through the full spectrum — from lightweight namespace isolation to fully virtual clusters — and gives you the concrete Kubernetes primitives to implement each one.
The Multi-Tenancy Spectrum
Multi-tenancy in Kubernetes exists on a spectrum. On one end, namespaces provide logical separation within a shared cluster. On the other end, each tenant gets a dedicated cluster with complete physical isolation. In between sits a newer approach: virtual clusters that simulate a full cluster inside namespaces of a host cluster.
graph LR
subgraph SOFT["Soft Multi-Tenancy"]
NS["Namespace per Tenant<br/>RBAC + Quotas + NetworkPolicy"]
end
subgraph VIRTUAL["Virtual Clusters"]
VC["vCluster per Tenant<br/>Dedicated API server<br/>Shared worker nodes"]
end
subgraph HARD["Hard Multi-Tenancy"]
HC["Cluster per Tenant<br/>Full physical isolation"]
end
NS -->|"More isolation"| VC
VC -->|"More isolation"| HC
style SOFT fill:#e8f5e9,stroke:#388e3c,color:#1b5e20
style VIRTUAL fill:#fff3e0,stroke:#f57c00,color:#e65100
style HARD fill:#fce4ec,stroke:#c62828,color:#b71c1c
| Model | Isolation Level | Operational Cost | Tenant Self-Service | Use Case |
|---|---|---|---|---|
| Namespace per Tenant | Logical (kernel shared) | Low — single cluster to manage | Limited — tenants cannot create CRDs or cluster-scoped resources | Internal teams within the same org, dev/staging environments |
| Virtual Cluster (vCluster) | Strong logical (separate API server) | Medium — vClusters are lightweight but add a management layer | High — tenants get their own API server, can install CRDs | Platform teams offering Kubernetes-as-a-Service, CI/CD ephemeral clusters |
| Cluster per Tenant | Physical (separate nodes, etcd, control plane) | High — N clusters to upgrade, monitor, secure | Full — tenant has complete cluster admin | Regulated industries, untrusted tenants, strict compliance |
Namespace-Level Isolation (Soft Multi-Tenancy)
Namespace-based isolation is the most common multi-tenancy pattern. Each tenant gets one or more namespaces, and you enforce boundaries using five Kubernetes-native mechanisms: RBAC, ResourceQuotas, LimitRanges, NetworkPolicies, and Pod Security Standards. None of these is sufficient alone — you need all five working together to create a meaningful isolation boundary.
1. RBAC per Namespace
RBAC confines each tenant to their own namespace. You create a Role (namespace-scoped) with the permissions tenants need, then bind it to their identity with a RoleBinding. Tenants should never receive ClusterRole bindings unless you explicitly want them to access cluster-wide resources.
# Role: allow full control within the tenant namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: tenant-developer
namespace: team-payments
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["pods", "deployments", "services", "configmaps", "secrets", "jobs", "cronjobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["pods/log", "pods/exec"]
verbs: ["get", "create"]
---
# Bind the role to the team's group identity
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-payments-developers
namespace: team-payments
subjects:
- kind: Group
name: payments-team
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: tenant-developer
apiGroup: rbac.authorization.k8s.io
Notice that this Role does not include access to nodes, namespaces, persistentvolumes, or any cluster-scoped resource. Tenants can work freely within team-payments but cannot see or affect anything outside it. You should also avoid granting escalate, bind, or impersonate verbs — these allow privilege escalation.
2. ResourceQuotas
RBAC controls what a tenant can do, but it does not limit how much. A single tenant could consume all the cluster's CPU and memory, starving everyone else. ResourceQuotas set hard limits on the total resources a namespace can consume and the number of objects it can create.
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: team-payments
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "40"
services: "10"
persistentvolumeclaims: "5"
secrets: "20"
configmaps: "20"
Once a ResourceQuota is active, every Pod in the namespace must specify resource requests and limits — otherwise the API server rejects the creation. This is by design: Kubernetes cannot enforce quotas if it does not know how much a Pod plans to consume.
3. LimitRanges
ResourceQuotas cap the total for the namespace. LimitRanges cap individual Pods and containers — they set default, minimum, and maximum resource values. This prevents a single Pod from requesting 100 CPUs within a namespace that has a 16-CPU quota.
apiVersion: v1
kind: LimitRange
metadata:
name: tenant-limits
namespace: team-payments
spec:
limits:
- type: Container
default: # Applied when no limits are specified
cpu: "500m"
memory: 256Mi
defaultRequest: # Applied when no requests are specified
cpu: "100m"
memory: 128Mi
max:
cpu: "2"
memory: 2Gi
min:
cpu: "50m"
memory: 64Mi
- type: Pod
max:
cpu: "4"
memory: 4Gi
4. NetworkPolicies
By default, every Pod in a Kubernetes cluster can communicate with every other Pod — across namespaces. In a multi-tenant setup, this is unacceptable. NetworkPolicies act as namespace-level firewalls, restricting which Pods can talk to each other. They require a CNI plugin that supports them (Calico, Cilium, or Antrea — the default kubenet does not).
# Default deny all ingress and egress in the tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-payments
spec:
podSelector: {} # Applies to ALL pods in the namespace
policyTypes:
- Ingress
- Egress
---
# Allow pods within the same namespace to talk to each other
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-intra-namespace
namespace: team-payments
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector: {}
egress:
- to:
- podSelector: {}
- to: # Allow DNS resolution (kube-system)
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Always start with a default-deny policy and then add specific allow rules. If you skip the default-deny, NetworkPolicies are purely additive — any Pod without a matching policy can still communicate freely with the entire cluster. The deny-all policy closes this gap.
5. Pod Security Standards
Pod Security Standards (PSS) replaced the deprecated PodSecurityPolicies in Kubernetes 1.25. They are enforced at the namespace level using labels and prevent tenants from deploying privileged containers, mounting host paths, or escalating privileges. Three built-in profiles exist:
| Profile | What It Allows | Use Case |
|---|---|---|
privileged | No restrictions — full access to host namespaces, capabilities, and volumes | System-level workloads (CNI plugins, monitoring agents in kube-system) |
baseline | Prevents known privilege escalations — blocks hostNetwork, hostPID, privileged containers, and dangerous capabilities | General-purpose workloads that don't need special privileges |
restricted | Heavily restricted — requires non-root, read-only root filesystem encouraged, drops ALL capabilities, and restricts volume types | Multi-tenant namespaces with untrusted or third-party workloads |
# Enforce the restricted profile on a tenant namespace
kubectl label namespace team-payments \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
With enforce=restricted, the admission controller rejects any Pod that violates the restricted profile. The warn and audit labels produce warnings and audit log entries respectively — useful for gradual rollout where you enable warn first, fix violations, and then switch to enforce.
Hierarchical Namespace Controller (HNC)
In large organizations, the platform team typically creates namespaces and RBAC bindings for each tenant. This becomes a bottleneck when you have hundreds of teams. The Hierarchical Namespace Controller (HNC) solves this by letting you define parent-child relationships between namespaces. A parent namespace can propagate Roles, RoleBindings, NetworkPolicies, ResourceQuotas, and other objects down to child (sub) namespaces automatically.
# Install HNC and define a hierarchy
# Parent namespace: team-payments
# Child namespaces: team-payments-staging, team-payments-dev
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
name: team-payments-staging
namespace: team-payments # The parent namespace
---
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
name: team-payments-dev
namespace: team-payments
With this hierarchy, any Role, RoleBinding, or NetworkPolicy you create in team-payments is automatically propagated to team-payments-staging and team-payments-dev. You configure which resource types propagate using HNCConfiguration. The team lead for team-payments can create sub-namespaces without involving the platform team — true self-service, scoped safely to their subtree.
# Install HNC using kubectl plugin
kubectl apply -f https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/default.yaml
# View the hierarchy tree
kubectl hns tree team-payments
# Output:
# team-payments
# ├── team-payments-staging
# └── team-payments-dev
# Check which objects are propagated
kubectl hns describe team-payments
Virtual Clusters with vCluster
Namespace-level isolation has a fundamental limitation: tenants cannot create cluster-scoped resources like CRDs, ClusterRoles, or admission webhooks. If a tenant needs to install a Helm chart that includes CRDs, namespace isolation is not enough. This is where virtual clusters fill the gap.
vCluster (by Loft Labs) creates lightweight virtual Kubernetes clusters that run inside namespaces of a host cluster. Each vCluster has its own API server and a separate etcd (or SQLite/PostgreSQL) backing store. Tenants interact with their vCluster as if it were a real cluster — they can create CRDs, namespaces, ClusterRoles — but the actual workload Pods run on the host cluster's worker nodes.
graph TB
subgraph HOST["Host Cluster"]
HAPI["Host API Server"]
subgraph NS1["Namespace: vc-tenant-a"]
VCA_API["vCluster A<br/>API Server"]
VCA_SYNC["Syncer"]
VCA_STORE["etcd / SQLite"]
VCA_API --> VCA_STORE
end
subgraph NS2["Namespace: vc-tenant-b"]
VCB_API["vCluster B<br/>API Server"]
VCB_SYNC["Syncer"]
VCB_STORE["etcd / SQLite"]
VCB_API --> VCB_STORE
end
W1["Worker Node 1<br/>Runs Pods from all vClusters"]
W2["Worker Node 2<br/>Runs Pods from all vClusters"]
end
TENANT_A["Tenant A<br/>kubectl"] -->|"kubeconfig"| VCA_API
TENANT_B["Tenant B<br/>kubectl"] -->|"kubeconfig"| VCB_API
VCA_SYNC -->|"sync Pods,<br/>Services"| HAPI
VCB_SYNC -->|"sync Pods,<br/>Services"| HAPI
HAPI --> W1
HAPI --> W2
# Install vCluster CLI
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster && sudo mv vcluster /usr/local/bin/
# Create a virtual cluster for tenant-a in the host cluster
vcluster create tenant-a --namespace vc-tenant-a
# Connect to the virtual cluster (switches kubeconfig context)
vcluster connect tenant-a --namespace vc-tenant-a
# Inside the vCluster — tenant sees a "clean" cluster
kubectl get namespaces # Only default, kube-system, kube-public
kubectl create namespace my-app
kubectl apply -f my-crd.yaml # CRDs work — this is a full API server
# Disconnect and return to host cluster context
vcluster disconnect
The syncer component is the bridge. It watches for Pods and Services created inside the vCluster and replicates them as real objects in the host namespace. The host scheduler places these Pods on real nodes, but the tenant only sees them through their virtual API server. This gives you strong API-level isolation with minimal infrastructure overhead — a vCluster typically consumes about 200Mi of memory for its control plane.
Pods from different vClusters still share the same host kernel and worker nodes. A container escape in one vCluster can potentially affect another. If you need kernel-level isolation between tenants, combine vCluster with node isolation (dedicated node pools per tenant) or use a sandboxed runtime like gVisor or Kata Containers.
Node Isolation with Node Pools and Taints
For tenants who require stronger isolation than what namespaces or virtual clusters provide at the network and resource level, you can dedicate physical (or virtual) nodes to specific tenants. This ensures one tenant's workloads never share a kernel with another's — eliminating noisy-neighbor performance issues and reducing the blast radius of a container escape.
The mechanism is a combination of taints on nodes (to repel all Pods by default) and tolerations on tenant Pods (to explicitly allow scheduling). Pair this with nodeAffinity to guarantee Pods land only on designated nodes.
# Label and taint nodes for a specific tenant
kubectl label nodes node-pool-a-1 node-pool-a-2 tenant=payments
kubectl taint nodes node-pool-a-1 node-pool-a-2 tenant=payments:NoSchedule
# Tenant Pod spec — tolerate the taint AND require the node label
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
namespace: team-payments
spec:
replicas: 3
selector:
matchLabels:
app: payment-api
template:
metadata:
labels:
app: payment-api
spec:
tolerations:
- key: "tenant"
operator: "Equal"
value: "payments"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tenant
operator: In
values: ["payments"]
containers:
- name: api
image: payments/api:2.1.0
resources:
requests:
cpu: "250m"
memory: 256Mi
The taint prevents other tenants' Pods from landing on these nodes. The nodeAffinity prevents this tenant's Pods from landing on other tenants' nodes. You need both — a toleration alone only means a Pod can run on a tainted node, it does not prevent it from running elsewhere. On managed Kubernetes (EKS, GKE, AKS), you typically configure this through dedicated node pools with auto-applied taints and labels.
Cost Allocation per Tenant
Sharing a cluster saves money, but you need to know who is consuming how much. Kubernetes does not have built-in cost tracking, but the combination of consistent labeling and a cost allocation tool like Kubecost (or OpenCost, the CNCF sandbox project it contributed) gives you per-tenant cost visibility.
Step 1: Establish a Labeling Convention
The foundation of cost allocation is consistent labels. Every tenant workload must carry labels that identify the owning team, the environment, and the cost center. Enforce this with an admission webhook (like OPA/Gatekeeper or Kyverno) that rejects resources missing required labels.
# Standard labels for cost allocation
metadata:
labels:
app.kubernetes.io/name: payment-api
app.kubernetes.io/part-of: payments-platform
cost-center: cc-4200
team: payments
env: production
Step 2: Deploy Kubecost / OpenCost
# Install OpenCost (CNCF sandbox project, fully open-source)
helm install opencost opencost/opencost \
--namespace opencost --create-namespace \
--set opencost.prometheus.internal.enabled=true
# Query per-namespace cost allocation via the API
curl -s "http://localhost:9090/allocation/compute?window=7d&aggregate=namespace" | jq '.data'
# Query per-label cost breakdown (e.g., by team label)
curl -s "http://localhost:9090/allocation/compute?window=30d&aggregate=label:team" | jq '.data'
Kubecost and OpenCost work by correlating Prometheus metrics for CPU, memory, storage, and network usage with the actual pricing from your cloud provider. They allocate shared costs (like the control plane fee or idle resources) using configurable strategies — proportional to usage, evenly split, or custom weights. The output is a per-namespace or per-label cost report you can feed into chargeback dashboards or internal billing systems.
Multi-Tenancy Decision Matrix
Choosing the right model depends on three factors: how much you trust the tenant, how strong the isolation must be, and how much operational complexity you can absorb. Use this matrix as a starting point.
| Factor | Namespace Isolation | vCluster | Cluster per Tenant |
|---|---|---|---|
| Trust level required | High — tenants are internal teams within the same org | Medium — tenants need more autonomy but you still control infrastructure | Low — tenants are external, untrusted, or subject to strict regulatory boundaries |
| API isolation | Shared API server — tenants can see that other namespaces exist | Separate API server per tenant — full cluster illusion | Fully separate — no shared components |
| CRD support | No — CRDs are cluster-scoped, affect all tenants | Yes — each vCluster has its own CRD registry | Yes — full cluster admin |
| Kernel isolation | None — shared kernel on worker nodes | None by default — add node pools or sandboxed runtimes | Full — dedicated nodes and control plane |
| Cluster count | 1 | 1 host + N lightweight virtual | N real clusters |
| Upgrade burden | 1 cluster to upgrade | 1 host cluster + vCluster versions (independent) | N clusters to upgrade |
| Resource efficiency | Highest — full bin-packing across all tenants | High — small overhead per vCluster (~200Mi) | Lowest — each cluster has its own control plane overhead |
| Setup complexity | Low — native Kubernetes primitives | Medium — requires vCluster operator | High — requires fleet management (Fleet, Rancher, ArgoCD multi-cluster) |
| Blast radius | Namespace — but a kernel exploit affects all | vCluster — but shared kernel still a risk | Fully contained to one cluster |
For most organizations, namespace-level isolation with RBAC, quotas, network policies, and Pod Security Standards is sufficient and dramatically simpler to operate. Move to vCluster when tenants need CRDs or cluster-admin-like autonomy. Move to dedicated clusters only when compliance or zero-trust requirements demand it — the operational cost of managing many clusters is significant and often underestimated.
Putting It All Together: A Tenant Onboarding Checklist
When you onboard a new tenant using namespace-level isolation, apply all five isolation mechanisms as a cohesive unit. Missing any one of them leaves a gap that undermines the others.
TENANT="team-orders"
# 1. Create namespace with Pod Security Standards
kubectl create namespace $TENANT
kubectl label namespace $TENANT \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restricted
# 2. Apply ResourceQuota
kubectl apply -n $TENANT -f quota.yaml
# 3. Apply LimitRange
kubectl apply -n $TENANT -f limitrange.yaml
# 4. Apply default-deny NetworkPolicy
kubectl apply -n $TENANT -f network-policy-deny-all.yaml
kubectl apply -n $TENANT -f network-policy-allow-intra-ns.yaml
# 5. Create RBAC Role and RoleBinding
kubectl apply -n $TENANT -f rbac-role.yaml
kubectl apply -n $TENANT -f rbac-binding.yaml
# Verify the setup
kubectl get quota,limitrange,networkpolicy,role,rolebinding -n $TENANT
In practice, you should codify this as a templated Helm chart, a Kustomize overlay, or a Crossplane Composition — so every new tenant gets identical isolation guarantees, and no step is accidentally skipped. The previous section on High Availability ensures the cluster itself is resilient; this section ensures tenants within it cannot interfere with each other. The next section ties both together with a production readiness checklist that covers the remaining operational concerns.
Production Readiness Checklist and Common Pitfalls
Running Kubernetes in development is forgiving. Running it in production is not. The gap between a cluster that works on a demo and one that survives real traffic, security audits, and 3 AM incidents is bridged by deliberate, systematic preparation. This section distills that preparation into a concrete checklist organized by category, followed by the ten most common mistakes teams make — and how to avoid every one of them.
Treat this as your pre-flight checklist. Before promoting any cluster or workload to production, walk through each category. A single missed item — an absent resource limit, a missing network policy, a skipped etcd backup test — can be the difference between a minor alert and a major outage.
graph LR
subgraph Checklist["Production Readiness Categories"]
direction TB
W["Workloads"]
S["Security"]
N["Networking"]
ST["Storage"]
O["Observability"]
C["Cluster Ops"]
end
W --> READY["Production Ready"]
S --> READY
N --> READY
ST --> READY
O --> READY
C --> READY
1. Workloads
Workload configuration is where most production incidents originate. A Pod without resource limits can starve its neighbors. A Deployment without probes can route traffic to containers that aren't ready. Anti-affinity rules and Pod Disruption Budgets are not optional — they are the mechanisms that keep your application available during node failures and cluster upgrades.
Set Resource Requests and Limits on Every Container
Resource requests tell the scheduler how much CPU and memory a Pod needs; limits cap what it can consume. Without requests, the scheduler makes blind placement decisions. Without limits, a single runaway container can trigger the out-of-memory terminator on everything else on the node. Always set both, and base them on observed usage, not guesses.
containers:
- name: api-server
image: myapp/api:v2.4.1
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
Even with developer discipline, someone will forget to set resources. Apply a LimitRange per namespace to inject default requests/limits on any container that omits them, and a ResourceQuota to cap total namespace consumption. This prevents any single team from monopolizing cluster resources.
Configure Liveness and Readiness Probes
Readiness probes gate traffic — a Pod won't receive Service traffic until its readiness probe passes. Liveness probes detect deadlocks — if a container is alive but stuck, the kubelet restarts it. A startup probe gives slow-starting containers (like Java apps) extra time before liveness checks begin. Use all three where appropriate.
containers:
- name: api-server
image: myapp/api:v2.4.1
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2 # up to 60s to start
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
Use Anti-Affinity and Topology Spread Constraints
Running all replicas of a Deployment on a single node defeats the purpose of high availability. Pod anti-affinity rules ensure replicas spread across nodes, while topologySpreadConstraints give finer control — you can spread across zones, racks, or any custom topology domain.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostname
Define Pod Disruption Budgets (PDBs)
During voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs — Kubernetes respects PDBs to ensure a minimum number of Pods remain available. Without a PDB, a node drain can evict all replicas simultaneously, causing downtime even when you have plenty of replicas.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # or use maxUnavailable: 1
selector:
matchLabels:
app: api-server
2. Security
Kubernetes is secure-by-capability but permissive-by-default. A fresh cluster allows Pods to run as root, talk to any other Pod, and access the full Kubernetes API from within the cluster. Production hardening means flipping those defaults: deny everything, then allow only what's explicitly needed.
Enforce Pod Security Standards
Pod Security Standards (PSS), enforced through Pod Security Admission (PSA), replace the deprecated PodSecurityPolicy. There are three levels: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (hardened, follows security best practices). Apply at least baseline in enforce mode on every production namespace, and target restricted for workloads that don't need host access.
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Your Pod specs must match the enforced standard. Under restricted, every container needs an explicit security context:
securityContext:
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true
RBAC, Image Scanning, and API Access
RBAC should follow the principle of least privilege — grant only the verbs and resources each service account actually needs. Never bind cluster-admin to application workloads. Scan every container image in your CI pipeline using tools like Trivy, Grype, or Snyk. Restrict API server access with firewall rules and disable anonymous authentication.
# Least-privilege Role: only read Pods and ConfigMaps
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: app-reader
rules:
- apiGroups: [""]
resources: ["pods", "configmaps"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: api-server-binding
subjects:
- kind: ServiceAccount
name: api-server
namespace: production
roleRef:
kind: Role
name: app-reader
apiGroup: rbac.authorization.k8s.io
Rotate Secrets and Use External Secret Stores
Kubernetes Secrets are base64-encoded, not encrypted at rest by default. Enable etcd encryption at rest, and prefer external secret managers (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) with the Secrets Store CSI Driver or External Secrets Operator. Automate rotation so credentials are never stale.
3. Networking
By default, every Pod can talk to every other Pod in the cluster — across namespaces, across teams, across trust boundaries. This flat network is convenient for development and catastrophic for production. Network policies are your firewall rules within the cluster.
Apply NetworkPolicies with Default-Deny
Start by denying all ingress and egress traffic in each namespace, then explicitly allow only the communication paths your application requires. This mirrors the approach used in traditional firewalls: deny by default, allow by exception.
# Default deny all traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow specific traffic: api-server receives from ingress controller
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 8080
Configure Ingress TLS and Plan IP Ranges
Terminate TLS at the Ingress controller using cert-manager for automated certificate provisioning and renewal. Plan your Pod CIDR, Service CIDR, and node CIDR ranges before cluster creation — changing them later requires a full cluster rebuild. Use non-overlapping ranges with your corporate network if you need VPN or VPC peering.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls-cert
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server
port:
number: 8080
4. Storage
Stateful workloads need careful planning. Losing a PersistentVolume means losing data — there is no controller loop to reconcile that back into existence. Use dynamic provisioning to avoid manual volume management, and validate your backup and disaster recovery procedures before you need them.
Use Dynamic Provisioning with Appropriate StorageClasses
Define StorageClasses that match your performance and durability requirements. Use reclaimPolicy: Retain for any data you can't afford to lose — the default Delete policy destroys the underlying volume when the PVC is deleted.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-retain
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "4000"
throughput: "250"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Backup PersistentVolumes and Test Restores
Use Velero or a CSI snapshot controller to back up PVs on a schedule. A backup you've never restored is a backup you don't have. Run quarterly disaster recovery drills: delete a PVC, restore from backup, and verify data integrity. Document the procedure so any team member can execute it at 3 AM.
# Create a Velero backup of the entire production namespace
velero backup create prod-daily-$(date +%Y%m%d) \
--include-namespaces production \
--snapshot-volumes=true \
--ttl 720h
# Verify backup completed successfully
velero backup describe prod-daily-$(date +%Y%m%d)
# Restore to a test namespace to validate integrity
velero restore create --from-backup prod-daily-$(date +%Y%m%d) \
--namespace-mappings production:production-restore-test
5. Observability
You cannot manage what you cannot see. A production cluster without monitoring is flying blind. You need three pillars — metrics, logs, and traces — plus alerting that tells you about problems before your users do.
The Observability Stack
| Pillar | Tool Options | What to Monitor |
|---|---|---|
| Metrics | Prometheus + Grafana, Datadog, New Relic | CPU/memory usage, Pod restarts, request latency (p50/p95/p99), error rates, node disk pressure, PVC usage |
| Logs | Loki + Grafana, EFK (Elasticsearch + Fluentd + Kibana), CloudWatch | Application logs, kubelet logs, audit logs, ingress access logs |
| Traces | Jaeger, Tempo, Zipkin, OpenTelemetry Collector | Request flow across microservices, latency breakdown per hop, error propagation |
| Alerting | Alertmanager, PagerDuty, Opsgenie | Pod CrashLoopBackOff, node NotReady, certificate expiry, persistent volume >85% full, HPA at max replicas |
Key Alerts Every Cluster Needs
Don't wait for users to report issues. These Prometheus alerting rules cover the most critical failure modes. Deploy them with the kube-prometheus-stack Helm chart, which includes Prometheus, Alertmanager, Grafana, and a comprehensive set of recording and alerting rules out of the box.
groups:
- name: critical-cluster-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} has been NotReady for 5 minutes"
- alert: PVCAlmostFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is >85% full"
6. Cluster Operations
A production cluster is a living system. Kubernetes releases a new minor version every four months, nodes need patching, and etcd — the single source of truth for all cluster state — needs regular backup and tested restore procedures. Automate as much as possible, and never skip the testing step.
Automate Cluster Upgrades
Kubernetes supports upgrading one minor version at a time (e.g., 1.29 to 1.30, not 1.28 to 1.30). On managed clusters, use the provider's upgrade mechanism. On self-managed clusters, follow the documented kubeadm upgrade workflow. Always upgrade the control plane first, then worker nodes.
# 1. Check current version and available upgrades
kubeadm upgrade plan
# 2. Upgrade control plane (run on each control plane node)
sudo kubeadm upgrade apply v1.30.2
# 3. Drain a worker node before upgrading its kubelet
kubectl drain node-3 --ignore-daemonsets --delete-emptydir-data
# 4. Upgrade kubelet and kubectl on the worker node
sudo apt-get update && sudo apt-get install -y kubelet=1.30.2-* kubectl=1.30.2-*
sudo systemctl daemon-reload && sudo systemctl restart kubelet
# 5. Uncordon the node to resume scheduling
kubectl uncordon node-3
Backup and Restore etcd
etcd holds every resource definition in the cluster. If etcd is lost without a backup, the entire cluster state is gone — every Deployment, Service, Secret, and ConfigMap. On self-managed clusters, back up etcd at least every hour to an off-cluster location.
# Snapshot etcd to a file
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-table
# Restore from snapshot (stop kube-apiserver and etcd first)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20240115-0300.db \
--data-dir=/var/lib/etcd-restored
Plan Capacity and Enable Autoscaling
Monitor resource utilization at the node level. If nodes consistently run above 70% CPU or memory, add capacity before a traffic spike pushes them over the edge. Use the Cluster Autoscaler (or Karpenter on AWS) to automatically add or remove nodes based on pending Pod demand. Pair it with Horizontal Pod Autoscaling (HPA) for end-to-end elastic scaling.
The Full Checklist
Here is the complete checklist condensed into a single reference table. Pin it to your team wiki or add it to your deployment pipeline as a validation gate.
| Category | Item | Priority |
|---|---|---|
| Workloads | Resource requests and limits set on every container | 🔴 Critical |
| Liveness, readiness, and startup probes configured | 🔴 Critical | |
| Anti-affinity or topology spread constraints for HA workloads | 🟠 High | |
| PodDisruptionBudgets defined for all critical services | 🟠 High | |
Graceful shutdown handled (preStop hooks, terminationGracePeriodSeconds) | 🟠 High | |
| Security | RBAC enforced with least-privilege Roles/ClusterRoles | 🔴 Critical |
| Pod Security Standards enforced (baseline or restricted) | 🔴 Critical | |
| Container images scanned in CI pipeline | 🟠 High | |
| Secrets encrypted at rest; external secret store integrated | 🟠 High | |
| API server access restricted to trusted networks | 🟠 High | |
| Networking | Default-deny NetworkPolicies in every namespace | 🔴 Critical |
| TLS termination at Ingress with automated certificate renewal | 🔴 Critical | |
| Pod, Service, and Node CIDRs planned and non-overlapping | 🟠 High | |
| Storage | Dynamic provisioning with appropriate StorageClasses | 🟠 High |
| PV backups scheduled and tested with restore drills | 🔴 Critical | |
reclaimPolicy: Retain for critical volumes | 🟠 High | |
| Observability | Metrics collection (Prometheus or equivalent) deployed | 🔴 Critical |
| Centralized logging with retention policy | 🟠 High | |
| Alerting configured for critical failure modes (CrashLoop, NotReady, PVC full) | 🔴 Critical | |
| Cluster Ops | Cluster upgrade process documented and tested | 🟠 High |
| etcd backup/restore tested quarterly | 🔴 Critical | |
| Cluster Autoscaler or Karpenter configured for elastic capacity | 🟡 Medium |
Top 10 Production Pitfalls (and How to Avoid Them)
These are the mistakes that show up in postmortems again and again. Every one of them is avoidable.
1. No Resource Limits — Noisy Neighbor Outages
A single Pod without memory limits can consume all available memory on a node, triggering the out-of-memory terminator on every other Pod running there. The fix: enforce LimitRange objects in every namespace so that even forgotten containers get default limits applied automatically.
2. Liveness Probes That Check Dependencies
A liveness probe that queries the database will restart your Pod every time the database has a blip — turning a transient issue into a cascading failure across every replica. Liveness probes should check only whether the process itself is alive and responsive. Use readiness probes for dependency checks.
3. No PodDisruptionBudget — Upgrades Cause Downtime
During a kubectl drain, every Pod on the node is evicted. If all your replicas happen to be on that node (because you also forgot anti-affinity), your service goes down. A PDB with minAvailable: 1 ensures at least one replica stays running throughout the disruption.
4. Using latest Image Tags
The latest tag is mutable — it can point to a different image after every push. This means two Pods in the same Deployment can run different code. Worse, rollbacks don't work because Kubernetes sees no spec change. Always use immutable tags: digests (myapp@sha256:abc...) or explicit version tags (myapp:v2.4.1).
5. Running as Root with No Security Context
Without an explicit securityContext, most container images run as root (UID 0). A container escape exploit running as root gives an attacker root access to the host node. Set runAsNonRoot: true, drop all capabilities, and make the filesystem read-only.
6. Flat Network with No NetworkPolicies
Without network policies, a compromised Pod in the dev namespace can reach the database in the production namespace. The fix: default-deny policies in every namespace. This takes 10 minutes to implement and closes one of the largest attack surface areas in Kubernetes.
7. Not Testing etcd / PV Backup Restores
Teams that back up but never restore are surprised to discover their backups are corrupted, incomplete, or that the restore procedure takes four hours instead of 20 minutes. Schedule quarterly restore drills. Make it a calendar event. Verify data integrity after every restore.
8. Ignoring Pod Topology and Affinity
The scheduler may place all three replicas of your critical service on the same node or in the same availability zone. When that node or zone fails, all replicas go down simultaneously. Use topologySpreadConstraints with topologyKey: topology.kubernetes.io/zone for zone-level spread.
9. Skipping Graceful Shutdown Configuration
When Kubernetes terminates a Pod, it sends SIGTERM and waits for terminationGracePeriodSeconds (default: 30s). If your app doesn't handle SIGTERM — or if it needs more than 30 seconds to drain connections — requests are dropped. Implement a signal handler, add a preStop lifecycle hook, and increase the grace period if needed.
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api-server
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # allow endpoints to de-register
10. No Observability Until the First Outage
Teams often delay deploying monitoring until after their first production incident, then scramble to debug blind. Deploy Prometheus, Grafana, and Alertmanager (or your stack of choice) as part of the initial cluster setup — before any workload goes live. The cost is minimal. The payoff is immediate.
These pitfalls rarely cause outages in isolation. The worst incidents combine multiple gaps: no resource limits plus no monitoring plus no PDB means a single memory leak cascades into a cluster-wide outage that nobody sees until customers call. Close every gap systematically — the checklist exists for exactly this reason.
Closing: From Knowledge to Practice
This page has taken you from the architecture of a Kubernetes cluster through workload design, networking, storage, security, and multi-tenancy. This final checklist ties every concept together into actionable preparation. Kubernetes rewards operators who are deliberate, systematic, and who automate their best practices into policy.
Don't try to implement every item at once. Prioritize the Critical items first — resource limits, probes, RBAC, network policies, TLS, and monitoring. Those alone will prevent the vast majority of production incidents. Then work through the High items as your operational maturity grows.
The best checklist is one that enforces itself. Use policy engines like Kyverno or OPA Gatekeeper to reject deployments that lack resource limits, probes, or security contexts. Use LimitRange and ResourceQuota as namespace-level guardrails. Shift compliance left so that developers get immediate feedback in CI, not a rejection in production.
Production Kubernetes is not a destination — it's an ongoing practice. Review this checklist before every major deployment, after every incident, and at the start of every quarter. The clusters that run reliably are the ones operated by teams that never stop asking: "What could go wrong, and have we prepared for it?"