Kubernetes Mastery — From Fundamentals to Advanced Operations

Prerequisites: Familiarity with Linux command line, basic networking (TCP/IP, DNS, HTTP), containers and Docker fundamentals (images, containers, Dockerfiles), YAML syntax, and a general understanding of distributed systems concepts.

Why Kubernetes — The Problem It Solves

Containers changed how we package software — a single image that runs the same way on a laptop and in production. But the moment you go from one container to dozens (or thousands), a new class of problems appears. Which server should run each container? What happens when a container crashes at 3 AM? How do services find each other as instances come and go? Kubernetes exists to answer these questions.

Before diving into architecture and YAML manifests, it is worth understanding why this system was built and what life looks like without it. That context makes every Kubernetes concept that follows feel like a solution to a problem you already recognize.

The Pain of Manual Container Orchestration

Imagine you run a web application with three services: an API, a background worker, and a Redis cache. You have four servers. Without an orchestrator, you are the orchestrator. Here is what a typical deployment script looks like when humans manage containers directly:

bash
#!/bin/bash
# deploy.sh — manual container deployment across 4 servers

# 1. Decide which server has capacity (check manually or guess)
ssh server-02 "docker stats --no-stream" | grep -v "0.00%"

# 2. Pull the new image on each target server
for host in server-01 server-02 server-03; do
  ssh $host "docker pull registry.example.com/api:v2.4.1"
done

# 3. Stop old containers one at a time (hope nobody notices)
ssh server-01 "docker stop api-1 && docker rm api-1"
ssh server-01 "docker run -d --name api-1 -p 8080:8080 \
  -e DB_HOST=10.0.1.50 -e DB_PASSWORD=s3cret \
  registry.example.com/api:v2.4.1"

# 4. Repeat for server-02 and server-03...
# 5. Update the load balancer config manually
# 6. Pray that DNS propagation is fast enough
# 7. Set up a cron job to restart crashed containers (?)

This script has serious problems. The database password is hardcoded in plain text. There is no health check — if the new version crashes on startup, traffic still routes to it. Rolling back means re-running the script with the old image tag and hoping the state is clean. Scaling up means provisioning a new server, installing Docker, copying SSH keys, and updating the script. At 3 AM, when server-02 dies, nobody restarts those containers until a human wakes up.

The Five Hard Problems at Scale

The script above is not a strawman — it is how many teams actually operated in 2014–2016. The problems it exposes fall into five categories that every container orchestrator must solve.

mindmap
  root((Container
    Orchestration))
    Scheduling
      Bin-packing onto nodes
      Resource-aware placement
      Affinity and anti-affinity
    Self-Healing
      Restart crashed containers
      Replace unresponsive nodes
      Health check enforcement
    Service Discovery
      Stable DNS names
      Load balancing across replicas
      Dynamic IP handling
    Rolling Updates
      Zero-downtime deploys
      Automatic rollback
      Canary and blue-green
    Configuration
      Secret management
      Environment-specific config
      Hot-reload support
    Scaling
      Horizontal pod autoscaling
      Cluster autoscaling
      Scale to zero
    Storage
      Persistent volumes
      Dynamic provisioning
      Storage class abstraction
    

1. Scheduling

Given 50 containers and 10 servers, which container goes where? You need to consider CPU and memory availability, disk I/O requirements, data locality, and constraints like "don't put two replicas of the same service on the same host." Doing this by hand is error-prone. Doing it automatically, thousands of times per day, is a scheduling problem that requires a dedicated system.

2. Service Discovery and Load Balancing

Containers get new IP addresses every time they restart. If your API talks to a Redis instance at a hardcoded IP, that breaks the moment Redis is rescheduled to a different node. You need a dynamic registry that tracks where each service is running and routes traffic accordingly — without requiring code changes.

3. Self-Healing

Containers crash. Nodes go offline. Disks fill up. A production-grade system must detect these failures and recover automatically — restart the container, reschedule it to a healthy node, and stop sending traffic to instances that are not ready. The goal is to match a desired state continuously, not just at deployment time.

4. Rolling Updates and Rollbacks

Deploying a new version should not cause downtime. You need to bring up new containers, verify they are healthy, shift traffic, and drain old containers — all without dropping requests. When a deployment goes wrong, you need to revert to the previous version in seconds, not minutes.

5. Secret and Configuration Management

Database passwords, API keys, and TLS certificates cannot live in Docker images or shell scripts. You need a way to inject secrets at runtime, rotate them without redeploying, and ensure that only authorized services can access them.

From Google Borg to Kubernetes

Kubernetes did not appear from a vacuum. Google had been running containers at scale since the early 2000s — long before Docker existed. Their internal systems, Borg and its successor Omega, orchestrated billions of containers per week across Google's global infrastructure. Gmail, Search, YouTube, and Maps all run on Borg.

In 2014, Google open-sourced a system that captured the core design principles of Borg and Omega but was rebuilt from scratch for the broader community. Three key ideas carried over:

  • Declarative configuration. You describe what you want (3 replicas of this service, 512 MiB of RAM each), not how to get there. The system continuously works to make reality match your declaration.
  • API-driven everything. Every operation — deploying, scaling, inspecting — goes through a versioned REST API. There is no SSH, no special CLI magic. The kubectl command is just an API client.
  • Reconciliation loops. Controllers constantly compare actual state against desired state and take corrective action. This is what makes the system self-healing: it is always converging, not just executing a one-time script.

In 2015, Google donated Kubernetes to the newly formed Cloud Native Computing Foundation (CNCF). It was the first project to graduate from CNCF, and it catalyzed an entire ecosystem — from container runtimes (containerd, CRI-O) to service meshes (Istio, Linkerd) to observability tools (Prometheus, Jaeger).

Note

The name "Kubernetes" comes from the Greek word κυβερνήτης, meaning "helmsman" or "pilot." The abbreviation K8s replaces the eight letters between "K" and "s." The seven-sided logo represents the original project codename: "Project Seven" (a reference to Seven of Nine from Star Trek).

The Declarative Model: Tell It What, Not How

The most important mental shift when learning Kubernetes is moving from imperative commands to declarative specifications. Instead of scripting a sequence of steps, you write a document that describes your desired end state and hand it to the system.

Here is the contrast. The imperative approach you saw earlier — SSH into servers, run Docker commands, update load balancers — is a recipe of how to deploy. If any step fails, the system is in an unknown state. The declarative approach is fundamentally different:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v2.4.1
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password

Apply this file with a single command — kubectl apply -f api-deployment.yaml — and Kubernetes handles everything: finding nodes with available resources, pulling the image, starting containers, running health checks, configuring rolling updates, and injecting the secret. If a container crashes, Kubernetes restarts it. If a node dies, Kubernetes reschedules the pods elsewhere. The YAML file becomes the single source of truth.

Before and After: A Side-by-Side Comparison

OperationManual (Bash + SSH + Docker)Kubernetes
Deploy 3 replicas SSH into 3 servers, docker run on each, update load balancer config kubectl apply -f deployment.yaml — scheduler places pods automatically
Scale to 10 replicas Provision servers, install Docker, update deploy script, run it kubectl scale deployment/api --replicas=10 or edit the YAML
Rolling update Stop/start containers one by one, manually verify, update LB between each Change the image tag in YAML and apply — Kubernetes handles the rollout
Rollback Re-run deploy script with old tag, hope state is clean kubectl rollout undo deployment/api — instant revert
Self-healing Cron job checking docker ps, manual restart, pager alert at 3 AM Built-in: kubelet restarts failed containers, scheduler replaces lost pods
Service discovery Hardcoded IPs in config files, consul/etcd bolted on separately Built-in DNS: api.default.svc.cluster.local resolves automatically
Secrets Environment variables in scripts, .env files on disk, plaintext Kubernetes Secrets, injected as volumes or env vars, RBAC-controlled
Resource limits Hope nobody deploys a memory-leaking container that kills the host Per-container CPU/memory requests and limits, enforced by cgroups

Kubernetes vs. the Alternatives

Kubernetes is not the only container orchestrator. Docker Swarm, HashiCorp Nomad, and Amazon ECS all solve overlapping problems. Choosing between them depends on your scale, team expertise, and operational requirements.

CriteriaKubernetesDocker SwarmNomadAmazon ECS
Learning curve Steep — many concepts and abstractions Gentle — extends Docker CLI naturally Moderate — simpler model, HCL config Moderate — AWS console or Terraform
Ecosystem Massive — CNCF landscape, Helm charts, operators Limited — mostly Docker-native tools Growing — integrates with Vault, Consul AWS-only — tight Fargate, ALB, CloudWatch integration
Multi-cloud Yes — runs on any cloud or bare metal Yes — but declining adoption Yes — cloud-agnostic No — locked to AWS
Workload types Containers (primary), VMs via KubeVirt Containers only Containers, VMs, Java JARs, binaries Containers only
Auto-scaling HPA, VPA, Cluster Autoscaler, KEDA Manual or basic rules Built-in autoscaler Application Auto Scaling, Fargate auto
Production adoption De facto standard — used by ~96% of organizations surveyed by CNCF (2023) Declining — Docker Inc. pivoted away Niche — strong at companies using HashiCorp stack Significant — dominant within AWS-only shops
When Kubernetes is overkill

If you run fewer than five services on a single cloud provider and your team has no Kubernetes experience, starting with ECS (on AWS), Cloud Run (on GCP), or Azure Container Apps can get you to production faster. Kubernetes shines when you need multi-cloud portability, complex networking policies, custom operators, or you are scaling beyond what managed PaaS platforms handle well. Don't adopt it for résumé-driven development.

The API-Driven Architecture

Everything in Kubernetes flows through its API server. When you run kubectl apply, you are making an HTTP request to the API. When a controller restarts a crashed pod, it reads from and writes to the same API. When a CI/CD pipeline deploys a new version, it hits the same endpoint. This uniform interface is what makes Kubernetes so extensible.

bash
# kubectl is just an API client — you can do the same with curl
kubectl get pods -o json

# Equivalent raw API call (with authentication)
curl -k https://<api-server>:6443/api/v1/namespaces/default/pods \
  --header "Authorization: Bearer $TOKEN"

# Watch for real-time changes (the same mechanism controllers use)
curl -k https://<api-server>:6443/api/v1/namespaces/default/pods?watch=true \
  --header "Authorization: Bearer $TOKEN"

This design means you can build your own tools, dashboards, and automation on top of Kubernetes without special access. Custom Resource Definitions (CRDs) let you extend the API with your own object types, and operators — custom controllers that watch those objects — let you encode domain-specific operational knowledge into the cluster itself. You can define a PostgresCluster resource and let an operator handle backups, failovers, and upgrades automatically.

The Reconciliation Loop: Why Declarative Wins

The power of the declarative model becomes clear when things go wrong. In an imperative system, a failed step leaves you in a partially applied state — some servers have the new version, others have the old one. Recovery means writing more imperative logic to detect and fix the inconsistency.

In Kubernetes, every controller runs a continuous loop: observe the current state, compare it to the desired state, and act to close the gap. This loop runs constantly — not just at deploy time.

flowchart LR
    A["Desired State\n(YAML in etcd)"] -->|compare| B{"Diff?"}
    B -->|No difference| C["Do nothing\n(system is healthy)"]
    B -->|Drift detected| D["Take action\n(create, update, delete)"]
    D --> E["Actual State\n(running containers)"]
    E -->|observe| B
    C -->|wait, re-check| B

    style A fill:#4a9eff,color:#fff,stroke:#3380cc
    style B fill:#f5a623,color:#fff,stroke:#c4851c
    style D fill:#e74c3c,color:#fff,stroke:#b83a2e
    style E fill:#2ecc71,color:#fff,stroke:#25a25a
    style C fill:#2ecc71,color:#fff,stroke:#25a25a
    

Suppose you declare 3 replicas of your API. A node crashes, taking one replica with it. The Deployment controller notices the actual count (2) does not match the desired count (3), and it schedules a new pod on a healthy node. No human intervention. No pager alert. The system converges back to the desired state automatically. This is the fundamental difference between running containers and orchestrating them.

Declarative does not mean "set and forget"

Kubernetes handles container-level failures automatically, but it does not replace monitoring, alerting, or capacity planning. If your cluster runs out of nodes, the scheduler cannot place new pods. If your application has a logic bug, Kubernetes will faithfully keep running the broken code. Declarative orchestration handles infrastructure-level reliability — application-level reliability is still your responsibility.

What You Will Build on This Foundation

This section answered the "why." You now understand the five core problems — scheduling, service discovery, self-healing, rolling updates, and configuration management — and why a declarative, API-driven system is the right abstraction for solving them at scale. You have seen how Kubernetes inherits battle-tested ideas from a decade of Google's internal infrastructure and where it fits relative to alternatives.

In the next section, Cluster Architecture at a Glance, you will see how Kubernetes is structured — the control plane components that make decisions and the node components that execute them. Every concept from this section — the API server, the reconciliation loop, the scheduler — maps directly to a specific component in the architecture.

Cluster Architecture at a Glance

A Kubernetes cluster is a set of machines — physical or virtual — organized into two distinct roles: control plane nodes that make decisions about the cluster, and worker nodes that run your application workloads. Every interaction you have with Kubernetes, from deploying an app to scaling a service, flows through this architecture.

Understanding this split is foundational. The control plane is the brain; worker nodes are the muscle. They communicate over well-defined APIs, and every component has a single, clear responsibility. This section maps out every piece so you know exactly what happens when you run kubectl apply.

The Two Halves of a Cluster

A Kubernetes cluster divides cleanly into the control plane and the data plane (worker nodes). The control plane manages cluster state — it decides what should run and where. Worker nodes execute those decisions — they actually run your containers. Here is what lives on each side:

Control Plane ComponentsRole
kube-apiserverFront door to the cluster. All reads/writes to cluster state go through it. Exposes the REST API that kubectl, controllers, and kubelets all talk to.
etcdDistributed key-value store. The single source of truth for all cluster data — every Pod spec, Service definition, and ConfigMap lives here.
kube-schedulerWatches for newly created Pods with no assigned node and selects the best node based on resource requirements, affinity rules, and constraints.
kube-controller-managerRuns controller loops (Deployment controller, ReplicaSet controller, Node controller, etc.) that continuously reconcile desired state with actual state.
cloud-controller-managerIntegrates with cloud provider APIs for load balancers, routes, and node lifecycle. Only present when running on a cloud platform.
Worker Node ComponentsRole
kubeletAgent on every node. Watches the API server for Pods assigned to its node, then instructs the container runtime to start/stop containers. Reports node and Pod status back.
kube-proxyMaintains network rules (iptables or IPVS) on each node so that Service ClusterIPs and NodePorts route traffic to the correct Pod endpoints.
Container RuntimeThe software that actually runs containers. Kubernetes talks to it through the Container Runtime Interface (CRI). Common runtimes: containerd, CRI-O.

Cluster Architecture Diagram

The diagram below shows how control plane and worker node components connect. Notice that every component communicates through the API server — it is the single hub. No component talks directly to etcd except the API server, and no component talks directly to another component's internal state.

graph TB
    subgraph CP["Control Plane Node(s)"]
        API["kube-apiserver"]
        ETCD["etcd"]
        SCHED["kube-scheduler"]
        CM["kube-controller-manager"]
        CCM["cloud-controller-manager"]
    end

    subgraph W1["Worker Node 1"]
        KL1["kubelet"]
        KP1["kube-proxy"]
        CR1["Container Runtime"]
        P1A["Pod A"]
        P1B["Pod B"]
    end

    subgraph W2["Worker Node 2"]
        KL2["kubelet"]
        KP2["kube-proxy"]
        CR2["Container Runtime"]
        P2A["Pod C"]
        P2B["Pod D"]
    end

    USER["👤 User / CI"]
    KUBECTL["kubectl"]

    USER --> KUBECTL
    KUBECTL -->|"HTTPS REST"| API
    API <-->|"read/write state"| ETCD
    SCHED -->|"watch & bind Pods"| API
    CM -->|"watch & reconcile"| API
    CCM -->|"cloud provider calls"| API

    KL1 -->|"watch assigned Pods"| API
    KL2 -->|"watch assigned Pods"| API
    KL1 --> CR1
    KL2 --> CR2
    CR1 --> P1A
    CR1 --> P1B
    CR2 --> P2A
    CR2 --> P2B
    KP1 -->|"watch Services/Endpoints"| API
    KP2 -->|"watch Services/Endpoints"| API
    
The API Server Is the Only etcd Client

No other component reads from or writes to etcd directly. The kube-scheduler, controller manager, kubelets, and kube-proxy all interact with etcd indirectly through the API server. This is a deliberate design choice — it centralizes authentication, authorization (RBAC), admission control, and validation in a single place.

Desired State vs. Actual State

Kubernetes operates on a declarative model. You tell it what you want (desired state), and the system continuously works to make reality match. You never say "start 3 Pods" imperatively — you declare "there should be 3 replicas" and Kubernetes figures out the rest.

The desired state lives in etcd as resource specs (Deployments, Services, ConfigMaps). The actual state is the real-time status of nodes and containers as reported by kubelets. Controller loops in the controller manager are the bridge — they watch for drift between the two and take corrective action. If a Pod crashes, the ReplicaSet controller sees the replica count is below the desired number and tells the API server to create a new Pod.

Anatomy of a Request: From kubectl apply to Running Containers

Understanding the end-to-end request flow demystifies Kubernetes. Here is exactly what happens when you apply a Deployment manifest:

sequenceDiagram
    actor User
    participant kubectl
    participant API as kube-apiserver
    participant etcd
    participant CM as controller-manager
    participant Sched as kube-scheduler
    participant KL as kubelet
    participant CRT as Container Runtime

    User->>kubectl: kubectl apply -f deployment.yaml
    kubectl->>API: POST /apis/apps/v1/namespaces/default/deployments
    API->>API: Authenticate, Authorize (RBAC), Admission Control
    API->>etcd: Persist Deployment object
    etcd-->>API: Write confirmed

    Note over CM: Deployment controller watch triggers
    CM->>API: Read new Deployment, create ReplicaSet
    API->>etcd: Persist ReplicaSet object
    CM->>API: ReplicaSet controller creates Pod objects
    API->>etcd: Persist Pod objects (nodeName = "")

    Note over Sched: Scheduler watch triggers on unbound Pods
    Sched->>API: Read unbound Pods, evaluate node fitness
    Sched->>API: Bind Pod → selected Node (set nodeName)
    API->>etcd: Update Pod with nodeName

    Note over KL: Kubelet watch triggers for its node
    KL->>API: Read Pod spec assigned to this node
    KL->>CRT: Pull image, create & start containers
    CRT-->>KL: Containers running
    KL->>API: Report Pod status = Running
    API->>etcd: Update Pod status
    

The entire flow is asynchronous and event-driven. No component blocks waiting for the next one — they all use watches (long-lived HTTP connections to the API server) to react to changes. This is why Kubernetes can handle thousands of Pods across hundreds of nodes without a central orchestration bottleneck.

The Seven Stages in Plain Language

  1. Submission. You run kubectl apply. kubectl reads your YAML, validates it client-side, and sends an HTTPS request to the API server.
  2. Admission & Persistence. The API server authenticates your identity, checks RBAC authorization, runs admission webhooks (e.g., injecting sidecar containers), validates the schema, and writes the Deployment object to etcd.
  3. Controller Reconciliation. The Deployment controller notices the new Deployment and creates a ReplicaSet. The ReplicaSet controller sees the new ReplicaSet and creates the specified number of Pod objects — but these Pods have no nodeName yet.
  4. Scheduling. The scheduler watches for Pods without a node assignment. It evaluates each worker node against the Pod's resource requests, node selectors, affinity/anti-affinity rules, and taints/tolerations. It picks the best node and writes the binding back to the API server.
  5. Kubelet Execution. The kubelet on the chosen node sees a new Pod assigned to it. It pulls the container image (if not cached), creates the container sandbox via the Container Runtime Interface (CRI), and starts the containers.
  6. Status Reporting. The kubelet continuously reports Pod status (Pending → Running → Succeeded/Failed) back to the API server, which persists it in etcd.
  7. Ongoing Reconciliation. Controllers keep watching. If a container crashes, the kubelet restarts it (based on the Pod's restartPolicy). If a node goes down, the node controller marks it as NotReady, and the ReplicaSet controller creates replacement Pods on healthy nodes.

Seeing It in Practice

You can observe this flow in real time. Open two terminal windows — one to watch events as they happen, and another to trigger the deployment:

bash
# Terminal 1 — watch cluster events in real time
kubectl get events --watch --sort-by='.lastTimestamp'
bash
# Terminal 2 — apply a simple deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-demo
  template:
    metadata:
      labels:
        app: nginx-demo
    spec:
      containers:
      - name: nginx
        image: nginx:1.27
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
EOF

In the events window, you will see the exact sequence play out: the Deployment is created, then a ReplicaSet, then three Pods. Each Pod goes through ScheduledPullingPulledCreatedStarted events. You can also inspect each layer directly:

bash
# See the ownership chain: Deployment → ReplicaSet → Pod
kubectl get deploy nginx-demo -o wide
kubectl get rs -l app=nginx-demo -o wide
kubectl get pods -l app=nginx-demo -o wide

# Check which node each Pod was scheduled to
kubectl get pods -l app=nginx-demo -o custom-columns=\
NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Managed Kubernetes vs. Self-Managed Clusters

When you run Kubernetes, you have a choice: manage the control plane yourself, or let a cloud provider handle it. This decision affects your operational burden, cost model, and level of control. In production, most teams choose managed Kubernetes unless they have specific compliance or customization requirements.

AspectManaged (EKS, GKE, AKS)Self-Managed (kubeadm, k3s, Rancher)
Control PlaneProvider runs and patches the API server, etcd, scheduler, and controller manager. You never SSH into control plane nodes.You install, configure, upgrade, and monitor every control plane component yourself.
etcdFully managed, backed up automatically, and replicated across availability zones. Invisible to you.You manage etcd cluster health, backups, compaction, and disaster recovery.
UpgradesOne-click or automated control plane upgrades. You still manage worker node upgrades (often via managed node groups).You plan and execute upgrades for both control plane and worker nodes, one minor version at a time.
NetworkingIntegrated with cloud VPCs, load balancers, and DNS. CNI plugins often pre-configured or tightly integrated (e.g., VPC-native Pods on GKE).You choose and configure the CNI plugin (Calico, Cilium, Flannel), set up load balancers, and manage DNS.
CostControl plane fee (e.g., ~$0.10/hr on EKS/AKS, free on GKE Autopilot for basic tier) + worker node compute costs.No control plane fee, but you pay for the compute of control plane nodes and the engineering time to manage them.
CustomizationLimited API server flags and admission controller choices. You use what the provider supports.Full control over every flag, plugin, and configuration file. You can run custom schedulers, admission webhooks, and etcd topologies.
Best ForTeams that want to focus on applications, not infrastructure. Most production workloads.Air-gapped environments, edge deployments, custom compliance requirements, or deep Kubernetes learning.
Managed Does Not Mean Fully Hands-Off

Even with managed Kubernetes, you are responsible for worker node updates, Pod security, RBAC policies, network policies, application-level monitoring, and backup of your own workload state. The provider manages the control plane — everything else is still on you.

Quick Cluster Inspection Commands

Regardless of whether your cluster is managed or self-managed, these commands give you immediate visibility into the architecture of any cluster you connect to:

bash
# Cluster info: API server endpoint, CoreDNS, and add-on URLs
kubectl cluster-info

# All nodes with roles, version, and OS info
kubectl get nodes -o wide

# Control plane components health (self-managed clusters)
kubectl get componentstatuses   # deprecated but still works in many clusters
kubectl get --raw='/healthz?verbose'

# See system Pods running the control plane and node agents
kubectl get pods -n kube-system -o wide

# Detailed view of a specific node's capacity and allocatable resources
kubectl describe node <node-name> | grep -A 6 "Capacity\|Allocatable"
On Managed Clusters, Control Plane Pods Are Hidden

If you run kubectl get pods -n kube-system on EKS, GKE, or AKS, you will not see the API server, scheduler, or controller manager Pods — the provider runs them outside your cluster's visibility. You will see kube-proxy, CoreDNS, and any CNI plugin DaemonSets, but the core control plane is abstracted away.

Putting It All Together

The architecture of a Kubernetes cluster is deliberate in its separation of concerns. The API server is the single communication hub. etcd is the single source of truth. The scheduler makes placement decisions. Controllers reconcile desired state with reality. Kubelets do the work of running containers. Every piece is independently replaceable and horizontally scalable.

This architecture is why Kubernetes can self-heal: if a node fails, the node controller detects it, the ReplicaSet controller creates replacement Pods, the scheduler places them on healthy nodes, and the kubelets start them — all automatically, with no human intervention. The next section digs deeper into each control plane component and how they work internally.

Control Plane Components — Under the Hood

The control plane is the brain of a Kubernetes cluster. It makes global decisions about scheduling, detects and responds to cluster events, and serves as the single source of truth for the desired state of every object. Understanding how each component works — and how they interact — is essential for debugging production issues and designing resilient clusters.

This section dissects four components: the API server (the front door), etcd (the memory), the scheduler (the matchmaker), and the controller manager (the reconciliation engine). We will look at internals, trace real request flows, and use kubectl to inspect each component in a live cluster.

flowchart TB
    subgraph ControlPlane["Control Plane"]
        API["kube-apiserver\n(REST API Gateway)"]
        ETCD["etcd\n(Distributed KV Store)"]
        SCHED["kube-scheduler\n(Pod Placement)"]
        CM["kube-controller-manager\n(Reconciliation Loops)"]
    end

    USER["kubectl / Client"] -->|"REST + AuthN/AuthZ\n+ Admission"| API
    API <-->|"gRPC\n(reads & writes)"| ETCD
    API -->|"Watch: unscheduled Pods"| SCHED
    SCHED -->|"Bind Pod → Node"| API
    API -->|"Watch: resource changes"| CM
    CM -->|"Create/Update objects"| API

    subgraph Nodes["Worker Nodes"]
        K1["kubelet"]
        K2["kubelet"]
    end

    API -->|"Watch: Pod specs"| K1
    API -->|"Watch: Pod specs"| K2

    style ControlPlane fill:#1a1a2e,stroke:#4a9eff,stroke-width:2px,color:#e0e0e0
    style Nodes fill:#1a1a2e,stroke:#50c878,stroke-width:2px,color:#e0e0e0
    style API fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
    style ETCD fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style SCHED fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style CM fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
    

kube-apiserver — The Front Door to Everything

Every interaction with your cluster — whether from kubectl, the dashboard, a CI/CD pipeline, or an internal controller — goes through the API server. It is the only component that talks to etcd directly. Nothing else reads or writes persistent state without going through this gateway first.

The API server is a stateless REST API built on HTTP/2. You can scale it horizontally by running multiple instances behind a load balancer. Each instance is functionally identical because etcd holds all the state.

The Request Pipeline: AuthN → AuthZ → Admission → etcd

Every request that hits the API server passes through a well-defined pipeline. Understanding this pipeline is critical for debugging 403 Forbidden errors, webhook failures, and mysterious object mutations.

flowchart LR
    REQ["Incoming\nRequest"] --> AUTHN["Authentication\n(Who are you?)"]
    AUTHN --> AUTHZ["Authorization\n(Can you do this?)"]
    AUTHZ --> MUT["Mutating\nAdmission"]
    MUT --> SCHEMA["Object Schema\nValidation"]
    SCHEMA --> VAL["Validating\nAdmission"]
    VAL --> ETCD["Persist\nto etcd"]

    style REQ fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
    style AUTHN fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style AUTHZ fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style MUT fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style SCHEMA fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
    style VAL fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style ETCD fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    

Authentication determines who is making the request. The API server evaluates multiple authenticators in sequence — client certificates, bearer tokens, OIDC tokens, service account tokens — and stops at the first one that succeeds. The result is a username, UID, and group membership.

Authorization determines what the authenticated identity is allowed to do. Kubernetes supports multiple authorizers: Node, ABAC, RBAC, and Webhook. In practice, nearly every cluster uses RBAC. The API server checks each authorizer in order; a single "allow" grants the request, while a "deny" from all authorizers rejects it.

Admission control happens in two phases. Mutating admission webhooks and built-in plugins run first — they can modify the object (e.g., injecting sidecar containers, setting default resource requests). Then validating admission webhooks run — they can accept or reject but cannot modify. This two-phase design means validators always see the final, mutated object.

bash
# Check which admission plugins are enabled on your API server
kubectl get pod kube-apiserver-controlplane -n kube-system \
  -o jsonpath='{.spec.containers[0].command}' | tr ',' '\n' | grep enable-admission

# List all registered mutating and validating webhook configurations
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

API Groups and Versioning

Kubernetes organizes its API into groups and versions. The core group (Pods, Services, ConfigMaps) lives at /api/v1. Named groups live under /apis/<group>/<version> — for example, /apis/apps/v1 for Deployments, or /apis/batch/v1 for Jobs. Every resource has a stability level indicated by its version: v1 (GA), v1beta1 (beta), or v1alpha1 (alpha).

bash
# Discover all API groups and their preferred versions
kubectl api-versions

# List all resources in the 'apps' group
kubectl api-resources --api-group=apps

# Make a raw API call — bypasses kubectl abstractions
kubectl get --raw /apis/apps/v1/namespaces/default/deployments | jq '.items[].metadata.name'

# Check API server health endpoints
kubectl get --raw /healthz
kubectl get --raw /readyz
Watch Mechanism

The API server supports long-lived watch connections. Instead of polling, clients (schedulers, controllers, kubelets) open a watch stream and receive a real-time event feed of changes. This is how Kubernetes achieves near-instant reaction to state changes — and why the API server is the central nervous system of the cluster.

etcd — The Cluster's Source of Truth

etcd is a distributed, strongly consistent key-value store. Every Kubernetes object — every Pod, Service, Secret, ConfigMap, and CustomResource — is serialized (typically as protobuf) and stored in etcd. If etcd is lost and unrecoverable, the cluster's desired state is gone. This makes etcd the single most critical piece of infrastructure in any Kubernetes deployment.

Data Model and Key Structure

Kubernetes stores objects in etcd under a predictable key hierarchy. The default prefix is /registry. A Deployment named nginx in the default namespace is stored at /registry/deployments/default/nginx. Understanding this structure is useful when inspecting etcd directly for disaster recovery or debugging.

bash
# List top-level keys in etcd (run on a control plane node)
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only --limit=20 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Look up a specific object
ETCDCTL_API=3 etcdctl get /registry/deployments/default/nginx \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Raft Consensus

etcd uses the Raft consensus algorithm to replicate data across its cluster members. One member is elected the leader; all writes go through the leader and are replicated to followers. A write is committed only when a majority (quorum) of members acknowledge it. This means a 3-node etcd cluster tolerates 1 failure, and a 5-node cluster tolerates 2 failures.

flowchart LR
    C["API Server\n(Client)"] -->|"Write request"| L["etcd Leader"]
    L -->|"Replicate log entry"| F1["etcd Follower 1"]
    L -->|"Replicate log entry"| F2["etcd Follower 2"]
    F1 -->|"Acknowledge"| L
    F2 -->|"Acknowledge"| L
    L -->|"Commit (quorum reached)\nReturn success"| C

    style C fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
    style L fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style F1 fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style F2 fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    
etcd Cluster SizeQuorum RequiredFault ToleranceRecommendation
110 failuresDevelopment only
321 failureStandard production
532 failuresHigh-availability production
743 failuresRarely needed — added latency

Compaction, Defragmentation, and Backup

etcd keeps a history of all key revisions to support the watch mechanism. Over time, this history grows and consumes disk space. Compaction discards revisions older than a given point, while defragmentation reclaims the freed disk space. Kubernetes runs automatic compaction (default: every 5 minutes, retaining roughly the last 5 minutes of history), but defragmentation must be triggered manually or via a cron job.

bash
# Check etcd cluster health and member list
ETCDCTL_API=3 etcdctl endpoint health --cluster \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

ETCDCTL_API=3 etcdctl member list --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check current database size and revision
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
bash
# Snapshot backup — THE most important operational task for etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot is valid
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20240115.db \
  --write-out=table

# Restore from snapshot (disaster recovery)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \
  --data-dir=/var/lib/etcd-restored \
  --name=controlplane \
  --initial-cluster=controlplane=https://192.168.1.10:2380 \
  --initial-advertise-peer-urls=https://192.168.1.10:2380
etcd Backup Is Non-Negotiable

If you run a production Kubernetes cluster and you are not backing up etcd on a regular schedule (at minimum daily, ideally hourly), you are one disk failure away from losing your entire cluster state. Automate it. Verify restores periodically. Store backups off-cluster and encrypted.

kube-scheduler — Deciding Where Pods Run

When you create a Pod (directly or via a Deployment), it initially has no spec.nodeName. The scheduler watches for these unscheduled Pods, evaluates every available node against a series of constraints and preferences, and then binds the Pod to the best node by writing the node name back to the API server. The kubelet on that node then picks up the Pod and starts its containers.

The Two-Phase Scheduling Cycle

The scheduler operates in two distinct phases for each unscheduled Pod. First, it filters out nodes that cannot run the Pod. Then, it scores the remaining candidates to find the best fit. This filter-then-score approach keeps the algorithm efficient — scoring only runs on nodes that have already passed all hard constraints.

flowchart LR
    P["Unscheduled Pod"] --> F["Filter Phase\n(Predicates)"]
    F -->|"Feasible nodes"| S["Score Phase\n(Priorities)"]
    S -->|"Highest score"| B["Bind to Node"]

    F -.- F1["NodeResourcesFit"]
    F -.- F2["NodeAffinity"]
    F -.- F3["TaintToleration"]
    F -.- F4["PodTopologySpread"]

    S -.- S1["LeastAllocated"]
    S -.- S2["BalancedAllocation"]
    S -.- S3["ImageLocality"]
    S -.- S4["NodeAffinityPriority"]

    style P fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
    style F fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style S fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style B fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
    
PhasePurposeKey PluginsEffect
FilterHard constraints — eliminate unsuitable nodesNodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpreadNode is either feasible or not
ScoreSoft preferences — rank the remaining candidatesLeastAllocated, BalancedAllocation, ImageLocality, InterPodAffinityEach node gets a score 0–100

Resource-Based Scheduling

The scheduler uses requests (not limits) to determine if a node has enough capacity. If a Pod requests 500m CPU and 256Mi memory, the scheduler only considers nodes with at least that much allocatable capacity remaining. This is why setting accurate resource requests is critical — overestimate and you waste capacity, underestimate and Pods get scheduled onto overloaded nodes.

bash
# See allocatable resources vs current requests on each node
kubectl describe nodes | grep -A 5 "Allocated resources"

# More precise: get node capacity, allocatable, and current allocation
kubectl get nodes -o custom-columns=\
  NAME:.metadata.name,\
  CPU_CAP:.status.capacity.cpu,\
  MEM_CAP:.status.capacity.memory,\
  CPU_ALLOC:.status.allocatable.cpu,\
  MEM_ALLOC:.status.allocatable.memory

# Why was a Pod not scheduled? Check events.
kubectl describe pod <pod-name> | grep -A 10 "Events"

# Inspect scheduler decisions in the scheduler logs
kubectl logs -n kube-system kube-scheduler-controlplane --tail=50

Node Affinity and Anti-Affinity

Node affinity lets you constrain which nodes a Pod can be scheduled on based on node labels. It comes in two flavors: requiredDuringSchedulingIgnoredDuringExecution (hard requirement — filter phase) and preferredDuringSchedulingIgnoredDuringExecution (soft preference — scoring phase). The "IgnoredDuringExecution" suffix means the rule only applies at scheduling time; if a node's labels change later, already-running Pods are not evicted.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu-type
                operator: In
                values: ["nvidia-a100", "nvidia-v100"]
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a"]
  containers:
    - name: training
      image: ml-training:v2
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"
          nvidia.com/gpu: "1"

Taints and Tolerations

While affinity pulls Pods toward nodes, taints push Pods away. A taint on a node repels all Pods that do not have a matching toleration. This is how Kubernetes keeps user workloads off control plane nodes (they carry a node-role.kubernetes.io/control-plane:NoSchedule taint) and how you can dedicate nodes for specific purposes.

There are three taint effects: NoSchedule (hard — never schedule here without a toleration), PreferNoSchedule (soft — try to avoid), and NoExecute (hard — evict already-running Pods that do not tolerate the taint).

bash
# View taints on all nodes
kubectl get nodes -o custom-columns=\
  NAME:.metadata.name,\
  TAINTS:.spec.taints

# Add a taint to dedicate nodes for a team
kubectl taint nodes worker-3 team=ml:NoSchedule

# Remove a taint (note the trailing dash)
kubectl taint nodes worker-3 team=ml:NoSchedule-

# Check which Pods tolerate control-plane taints
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.tolerations[]? | .key == "node-role.kubernetes.io/control-plane") |
  "\(.metadata.namespace)/\(.metadata.name)"'
yaml
# Pod that tolerates the ML team taint
apiVersion: v1
kind: Pod
metadata:
  name: ml-job
spec:
  tolerations:
    - key: "team"
      operator: "Equal"
      value: "ml"
      effect: "NoSchedule"
  containers:
    - name: training
      image: pytorch:latest

kube-controller-manager — Relentless Reconciliation

The controller manager is a single binary that bundles dozens of independent control loops (controllers). Each controller watches a specific set of resources through the API server and continuously drives the actual state toward the desired state. This is the heart of the Kubernetes declarative model: you declare what you want, and controllers make it happen.

The Controller Pattern

Every controller follows the same fundamental loop: observe (watch the API server for changes), diff (compare desired state vs. actual state), and act (make API calls to close the gap). If a controller crashes and restarts, it simply re-reads the current state from the API server and picks up where it left off. No state is stored locally.

flowchart TB
    WATCH["Watch API Server\n(Informer Cache)"] --> DIFF["Compare Desired\nvs. Actual State"]
    DIFF -->|"Drift detected"| ACT["Take Action\n(Create / Update / Delete)"]
    ACT --> API["API Server"]
    API --> WATCH

    DIFF -->|"In sync"| WATCH

    style WATCH fill:#2d3a5c,stroke:#4a9eff,color:#e0e0e0
    style DIFF fill:#2d3a5c,stroke:#f5a623,color:#e0e0e0
    style ACT fill:#2d3a5c,stroke:#50c878,color:#e0e0e0
    style API fill:#2d3a5c,stroke:#c77dba,color:#e0e0e0
    

Key Controllers and What They Do

ControllerWatchesReconcilesWhat It Does
DeploymentDeploymentsReplicaSetsCreates/updates ReplicaSets during rollouts; manages rollout history, rollback, and scaling
ReplicaSetReplicaSetsPodsEnsures the correct number of Pod replicas exist; creates or deletes Pods to match .spec.replicas
NodeNodesNode status, taintsMonitors node heartbeats; adds NoExecute taints to unreachable nodes; triggers Pod eviction
ServiceAccountNamespacesServiceAccountsCreates a default ServiceAccount in every new namespace
NamespaceNamespacesAll namespaced resourcesHandles namespace deletion — garbage-collects all resources within a terminating namespace
JobJobsPodsCreates Pods to completion; tracks success/failure counts; handles parallelism and backoff
EndpointSliceServices, PodsEndpointSlicesMaintains the mapping of Service selectors to Pod IPs
GarbageCollectorOwner referencesOrphaned objectsCascading deletion — when you delete a Deployment, it deletes owned ReplicaSets and Pods
bash
# Observe the controller pattern in action — scale a Deployment
# and watch the ReplicaSet controller respond
kubectl scale deployment/nginx --replicas=5
kubectl get events --watch --field-selector reason=SuccessfulCreate

# See the ownership chain: Deployment → ReplicaSet → Pod
kubectl get rs -o custom-columns=\
  NAME:.metadata.name,\
  OWNER:.metadata.ownerReferences[0].name,\
  DESIRED:.spec.replicas,\
  READY:.status.readyReplicas

# Inspect controller manager logs
kubectl logs -n kube-system kube-controller-manager-controlplane --tail=100

# Check which controllers are enabled
kubectl get pod kube-controller-manager-controlplane -n kube-system \
  -o jsonpath='{.spec.containers[0].command}' | tr ',' '\n' | grep controllers

Leader Election

In a highly available cluster with multiple control plane nodes, you have multiple instances of the controller manager and scheduler running. But only one instance of each should be actively reconciling at any time — otherwise you'd get duplicate actions. Kubernetes solves this with leader election.

Each component races to acquire a Lease object in the kube-system namespace. The winner becomes the active leader and periodically renews the lease. If the leader crashes, it stops renewing, and another instance acquires the lease within seconds. You can inspect the current leaders directly.

bash
# Check which node is the current leader for each component
kubectl get lease -n kube-system

# Detailed view — see holderIdentity and renewal time
kubectl get lease kube-controller-manager -n kube-system -o yaml
kubectl get lease kube-scheduler -n kube-system -o yaml

# Example output (holderIdentity shows the current leader):
# holderIdentity: controlplane-1_xxxxxxxx-xxxx
# leaseDurationSeconds: 15
# renewTime: "2024-01-15T10:23:45.000000Z"

Putting It All Together: The Life of a Deployment

To solidify how these components interact, let's trace what happens end-to-end when you run kubectl apply -f deployment.yaml for a new 3-replica Deployment.

1

kubectl → API Server

kubectl serializes your YAML, sends a POST to /apis/apps/v1/namespaces/default/deployments. The API server authenticates (client cert), authorizes (RBAC — does your user have create on deployments?), runs mutating admission (e.g., injects default labels), validates the schema, runs validating admission, and persists the Deployment object to etcd.

2

Deployment Controller → ReplicaSet

The Deployment controller (inside kube-controller-manager) receives a watch event: "new Deployment created." It creates a ReplicaSet with .spec.replicas: 3 and an owner reference pointing back to the Deployment. The ReplicaSet is written to the API server and persisted to etcd.

3

ReplicaSet Controller → Pods

The ReplicaSet controller receives a watch event: "new ReplicaSet with 3 desired replicas, 0 current." It creates 3 Pod objects, each with no spec.nodeName and an owner reference to the ReplicaSet. These Pods are persisted to etcd via the API server.

4

Scheduler → Bind Pods to Nodes

The scheduler receives watch events for 3 unscheduled Pods. For each Pod, it runs the filter phase (eliminating nodes without enough resources, wrong taints, etc.) and the score phase (preferring nodes with balanced allocation). It writes the selected spec.nodeName back to each Pod object via the API server.

5

Kubelet → Container Runtime

The kubelet on each selected node receives a watch event: "Pod assigned to me." It pulls the container image (if not cached), creates the sandbox via the container runtime (containerd), starts the containers, sets up networking, and begins reporting Pod status back to the API server.

bash
# Watch the entire chain in real-time in a second terminal
kubectl get events --watch

# Then trigger a Deployment in your first terminal
kubectl create deployment nginx --image=nginx:1.25 --replicas=3

# You will see events flow through:
#   Deployment created
#   ReplicaSet created (by deployment-controller)
#   Pods created (by replicaset-controller)
#   Pods scheduled (by default-scheduler)
#   Containers pulling image, started (by kubelet)
Debugging Tip

When a Pod is stuck in Pending, the problem is usually between steps 3 and 4 — the scheduler cannot find a feasible node. Run kubectl describe pod <name> and look at the Events section. The scheduler reports exactly which filter plugins failed and on how many nodes (e.g., "0/3 nodes are available: 3 Insufficient cpu").

Node Components — Kubelet, Kube-Proxy, and Container Runtime

The control plane makes decisions, but it is the node components that do the actual work. Every worker node in a Kubernetes cluster runs three core components: the kubelet, which manages pod lifecycle; kube-proxy, which handles network routing for Services; and a container runtime, which pulls images and runs containers. Understanding how these three collaborate is essential for debugging node-level issues and optimizing cluster performance.

Each of these components operates independently but communicates through well-defined interfaces. The kubelet talks to the container runtime via the Container Runtime Interface (CRI). Kube-proxy watches the API server for Service and Endpoint changes, then programs kernel-level routing rules. The container runtime handles the low-level mechanics of creating and destroying containers. Together, they turn a bare Linux machine into a functioning Kubernetes node.

Worker Node Architecture

The following diagram shows how the three node components interact on a single worker node when traffic arrives for a Kubernetes Service and when the kubelet manages a pod's lifecycle.

flowchart TB
    API["API Server\n(Control Plane)"]

    subgraph WorkerNode["Worker Node"]
        direction TB
        KL["kubelet"]
        KP["kube-proxy"]
        CR["Container Runtime\n(containerd / CRI-O)"]
        cAdvisor["cAdvisor\n(embedded)"]
        IPTABLES["iptables / IPVS\n(kernel)"]

        subgraph Pod1["Pod A"]
            C1["Container 1"]
            C2["Container 2"]
        end

        subgraph Pod2["Pod B"]
            C3["Container 3"]
        end
    end

    API -- "pod specs,\ndesired state" --> KL
    KL -- "status reports,\nnode heartbeat" --> API
    KL -- "CRI gRPC calls\n(RunPodSandbox,\nCreateContainer)" --> CR
    CR --> Pod1
    CR --> Pod2
    KL -- "metrics" --> cAdvisor
    cAdvisor -.-> Pod1
    cAdvisor -.-> Pod2
    API -- "Service &\nEndpointSlice watches" --> KP
    KP -- "programs rules" --> IPTABLES
    IPTABLES -- "DNAT to\npod IP:port" --> Pod1
    IPTABLES -- "DNAT to\npod IP:port" --> Pod2
    

Kubelet — The Node Agent

The kubelet is a daemon that runs on every node (including control plane nodes). It is the sole authority responsible for ensuring that the containers described in a Pod spec are running and healthy. The kubelet does not manage containers that were not created by Kubernetes — it only cares about Pods assigned to its node by the API server (or defined as static pods on disk).

Pod Lifecycle Management

When the scheduler assigns a Pod to a node, the kubelet picks up the assignment by watching the API server. It then drives the Pod through its lifecycle: pulling images, creating the pod sandbox (network namespace), starting init containers in sequence, starting app containers, running startup/liveness/readiness probes, and ultimately tearing everything down when the Pod is deleted. Each transition is reported back to the API server as a pod status update.

The kubelet's main sync loop runs every 10 seconds by default (configurable via --sync-frequency). On each iteration, it compares the desired state from the API server against the actual state on the node and reconciles any differences. If a container crashes, the kubelet restarts it according to the pod's restartPolicy with an exponential backoff that caps at 5 minutes.

You can inspect the kubelet's view of what is running on a node by querying its read-only status port or using kubectl:

bash
# List all pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=worker-1

# Check kubelet health on the node itself
systemctl status kubelet

# View kubelet logs for pod lifecycle events
journalctl -u kubelet --no-pager --since "10 minutes ago" | grep -i "SyncLoop"

# Inspect the kubelet's configuration
kubectl get --raw "/api/v1/nodes/worker-1/proxy/configz" | jq .

Static Pods

Static pods are managed directly by the kubelet on a specific node, without the API server scheduling them. The kubelet watches a directory on disk (default: /etc/kubernetes/manifests/) and automatically creates or destroys pods when YAML files are added to or removed from that directory. This is how kubeadm-based clusters run control plane components — etcd, kube-apiserver, kube-controller-manager, and kube-scheduler all run as static pods.

The kubelet creates a mirror pod in the API server for each static pod so that kubectl get pods can show them. However, you cannot delete a mirror pod through the API — you must remove the manifest file from disk.

bash
# Find the static pod manifest directory
ps aux | grep kubelet | grep -- "--pod-manifest-path"

# Or check the kubelet config file
cat /var/lib/kubelet/config.yaml | grep staticPodPath

# List static pod manifests (on a kubeadm control plane node)
ls -la /etc/kubernetes/manifests/
# etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

You can create your own static pods by placing a manifest in the static pod directory. The kubelet will pick it up within seconds:

yaml
# /etc/kubernetes/manifests/debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: debug-static
  namespace: default
spec:
  containers:
  - name: debug
    image: busybox:1.36
    command: ["sleep", "3600"]

CRI — Container Runtime Interface

The kubelet does not start containers directly. Instead, it communicates with the container runtime through the Container Runtime Interface (CRI), a gRPC-based API defined as a set of protobuf services. CRI has two services: the RuntimeService (for managing pod sandboxes and containers) and the ImageService (for pulling, listing, and removing images).

This abstraction is what allows Kubernetes to support multiple container runtimes interchangeably. The kubelet connects to the CRI endpoint via a Unix socket — typically /run/containerd/containerd.sock for containerd or /var/run/crio/crio.sock for CRI-O.

bash
# Check which CRI endpoint the kubelet is using
ps aux | grep kubelet | grep -- "container-runtime-endpoint"

# Use crictl to interact with the runtime directly (works with any CRI runtime)
crictl --runtime-endpoint unix:///run/containerd/containerd.sock info

# List running containers through CRI
crictl ps

# List pod sandboxes
crictl pods

# Pull an image through CRI
crictl pull nginx:1.25

# Inspect a specific container
crictl inspect <CONTAINER_ID>
Note

crictl is the standard CLI for debugging CRI-compatible runtimes. It ships with most Kubernetes distributions and uses the same CRI gRPC calls as the kubelet. Configure its default endpoint in /etc/crictl.yaml so you don't have to pass --runtime-endpoint every time.

cAdvisor Metrics and Resource Monitoring

The kubelet embeds cAdvisor (Container Advisor), which collects CPU, memory, filesystem, and network usage metrics for every running container. These metrics serve two critical purposes: they power the kubectl top command (via the Metrics Server, which scrapes the kubelet's /metrics/resource endpoint), and they inform the kubelet's own eviction decisions when a node runs low on resources.

bash
# View resource usage per pod on a node (requires Metrics Server)
kubectl top pods -n default

# View resource usage per node
kubectl top nodes

# Query the kubelet's metrics endpoint directly (from the node)
curl -sk https://localhost:10250/metrics/resource \
  --cert /var/lib/kubelet/pki/kubelet-client-current.pem \
  --key /var/lib/kubelet/pki/kubelet-client-current.pem | head -30

# Check kubelet's summary API for detailed per-container stats
kubectl get --raw "/api/v1/nodes/worker-1/proxy/stats/summary" | jq '.pods[0]'

Garbage Collection

Over time, terminated containers and unused images accumulate on each node, consuming disk space. The kubelet runs garbage collection for both. Container garbage collection removes dead containers based on three settings: MaxPerPodContainer (default 1 — keep the last terminated container per pod), MaxContainers (total dead containers on the node), and a minimum age threshold. Image garbage collection triggers based on disk usage thresholds.

yaml
# Key kubelet config fields for garbage collection (/var/lib/kubelet/config.yaml)
imageMinimumGCAge: 2m           # Don't GC images younger than 2 minutes
imageGCHighThresholdPercent: 85 # Start removing images when disk hits 85%
imageGCLowThresholdPercent: 80  # Stop removing images when disk drops to 80%
evictionHard:
  imagefs.available: "15%"      # Evict pods if image filesystem < 15% free
  memory.available: "100Mi"     # Evict pods if available memory < 100Mi
  nodefs.available: "10%"       # Evict pods if root filesystem < 10% free
bash
# Check current image disk usage on a node
crictl images | tail -5
crictl imagefsinfo

# Manually trigger image cleanup (remove unused images)
crictl rmi --prune

# See if the node is under disk pressure (triggers eviction)
kubectl describe node worker-1 | grep -A 5 "Conditions"

Kube-Proxy — Service Networking at the Kernel Level

When you create a Kubernetes Service, it gets a stable virtual IP address (ClusterIP) that doesn't map to any network interface. Kube-proxy is the component that makes this virtual IP actually work. It runs on every node, watches the API server for Service and EndpointSlice objects, and programs the node's kernel networking stack to translate Service IPs into real pod IPs using destination NAT (DNAT).

Kube-proxy does not proxy traffic itself in the data path (despite its name). It is a control plane component for node networking — it configures the rules and then steps aside. All actual packet forwarding happens in the Linux kernel.

How Service Routing Works

flowchart LR
    Client["Client Pod\n10.244.1.5"]
    SVC["Service ClusterIP\n10.96.0.100:80"]
    KERNEL["Linux Kernel\n(iptables / IPVS)"]
    EP1["Pod A\n10.244.2.10:8080"]
    EP2["Pod B\n10.244.3.22:8080"]
    EP3["Pod C\n10.244.1.18:8080"]

    WATCH["kube-proxy\n(watches API server)"]
    APISVR["API Server"]

    Client -- "dst: 10.96.0.100:80" --> KERNEL
    KERNEL -- "DNAT to\n10.244.2.10:8080" --> EP1
    KERNEL -. "or" .-> EP2
    KERNEL -. "or" .-> EP3
    APISVR -- "Service &\nEndpointSlice changes" --> WATCH
    WATCH -- "programs\nrouting rules" --> KERNEL
    

The flow works like this: a client pod sends a packet to the Service ClusterIP. The kernel intercepts the packet before it leaves the node (using PREROUTING or OUTPUT chains) and rewrites the destination address to one of the backing pod IPs. The response packet goes directly back to the client with the source address rewritten to the Service IP, so the client never knows which pod it hit.

Proxy Modes Compared

Kube-proxy supports three modes for programming these routing rules. The mode determines which kernel subsystem does the actual work. Each has meaningful trade-offs at scale.

Featureiptables (default)IPVSnftables
Kernel subsystemnetfilter / iptablesLinux Virtual Server (LVS)nftables (netfilter successor)
Default sinceKubernetes 1.2Stable since 1.11Alpha in 1.29, beta in 1.31
Rule complexityO(n) — chain of rulesO(1) — hash table lookupO(1) — optimized rulesets
Load balancingRandom (equal probability)Round-robin, least-conn, weighted, and moreRandom (equal probability)
Performance at scaleDegrades at >5,000 ServicesHandles 10,000+ Services wellBetter than iptables, comparable to IPVS
Rule update speedFull chain rewrite on changeIncremental updatesIncremental, transactional updates
Session affinityYes (via recent module)Yes (built-in persistence)Yes
Best forSmall-to-medium clustersLarge clusters with many ServicesModern kernels, future default
Tip

If you are running a cluster with more than a few thousand Services, switch to IPVS mode. The iptables mode rewrites the entire rule chain on every Service or Endpoint change, which causes noticeable latency spikes in large clusters. IPVS uses a kernel-level hash table and supports incremental updates.

Inspecting Kube-Proxy Rules

Understanding what kube-proxy has actually programmed into the kernel is one of the most valuable debugging skills for Service connectivity issues. The commands differ by proxy mode.

bash
# Check which proxy mode is active
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# --- iptables mode ---
# List all KUBE-SERVICES chains (one per ClusterIP:port)
iptables -t nat -L KUBE-SERVICES -n | head -20

# Trace rules for a specific Service ClusterIP
iptables -t nat -L -n -v | grep "10.96.0.100"

# Show the full DNAT chain for a service
# (Follow KUBE-SVC-xxx -> KUBE-SEP-xxx chains to see pod endpoints)
iptables -t nat -L KUBE-SVC-XXXXXX -n -v

# --- IPVS mode ---
# List all virtual servers (Service IPs) and their real servers (Pod IPs)
ipvsadm -Ln

# Show stats for a specific Service ClusterIP
ipvsadm -Ln -t 10.96.0.100:80

# --- nftables mode ---
# List nftables rules managed by kube-proxy
nft list table ip kube-proxy

Session Affinity

By default, kube-proxy distributes traffic across all healthy endpoints with no stickiness. If you need a client to consistently reach the same backend pod (for example, for in-memory sessions), you can enable session affinity on the Service. Kubernetes supports ClientIP affinity, which routes all requests from the same source IP to the same pod for a configurable timeout (default: 10,800 seconds / 3 hours).

yaml
apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 1800  # 30 minutes
bash
# Verify session affinity is set on a Service
kubectl describe svc web-app | grep -i "session"

# Test affinity — repeated requests should hit the same pod
for i in $(seq 1 5); do
  kubectl exec test-pod -- curl -s web-app.default.svc.cluster.local | grep "pod-name"
done

Kube-Proxy Logs and Debugging

bash
# kube-proxy usually runs as a DaemonSet — find the pod on your node
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide

# View kube-proxy logs
kubectl logs -n kube-system kube-proxy-abc12 --tail=50

# Check kube-proxy's current configuration
kubectl get configmap kube-proxy -n kube-system -o yaml

# Verify kube-proxy is correctly watching EndpointSlices
kubectl logs -n kube-system kube-proxy-abc12 | grep -i "endpoints"

# Check conntrack entries for active NAT translations
conntrack -L -d 10.96.0.100 2>/dev/null | head -10

Container Runtime — Where Containers Actually Run

The container runtime is the software that actually executes containers. It pulls images from registries, creates filesystem layers, sets up namespaces and cgroups, and starts the process inside the container. Kubernetes interacts with the runtime exclusively through the CRI specification, which means any runtime implementing the CRI gRPC API works with Kubernetes.

The CRI Specification

CRI defines two gRPC services. The RuntimeService handles the pod and container lifecycle — creating sandboxes, starting and stopping containers, executing commands, attaching to running containers, and port forwarding. The ImageService handles image operations — pulling, listing, inspecting, and removing images. This clean separation means the kubelet never needs to know the implementation details of how containers or images are managed under the hood.

CRI ServiceKey RPCsPurpose
RuntimeServiceRunPodSandboxCreate the pod's network namespace and infrastructure container
CreateContainerCreate a container within an existing pod sandbox
StartContainerStart a previously created container
StopContainerGracefully stop a running container
RemoveContainerRemove a stopped container
ExecSyncExecute a command in a container synchronously
ImageServicePullImagePull an image from a registry
ListImagesList images on the node
RemoveImageRemove an image from the node
ImageFsInfoReport filesystem usage for image storage

containerd — The Default Runtime

containerd is the default container runtime for most Kubernetes distributions, including kubeadm clusters, GKE, EKS, and AKS. It was originally extracted from Docker as a standalone daemon and donated to the CNCF. containerd handles the full container lifecycle: image transfer and storage, container execution (via runc as the low-level OCI runtime), snapshotting (filesystem layering), and network namespace management.

The architecture is layered: the kubelet calls containerd's CRI plugin over gRPC, containerd manages image and container metadata, and it delegates actual container creation to an OCI runtime (usually runc, but gVisor's runsc and Kata Containers' kata-runtime are also supported).

bash
# Check containerd status
systemctl status containerd

# View containerd version and runtime info
ctr version

# Use the containerd-specific CLI to list containers (in the k8s.io namespace)
ctr -n k8s.io containers list

# List images managed by containerd for Kubernetes
ctr -n k8s.io images list | head -10

# View the containerd configuration
cat /etc/containerd/config.toml | grep -A 5 "plugins.*cri"

# Check which OCI runtime containerd is using
cat /etc/containerd/config.toml | grep -A 3 "runtimes.runc"

CRI-O — The Kubernetes-Native Alternative

CRI-O is a lightweight container runtime built specifically for Kubernetes. Unlike containerd, which has a broader scope (Docker, Swarm, etc.), CRI-O implements only the CRI interface and nothing more. It follows a strict version-locking policy with Kubernetes: CRI-O 1.29.x supports Kubernetes 1.29.x, CRI-O 1.30.x supports Kubernetes 1.30.x, and so on. This makes it a popular choice in OpenShift clusters and environments that want the thinnest possible runtime layer.

bash
# Check CRI-O status
systemctl status crio

# View CRI-O version and config
crio --version
crio config --default | head -30

# On a CRI-O node, crictl works exactly the same way
crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps
crictl --runtime-endpoint unix:///var/run/crio/crio.sock images

The Dockershim Deprecation

Before Kubernetes 1.24, the kubelet could talk to Docker Engine via a built-in adapter called dockershim. This adapter translated CRI calls into Docker API calls. The chain was wasteful: kubelet → dockershim → Docker Engine → containerd → runc. Kubernetes 1.24 removed dockershim entirely. If your cluster previously used Docker as its runtime, it was migrated to containerd (which Docker itself used internally). Your container images remain 100% compatible because both Docker and Kubernetes build and run OCI-compliant images.

Image Pulling and Caching

When a pod is scheduled to a node, the kubelet instructs the container runtime to pull any images not already present on the node. Images are stored in a content-addressable store and shared across containers — if two pods use the same image, it is stored only once. The pull behavior is controlled by the pod spec's imagePullPolicy.

imagePullPolicyBehaviorWhen to use
AlwaysAlways contacts the registry (uses cached layers if digest matches)Tags that can change, like latest or dev
IfNotPresentPulls only if the image is not on the nodeImmutable tags like v1.2.3 (default for tagged images)
NeverNever pulls — fails if image is missingPre-loaded images, air-gapped environments
bash
# List images cached on the node
crictl images

# Check the disk space used by images
crictl imagefsinfo

# Pre-pull an image to avoid cold-start latency
crictl pull gcr.io/my-project/my-app:v2.1.0

# Debug image pull failures — check kubelet logs
journalctl -u kubelet --no-pager --since "5 minutes ago" | grep -i "pull"

# Check events on a pod stuck in ImagePullBackOff
kubectl describe pod my-pod | grep -A 10 "Events"
Warning

Avoid using the :latest tag in production. With imagePullPolicy: Always (the default for :latest), every pod start contacts the registry, adding latency and creating a hard dependency on registry availability. With IfNotPresent, different nodes may run different versions of :latest. Always use immutable, versioned tags like v1.4.2 or full digest references like sha256:abc123....

Putting It All Together — Node Status and Debugging

The kubelet reports the overall health of all three components back to the API server as node conditions. When any component is unhealthy, it surfaces as a condition change on the node object. Understanding these conditions is the starting point for diagnosing node-level problems.

bash
# Full node inspection — allocatable resources, conditions, images, and running pods
kubectl describe node worker-1

# Check node conditions specifically
kubectl get node worker-1 -o jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'

# Verify the container runtime the node is using
kubectl get node worker-1 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
# Output: containerd://1.7.11

# Comprehensive node debug — creates a privileged debug pod on the node
kubectl debug node/worker-1 -it --image=ubuntu:22.04
# Once inside the debug pod:
#   chroot /host
#   systemctl status kubelet
#   systemctl status containerd
#   journalctl -u kubelet --since "30 minutes ago" | tail -50
Node ConditionHealthy ValueWhat It Means When Unhealthy
ReadyTrueKubelet cannot communicate with the API server or the container runtime is down
MemoryPressureFalseNode memory is running low — kubelet will start evicting pods
DiskPressureFalseRoot or image filesystem is nearly full — image GC and pod eviction triggered
PIDPressureFalseToo many processes on the node — kubelet refuses new pods
NetworkUnavailableFalseCNI plugin has not configured networking — pods cannot communicate

With a solid understanding of these three components, you can trace any node-level issue from symptom to root cause. A pod stuck in ContainerCreating? Check the kubelet logs and the container runtime. A Service not reachable? Inspect kube-proxy's iptables or IPVS rules. An ImagePullBackOff? Look at the kubelet events and test the image pull with crictl. These components are the foundation — everything else in Kubernetes runs on top of them.

The Kubernetes API and kubectl Essentials

Every interaction with a Kubernetes cluster — whether you type a kubectl command, deploy from CI/CD, or a controller reconciles state — goes through the Kubernetes API server. The API is the single source of truth. Understanding its structure is not optional background knowledge; it is the foundation for everything else you will do with Kubernetes.

kubectl is the primary CLI client for this API. It translates your commands into HTTP requests against the API server, then formats the responses for your terminal. This section covers the API's structure, how to explore it, and the essential kubectl commands organized by real-world workflow.

The Kubernetes API Structure

The Kubernetes API is a RESTful HTTP API organized into API groups. Each group contains a set of related resources, and each resource has one or more supported versions. When you create a Deployment, you are making an HTTP POST to a specific URL path that encodes the group, version, and resource type.

API Groups

Resources are organized into groups to keep the API modular and allow independent versioning. The core group (also called the legacy group) has no explicit group name in its URL path — its resources live directly under /api/v1. All other groups are under /apis/<group-name>/<version>.

API GroupURL Path PrefixKey ResourcesPurpose
core (legacy)/api/v1Pod, Service, ConfigMap, Secret, Namespace, Node, PersistentVolumeFoundational cluster primitives
apps/apis/apps/v1Deployment, StatefulSet, DaemonSet, ReplicaSetWorkload controllers
batch/apis/batch/v1Job, CronJobRun-to-completion and scheduled workloads
networking.k8s.io/apis/networking.k8s.io/v1Ingress, NetworkPolicy, IngressClassNetwork routing and access control
rbac.authorization.k8s.io/apis/rbac.authorization.k8s.io/v1Role, ClusterRole, RoleBinding, ClusterRoleBindingAccess control
storage.k8s.io/apis/storage.k8s.io/v1StorageClass, CSIDriver, VolumeAttachmentStorage provisioning
autoscaling/apis/autoscaling/v2HorizontalPodAutoscalerAutomatic scaling

The full API URL for a namespaced resource follows this pattern: /apis/{group}/{version}/namespaces/{namespace}/{resource}. For example, to list Deployments in the production namespace, the API server handles a GET request to /apis/apps/v1/namespaces/production/deployments.

Resource Versioning

Each API group version indicates its stability level. This is not just a label — it carries guarantees about backward compatibility and how long the version will be supported.

Version LabelStabilityMeaning
v1, v2GA (stable)Fully supported, backward-compatible changes only. Safe for production.
v1beta1, v2beta1BetaEnabled by default but may have breaking changes between releases. Feature is well-tested.
v1alpha1AlphaDisabled by default. May be removed without notice. Never use in production.

You specify the API version in every manifest's apiVersion field. For core group resources, you write apiVersion: v1. For named groups, you write apiVersion: group/version — for example, apiVersion: apps/v1 for a Deployment.

Exploring the API with kubectl

You do not need to memorize the full API table. Two commands let you explore it interactively. kubectl api-resources lists every resource type available in your cluster, and kubectl explain gives you inline documentation for any resource or field.

bash
# List all available resource types with their API group and kind
kubectl api-resources

# Filter to only namespaced resources
kubectl api-resources --namespaced=true

# Filter by API group
kubectl api-resources --api-group=apps

# Show supported API versions
kubectl api-versions

The output of api-resources shows each resource's short name (useful for saving keystrokes), the API group it belongs to, whether it is namespaced, and its Kind. For example, you will see that deploy is the short name for deployments in the apps group.

bash
# Get top-level documentation for a resource
kubectl explain deployment

# Drill into a specific field path
kubectl explain deployment.spec.strategy

# Show the full recursive structure
kubectl explain deployment.spec --recursive

# Specify an API version explicitly
kubectl explain cronjob --api-version=batch/v1
Tip

Use kubectl explain as your first reference instead of searching the web. It reflects the exact API version running on your cluster, which may differ from online documentation. Combine it with --recursive to see the full field hierarchy, then drill into specific paths for descriptions and types.

Kubeconfig, Contexts, and Namespace Management

Before you can talk to a cluster, kubectl needs to know which cluster, which user credentials, and which namespace to use by default. All of this is stored in the kubeconfig file, typically located at ~/.kube/config. The file has three main sections: clusters, users, and contexts.

Kubeconfig Structure

yaml
apiVersion: v1
kind: Config
current-context: dev-cluster

clusters:
- name: dev-cluster
  cluster:
    server: https://dev-k8s.example.com:6443
    certificate-authority-data: LS0tLS1CRUd...
- name: prod-cluster
  cluster:
    server: https://prod-k8s.example.com:6443
    certificate-authority-data: LS0tLS1CRUd...

users:
- name: dev-admin
  user:
    client-certificate-data: LS0tLS1CRUd...
    client-key-data: LS0tLS1CRUd...
- name: prod-reader
  user:
    token: eyJhbGciOiJSUzI1NiIs...

contexts:
- name: dev-cluster
  context:
    cluster: dev-cluster
    user: dev-admin
    namespace: default
- name: prod-readonly
  context:
    cluster: prod-cluster
    user: prod-reader
    namespace: monitoring

A context is a triple of (cluster, user, namespace). It gives you a named shortcut so you can switch between environments with a single command instead of specifying credentials and endpoints every time.

Context Switching

bash
# See all available contexts
kubectl config get-contexts

# Check which context is currently active
kubectl config current-context

# Switch to a different context
kubectl config use-context prod-readonly

# Set the default namespace for the current context
kubectl config set-context --current --namespace=kube-system

# View the full merged kubeconfig
kubectl config view

Namespace Management

Namespaces provide scope for resource names and are the primary boundary for access control policies and resource quotas. Every namespaced command defaults to the namespace set in your current context (or default if none is set). You can override this on any command with the -n flag.

bash
# List all namespaces
kubectl get namespaces

# Run a command in a specific namespace
kubectl get pods -n kube-system

# Query across ALL namespaces
kubectl get pods --all-namespaces
kubectl get pods -A   # shorthand

# Create a new namespace
kubectl create namespace staging

Essential kubectl Commands by Workflow

Rather than listing commands alphabetically, let's organize them by the workflow stages you will actually use every day: create, inspect, update, delete, and debug.

Creating Resources

There are two primary ways to create resources: kubectl apply (declarative) and kubectl create (imperative). The declarative approach with apply is strongly preferred for anything beyond quick experiments, because it tracks the desired state and enables repeatable updates. The next section on the declarative model covers the "why" in depth.

bash
# Declarative: apply a manifest file (create or update)
kubectl apply -f deployment.yaml

# Apply an entire directory of manifests
kubectl apply -f ./k8s/

# Apply from a URL
kubectl apply -f https://raw.githubusercontent.com/org/repo/main/manifests/app.yaml

# Dry-run to validate without creating anything
kubectl apply -f deployment.yaml --dry-run=client
kubectl apply -f deployment.yaml --dry-run=server  # server-side validation

# Imperative: create resources directly (errors if resource already exists)
kubectl create deployment nginx --image=nginx:1.27 --replicas=3
kubectl create service clusterip nginx --tcp=80:80
kubectl create configmap app-config --from-file=config.properties
kubectl create secret generic db-creds --from-literal=password=s3cret
apply vs create

kubectl create fails if the resource already exists. kubectl apply creates the resource if it does not exist and updates it if it does. In production workflows, apply is almost always what you want because it is idempotent — you can run it repeatedly with the same result.

Inspecting Resources

Inspection is where you will spend most of your kubectl time. The core commands are get (list resources), describe (detailed view with events), and logs (container output). Each serves a different purpose in your investigation workflow.

bash
# List pods with status information
kubectl get pods
kubectl get pods -o wide            # includes Node, IP, and nominated node
kubectl get pods --show-labels       # show all labels as columns
kubectl get pods -l app=frontend     # filter by label selector
kubectl get pods --field-selector=status.phase=Running

# List multiple resource types at once
kubectl get pods,services,deployments

# Describe gives you the full picture: spec, status, conditions, and events
kubectl describe pod nginx-7d6b8f5c9-x2k4m
kubectl describe deployment frontend
kubectl describe node worker-01

# Container logs
kubectl logs nginx-7d6b8f5c9-x2k4m
kubectl logs nginx-7d6b8f5c9-x2k4m -c sidecar   # specific container
kubectl logs nginx-7d6b8f5c9-x2k4m --previous     # logs from crashed container
kubectl logs -f nginx-7d6b8f5c9-x2k4m              # follow (stream) logs
kubectl logs -l app=frontend --all-containers       # logs from all matching pods

# Execute commands inside a running container
kubectl exec nginx-7d6b8f5c9-x2k4m -- ls /etc/nginx
kubectl exec -it nginx-7d6b8f5c9-x2k4m -- /bin/sh  # interactive shell

The difference between get and describe is critical. get gives you a tabular summary — perfect for scanning across many resources. describe gives you the full story of a single resource, including the Events section at the bottom, which is often the first place you find the reason for a failure (image pull errors, insufficient resources, failed health checks).

Updating Resources

For updates, the declarative approach is to modify your YAML file and run kubectl apply again. But kubectl also provides imperative update commands that are useful for quick changes during development or incident response.

bash
# Edit a resource in your default editor (opens YAML in $EDITOR)
kubectl edit deployment frontend

# Patch a resource with a JSON merge patch
kubectl patch deployment frontend \
  -p '{"spec":{"replicas":5}}'

# Strategic merge patch (default) — merges arrays intelligently
kubectl patch deployment frontend --type=strategic \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","image":"app:v2.1"}]}}}}'

# JSON patch — precise array operations
kubectl patch deployment frontend --type=json \
  -p '[{"op":"replace","path":"/spec/replicas","value":5}]'

# Quick replica scaling
kubectl scale deployment frontend --replicas=10

# Update the container image directly
kubectl set image deployment/frontend app=myapp:v2.1

# Add or update labels and annotations
kubectl label pods nginx-7d6b8f5c9-x2k4m env=staging
kubectl annotate deployment frontend team=platform

Deleting Resources

Deletion can be targeted or broad. By default, kubectl delete waits for the resource to be fully terminated (which respects graceful shutdown periods). Use --wait=false to return immediately.

bash
# Delete a specific resource
kubectl delete pod nginx-7d6b8f5c9-x2k4m
kubectl delete deployment frontend

# Delete using the same manifest file used to create it
kubectl delete -f deployment.yaml

# Delete by label selector
kubectl delete pods -l app=frontend

# Delete all resources of a type in a namespace
kubectl delete pods --all -n staging

# Force-delete a stuck pod (skips graceful shutdown)
kubectl delete pod stuck-pod --grace-period=0 --force

# Delete a namespace and ALL resources within it
kubectl delete namespace staging
Warning

Deleting a Pod directly does not prevent it from being recreated. If the Pod is managed by a Deployment or ReplicaSet, the controller will immediately create a replacement. To remove Pods permanently, delete the controller (Deployment, StatefulSet, etc.) that owns them.

Debugging

When things go wrong — and they will — kubectl provides tools to dig deeper without modifying production workloads. Port-forwarding and proxying let you reach cluster-internal endpoints from your local machine. The debug command lets you attach diagnostic containers to running pods.

bash
# Forward a local port to a pod (access pod's port 8080 at localhost:8080)
kubectl port-forward pod/frontend-5d7b8c9f-k2m4x 8080:8080

# Forward to a service (kubectl picks a backing pod)
kubectl port-forward svc/frontend 8080:80

# Start a proxy to the entire API server (accessible at localhost:8001)
kubectl proxy

# Attach a debug container to a running pod (ephemeral container)
kubectl debug -it frontend-5d7b8c9f-k2m4x --image=busybox --target=app

# Create a copy of a pod with a debug container for troubleshooting
kubectl debug frontend-5d7b8c9f-k2m4x -it --copy-to=debug-pod --image=ubuntu

# Debug a node directly (creates a privileged pod on the node)
kubectl debug node/worker-01 -it --image=ubuntu

Output Formatting

The default table output is fine for quick glances, but real-world workflows often require structured data — piping to jq, feeding into scripts, or extracting a single value for a CI variable. The -o flag controls the output format.

Standard Formats

bash
# Full YAML output — great for seeing the complete resource spec
kubectl get deployment frontend -o yaml

# Full JSON output — ideal for piping to jq
kubectl get pods -o json | jq '.items[].metadata.name'

# Wide table with extra columns (Node, IP, etc.)
kubectl get pods -o wide

# Just the resource names (useful for scripting)
kubectl get pods -o name

JSONPath — Extracting Specific Fields

JSONPath expressions let you extract exactly the fields you need without an external JSON processor. The syntax uses curly braces around the path expression, starting from the root object.

bash
# Get the IP address of a specific pod
kubectl get pod nginx-7d6b8f5c9-x2k4m \
  -o jsonpath='{.status.podIP}'

# List all container images across all pods
kubectl get pods \
  -o jsonpath='{.items[*].spec.containers[*].image}'

# Format with newlines for readability
kubectl get pods \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

# Get the node port of a NodePort service
kubectl get svc frontend \
  -o jsonpath='{.spec.ports[0].nodePort}'

# Extract a secret value (base64-encoded)
kubectl get secret db-creds \
  -o jsonpath='{.data.password}' | base64 -d

Custom Columns — Readable Tables from Arbitrary Fields

When you want a table format but with different columns than the default, custom-columns lets you define exactly what to show. Each column has a header and a JSONPath expression.

bash
# Custom columns: pod name, node, status, and restart count
kubectl get pods -o custom-columns=\
'NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'

# Deployments with image and replica info
kubectl get deployments -o custom-columns=\
'NAME:.metadata.name,IMAGE:.spec.template.spec.containers[0].image,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas'

Real-World Workflow Patterns

Individual commands are building blocks. In practice, you chain them together to accomplish real tasks. Here are patterns you will use regularly.

Investigating a Failing Deployment

When a deployment is not reaching its desired replica count, work through the resource hierarchy: Deployment → ReplicaSet → Pod → Container logs → Events.

bash
# Step 1: Check the deployment status
kubectl get deployment frontend
kubectl describe deployment frontend | tail -20

# Step 2: Check the ReplicaSet it created
kubectl get replicasets -l app=frontend

# Step 3: Find pods that are not Ready
kubectl get pods -l app=frontend --field-selector=status.phase!=Running

# Step 4: Describe the failing pod for events and conditions
kubectl describe pod frontend-5d7b8c9f-crash1

# Step 5: Check container logs (including previous crashed container)
kubectl logs frontend-5d7b8c9f-crash1 --previous

# Step 6: Check cluster-wide events sorted by time
kubectl get events --sort-by='.lastTimestamp' -n default | tail -20

Quick Rollback After a Bad Deploy

bash
# View rollout history
kubectl rollout history deployment/frontend

# Check the current rollout status
kubectl rollout status deployment/frontend

# Roll back to the previous version
kubectl rollout undo deployment/frontend

# Roll back to a specific revision
kubectl rollout undo deployment/frontend --to-revision=3

# Pause a rollout (prevent further changes from progressing)
kubectl rollout pause deployment/frontend

# Resume a paused rollout
kubectl rollout resume deployment/frontend

Generating Manifests Without Memorizing YAML

One of the most practical kubectl tricks: use imperative commands with --dry-run=client -o yaml to generate manifest templates. This saves you from writing YAML from scratch and ensures the basic structure is correct.

bash
# Generate a Deployment manifest
kubectl create deployment web --image=nginx:1.27 --replicas=3 \
  --dry-run=client -o yaml > deployment.yaml

# Generate a Service manifest
kubectl create service clusterip web --tcp=80:8080 \
  --dry-run=client -o yaml > service.yaml

# Generate a Job manifest
kubectl create job backup --image=postgres:16 \
  -- pg_dump -h db myapp \
  --dry-run=client -o yaml > job.yaml

# Generate a CronJob manifest
kubectl create cronjob cleanup --image=busybox \
  --schedule="0 2 * * *" -- /bin/sh -c "echo cleaning up" \
  --dry-run=client -o yaml > cronjob.yaml

Working Across Multiple Resources

bash
# Get a comprehensive view of everything in a namespace
kubectl get all -n production

# Watch resources in real time (updates as state changes)
kubectl get pods -w

# Get top resource consumers (requires metrics-server)
kubectl top pods --sort-by=memory
kubectl top nodes

# Diff local changes against the live cluster state before applying
kubectl diff -f deployment.yaml

# Apply changes only when the diff looks right
kubectl diff -f deployment.yaml && kubectl apply -f deployment.yaml

# Bulk operations: restart all pods in a deployment (rolling restart)
kubectl rollout restart deployment/frontend

# Copy files to/from a container
kubectl cp ./local-file.txt frontend-pod:/tmp/remote-file.txt
kubectl cp frontend-pod:/var/log/app.log ./app.log
Tip

Use kubectl diff -f before every kubectl apply in production. It shows you exactly what will change, just like git diff before a commit. This habit prevents surprises — especially when multiple people manage the same cluster.

Quick Reference: Common Short Names and Flags

Typing full resource names gets tedious. Kubernetes provides short names for frequently used resources, and kubectl supports shorthand flags. Here are the ones you will use most.

ResourceShort NameExample
podspokubectl get po
servicessvckubectl get svc
deploymentsdeploykubectl get deploy
replicasetsrskubectl get rs
configmapscmkubectl get cm
namespacesnskubectl get ns
nodesnokubectl get no
persistentvolumeclaimspvckubectl get pvc
persistentvolumespvkubectl get pv
serviceaccountssakubectl get sa
horizontalpodautoscalershpakubectl get hpa
ingressesingkubectl get ing
FlagShortPurpose
--namespace-nTarget a specific namespace
--all-namespaces-AQuery all namespaces
--selector-lFilter by label selector
--output-oSet output format
--follow-fStream logs in real time
--watch-wWatch for resource changes
--container-cTarget a specific container in a pod

The Declarative Model and Reconciliation Loops

Most infrastructure tools ask you to write a script of commands: "create this, then modify that, then delete the other thing." Kubernetes works differently. You hand the cluster a document that says what you want, and a fleet of controllers figure out how to make it happen. This is the declarative model, and it is the single most important design decision in Kubernetes.

The declarative approach turns cluster management into a convergence problem. You declare a desired state \u2014 three replicas of your web server, a load balancer on port 443, a 10Gi persistent volume \u2014 and the system continuously works to make reality match that declaration. If something drifts (a pod crashes, a node goes down, someone manually deletes a resource), the controllers detect the gap and close it automatically.

Imperative vs. Declarative: Two Mental Models

Kubernetes supports both imperative and declarative workflows through kubectl, but they represent fundamentally different ways of thinking about infrastructure. Understanding the distinction is critical to using Kubernetes effectively.

Imperative commands tell Kubernetes exactly what to do right now. You issue a verb \u2014 create, scale, delete \u2014 and the action executes immediately. There is no record of intent beyond the resulting object in the cluster.

bash
# Imperative: issue commands one at a time
kubectl create deployment nginx --image=nginx:1.27
kubectl scale deployment nginx --replicas=3
kubectl set image deployment/nginx nginx=nginx:1.28
kubectl delete deployment nginx

Declarative configuration tells Kubernetes what the end state should look like. You write a manifest (a YAML file), and kubectl apply sends it to the API server. Kubernetes computes the difference between what exists and what you declared, then makes only the necessary changes.

yaml
# deployment.yaml \u2014 the desired state
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.28
bash
# Declarative: apply the manifest \u2014 works for create AND update
kubectl apply -f deployment.yaml
AspectImperative (kubectl create, kubectl run)Declarative (kubectl apply)
Mental model"Do this action now""Make reality match this file"
IdempotencyNo \u2014 running create twice errorsYes \u2014 apply is safe to run repeatedly
Change trackingNone \u2014 history lives in your terminalManifests are version-controlled in Git
CollaborationHard to share or review shell commandsPull requests on YAML files
Drift correctionManual \u2014 you must re-run commandsRe-apply the manifest to restore desired state
Best forQuick experiments, one-off debuggingProduction workloads, GitOps, automation
Note

kubectl apply stores the last-applied configuration as an annotation on the object (kubectl.kubernetes.io/last-applied-configuration). This is how it computes three-way diffs \u2014 comparing your new manifest, the last-applied manifest, and the live object in the cluster. If you mix create and apply on the same resource, this annotation is missing and diff calculations can produce unexpected results.

The Reconciliation Loop: Observe \u2192 Diff \u2192 Act

Declaring desired state is only half the story. The other half is reconciliation \u2014 the continuous process by which Kubernetes controllers drive actual state toward desired state. Every controller in Kubernetes follows the same three-phase pattern.

  1. Observe \u2014 Read the current state of the world from the API server (the objects the controller is responsible for, plus any dependent resources).
  2. Diff \u2014 Compare the observed state against the desired state declared in the object\u2019s spec.
  3. Act \u2014 If there is a gap, take the minimum set of actions to close it: create missing resources, update drifted ones, or delete ones that should no longer exist. Then update the object\u2019s status subresource to reflect the new reality.

This loop runs continuously. It is not a one-shot process triggered by your kubectl apply. After you apply a Deployment with replicas: 3, the Deployment controller does not just create three pods and walk away. It keeps watching. If a pod dies at 3 AM, the controller notices and creates a replacement \u2014 no human intervention required.

stateDiagram-v2
    [*] --> DesiredStateDeclared: User applies manifest
    DesiredStateDeclared --> Watching: Controller starts watch
    Watching --> DriftDetected: Actual \u2260 Desired
    Watching --> Watching: Actual = Desired (no-op)
    DriftDetected --> Acting: Create / Update / Delete resources
    Acting --> StatusUpdate: Write status to API server
    StatusUpdate --> Watching: Resume watch loop
    

The reconciliation loop. Controllers cycle between watching for changes and acting on detected drift. The loop never terminates \u2014 it runs for the entire lifetime of the cluster.

Watches: Event-Driven, Not Polling

A naive implementation of the reconciliation loop would poll the API server on a timer \u2014 "check every 5 seconds if anything changed." Kubernetes is smarter than that. Controllers use the API server\u2019s watch mechanism, a long-lived HTTP connection (using chunked transfer encoding) that streams change events in real time.

When a controller starts, it performs a list operation to get the full current state, then opens a watch starting from the resource version returned by the list. The API server pushes ADDED, MODIFIED, and DELETED events as they happen. This design means controllers react to changes within milliseconds, not on some polling interval.

bash
# You can see the watch mechanism in action with kubectl
# This opens a long-lived connection and streams events as pods change
kubectl get pods --watch

# Raw API equivalent \u2014 note the watch=true query parameter
kubectl get pods --v=8 2>&1 | grep "GET"
# Output shows: GET https://<api-server>/api/v1/namespaces/default/pods?watch=true

Internally, the client-go library (used by all built-in controllers and most custom ones) wraps this list+watch pattern in a component called an Informer. An Informer maintains a local in-memory cache of the resources it watches, so controllers can read state without hitting the API server on every reconciliation. It also de-duplicates events and feeds them into a work queue, ensuring each resource is reconciled at most once at a time.

Eventual Consistency: The Kubernetes Bargain

Kubernetes is an eventually consistent system. When you kubectl apply a manifest, the API server acknowledges the write to etcd and returns immediately. But the actual resources \u2014 the pods, the endpoints, the iptables rules \u2014 are not yet created. That work happens asynchronously as each controller picks up the change and acts on it.

This means there is always a window of time where actual state does not match desired state. A Deployment can exist in etcd before its pods are running. A Service can be created before its endpoints are populated. The gap is normally small (seconds), but under heavy load or during node failures, convergence can take longer.

This is a deliberate tradeoff. Kubernetes optimizes for resilience and scalability over immediate consistency. In a system managing thousands of nodes and tens of thousands of pods, attempting synchronous, transactional updates across all components would be impossibly slow and fragile. Instead, each controller independently converges its slice of the world, and the system as a whole settles into the desired state.

Warning

Do not write scripts that kubectl apply a resource and immediately assume it is ready. Use kubectl wait or kubectl rollout status to block until the system has converged. For example: kubectl rollout status deployment/nginx --timeout=120s.

Drift Detection and Self-Healing in Practice

The real power of the reconciliation model becomes visible when things go wrong. The following scenarios demonstrate how Kubernetes detects drift between desired and actual state, then automatically heals.

Scenario 1: Pod Deletion \u2014 The Controller Recreates It

Start with a Deployment running three replicas. Manually delete one pod. The ReplicaSet controller notices the count dropped to 2, compares it against the desired count of 3, and immediately creates a replacement.

bash
# Create a deployment with 3 replicas
kubectl apply -f deployment.yaml

# Confirm 3 pods are running
kubectl get pods -l app=nginx
# NAME                     READY   STATUS    AGE
# nginx-7c5ddbdf54-abc12   1/1     Running   30s
# nginx-7c5ddbdf54-def34   1/1     Running   30s
# nginx-7c5ddbdf54-ghi56   1/1     Running   30s

# Delete one pod manually \u2014 simulating a crash
kubectl delete pod nginx-7c5ddbdf54-abc12

# Within seconds, a replacement appears
kubectl get pods -l app=nginx
# NAME                     READY   STATUS    AGE
# nginx-7c5ddbdf54-def34   1/1     Running   60s
# nginx-7c5ddbdf54-ghi56   1/1     Running   60s
# nginx-7c5ddbdf54-xyz99   1/1     Running   3s    <-- new pod

Scenario 2: Imperative Drift \u2014 Declarative Correction

What happens if someone uses kubectl scale to manually change the replica count on a live Deployment? The cluster state drifts from what the manifest declares. The next kubectl apply of the original manifest detects the difference and corrects it. This is why imperative edits on declaratively managed resources are dangerous \u2014 the next apply will overwrite them.

bash
# Someone scales the deployment imperatively
kubectl scale deployment/nginx --replicas=5

# The cluster now has 5 pods. But the manifest says 3.
# Re-apply the manifest to restore desired state:
kubectl apply -f deployment.yaml

# The controller terminates the 2 extra pods
kubectl get pods -l app=nginx
# Back to 3 pods

Scenario 3: Node Failure \u2014 Rescheduling to Healthy Nodes

When a node becomes unreachable, the node controller marks it as NotReady after a timeout (default: 40 seconds). The pod eviction controller then taints the node, and pods bound to it begin terminating. The ReplicaSet controller sees replica counts drop below the desired number and schedules replacements on healthy nodes \u2014 all without human intervention.

bash
# Monitor the self-healing process during a node failure
kubectl get nodes --watch
# NAME     STATUS     ROLES    AGE   VERSION
# node-1   Ready      <none>   10d   v1.30.2
# node-2   NotReady   <none>   10d   v1.30.2   <-- node went down

kubectl get pods -l app=nginx -o wide --watch
# Pods on node-2 enter Terminating status
# New pods are scheduled on node-1 (or other healthy nodes)

# Check events to see the reconciliation in action
kubectl get events --sort-by=.lastTimestamp | tail -5
# 2m   Normal   SuccessfulCreate   replicaset/nginx-7c5ddbdf54   Created pod: nginx-7c5ddbdf54-new01

Multiple Controllers, One Convergence

A single kubectl apply of a Deployment triggers a cascade of controllers, each reconciling its own piece of the puzzle. No single controller handles the full lifecycle \u2014 they compose together through the objects they create and own.

  1. The Deployment controller sees the new Deployment object. It creates (or updates) a ReplicaSet to match the pod template.
  2. The ReplicaSet controller sees the new ReplicaSet. It counts existing pods matching the selector. If the count is too low, it creates new Pod objects.
  3. The Scheduler sees unscheduled Pods (those with no nodeName). It assigns each pod to a node by writing the nodeName field.
  4. The kubelet on each assigned node watches for pods bound to it. It pulls the container image and starts the container through the container runtime.
  5. The Endpoint/EndpointSlice controller sees pods become Ready. It adds their IPs to the corresponding Service\u2019s endpoint list.
  6. kube-proxy (or the CNI plugin) picks up the new endpoints and updates the node\u2019s network rules so traffic can reach the new pods.

Each controller independently converges its own resources. The Deployment controller does not know or care about iptables rules. The scheduler does not know about Services. Yet the final result \u2014 a fully networked, load-balanced set of running containers \u2014 emerges from their independent reconciliation loops.

Tip

Use kubectl get events --sort-by=.lastTimestamp to trace the reconciliation cascade after applying a manifest. You will see events from the Deployment controller, the ReplicaSet controller, the scheduler, and the kubelet \u2014 in order \u2014 as each one does its part.

Why This Matters

The declarative model with reconciliation loops gives you three properties that are hard to achieve with imperative scripting:

  • Self-healing. The system continuously repairs itself. Crashed pods are restarted, failed nodes are drained, and missing resources are recreated. You do not need monitoring scripts or cron jobs to do this \u2014 it is built into the architecture.
  • Idempotent operations. Running kubectl apply -f manifest.yaml ten times has the same result as running it once. This makes automation safe and CI/CD pipelines reliable. You never have to worry about "is this resource already created?"
  • Git as the source of truth. Because the desired state is a YAML file, it lives in version control. You get history, code review, rollback, and audit trails for free. This is the foundation of the GitOps workflow pattern covered later in this guide.

Pods — The Smallest Deployable Unit

A Pod is the smallest object you can create in Kubernetes. It represents a group of one or more containers that share a network identity, an IPC namespace, and optionally a set of storage volumes. When Kubernetes schedules work onto a node, it schedules entire Pods — never individual containers.

Most Pods you encounter in the wild contain a single container. But the abstraction exists as a group because some workloads genuinely need tightly coupled processes: a web server and a log-shipping sidecar, or an application container alongside a service-mesh proxy. These containers are co-located, co-scheduled, and run in a shared context.

What Containers in a Pod Share

Containers inside the same Pod are not isolated from each other the way separate Pods are. They share three key namespaces, which has concrete implications for how you design your applications.

Shared ResourceWhat This MeansPractical Implication
Network namespaceAll containers share the same IP address and port spaceContainers talk to each other via localhost. Two containers cannot bind to the same port.
IPC namespaceContainers can use System V IPC or POSIX shared memoryUseful for legacy apps that communicate via shared memory segments or semaphores.
VolumesVolumes defined at the Pod level can be mounted into any containerA sidecar can read log files written by the main container to a shared emptyDir volume.

Each Pod gets its own unique cluster IP address. Other Pods in the cluster communicate with it using that IP, regardless of how many containers are running inside. From the network’s perspective, a Pod is a single host.

Note

Containers within a Pod share the network and IPC namespaces, but they each have their own filesystem. A file written inside one container is not visible to another unless they both mount the same volume at the relevant path.

Pod Lifecycle

Every Pod moves through a defined set of phases from creation to termination. Understanding these phases is critical for debugging — when a Pod is stuck, its phase tells you where in the lifecycle it stalled.

stateDiagram-v2
    [*] --> Pending
    Pending --> Running : Scheduled, images pulled, containers starting
    Running --> Succeeded : All containers exit with code 0
    Running --> Failed : A container exits with non-zero code
    Pending --> Failed : Image pull fails or cannot be scheduled
    Running --> Unknown : Node becomes unreachable
    Unknown --> Running : Node reconnects
    Unknown --> Failed : Node stays unreachable beyond timeout
    Succeeded --> [*]
    Failed --> [*]
    

Pod Phases

PhaseDescription
PendingThe Pod is accepted by the cluster but one or more containers are not yet running. This includes time spent waiting for scheduling, pulling images, and initializing init containers.
RunningThe Pod has been bound to a node and all containers have been created. At least one container is running, starting, or restarting.
SucceededAll containers in the Pod terminated with exit code 0 and will not be restarted. Typical for Jobs and batch workloads.
FailedAll containers have terminated, and at least one exited with a non-zero exit code or was terminated by the system.
UnknownThe state of the Pod cannot be determined, usually because the kubelet on the node has stopped reporting.

Pod Conditions

While the phase gives you a high-level summary, conditions provide a more granular picture. Each condition is a boolean with a reason and a timestamp. You can inspect them with kubectl describe pod or query them in JSON output.

ConditionMeaning
PodScheduledThe Pod has been assigned to a node.
InitializedAll init containers have completed successfully.
ContainersReadyAll containers in the Pod are ready (passed readiness probes).
ReadyThe Pod is ready to serve traffic and should be added to Service endpoints.

Container States

Each container inside a Pod has its own state, independent of the Pod phase. Kubernetes tracks three possible states per container:

  • Waiting — The container is not yet running. The reason field tells you why: ContainerCreating, ImagePullBackOff, CrashLoopBackOff, etc.
  • Running — The container is executing. The startedAt timestamp tells you when it began.
  • Terminated — The container finished execution. You get the exitCode, reason, and both startedAt and finishedAt timestamps.

Restart Policies

The restartPolicy field in the Pod spec controls what the kubelet does when a container exits. It applies to all containers in the Pod — you cannot set different restart policies for different containers. The default is Always.

PolicyBehaviorBest For
AlwaysRestart the container regardless of exit code. Uses exponential backoff (10s, 20s, 40s, … up to 5 min).Long-running services managed by Deployments, StatefulSets, DaemonSets.
OnFailureRestart only if the container exits with a non-zero code. Containers that exit 0 stay terminated.Jobs and batch tasks that should retry on failure but stop on success.
NeverNever restart, regardless of exit code.One-shot diagnostic or debug Pods where you want to inspect the exit state.

When a container repeatedly crashes, the kubelet delays restarts using exponential backoff. This is the infamous CrashLoopBackOff status — it means the kubelet is waiting before retrying. The delay caps at 5 minutes.

Anatomy of a Pod Spec

A Pod spec is more than just a list of containers. It includes scheduling constraints, security settings, volume definitions, and metadata that controllers and the scheduler use to make decisions. Here is an annotated example that shows the most commonly used fields.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  namespace: default
  labels:
    app: my-app
    version: v1
spec:
  serviceAccountName: my-app-sa        # Identity for RBAC and API access
  restartPolicy: Always

  # --- Scheduling constraints ---
  nodeSelector:
    disk: ssd                           # Only nodes with this label
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"             # Tolerate the gpu taint
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: my-app
            topologyKey: kubernetes.io/hostname  # Spread across nodes

  # --- Security ---
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000

  # --- Volumes ---
  volumes:
    - name: config-vol
      configMap:
        name: my-app-config
    - name: data-vol
      persistentVolumeClaim:
        claimName: my-app-data
    - name: tmp
      emptyDir: {}

  # --- Containers ---
  containers:
    - name: app
      image: my-app:1.2.0
      ports:
        - containerPort: 8080
      env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: host
      volumeMounts:
        - name: config-vol
          mountPath: /etc/app/config
          readOnly: true
        - name: data-vol
          mountPath: /var/data
        - name: tmp
          mountPath: /tmp
      resources:
        requests:
          cpu: "250m"
          memory: "128Mi"
        limits:
          cpu: "500m"
          memory: "256Mi"
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 15
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5

A few things to note in this spec. The serviceAccountName determines what Kubernetes API permissions the Pod has — always set it explicitly rather than relying on the default service account. The securityContext at the Pod level applies to all containers; you can override it per-container for finer control. Scheduling fields like nodeSelector, tolerations, and affinity work together to control where the Pod lands.

Creating and Inspecting Pods

You will rarely create bare Pods in production — Deployments and Jobs handle that for you. But during development and debugging, working directly with Pods is essential. Here are the most common operations.

Create a Pod Imperatively

The fastest way to spin up a Pod for quick testing. The --rm and -it flags make it behave like a temporary interactive shell.

bash
# Run a one-off Pod (deleted automatically when you exit)
kubectl run tmp-shell --rm -it --image=alpine -- /bin/sh

# Run a Pod with a specific command
kubectl run dns-test --image=busybox --restart=Never -- nslookup kubernetes.default

# Generate a Pod YAML without creating it (dry-run)
kubectl run my-app --image=nginx:1.25 --port=80 --dry-run=client -o yaml > pod.yaml

Create a Pod Declaratively

bash
# Apply a Pod manifest
kubectl apply -f pod.yaml

# Watch the Pod come up
kubectl get pod my-app -w

Inspecting Pod State

kubectl get shows a summary. kubectl describe gives you the full picture: events, conditions, container states, and reasons for failures. When something goes wrong, describe is always your first stop.

bash
# Summary with status, restarts, age, IP, and node
kubectl get pod my-app -o wide

# Full details: events, conditions, container states
kubectl describe pod my-app

# Logs from the primary container
kubectl logs my-app

# Logs from a specific container in a multi-container Pod
kubectl logs my-app -c sidecar

# Stream logs in real time
kubectl logs my-app -f --tail=50

# Check container states as JSON
kubectl get pod my-app -o jsonpath='{.status.containerStatuses[*].state}'

Debugging Pods

When a container is running but misbehaving, kubectl exec lets you open a shell inside it. This works only if the container has a shell binary — distroless and minimal images often do not. That is where ephemeral debug containers come in.

Exec into a Running Container

bash
# Open an interactive shell
kubectl exec -it my-app -- /bin/sh

# Run a one-off command
kubectl exec my-app -- cat /etc/app/config/settings.yaml

# Exec into a specific container in a multi-container Pod
kubectl exec -it my-app -c sidecar -- /bin/bash

Ephemeral Debug Containers

Introduced as stable in Kubernetes 1.25, ephemeral containers solve the “distroless debugging” problem. You inject a temporary container — with the tools you need — into a running Pod. The debug container shares the Pod’s namespaces, so it can see the same network interfaces, processes, and file systems.

bash
# Attach a debug container that shares the target container's PID namespace
kubectl debug -it my-app --image=busybox --target=app

# Debug a CrashLoopBackOff Pod by copying it with a different command
kubectl debug my-app -it --copy-to=my-app-debug --container=app -- /bin/sh

# Debug at the node level (creates a privileged Pod on the node)
kubectl debug node/worker-1 -it --image=ubuntu
Tip

The --target flag in kubectl debug makes the ephemeral container share the process namespace of the specified container. This means you can use ps aux from the debug container to see the target’s processes, inspect /proc, and even attach a debugger.

Deleting Pods

When you delete a Pod, Kubernetes sends a SIGTERM to every container and waits up to the terminationGracePeriodSeconds (default: 30 seconds) for a clean shutdown. If containers are still running after that deadline, it sends SIGKILL.

bash
# Graceful delete (waits for terminationGracePeriodSeconds)
kubectl delete pod my-app

# Force delete (skip the grace period — use sparingly)
kubectl delete pod my-app --grace-period=0 --force

# Delete all Pods matching a label
kubectl delete pods -l app=my-app

# Delete a Pod from a YAML file
kubectl delete -f pod.yaml
Warning

Force-deleting a Pod (--grace-period=0 --force) does not wait for confirmation that the containers have actually stopped. The Pod object is removed from etcd immediately, but the container processes may still be running on the node. Use this only when a Pod is stuck in Terminating and you are certain the node is unreachable or the workload is safe to abandon.

ReplicaSets and Deployments — Managing Pod Replicas

In the previous section, you learned that a Pod is the smallest deployable unit in Kubernetes. But here is the thing: you almost never create Pods directly. A bare Pod has no self-healing capability — if it crashes, gets evicted, or its node goes down, it is gone forever. Nothing recreates it.

This is where ReplicaSets and Deployments come in. They form a two-layer abstraction that keeps your application running at the desired scale and gives you controlled rollout and rollback capabilities. Understanding how these layers interact is essential to operating anything in production on Kubernetes.

ReplicaSets: The Replication Engine

A ReplicaSet has one job: ensure that a specified number of identical Pod replicas are running at all times. It does this through a continuous reconciliation loop. The ReplicaSet controller watches the cluster state, compares the current number of matching Pods to the desired count in the spec, and creates or deletes Pods to close the gap.

The "matching" part is critical. A ReplicaSet finds its Pods using a label selector, not by tracking specific Pod names. It looks for Pods whose labels match its spec.selector, counts them, and acts accordingly. This decoupled relationship means a ReplicaSet can even adopt pre-existing Pods if their labels match.

yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80

Three key fields define a ReplicaSet: spec.replicas sets the desired Pod count, spec.selector defines which labels to match, and spec.template provides the Pod blueprint for creating new replicas. The labels in template.metadata.labels must match the selector — Kubernetes validates this and rejects the object if they do not align.

You should not create ReplicaSets directly either

While ReplicaSets handle replication, they have no concept of updates or rollbacks. If you change the Pod template on a ReplicaSet, existing Pods are not replaced — only new Pods created after the change use the updated template. This is why Deployments exist.

Deployments: The Abstraction You Actually Use

A Deployment is a higher-level controller that manages ReplicaSets on your behalf. When you create a Deployment, it creates a ReplicaSet. When you update the Deployment's Pod template (for example, changing the container image), the Deployment creates a new ReplicaSet with the updated template and gradually scales it up while scaling the old one down. This is how rolling updates work.

The ownership chain is: Deployment → ReplicaSet → Pods. The Deployment never manages Pods directly. It manages ReplicaSets, and each ReplicaSet manages its own set of Pods. Old ReplicaSets are kept around (scaled to zero) so that rollbacks can reactivate them instantly.

graph TD
    D["Deployment<br/><strong>web-app</strong>"]
    RS1["ReplicaSet · revision 1<br/><em>nginx:1.24</em><br/>replicas: 0"]
    RS2["ReplicaSet · revision 2<br/><em>nginx:1.25</em><br/>replicas: 3"]
    P1["Pod web-app-a7x2k"]
    P2["Pod web-app-b9m3p"]
    P3["Pod web-app-c4n8q"]

    D --> RS1
    D --> RS2
    RS2 --> P1
    RS2 --> P2
    RS2 --> P3

    style D fill:#326ce5,stroke:#fff,color:#fff
    style RS1 fill:#666,stroke:#999,color:#ccc
    style RS2 fill:#1a73e8,stroke:#fff,color:#fff
    style P1 fill:#4caf50,stroke:#fff,color:#fff
    style P2 fill:#4caf50,stroke:#fff,color:#fff
    style P3 fill:#4caf50,stroke:#fff,color:#fff
    

Deployment Spec Anatomy

A Deployment spec is structurally similar to a ReplicaSet — it has replicas, selector, and template — but adds fields for update strategy, revision history, and rollout behavior. Here is a production-ready Deployment with every important field annotated.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  replicas: 3
  revisionHistoryLimit: 5          # Keep 5 old ReplicaSets for rollback
  selector:
    matchLabels:
      app: web-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1                  # At most 1 extra Pod during update
      maxUnavailable: 0            # Never drop below desired count
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: myregistry/web-app:v2.4.1
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

Deployment Strategies

Kubernetes offers two built-in strategies for how a Deployment replaces old Pods with new ones. Choosing the right one depends on whether your application can tolerate running two versions simultaneously.

RollingUpdate (Default)

The RollingUpdate strategy incrementally replaces Pods. Kubernetes creates new Pods from the updated ReplicaSet and terminates old Pods from the previous ReplicaSet in a controlled sequence. At no point does the total count of available Pods drop to zero (assuming sane settings), which makes this the standard choice for stateless web services.

Two parameters control the pace of the rollout:

ParameterWhat It ControlsDefaultExample
maxSurge How many extra Pods (above desired count) can exist during the update 25% With 4 replicas and maxSurge: 1, up to 5 Pods can run simultaneously
maxUnavailable How many Pods can be unavailable (not ready) during the update 25% With 4 replicas and maxUnavailable: 0, all 4 must stay ready throughout

Setting maxSurge: 1 and maxUnavailable: 0 is the most conservative configuration — Kubernetes spins up one new Pod, waits for it to become ready, then terminates one old Pod, and repeats. This is slower but guarantees zero capacity loss. Setting both to higher values (or percentages) speeds up the rollout at the cost of temporary over-provisioning or reduced capacity.

Recreate

The Recreate strategy is simpler and more brutal: it kills all existing Pods before creating any new ones. This means a period of complete downtime during the update. Use this only when your application cannot run two versions side-by-side — for example, when a new version requires an exclusive database migration lock or uses an incompatible on-disk format.

yaml
spec:
  strategy:
    type: Recreate     # No rollingUpdate block needed

Rollout Management

Every change to a Deployment’s .spec.template triggers a new rollout. Kubernetes tracks these rollouts as revisions, and you can inspect, pause, and undo them using kubectl rollout commands. Changes to fields outside the template (like replicas) do not trigger a new rollout.

Watching a Rollout in Progress

bash
# Trigger a rollout by updating the image
kubectl set image deployment/web-app web-app=myregistry/web-app:v2.5.0

# Watch the rollout progress in real-time
kubectl rollout status deployment/web-app
# Waiting for deployment "web-app" rollout to finish:
#   1 out of 3 new replicas have been updated...
#   2 out of 3 new replicas have been updated...
#   3 out of 3 new replicas have been updated...
#   deployment "web-app" successfully rolled out

Viewing Rollout History

Each rollout creates a new revision. You can inspect the full history and see what changed in each revision. Adding the --record flag (or using annotations) when making changes captures the command that triggered each revision.

bash
# List all revisions
kubectl rollout history deployment/web-app
# REVISION  CHANGE-CAUSE
# 1         kubectl apply --filename=web-app.yaml
# 2         kubectl set image deployment/web-app web-app=myregistry/web-app:v2.4.1
# 3         kubectl set image deployment/web-app web-app=myregistry/web-app:v2.5.0

# Inspect a specific revision’s Pod template
kubectl rollout history deployment/web-app --revision=2

Rolling Back

When a new release is broken, you can instantly revert to a previous revision. The rollback does not re-deploy the old image from scratch — it reactivates the old ReplicaSet (which was kept around at zero replicas) and scales it back up. This makes rollbacks fast.

bash
# Roll back to the previous revision
kubectl rollout undo deployment/web-app

# Roll back to a specific revision
kubectl rollout undo deployment/web-app --to-revision=2

# Verify the rollback completed
kubectl rollout status deployment/web-app

Revision History Limits

Every old ReplicaSet consumes a small amount of etcd storage and clutters kubectl get rs output. The spec.revisionHistoryLimit field controls how many old ReplicaSets are retained. The default is 10. Setting it to 0 disables rollback entirely because no old ReplicaSets are preserved. A value between 3 and 10 is practical for most workloads.

Scaling

Scaling a Deployment changes the replicas field on the active ReplicaSet. You can do this imperatively with kubectl scale or declaratively by updating the manifest. Since scaling does not change the Pod template, it does not trigger a new rollout or create a new revision.

bash
# Imperative scaling
kubectl scale deployment/web-app --replicas=5

# Verify the new replica count
kubectl get deployment web-app
# NAME      READY   UP-TO-DATE   AVAILABLE   AGE
# web-app   5/5     5            5           12d

For declarative scaling, update spec.replicas in your YAML and apply it. For automatic scaling based on CPU or memory utilization, use a HorizontalPodAutoscaler (HPA) — which adjusts the replica count on the Deployment dynamically. When using an HPA, you typically omit spec.replicas from your manifest to avoid conflicts between the HPA controller and your declared value.

bash
# Create an HPA targeting 70% CPU utilization, scaling between 3 and 10 replicas
kubectl autoscale deployment/web-app --min=3 --max=10 --cpu-percent=70

Real-World Patterns: Blue-Green and Canary Deployments

The built-in RollingUpdate strategy works well for most scenarios, but some teams need finer control over traffic shifting during deploys. Kubernetes does not have first-class "blue-green" or "canary" resources, but you can implement both patterns using native primitives: Deployments, Services, and label selectors.

Blue-Green Deployments

In a blue-green deployment, you run two complete environments side by side — "blue" (current) and "green" (new). All traffic goes to one environment at a time. Once the green environment is validated, you switch traffic instantly by updating the Service selector. If something goes wrong, you switch back.

yaml
# deployment-blue.yaml — the currently active version
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
      version: blue
  template:
    metadata:
      labels:
        app: web-app
        version: blue
    spec:
      containers:
        - name: web-app
          image: myregistry/web-app:v2.4.1
---
# deployment-green.yaml — the new version, deployed alongside blue
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
      version: green
  template:
    metadata:
      labels:
        app: web-app
        version: green
    spec:
      containers:
        - name: web-app
          image: myregistry/web-app:v2.5.0
---
# service.yaml — traffic switch controlled by the 'version' label
apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web-app
    version: blue      # ← Change to "green" to switch traffic
  ports:
    - port: 80
      targetPort: 8080

The cutover is a single operation: patch the Service selector from blue to green. Traffic shifts immediately because the Service updates its endpoint list. To roll back, patch the selector back to blue. Once you are confident the green version is stable, delete the blue Deployment.

bash
# Switch traffic from blue to green
kubectl patch svc web-app -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback: switch traffic back to blue
kubectl patch svc web-app -p '{"spec":{"selector":{"version":"blue"}}}'

Canary Deployments

A canary deployment routes a small percentage of traffic to the new version while the majority continues hitting the stable version. In native Kubernetes, you achieve this by running two Deployments with a shared label that the Service selects on. Traffic is distributed across all matching Pods proportionally.

yaml
# Stable deployment — 9 replicas serving ~90% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: web-app
      track: stable
  template:
    metadata:
      labels:
        app: web-app        # ← Shared label the Service selects on
        track: stable
    spec:
      containers:
        - name: web-app
          image: myregistry/web-app:v2.4.1
---
# Canary deployment — 1 replica serving ~10% of traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web-app
      track: canary
  template:
    metadata:
      labels:
        app: web-app        # ← Same shared label
        track: canary
    spec:
      containers:
        - name: web-app
          image: myregistry/web-app:v2.5.0
---
# Service selects ONLY on app: web-app — matches both Deployments
apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web-app            # ← Does NOT include 'track', so both match
  ports:
    - port: 80
      targetPort: 8080

Since the Service selects on app: web-app only, it targets all 10 Pods (9 stable + 1 canary). Kubernetes distributes requests roughly evenly across endpoints, so about 10% of traffic hits the canary. You control the ratio by adjusting replica counts. To promote the canary, update the stable Deployment’s image and scale the canary back to zero.

When native canaries are not enough

Native Kubernetes canary deployments control traffic split by Pod ratio, which is coarse-grained. If you need header-based routing, precise percentage-based traffic splitting (like 1% to canary), or automatic rollback on error rate thresholds, look into a service mesh like Istio or a progressive delivery controller like Argo Rollouts or Flagger.

Putting It All Together

Here is a summary of the relationship between the resources covered in this section and when to reach for each pattern:

Resource / PatternWhat It DoesWhen to Use
ReplicaSet Maintains a fixed number of Pod replicas via label selectors Never directly — always through a Deployment
Deployment (RollingUpdate) Incrementally rolls out new Pod versions with zero-downtime Default choice for stateless applications
Deployment (Recreate) Terminates all old Pods before creating new ones Applications that cannot run two versions simultaneously
Blue-Green Instant traffic cutover between two full environments When you need instant rollback and can afford double the resources
Canary Routes a fraction of traffic to the new version for validation High-risk changes where you want gradual exposure

In the next section, you will learn about StatefulSets — the controller designed for applications that need stable network identities and persistent storage, where the interchangeable nature of Deployment-managed Pods breaks down.

StatefulSets — Stable Identity for Stateful Applications

Not every workload is stateless. Databases need persistent disk, message queues need stable network addresses, and distributed stores need to know which replica they are. A Deployment treats every Pod as interchangeable — when a Pod dies, it gets a new random name, a new IP, and its local storage vanishes. That model breaks fundamentally for anything that stores data or participates in a cluster protocol.

StatefulSets solve this by giving each Pod a persistent identity that survives rescheduling. The Pod name, its DNS hostname, and its storage volume are all stable. Pod mysql-0 is always mysql-0, whether it runs on node A today or node B tomorrow.

Why Deployments Fall Short for Stateful Workloads

Consider a 3-node PostgreSQL cluster running in streaming replication. The primary (node 0) writes the WAL, and replicas (nodes 1 and 2) connect to the primary by hostname to stream changes. If you use a Deployment, you immediately hit three problems: Pod names are random (so replicas cannot find the primary), storage is ephemeral (so a restarted Pod loses all data), and Pods are created and destroyed in any order (so the primary might not exist when replicas start).

BehaviorDeploymentStatefulSet
Pod namingRandom suffix (app-7b9f4d)Ordinal index (app-0, app-1)
Network identityChanges on every rescheduleStable DNS per Pod via headless Service
Persistent storageShared PVC or ephemeralDedicated PVC per Pod via volumeClaimTemplates
Startup/shutdown orderAll Pods created in parallelSequential by ordinal (0 → 1 → 2)
Rolling updateAny order, surge allowedReverse ordinal order (2 → 1 → 0)
Pod replacementNew Pod with new identityReplacement reuses same name and PVC

The Three Guarantees of StatefulSets

1. Stable Network Identity

Each Pod in a StatefulSet gets a predictable hostname following the pattern <statefulset-name>-<ordinal>. A StatefulSet named postgres with 3 replicas creates Pods named postgres-0, postgres-1, and postgres-2. When postgres-1 is rescheduled to a different node, it keeps the name postgres-1 — and critically, its DNS record points to the new IP.

This identity is paired with a headless Service (covered below) to give each Pod a stable DNS name like postgres-0.postgres-headless.default.svc.cluster.local. Other components can connect to a specific replica by name, which is exactly what replication protocols require.

2. Stable Persistent Storage

StatefulSets use volumeClaimTemplates to create a dedicated PersistentVolumeClaim for each Pod. When the StatefulSet creates postgres-0, it also creates a PVC named data-postgres-0. If postgres-0 is deleted and recreated, the new Pod reattaches to the same data-postgres-0 PVC — your data survives.

3. Ordered Deployment and Scaling

By default, StatefulSets create Pods sequentially. Pod 0 must be Running and Ready before Pod 1 starts. This ordering matters for leader-election or primary-replica setups where the first node must initialize the cluster before replicas join. Scaling down reverses the order: the highest-ordinal Pod is terminated first.

flowchart LR
    subgraph "StatefulSet: postgres (replicas=3)"
        direction LR
        P0["postgres-0<br/><small>Created first</small>"] -->|"Ready ✓"| P1["postgres-1<br/><small>Created second</small>"]
        P1 -->|"Ready ✓"| P2["postgres-2<br/><small>Created third</small>"]
    end

    subgraph "PersistentVolumeClaims"
        PVC0["data-postgres-0<br/>10Gi"]
        PVC1["data-postgres-1<br/>10Gi"]
        PVC2["data-postgres-2<br/>10Gi"]
    end

    P0 -.->|"bound"| PVC0
    P1 -.->|"bound"| PVC1
    P2 -.->|"bound"| PVC2
    
Note

The ordering guarantee applies to startup and shutdown, not to steady-state operation. Once all Pods are Running and Ready, they operate independently. If postgres-1 crashes, only postgres-1 is restarted — it does not wait for postgres-0 or affect postgres-2.

Headless Services — Direct Pod DNS

A normal ClusterIP Service gives you a single virtual IP that load-balances across all Pods. That is useless when you need to connect to a specific Pod — you cannot tell a replica "connect to the primary at this VIP" because the VIP might route to any backend Pod.

A headless Service is a Service with clusterIP: None. Instead of creating a single VIP, it creates individual DNS A records for each Pod. This gives every StatefulSet Pod a predictable, resolvable hostname.

yaml
apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
  labels:
    app: postgres
spec:
  clusterIP: None          # This makes it headless
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432

With this Service in place, each Pod gets a DNS entry following the pattern:

text
# Pattern:
# <pod-name>.<headless-service>.<namespace>.svc.cluster.local

postgres-0.postgres-headless.default.svc.cluster.local  → 10.244.1.5
postgres-1.postgres-headless.default.svc.cluster.local  → 10.244.2.8
postgres-2.postgres-headless.default.svc.cluster.local  → 10.244.3.3

The StatefulSet's spec.serviceName field must reference this headless Service. This is how Kubernetes knows which Service to register Pod DNS records with. Without this link, your Pods will not get individual DNS entries.

Anatomy of a StatefulSet Manifest

A StatefulSet spec looks similar to a Deployment, with two important additions: the serviceName field that links to the headless Service, and the volumeClaimTemplates section that defines per-Pod storage. Here is the complete structure for a 3-replica PostgreSQL cluster:

yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres-headless   # Must match the headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: "1"
              memory: 1Gi
  volumeClaimTemplates:            # One PVC per Pod
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi

The volumeClaimTemplates section acts as a PVC template. For each Pod ordinal, Kubernetes creates a PVC named <template-name>-<statefulset-name>-<ordinal> — in this case data-postgres-0, data-postgres-1, and data-postgres-2. These PVCs are not deleted when Pods are rescheduled, which is exactly what you want for a database.

Practical Example: PostgreSQL with Streaming Replication

A real-world replicated PostgreSQL setup needs each Pod to know its role (primary vs. replica) and configure itself accordingly. The ordinal index makes this straightforward: Pod 0 initializes as primary, all others connect to Pod 0 as replicas.

Start by creating the Secret and headless Service:

bash
# Create the password secret
kubectl create secret generic postgres-secret \
  --from-literal=password='S3cur3P@ss' \
  --from-literal=replication-password='R3plP@ss'

# Apply the headless Service
kubectl apply -f postgres-headless-svc.yaml

Next, use a ConfigMap to hold an init script that detects whether the Pod is the primary or a replica based on its hostname ordinal:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-init
data:
  setup.sh: |
    #!/bin/bash
    set -e

    # Extract ordinal from hostname (e.g., "postgres-0" -> "0")
    ORDINAL="${HOSTNAME##*-}"

    if [ "$ORDINAL" -eq 0 ]; then
      echo "Initializing as PRIMARY"
      # Primary-specific config: enable replication slots, WAL shipping
      cat >> /var/lib/postgresql/data/pgdata/postgresql.conf <<CONF
    wal_level = replica
    max_wal_senders = 5
    max_replication_slots = 5
    CONF
    else
      echo "Initializing as REPLICA — connecting to postgres-0"
      # Use pg_basebackup to clone data from the primary
      pg_basebackup -h postgres-0.postgres-headless \
        -U replicator -D /var/lib/postgresql/data/pgdata \
        -Fp -Xs -R
    fi

The key insight is the HOSTNAME environment variable. Kubernetes sets it to the Pod name (postgres-0, postgres-1, etc.), so you can parse the ordinal and branch your initialization logic. Replicas connect to postgres-0.postgres-headless — the stable DNS name — regardless of what node the primary is running on.

Pod Management Policy

The default podManagementPolicy is OrderedReady, which enforces the sequential startup and shutdown behavior described above. For workloads that do not need ordering — for example, a distributed cache where all nodes are peers — you can set it to Parallel to launch all Pods at once.

yaml
spec:
  podManagementPolicy: Parallel    # All Pods start/stop simultaneously
  replicas: 5
  serviceName: memcached-headless
PolicyStartupShutdownUse case
OrderedReadySequential (0 → 1 → 2)Reverse (2 → 1 → 0)Databases, leader-follower systems
ParallelAll Pods at onceAll Pods at oncePeer-to-peer caches, stateful workers

Update Strategies

StatefulSets support two update strategies that control how Pods are replaced when you change the Pod template (e.g., updating the container image).

RollingUpdate (default)

Pods are updated one at a time in reverse ordinal order (highest ordinal first). This is intentional — in most primary/replica setups, replicas have higher ordinals and should be updated before the primary. Each Pod must become Running and Ready before the next one is updated.

RollingUpdate with Partition (Canary Deploys)

The partition parameter lets you perform a staged rollout. Only Pods with an ordinal greater than or equal to the partition value are updated. This is a powerful canary mechanism — you can test a new image on the last replica before rolling it out to the entire set.

yaml
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2             # Only pods with ordinal >= 2 are updated
bash
# Step 1: Set partition to 2, update the image
# Only postgres-2 gets the new image
kubectl patch statefulset postgres -p \
  '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset/postgres postgres=postgres:17

# Step 2: Verify postgres-2 is healthy with the new version
kubectl get pod postgres-2 -o jsonpath='{.spec.containers[0].image}'

# Step 3: Lower partition to 0 to roll out to all Pods
kubectl patch statefulset postgres -p \
  '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

OnDelete

With OnDelete, Kubernetes does not automatically update Pods when you change the template. Instead, you manually delete Pods one by one, and each replacement is created with the new template. This gives you full control over the update pace and order, which can be critical for databases where you must manually verify replication health between each step.

yaml
spec:
  updateStrategy:
    type: OnDelete

Common Pitfalls

PVC Retention: Volumes Outlive the StatefulSet

When you delete a StatefulSet or scale it down, the PVCs are not automatically deleted. This is a safety feature — you do not want to lose a database volume because someone ran kubectl delete statefulset. However, it means orphaned PVCs accumulate and continue to consume storage (and cost money) unless you clean them up.

bash
# List PVCs that belonged to a deleted StatefulSet
kubectl get pvc -l app=postgres

# Manually delete orphaned PVCs after confirming data is backed up
kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2

Starting with Kubernetes 1.27 (stable), you can configure automatic PVC cleanup using persistentVolumeClaimRetentionPolicy:

yaml
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete        # Delete PVCs when StatefulSet is deleted
    whenScaled: Retain         # Keep PVCs when scaling down (safe default)

Ordering Dependencies and Stuck Rollouts

With OrderedReady, if Pod 0 fails to become Ready, the entire StatefulSet is stuck — Pods 1 and 2 will never be created. This cascading failure is one of the most common StatefulSet debugging scenarios. Always check the Pod 0 logs and events first when a StatefulSet is not scaling up.

bash
# Diagnose a stuck StatefulSet
kubectl describe statefulset postgres
kubectl get pods -l app=postgres
kubectl logs postgres-0 --previous    # Check crash logs
kubectl describe pod postgres-0       # Check events for scheduling/volume issues

Scaling Down Safely

Scaling down removes Pods in reverse ordinal order (highest first), which is usually the safest order for primary-replica setups. However, Kubernetes does not understand your application's replication state. If postgres-2 holds data that has not been replicated elsewhere, scaling from 3 to 2 can cause data loss. Always verify your application's replication health before scaling down.

Warning

Never scale a StatefulSet to 0 and then delete it as a "cleanup" strategy. Deleting the StatefulSet with --cascade=foreground will delete Pods in order, but the PVCs remain. If you later recreate the StatefulSet, the old PVCs will be reattached — which may contain stale data that conflicts with a fresh initialization. Either delete PVCs explicitly after backup, or use persistentVolumeClaimRetentionPolicy.

When to Use Operators Instead

Raw StatefulSets give you identity, storage, and ordering — but they do not understand your application. They cannot perform automated failover when a PostgreSQL primary dies, rebalance partitions in a Kafka cluster, or trigger a backup before scaling down. For production databases and complex distributed systems, a Kubernetes Operator wraps the StatefulSet with application-specific automation.

ScenarioRecommendation
Learning / dev environmentsRaw StatefulSet is fine — keep it simple
Single-node database (no replication)StatefulSet with 1 replica works well
Replicated database in productionUse an Operator (CloudNativePG, Zalando Postgres Operator, MySQL Operator)
Kafka, Elasticsearch, CassandraUse the vendor-supported Operator for lifecycle management
Custom stateful app with simple needsStatefulSet + init containers + readiness probes
Tip

Even when using an Operator, understanding StatefulSets is essential. Operators build on top of StatefulSets, and debugging a misbehaving Operator almost always means inspecting the underlying StatefulSet, Pods, and PVCs. The concepts from this section — stable identity, headless Services, ordered scaling, and PVC lifecycle — remain the foundation.

DaemonSets, Jobs, and CronJobs — Specialized Workload Controllers

Deployments and StatefulSets cover most workloads, but not all work fits the “run N replicas forever” model. Some pods need to run on every node — log shippers, monitoring agents, network plugins. Others need to run once, finish, and exit. Still others need to fire on a schedule, like a nightly database backup. Kubernetes provides three specialized controllers for exactly these patterns: DaemonSets, Jobs, and CronJobs.

Each controller builds on the same reconciliation loop that powers Deployments, but they differ in when pods are created, where they are placed, and what happens when a pod finishes. Understanding these differences lets you model your entire workload landscape without resorting to external cron daemons, systemd services, or manual node-by-node deployments.

DaemonSets — One Pod Per Node

A DaemonSet ensures that every node (or a targeted subset) runs exactly one copy of a pod. When a new node joins the cluster, the DaemonSet controller automatically schedules a pod on it. When a node is removed, the pod is garbage-collected. You never specify a replica count — the node count is the replica count.

This makes DaemonSets the natural choice for node-level infrastructure that must be present everywhere in the cluster. The most common use cases include:

  • Log collectors — Fluentd, Fluent Bit, or Logstash agents that tail container logs from /var/log and forward them to a central store.
  • Monitoring agents — Prometheus node-exporter for hardware and OS metrics, Datadog agents, or New Relic infrastructure agents.
  • CNI / network plugins — Calico, Cilium, and AWS VPC CNI all run as DaemonSets to configure networking on each node.
  • Storage daemons — Ceph OSDs, Longhorn engine processes, or CSI node plugins that need access to the host’s disk and device tree.

Basic DaemonSet YAML

The following manifest deploys Fluent Bit as a log collector across every node. Notice there is no replicas field — the DaemonSet controller handles pod placement automatically.

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.1
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              memory: 256Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: containers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: containers
          hostPath:
            path: /var/lib/docker/containers

Targeting a Subset of Nodes

You don’t always want a pod on every node. Maybe your GPU monitoring agent should only run on nodes with GPUs, or a storage daemon should only run on nodes with SSDs. Use nodeSelector or nodeAffinity inside the pod template to restrict placement. The DaemonSet controller will only schedule pods on nodes that match.

yaml
# Inside spec.template.spec:
nodeSelector:
  node.kubernetes.io/instance-type: gpu

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Tolerations are equally important. Control-plane nodes are typically tainted with node-role.kubernetes.io/control-plane:NoSchedule. If you need your DaemonSet to run on control-plane nodes too (for example, a CNI plugin), you must add a matching toleration. Without it, the DaemonSet simply skips those nodes.

Update Strategies

DaemonSets support two update strategies that control how pods are replaced when you change the pod template:

StrategyBehaviorBest For
RollingUpdate Automatically terminates old pods and creates new ones, one node at a time. Respects maxUnavailable (default: 1) and maxSurge (default: 0). Most workloads — log collectors, monitoring agents, anything that can tolerate brief gaps on individual nodes.
OnDelete Does nothing automatically. New pods are created only when you manually delete existing ones. Critical infrastructure like CNI plugins or storage daemons where you want full manual control over the rollout sequence.
yaml
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1   # update one node at a time
      maxSurge: 0          # don't create new pod before old is terminated
Note

Unlike Deployments, DaemonSets do not maintain revision history for rollbacks. If a bad image gets pushed via RollingUpdate, you need to update the manifest again to fix it. For mission-critical DaemonSets (especially CNI plugins), OnDelete gives you the safety of testing the new version node-by-node before deleting old pods.

Jobs — Run-to-Completion Workloads

A Job creates one or more pods and ensures that a specified number of them successfully terminate. Once the required number of completions is reached, the Job is considered complete. Unlike Deployments, a Job never restarts a pod that exited with status 0 — the work is done.

Jobs are the right tool for batch processing, data migrations, one-off scripts, report generation, and any task that has a clear beginning and end. The controller handles retries on failure, parallelism, deadlines, and cleanup — you declare the desired behavior, and Kubernetes orchestrates it.

Basic Job YAML

This Job runs a database migration. It creates a single pod, allows up to 4 retries on failure, and enforces a 10-minute deadline.

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
spec:
  backoffLimit: 4
  activeDeadlineSeconds: 600
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: myapp/migrations:v2.5.0
          command: ["python", "manage.py", "migrate", "--no-input"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url

Key Job Configuration Fields

Jobs expose several fields that control execution, failure handling, and lifecycle. Understanding each one prevents common pitfalls like runaway retry loops or orphaned completed pods consuming resources.

FieldDefaultDescription
completions 1 Total number of pods that must succeed. Set to N for batch work that needs N successful completions.
parallelism 1 Maximum number of pods running concurrently. Set higher than 1 to process items in parallel.
backoffLimit 6 Number of retries before the Job is marked as failed. Each retry uses exponential backoff (10s, 20s, 40s, … capped at 6 minutes).
activeDeadlineSeconds none Hard time limit for the entire Job. If the Job is still running after this duration, all pods are terminated and the Job is marked as failed.
ttlSecondsAfterFinished none Automatically deletes the Job (and its pods) this many seconds after completion. Requires the TTL-after-finished controller (enabled by default since Kubernetes 1.23).
Tip

Always set ttlSecondsAfterFinished on Jobs. Without it, completed Job objects and their pods stay in the cluster forever, cluttering kubectl get pods output and consuming etcd storage. A value of 3600 (1 hour) gives you time to inspect results before automatic cleanup kicks in.

Parallel Jobs with Completions

When you need to process multiple independent items — say, rendering 50 video chunks — set completions to the total number of items and parallelism to how many can run concurrently. The Job controller creates new pods as existing ones succeed, until all completions are reached.

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: video-render
spec:
  completions: 50
  parallelism: 10
  backoffLimit: 5
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: renderer
          image: myapp/renderer:v1.2
          command: ["./render.sh"]
          env:
            - name: QUEUE_URL
              value: "sqs://render-jobs"

In this pattern, each pod pulls work from an external queue (SQS, Redis, RabbitMQ). The Job doesn’t know what each pod processes — it only ensures 50 total pods succeed and keeps up to 10 running at any time.

Indexed Jobs for Unique Work Assignment

Indexed Jobs (stable since Kubernetes 1.24) assign each pod a unique index from 0 to completions - 1, exposed via the JOB_COMPLETION_INDEX environment variable. This eliminates the need for an external work queue when each unit of work can be identified by a simple integer — processing file shards, database partition ranges, or test suite splits.

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-process
spec:
  completionMode: Indexed
  completions: 8
  parallelism: 4
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: processor
          image: myapp/data-pipeline:v3.0
          command:
            - python
            - process_shard.py
            - --shard-index=$(JOB_COMPLETION_INDEX)
            - --total-shards=8

Pod 0 gets JOB_COMPLETION_INDEX=0, pod 1 gets 1, and so on. If pod 3 fails and is retried, the replacement pod still receives index 3. This guarantees exactly-once processing per index when your application is idempotent.

CronJobs — Scheduled Jobs

A CronJob creates a Job on a repeating schedule, using the same cron expression format that Linux administrators have used for decades. It’s the Kubernetes-native replacement for crontab entries, with the added benefit of running inside the cluster where it has access to Kubernetes secrets, volumes, and service networking.

Schedule Syntax

The schedule field uses standard five-field cron syntax. Each field can be a value, range, step, or wildcard:

FieldAllowed ValuesExampleMeaning
Minute0–5930At minute 30
Hour0–23*/6Every 6 hours
Day of month1–311,151st and 15th
Month1–12*Every month
Day of week0–6 (Sun=0)1-5Monday–Friday

Some common schedules for reference: "0 * * * *" (every hour on the hour), "*/15 * * * *" (every 15 minutes), "0 2 * * *" (daily at 2 AM), "0 0 * * 0" (weekly on Sunday at midnight), and "0 0 1 * *" (first day of every month).

Basic CronJob YAML

This CronJob creates a nightly database backup at 2:30 AM. The jobTemplate section is identical to a standalone Job spec — the CronJob controller simply stamps out a new Job on each trigger.

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
spec:
  schedule: "30 2 * * *"
  timeZone: "America/New_York"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      activeDeadlineSeconds: 1800
      ttlSecondsAfterFinished: 86400
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: myapp/db-backup:v1.4
              command:
                - /bin/sh
                - -c
                - pg_dump $DATABASE_URL | gzip > /backups/db-$(date +%Y%m%d).sql.gz
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
              volumeMounts:
                - name: backup-storage
                  mountPath: /backups
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-pvc

CronJob Configuration Fields

Beyond the schedule, CronJobs offer several fields that control overlap behavior, failure tolerance, and history management. Getting these right is important — a misconfigured CronJob can silently skip runs, stack up parallel executions, or leave hundreds of completed Job objects in the cluster.

FieldDefaultDescription
concurrencyPolicy Allow Allow: Multiple Jobs can run simultaneously. Forbid: Skip the new run if the previous one is still active. Replace: Terminate the still-running Job and start a new one.
startingDeadlineSeconds none If the CronJob misses its scheduled time (e.g., controller was down), it still creates the Job as long as the delay is within this window. If unset and more than 100 runs are missed, the CronJob stops scheduling entirely.
suspend false When set to true, no new Jobs are created on schedule. Already-running Jobs are unaffected. Useful for temporarily pausing a CronJob without deleting it.
successfulJobsHistoryLimit 3 Number of completed Job objects to retain. Older successful Jobs and their pods are automatically deleted.
failedJobsHistoryLimit 1 Number of failed Job objects to retain. Keep this higher (e.g., 5) so you can inspect recent failures.
timeZone UTC (controller’s TZ) IANA time zone name (e.g., America/New_York). Stable since Kubernetes 1.27. Without it, the schedule runs in the kube-controller-manager’s time zone, which is almost always UTC.

Choosing a Concurrency Policy

The concurrencyPolicy field is the most consequential CronJob setting. The right choice depends entirely on whether your task is safe to run in parallel and what should happen when a run takes longer than the interval between schedules.

  • Allow — Use when each run is independent and overlapping is harmless. Example: sending a periodic health check report where two overlapping reports are fine.
  • Forbid — Use when concurrent runs would conflict or corrupt shared resources. Example: a database backup that acquires an advisory lock — running two in parallel would cause one to fail. Missed schedules are simply skipped.
  • Replace — Use when only the latest run matters and stale runs should be stopped. Example: a cache warm-up job where the newest data supersedes anything the previous run was building.
Warning

Always set startingDeadlineSeconds on CronJobs. If the CronJob controller is unavailable (during a control-plane upgrade, for example) and misses more than 100 consecutive schedules, the CronJob permanently stops scheduling and logs a "Cannot determine if job needs to be started" error. Setting startingDeadlineSeconds prevents this by limiting how far back the controller looks for missed schedules.

Choosing the Right Controller

When you’re modeling a new workload, ask two questions: Does it need to run continuously or to completion? Does it need to run on specific nodes, or wherever the scheduler decides? The answer maps directly to one of the five workload controllers:

ControllerLifecyclePod PlacementTypical Use Case
DeploymentRun foreverScheduler decidesStateless web services, APIs
StatefulSetRun foreverOrdered, stable identityDatabases, distributed systems
DaemonSetRun foreverOne per node (or subset)Node agents, log shippers, CNI
JobRun to completionScheduler decidesMigrations, batch processing
CronJobRun to completion, on scheduleScheduler decidesBackups, reports, cleanup tasks

Multi-Container Pod Patterns — Init, Sidecar, Ambassador, Adapter

A Pod is not limited to a single container. When you place multiple containers in the same Pod, they share two critical resources: a network namespace (they all reach each other on localhost) and optionally storage volumes (they can read and write the same files). This is the foundation for every multi-container pattern in Kubernetes.

You would never cram unrelated services into one Pod — that defeats the purpose of microservices. Multi-container Pods exist for tightly coupled helpers that genuinely need to share fate and resources with the main application container. Kubernetes formalizes this with four well-established patterns: Init, Sidecar, Ambassador, and Adapter.

flowchart TB
    subgraph pod ["Pod Boundary — shared localhost + volumes"]
        direction TB

        subgraph init ["① Init Containers (sequential, run-to-completion)"]
            I1["init-1: wait-for-db"] --> I2["init-2: run-migrations"]
        end

        subgraph runtime ["② App + Helper Containers (run concurrently)"]
            direction LR
            APP["app container\n:8080"]

            subgraph sidecar ["Sidecar"]
                S["log-shipper\nreads shared volume"]
            end
            subgraph ambassador ["Ambassador"]
                A["db-proxy\nlocalhost:5432"]
            end
            subgraph adapter ["Adapter"]
                D["metrics-adapter\nPrometheus format"]
            end
        end

        init --> runtime
    end

    style pod fill:#1a1a2e,stroke:#6366f1,stroke-width:2px,color:#e2e8f0
    style init fill:#1e293b,stroke:#f59e0b,stroke-width:1px,color:#e2e8f0
    style runtime fill:#1e293b,stroke:#22d3ee,stroke-width:1px,color:#e2e8f0
    style sidecar fill:#0f172a,stroke:#a78bfa,stroke-width:1px,color:#e2e8f0
    style ambassador fill:#0f172a,stroke:#34d399,stroke-width:1px,color:#e2e8f0
    style adapter fill:#0f172a,stroke:#fb923c,stroke-width:1px,color:#e2e8f0
    

Why Containers in the Same Pod?

Containers in the same Pod share a network namespace — they communicate over localhost without any Service or DNS lookup. If your app container listens on port 8080, a sidecar container can reach it at localhost:8080. They also share the same IP address, so external callers see one network identity.

Shared volumes let containers exchange files without network overhead. An app can write log files to an emptyDir volume, and a sidecar can tail those same files and ship them to a logging backend. This cooperation model is what makes the four patterns below so effective.

Init Containers

Init containers run before any app container starts. They execute sequentially — the second init container will not start until the first exits with code 0. If any init container fails, the kubelet retries it according to the Pod's restartPolicy. The Pod stays in Pending state until all init containers succeed.

This makes them perfect for setup tasks that must complete before your application is ready: waiting for a database to accept connections, running schema migrations, cloning a config repository, or downloading ML model weights. The app container is guaranteed that these preconditions are met by the time it starts.

Real-World Example: Wait for a Database, Then Migrate

yaml
apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  initContainers:
    # Init container 1: block until PostgreSQL accepts connections
    - name: wait-for-db
      image: busybox:1.36
      command:
        - sh
        - -c
        - |
          until nc -z postgres-svc 5432; do
            echo "Waiting for database..."
            sleep 2
          done
          echo "Database is ready"

    # Init container 2: run schema migrations
    - name: run-migrations
      image: myapp/migrator:1.4.0
      command: ["./migrate", "--source", "file:///migrations", "--database", "$(DATABASE_URL)", "up"]
      env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

  containers:
    - name: web
      image: myapp/web:2.1.0
      ports:
        - containerPort: 8080
      env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

The execution order is strict: wait-for-db loops until PostgreSQL is reachable, then run-migrations applies any pending schema changes, and only then does the web container start. If the migration fails (non-zero exit), the kubelet restarts the init container — the app never starts in a broken state.

Init container retry behavior

Init containers respect the Pod's restartPolicy. With the default Always policy, a failed init container is retried with exponential backoff (10s, 20s, 40s, capped at 5 minutes). If restartPolicy is Never, a failed init container causes the entire Pod to fail permanently. Init containers also re-run if the Pod is restarted — they are not skipped on subsequent starts.

Sidecar Containers

Sidecar containers run alongside the main application container for the entire lifetime of the Pod. They extend the app's capabilities without modifying its code — shipping logs, injecting TLS, reloading configuration files, or proxying traffic through a service mesh like Istio's Envoy.

Historically, sidecars were just regular containers listed in spec.containers. The problem: Kubernetes had no way to distinguish a helper from the primary workload. Sidecars could start after the app, and worse, they could prevent a Job Pod from completing because they never exited.

Native Sidecar Support (Kubernetes 1.28+)

Kubernetes 1.28 introduced native sidecar containers (beta in 1.29, stable in 1.31). The mechanism is elegant: you declare a container in initContainers with restartPolicy: Always. This tells the kubelet to start it before the app containers and keep it running for the Pod's entire lifetime. Native sidecars start in init container order, are guaranteed running before app containers launch, and are terminated after the main containers exit.

Real-World Example: Log Shipper Sidecar

yaml
apiVersion: v1
kind: Pod
metadata:
  name: app-with-log-shipper
spec:
  initContainers:
    # Native sidecar: starts before app, stays running, stops after app
    - name: log-shipper
      image: fluent/fluent-bit:3.1
      restartPolicy: Always          # This makes it a native sidecar
      volumeMounts:
        - name: app-logs
          mountPath: /var/log/app
      env:
        - name: FLUENTBIT_OUTPUT
          value: "elasticsearch"
        - name: ES_HOST
          value: "elasticsearch.logging.svc.cluster.local"

  containers:
    - name: app
      image: myapp/api:3.0.0
      ports:
        - containerPort: 8080
      volumeMounts:
        - name: app-logs
          mountPath: /var/log/app

  volumes:
    - name: app-logs
      emptyDir: {}

The app container writes structured JSON logs to /var/log/app/. The log-shipper sidecar tails those files and forwards them to Elasticsearch. Because it is declared as a native sidecar (restartPolicy: Always in initContainers), the kubelet guarantees it starts before app and terminates after app exits. No lost log lines on shutdown.

Native sidecars fix the Job completion problem

Before native sidecars, a Job Pod with an Istio Envoy proxy would never complete because the proxy container ran indefinitely. With native sidecars, the kubelet shuts down sidecar containers after the main container exits, so the Pod terminates cleanly. If you run Jobs in a service mesh, native sidecars are essential.

Ambassador Pattern

The Ambassador pattern places a proxy container in the Pod that handles outbound connections on behalf of the app. The app connects to localhost on a well-known port, and the ambassador container handles the complexity of routing, connection pooling, authentication, or protocol translation to the actual remote service.

This cleanly separates connection logic from business logic. The app does not need to know about TLS certificates, connection pool sizes, retry policies, or even which database host it is talking to — it just connects to localhost:5432 and the ambassador handles the rest.

Real-World Example: Cloud SQL Auth Proxy

yaml
apiVersion: v1
kind: Pod
metadata:
  name: app-with-db-proxy
spec:
  serviceAccountName: cloud-sql-sa
  containers:
    - name: app
      image: myapp/api:3.0.0
      ports:
        - containerPort: 8080
      env:
        # App connects to localhost — no knowledge of Cloud SQL
        - name: DATABASE_HOST
          value: "127.0.0.1"
        - name: DATABASE_PORT
          value: "5432"
        - name: DATABASE_NAME
          value: "myapp_production"

    # Ambassador: proxies localhost:5432 to Cloud SQL over IAM auth
    - name: cloud-sql-proxy
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.13.0
      args:
        - "--structured-logs"
        - "--auto-iam-authn"
        - "--port=5432"
        - "my-project:us-central1:prod-db"
      securityContext:
        runAsNonRoot: true
      resources:
        requests:
          cpu: 50m
          memory: 64Mi

Google's Cloud SQL Auth Proxy is the canonical ambassador example. The app container connects to 127.0.0.1:5432 as if it were a local PostgreSQL server. The cloud-sql-proxy ambassador terminates that connection and establishes an authenticated, encrypted tunnel to the actual Cloud SQL instance. The app needs zero Cloud SQL awareness — no special drivers, no IAM token management, no TLS certificate handling.

Adapter Pattern

The Adapter pattern normalizes or transforms output from the main container into a format that external systems expect. The most common use case is metrics: your application might expose metrics in a proprietary format, and an adapter container converts them into a Prometheus-compatible /metrics endpoint so your monitoring stack can scrape them uniformly.

Like the Ambassador, the Adapter runs alongside the app and communicates over localhost or shared volumes. The difference is directional: Ambassadors proxy outbound traffic, while Adapters transform output data.

Real-World Example: Redis Metrics Adapter for Prometheus

yaml
apiVersion: v1
kind: Pod
metadata:
  name: redis-with-exporter
  labels:
    app: redis
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9121"
spec:
  containers:
    - name: redis
      image: redis:7.2-alpine
      ports:
        - containerPort: 6379

    # Adapter: reads Redis INFO command, exposes Prometheus metrics
    - name: redis-exporter
      image: oliver006/redis_exporter:v1.62.0
      ports:
        - containerPort: 9121
      env:
        - name: REDIS_ADDR
          value: "localhost:6379"
      resources:
        requests:
          cpu: 25m
          memory: 32Mi

Redis does not natively expose Prometheus metrics. The redis-exporter adapter container connects to Redis on localhost:6379, runs the INFO command, and translates the output into Prometheus exposition format on port 9121. Prometheus scrapes the adapter, not Redis directly. The same pattern works for MySQL, PostgreSQL, NGINX, and dozens of other services that have community-maintained exporters.

Choosing the Right Pattern

Pattern Runs Direction Typical Use Cases
Init Container Before app, sequentially DB readiness checks, schema migrations, config cloning, permission setup
Sidecar Alongside app, full lifetime Varies Log shippers, service mesh proxies (Envoy), config reloaders, TLS termination
Ambassador Alongside app, full lifetime Outbound proxy Database proxies (Cloud SQL, PgBouncer), API gateway sidecars, connection pooling
Adapter Alongside app, full lifetime Output transformation Prometheus exporters, log format converters, protocol translators
Don't overuse multi-container Pods

Every container you add to a Pod shares its lifecycle — if the Pod is evicted or rescheduled, all containers move together. Sidecars also consume resources counted against the Pod's requests and limits. If a helper container does not need localhost access or shared volumes with your app, deploy it as a separate Pod behind a Service instead. The Ambassador and Adapter patterns are only justified when the tight coupling genuinely simplifies your architecture.

The Kubernetes Networking Model and CNI Plugins

Networking is arguably the most complex piece of a Kubernetes cluster, yet it rests on a deceptively simple foundation. Before a single packet flows, Kubernetes mandates three non-negotiable rules about how Pods, nodes, and the network interact. Understanding these rules — and the pluggable system that implements them — is essential for debugging connectivity issues, choosing the right network plugin, and reasoning about performance.

The Three Fundamental Networking Rules

The Kubernetes networking model is defined by three invariants. These are not suggestions — every conformant cluster implementation must satisfy all three. They are spelled out in the official Kubernetes documentation and form the contract that every CNI plugin must uphold.

  1. Every Pod gets its own IP address. Each Pod is assigned a unique, cluster-routable IP from the Pod CIDR range. Containers within the same Pod share this IP and communicate over localhost.
  2. Pods can communicate with all other Pods without NAT. Any Pod can reach any other Pod in the cluster using the destination Pod's IP directly. Source and destination addresses are never translated in transit.
  3. Nodes can communicate with all Pods (and vice versa) without NAT. A process running on a node can reach any Pod IP directly, and Pods can reach node IPs without address translation.

The result is a flat network — every Pod and node exists in a single, shared address space. There is no port-mapping layer between Pods, no manual link configuration, and no NAT rewriting packet headers. If Pod A knows Pod B's IP, it can just send a packet there.

Note

Kubernetes deliberately does not specify how these rules are implemented — only that they must hold. This is why the networking layer is pluggable. The implementation is delegated to a CNI plugin, which can use overlay networks, BGP routing, eBPF, or anything else that satisfies the contract.

How This Differs from Docker's Default Networking

If you are coming from Docker, this model will feel unfamiliar. Docker's default bridge networking takes a fundamentally different approach: containers on a single host share a private bridge network (typically 172.17.0.0/16), and communication with the outside world requires explicit port mapping via -p flags. This means two containers on different hosts cannot talk to each other by default — they need port forwarding, link aliases, or an overlay network like Docker Swarm's.

PropertyDocker Bridge (Default)Kubernetes Flat Network
IP scopePrivate to each hostCluster-wide, routable
Cross-host communicationRequires port mapping or overlay setupWorks out of the box via Pod IPs
NAT involvedYes — SNAT/DNAT for external trafficNo NAT between Pods or between Pods and nodes
Port conflictsMapped ports must be unique per hostEach Pod has its own IP — no port conflicts
Service discoveryManual or via Docker DNS (compose only)Built-in DNS via CoreDNS + Services

Kubernetes chose the flat model because it eliminates an entire class of problems. Port conflicts disappear. Applications do not need to know whether they are running in a container. Network policies can reason about Pod IPs as stable identities rather than chasing ephemeral port mappings.

How Pods Get Their IPs — Under the Hood

When the kubelet on a node needs to start a new Pod, the sequence works like this: the container runtime creates the Pod's network namespace (an isolated Linux network stack), then calls the configured CNI plugin. The plugin assigns an IP address from the node's allocated Pod CIDR subnet, creates a virtual ethernet (veth) pair, connects one end to the Pod's namespace and the other to the host network, and sets up routes so traffic can flow.

You can verify a Pod's IP and network namespace from the node itself:

bash
# Get the IP assigned to a Pod
kubectl get pod my-app -o wide
# NAME     READY   STATUS    IP            NODE
# my-app   1/1     Running   10.244.1.23   worker-01

# From worker-01, inspect the veth pair
ip link show type veth
# You'll see interfaces like cali* (Calico), lxc* (Cilium), or vethXXXX

# Trace the route to a Pod on another node
ip route get 10.244.2.45
# 10.244.2.45 via 192.168.1.12 dev eth0  (routed to worker-02)

Pod-to-Pod Communication Across Nodes

When Pod A on Node 1 sends a packet to Pod B on Node 2, the packet travels through several layers. It exits Pod A's network namespace via the veth pair, hits the host network stack on Node 1, gets routed to Node 2 (via overlay encapsulation or direct routing depending on the CNI plugin), enters Node 2's host network stack, and is finally delivered into Pod B's namespace.

flowchart LR
    subgraph Node1["Node 1 (192.168.1.11)"]
        direction TB
        PodA["Pod A\n10.244.1.23"]
        veth1["veth pair"]
        Host1["Host Network Stack\n+ routing table"]
        PodA --- veth1 --- Host1
    end

    subgraph Node2["Node 2 (192.168.1.12)"]
        direction TB
        Host2["Host Network Stack\n+ routing table"]
        veth2["veth pair"]
        PodB["Pod B\n10.244.2.45"]
        Host2 --- veth2 --- PodB
    end

    Host1 -- "Overlay (VXLAN/GENEVE)\nor BGP direct route" --> Host2

    style PodA fill:#4a9eff,stroke:#2d7cd4,color:#fff
    style PodB fill:#4a9eff,stroke:#2d7cd4,color:#fff
    style Host1 fill:#f0f4f8,stroke:#9aa5b4
    style Host2 fill:#f0f4f8,stroke:#9aa5b4
    style veth1 fill:#ffd43b,stroke:#e6b800,color:#333
    style veth2 fill:#ffd43b,stroke:#e6b800,color:#333
        

Pod-to-Pod communication across nodes. The CNI plugin determines whether traffic is encapsulated in an overlay tunnel or routed directly via BGP.

The critical detail is that Pod A's source IP is preserved — Pod B sees the real IP of Pod A, not a translated address. This is what makes network policies, access logs, and distributed tracing work correctly.

CNI — The Container Network Interface

CNI is a specification, not a piece of software. It defines a minimal contract between a container runtime and a network plugin: "here's a network namespace, set it up" and "here's a network namespace, tear it down." The spec was originally developed by CoreOS and is now maintained by the CNCF. It is used by Kubernetes, but also by other container orchestrators like Podman and CRI-O directly.

A CNI plugin is just a binary that the container runtime executes. The runtime passes information through environment variables and stdin (a JSON configuration), and the plugin responds by configuring the network namespace and returning the result (including the assigned IP) as JSON on stdout.

Plugin Configuration

CNI plugins are configured via two things on each node:

  • Plugin binaries in /opt/cni/bin/ — the actual executables (e.g., calico, flannel, bridge, loopback).
  • Configuration files in /etc/cni/net.d/ — JSON or conflist files that tell the runtime which plugin to invoke and with what parameters.

The container runtime (containerd, CRI-O) reads the configuration directory, picks the first file alphabetically, and uses it for all Pod network setup. Here is a typical configuration file:

json
{
  "cniVersion": "1.0.0",
  "name": "k8s-pod-network",
  "type": "calico",
  "ipam": {
    "type": "calico-ipam"
  },
  "policy": {
    "type": "k8s"
  },
  "log_level": "info"
}

The type field maps directly to a binary name in /opt/cni/bin/. The ipam block configures IP address management — how the plugin allocates and tracks Pod IPs. Many plugins include their own IPAM, but you can also use standalone IPAM plugins like host-local or whereabouts.

The Plugin Lifecycle: ADD, DEL, CHECK

The CNI spec defines three operations that a runtime can invoke on a plugin. These map directly to the lifecycle of a Pod's network namespace:

OperationWhen It RunsWhat It Does
ADDPod is being createdConfigures the network namespace: creates interfaces, assigns IP, sets up routes. Returns the assigned IP and other details as JSON.
DELPod is being destroyedTears down the network configuration: removes interfaces, releases the IP back to the pool, cleans up routes.
CHECKHealth verification (periodic)Validates that the network setup is still correct. Returns an error if something is wrong (e.g., interface missing, IP conflict). Optional — not all plugins implement it.

You can inspect what happens during these operations by looking at kubelet logs on a node when a Pod is scheduled:

bash
# Watch CNI activity in kubelet logs
journalctl -u kubelet -f | grep -i cni

# List installed CNI plugins on a node
ls /opt/cni/bin/
# bandwidth  bridge  calico  calico-ipam  flannel  host-local  loopback  portmap

# View active CNI configuration
cat /etc/cni/net.d/10-calico.conflist | jq '.plugins[].type'
# "calico"
# "bandwidth"
# "portmap"

Comparing Popular CNI Plugins

The CNI plugin you choose has a direct impact on performance, features, operational complexity, and what network policies you can enforce. There is no single "best" plugin — the right choice depends on your cluster size, performance requirements, and whether you need advanced features like encryption or deep observability.

PluginDataplaneNetwork PoliciesEncryptionBest For
Calico BGP (native routing) or VXLAN overlay Full Kubernetes + extended Calico policies WireGuard Production clusters needing strong policy support and flexibility
Cilium eBPF (kernel-level) Kubernetes + L7-aware policies (HTTP, gRPC, Kafka) WireGuard / IPsec High-performance clusters, deep observability, service mesh replacement
Flannel VXLAN overlay (default), host-gw None built-in (pair with Calico for policies) None Simple clusters, learning environments, minimal overhead
Weave Net VXLAN overlay with mesh routing Basic Kubernetes network policies IPsec (sleeve mode) Small clusters with easy setup and built-in encryption

Calico

Calico is the most widely deployed CNI plugin in production Kubernetes clusters. In its default mode, it uses BGP (Border Gateway Protocol) to distribute Pod routes directly across nodes — no encapsulation overhead, no tunnel interfaces. Each node acts as a BGP peer and announces its Pod CIDR to the rest of the cluster. This approach gives near bare-metal networking performance.

When BGP is not feasible (for example, in cloud VPCs that block BGP), Calico falls back to VXLAN overlay mode. Calico also includes the most mature network policy implementation in the ecosystem, supporting both standard Kubernetes NetworkPolicy resources and its own extended GlobalNetworkPolicy CRD for cluster-wide rules.

bash
# Install Calico (operator-based)
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/custom-resources.yaml

# Verify Calico Pods are running
kubectl get pods -n calico-system
# NAME                                       READY   STATUS    RESTARTS
# calico-kube-controllers-7c5f8db89c-x2g4l   1/1     Running   0
# calico-node-abcde                           1/1     Running   0
# calico-typha-6f8b5c9d4f-k8m2n              1/1     Running   0

# Check BGP peering status
sudo calicoctl node status
# IPv4 BGP status
# +--------------+-------------------+-------+----------+-------------+
# | PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
# +--------------+-------------------+-------+----------+-------------+
# | 192.168.1.12 | node-to-node mesh | up    | 10:23:45 | Established |
# +--------------+-------------------+-------+----------+-------------+

Cilium

Cilium takes a radically different approach by moving networking logic into the Linux kernel using eBPF (extended Berkeley Packet Filter). Instead of configuring iptables rules (which become a bottleneck at scale), Cilium attaches eBPF programs directly to network interfaces. This results in lower latency, higher throughput, and — crucially — L7 visibility into application-layer protocols.

Cilium can enforce network policies that understand HTTP methods, gRPC services, Kafka topics, and DNS queries — not just IP addresses and ports. It also includes Hubble, an observability platform that gives you a real-time network traffic flow map of your entire cluster.

bash
# Install Cilium via Helm
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.15.0 \
  --namespace kube-system \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

# Check Cilium status
cilium status
# Cilium health daemon:  Ok
# IPAM:                  IPv4: 12/254 allocated
# BandwidthManager:      Disabled
# Encryption:            Disabled

# Observe live traffic flows with Hubble
hubble observe --namespace default --protocol HTTP
# TIMESTAMP             SOURCE             DESTINATION          TYPE     VERDICT
# Jan 15 10:00:01.234   default/frontend   default/api-server   L7/HTTP  FORWARDED
#                        GET /api/v1/users => 200

Flannel

Flannel is the simplest CNI plugin and often the first one people encounter. It creates a VXLAN overlay network by default: each node gets a subnet from the cluster's Pod CIDR, and traffic between nodes is encapsulated in VXLAN packets. There is no support for network policies — if you need policies with Flannel, you typically pair it with Calico in a configuration called "Canal."

Flannel's simplicity is both its strength and its limitation. It is ideal for development clusters, CI/CD environments, and learning Kubernetes. For production workloads where you need policy enforcement or performance optimization, you will outgrow it.

bash
# Install Flannel
kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

# Verify — Flannel runs as a DaemonSet on every node
kubectl get pods -n kube-flannel
# NAME                    READY   STATUS    RESTARTS
# kube-flannel-ds-abc12   1/1     Running   0
# kube-flannel-ds-def34   1/1     Running   0

# Inspect the VXLAN interface Flannel creates
ip -d link show flannel.1
# flannel.1: <BROADCAST,MULTICAST,UP> mtu 1450 ...
#     vxlan id 1 ... dstport 8472

Weave Net

Weave Net builds a mesh overlay network between nodes using VXLAN (fast datapath) or a user-space "sleeve" mode that can traverse firewalls and NATs. Its standout feature is zero-configuration encryption — enable it with a single password and all inter-node traffic is encrypted via IPsec. Weave also supports basic Kubernetes network policies.

Weave is a solid choice for small- to medium-sized clusters where ease of setup and built-in encryption matter more than raw performance. However, it has seen less active development compared to Calico and Cilium in recent years.

Choosing a CNI plugin

Starting out or running a dev cluster? Use Flannel — it works and stays out of your way. Need network policies and production reliability? Calico is the safest bet with the largest user base. Want cutting-edge performance, L7 policies, and deep observability? Cilium is the direction the ecosystem is heading — it is the default CNI in GKE Dataplane V2 and was adopted as a CNCF graduated project.

Debugging CNI Issues

When a Pod is stuck in ContainerCreating and the events show a CNI error, the problem is almost always in one of three places: the CNI binary is missing, the configuration file is malformed, or the IPAM pool is exhausted. Here is a quick diagnostic checklist:

bash
# 1. Check if the CNI config exists
ls /etc/cni/net.d/
# If empty, no CNI plugin is installed — Pods will stay in ContainerCreating

# 2. Check if the CNI binary exists
ls /opt/cni/bin/ | grep calico
# If the type in your config doesn't match a binary here, ADD will fail

# 3. Look at kubelet logs for the specific error
journalctl -u kubelet --since "5 min ago" | grep -i "cni\|network"

# 4. Check Pod events for the error message
kubectl describe pod stuck-pod | tail -20
# Warning  FailedCreatePodSandBox  kubelet  ...
# failed to set up sandbox container network:
# plugin type="calico" failed: ...

# 5. Verify the CNI plugin DaemonSet is healthy
kubectl get ds -n calico-system   # or kube-flannel, kube-system, etc.
Watch out for IPAM exhaustion

If your Pod CIDR is too small for the number of Pods you're running, the IPAM plugin will run out of IPs and new Pods will fail to schedule. Check your cluster's --cluster-cidr and each node's podCIDR allocation with kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. A /24 per node gives you 254 Pod IPs — enough for most workloads but tight if you run many sidecar containers.

Services — ClusterIP, NodePort, and LoadBalancer

Every Pod in Kubernetes gets its own IP address, but that IP is ephemeral. When a Pod is killed and replaced — whether by a Deployment rollout, a node failure, or a scaling event — the new Pod receives a completely different IP. Any client that was using the old IP now has a broken connection. This is the fundamental problem that Services solve.

A Service provides a stable virtual IP address (called a ClusterIP) and a DNS name that remain constant for the lifetime of the Service object. Behind the scenes, the Service uses label selectors to track which Pods should receive traffic, and kube-proxy programs networking rules on every node to route packets to healthy backends. The result is a reliable, load-balanced endpoint that decouples clients from the volatile lifecycle of Pods.

The Four Service Types

Kubernetes offers four Service types — ClusterIP, NodePort, LoadBalancer, and ExternalName — each building on the one before it. ClusterIP is the default. NodePort adds external access via a static port. LoadBalancer adds a cloud-managed LB on top of NodePort. ExternalName is a special case that creates a DNS CNAME alias.

ClusterIP — The Default Service Type

When you create a Service without specifying a type, you get a ClusterIP service. Kubernetes allocates a virtual IP from the cluster’s service CIDR range (configured at cluster creation, e.g., 10.96.0.0/12). This IP is not bound to any network interface or node — it exists only in the cluster’s networking rules. Pods and other Services within the cluster can reach it, but nothing outside the cluster can.

The virtual IP acts as a stable front door. When a packet is sent to the ClusterIP on a given port, kube-proxy intercepts it and forwards it to one of the backing Pods. The selection of which Pod receives the packet depends on the proxy mode in use.

yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: ecommerce
spec:
  type: ClusterIP          # default — can be omitted
  selector:
    app: order-api
    tier: backend
  ports:
    - name: http
      protocol: TCP
      port: 80              # port exposed on the ClusterIP
      targetPort: 8080      # port the Pods are listening on

The selector field is what links the Service to its Pods. Kubernetes continuously watches for Pods matching the labels app: order-api and tier: backend, and adds their IPs to the Service’s endpoint list. When a Pod becomes unready or is deleted, it is removed automatically.

How kube-proxy Makes It Work

kube-proxy runs on every node and watches the API server for Service and Endpoint changes. When it detects an update, it programs the node’s networking layer to perform the actual packet rewriting. There are three proxy modes, with iptables being the most common.

Proxy ModeMechanismLoad BalancingPerformance
iptables (default)Inserts NAT rules into the kernel’s netfilter tablesRandom selection via --probability chainsGood for up to ~5,000 Services; O(n) rule evaluation
ipvsPrograms the kernel’s IPVS virtual server tableRound-robin, least-connections, source-hash, and moreO(1) lookup via hash tables; scales to 10,000+ Services
nftablesUses nftables (successor to iptables) with native mapsRandom with nftables probabilityO(1) lookup; available since Kubernetes v1.31

In iptables mode, kube-proxy creates a chain of rules for each Service. A packet destined for the ClusterIP is DNAT’d (destination NAT) to a randomly selected Pod IP. Return traffic is automatically reverse-NAT’d via conntrack, so the client sees responses from the ClusterIP — not the Pod IP. In IPVS mode, the same concept applies but the kernel’s built-in load balancer handles the forwarding with better performance at scale.

NodePort — Exposing Services Outside the Cluster

A NodePort service builds on top of ClusterIP. Kubernetes still allocates a virtual ClusterIP, but additionally opens a static port — the NodePort — on every node in the cluster. Any traffic arriving at <NodeIP>:<NodePort> is forwarded to the Service, which then load-balances it to a backing Pod. The NodePort range is 30000–32767 by default (configurable via the API server’s --service-node-port-range flag).

This means external clients can reach the service by hitting any node’s IP on the allocated port. The node does not need to be running the backing Pod — kube-proxy on every node has rules to forward NodePort traffic to the correct Pod, even if it is on a different node.

yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service-nodeport
spec:
  type: NodePort
  selector:
    app: order-api
  ports:
    - name: http
      protocol: TCP
      port: 80              # ClusterIP port (internal)
      targetPort: 8080      # Pod port
      nodePort: 30080       # static port on every node (optional — auto-assigned if omitted)

With this Service, traffic flows through three layers: the external client connects to 192.168.1.10:30080 (any node IP), kube-proxy forwards it to the ClusterIP 10.96.x.x:80, and then DNAT sends it to a Pod on port 8080. If you omit the nodePort field, Kubernetes auto-assigns one from the available range.

LoadBalancer — Cloud-Integrated External Access

The LoadBalancer type extends NodePort by instructing the cloud provider’s controller to provision an external load balancer (an AWS NLB/ALB, GCP TCP LB, Azure LB, etc.). The cloud LB receives a public or internal IP and forwards traffic to the NodePorts on your cluster nodes. Kubernetes automatically configures health checks and backend pools.

This is the simplest way to expose a Service to the internet in a cloud environment. However, each LoadBalancer Service gets its own cloud LB — which means its own IP address and its own billing line item. For clusters with many externally-facing services, an Ingress controller (covered in the next section) is more cost-effective.

yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service-lb
  annotations:
    # AWS-specific: request an NLB instead of a Classic LB
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: order-api
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8080
  # Optional: restrict source IPs allowed through the LB
  loadBalancerSourceRanges:
    - 203.0.113.0/24

After creation, the status.loadBalancer.ingress field is populated with the external IP or hostname assigned by the cloud provider. This can take 30 seconds to a few minutes depending on the cloud. Use kubectl get svc order-service-lb -w to watch for it.

ExternalName — DNS Alias for External Services

ExternalName is fundamentally different from the other three types. It does not create a ClusterIP, does not configure kube-proxy rules, and does not proxy any traffic. Instead, it creates a DNS CNAME record in the cluster’s DNS (CoreDNS) that maps the Service name to an external hostname.

yaml
apiVersion: v1
kind: Service
metadata:
  name: legacy-payments
  namespace: ecommerce
spec:
  type: ExternalName
  externalName: payments.legacy-datacenter.example.com

When a Pod resolves legacy-payments.ecommerce.svc.cluster.local, CoreDNS returns a CNAME to payments.legacy-datacenter.example.com. This is useful during migrations — your application code references a Kubernetes Service name, and you can later swap the ExternalName for a ClusterIP Service pointing to in-cluster Pods without changing application configuration.

Traffic Flow Through Each Service Type

The following diagram shows how traffic traverses the networking layers for each Service type. Notice how each type builds on the previous one — LoadBalancer wraps NodePort, which wraps ClusterIP.

flowchart LR
    subgraph External
        Client(["External Client"])
        CloudLB(["Cloud Load Balancer"])
    end

    subgraph Cluster ["Kubernetes Cluster"]
        subgraph Node1 ["Node 1 — 192.168.1.10"]
            KP1["kube-proxy
iptables / IPVS rules"]
            Pod1(["Pod A
10.244.1.5:8080"])
        end
        subgraph Node2 ["Node 2 — 192.168.1.11"]
            KP2["kube-proxy
iptables / IPVS rules"]
            Pod2(["Pod B
10.244.2.8:8080"])
        end
        SvcIP["ClusterIP
10.96.47.12:80"]
    end

    Client -- "1 LoadBalancer
external-ip:80" --> CloudLB
    CloudLB -- "2 NodePort
any-node:30080" --> KP1
    KP1 -- "3 ClusterIP
10.96.47.12:80" --> SvcIP
    SvcIP -. "DNAT to Pod" .-> Pod1
    SvcIP -. "DNAT to Pod" .-> Pod2
    KP2 -- "3 ClusterIP" --> SvcIP

    InternalPod(["In-Cluster Pod"]) -- "ClusterIP only
10.96.47.12:80" --> SvcIP

    style SvcIP fill:#4a90d9,color:#fff,stroke:#2a6cb8
    style CloudLB fill:#f5a623,color:#fff,stroke:#d4891a
    style Pod1 fill:#7ed321,color:#fff,stroke:#5ea318
    style Pod2 fill:#7ed321,color:#fff,stroke:#5ea318
    

Endpoints and EndpointSlices

When you create a Service with a selector, Kubernetes automatically creates a companion Endpoints object with the same name. This object contains a flat list of IP:port pairs for every Pod that matches the selector and has passed its readiness probe. kube-proxy watches Endpoints objects to know where to send traffic.

bash
# View the Endpoints for a Service
kubectl get endpoints order-service -n ecommerce
# NAME            ENDPOINTS                                AGE
# order-service   10.244.1.5:8080,10.244.2.8:8080          4m

# Detailed view shows ready and not-ready addresses
kubectl describe endpoints order-service -n ecommerce

The problem with Endpoints is scalability. A single Endpoints object stores every backend IP in one resource. For Services with thousands of Pods, any single Pod change triggers an update to the entire Endpoints object, which must be transmitted to every node running kube-proxy. This creates a quadratic scaling problem.

EndpointSlices (stable since Kubernetes v1.21) fix this by splitting the backend list into multiple slices, each holding up to 100 endpoints by default. When a Pod changes, only the affected EndpointSlice is updated and propagated. This dramatically reduces API server load and network bandwidth in large clusters.

bash
# List EndpointSlices for a Service
kubectl get endpointslices -l kubernetes.io/service-name=order-service -n ecommerce

# Inspect a specific EndpointSlice
kubectl describe endpointslice order-service-abc12 -n ecommerce

Headless Services

Sometimes you do not want Kubernetes to load-balance for you. You want the actual Pod IPs — for example, when running a database cluster where each node has a distinct identity, or when implementing client-side load balancing with gRPC. A headless Service is created by setting clusterIP: None.

With a headless Service, Kubernetes does not allocate a virtual IP and kube-proxy does not create any forwarding rules. Instead, a DNS lookup for the Service name returns A records for each individual Pod IP. If the Service is combined with a StatefulSet, each Pod also gets a stable DNS hostname like pod-0.my-service.namespace.svc.cluster.local.

yaml
apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
spec:
  type: ClusterIP
  clusterIP: None            # makes it headless
  selector:
    app: postgres
  ports:
    - name: tcp-postgres
      port: 5432
      targetPort: 5432
bash
# DNS lookup returns individual Pod IPs (no ClusterIP)
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never --   nslookup postgres-headless.default.svc.cluster.local
# Server:    10.96.0.10
# Name:      postgres-headless.default.svc.cluster.local
# Address 1: 10.244.1.12
# Address 2: 10.244.2.19
# Address 3: 10.244.3.7

Session Affinity

By default, kube-proxy distributes traffic to backends with no stickiness — each new connection may land on a different Pod. If your application requires that all requests from the same client go to the same Pod (e.g., for in-memory session state), you can enable session affinity.

Kubernetes supports one type of session affinity: ClientIP. When enabled, kube-proxy creates affinity rules based on the client’s source IP address. All connections from the same IP are routed to the same Pod for a configurable timeout (default: 10,800 seconds / 3 hours).

yaml
apiVersion: v1
kind: Service
metadata:
  name: session-app
spec:
  selector:
    app: web-frontend
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600    # 1-hour sticky sessions
  ports:
    - port: 80
      targetPort: 3000
Prefer external session stores

Session affinity is a pragmatic escape hatch, not a best practice. It reduces the effectiveness of load balancing and breaks when Pods are rescheduled. For production workloads, store session state in Redis or a database, and keep your Pods stateless. Save ClientIP affinity for legacy apps that cannot be easily refactored.

Internal and External Traffic Policies

By default, when traffic arrives at a node, kube-proxy may forward it to a Pod on any node in the cluster. This adds an extra network hop and obscures the client’s source IP (because the packet is SNAT’d by the forwarding node). Two fields let you control this behavior.

externalTrafficPolicy

For NodePort and LoadBalancer Services, setting externalTrafficPolicy: Local tells kube-proxy to only forward to Pods running on the same node that received the traffic. This preserves the client’s source IP and eliminates the extra hop, but it means nodes without matching Pods will fail health checks and receive no traffic from the load balancer. You must ensure your Pods are reasonably spread across nodes.

internalTrafficPolicy

Similarly, internalTrafficPolicy: Local (available since Kubernetes v1.26) restricts in-cluster traffic to Pods on the same node as the client. This is useful for node-local caches or logging agents where you want each Pod to talk to the agent on its own node.

yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service-local
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local    # preserve source IP, avoid extra hops
  internalTrafficPolicy: Cluster  # default — route to any node
  selector:
    app: order-api
  ports:
    - port: 80
      targetPort: 8080

Service Type Comparison

FeatureClusterIPNodePortLoadBalancerExternalName
Virtual IP allocated✅ Yes✅ Yes✅ Yes❌ No
Accessible from inside cluster✅ Yes✅ Yes✅ Yes✅ Via DNS
Accessible from outside cluster❌ No✅ Via NodeIP:port✅ Via external IPN/A
Requires cloud providerNoNoYesNo
Port rangeAny30000–32767Any (LB frontend)N/A
Typical use caseInternal microservicesDev/test, bare-metalProduction external accessExternal DB, SaaS aliases
kube-proxy rules✅ Yes✅ Yes✅ Yes❌ No

Essential kubectl Commands for Services

bash
# Create a ClusterIP service imperatively
kubectl expose deployment order-api --port=80 --target-port=8080 --name=order-service

# Create a NodePort service
kubectl expose deployment order-api --type=NodePort --port=80 --target-port=8080

# List all Services with wide output (shows ClusterIP + external IP)
kubectl get svc -o wide

# Watch for LoadBalancer external IP assignment
kubectl get svc order-service-lb -w

# Debug: check which Pods are behind a Service
kubectl get endpoints order-service
kubectl get endpointslices -l kubernetes.io/service-name=order-service

# Debug: verify kube-proxy iptables rules for a ClusterIP
sudo iptables -t nat -L KUBE-SERVICES -n | grep order-service

# Temporary local access: port-forward through a Service
kubectl port-forward svc/order-service 8080:80

# DNS resolution test from inside the cluster
kubectl run dns-debug --rm -it --image=busybox:1.36 --restart=Never --   nslookup order-service.ecommerce.svc.cluster.local
NodePort is not production-ready on its own

Exposing a NodePort directly to the internet requires clients to know individual node IPs, provides no SSL termination, and offers no health-check-based routing. In production, always front NodePort with a load balancer — whether cloud-managed (LoadBalancer type) or self-hosted (e.g., MetalLB for bare-metal clusters). Use NodePort alone only for development, debugging, or tightly controlled internal access.

Ingress, Ingress Controllers, and the Gateway API

In the previous section, you saw how Services expose Pods inside and outside the cluster via ClusterIP, NodePort, and LoadBalancer. These abstractions work well for raw TCP/UDP connectivity — but they fall apart the moment you need HTTP-aware routing. A LoadBalancer Service gives you a single external IP mapped to a single backend. If you run 20 microservices, you get 20 cloud load balancers, 20 public IPs, and 20 separate bills.

Real-world HTTP traffic demands more: routing requests to different backends based on the hostname (api.example.com vs. app.example.com) or URL path (/api/v1 vs. /static), terminating TLS at the edge, injecting headers, enforcing rate limits, and performing canary releases. These are Layer 7 concerns, and Services operate at Layer 4. This gap is exactly what Ingress — and its successor, the Gateway API — exist to fill.

The Ingress Resource

An Ingress is a Kubernetes API object that declares HTTP and HTTPS routing rules. You define which hostnames and paths map to which backend Services, and an Ingress controller (a separate component you must install) reads those rules and configures a reverse proxy accordingly. The Ingress resource itself does nothing without a controller — it is purely declarative configuration.

Here is a minimal Ingress that routes traffic for two hostnames to two different Services:

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: frontend-service
                port:
                  number: 80

The spec.ingressClassName field tells Kubernetes which Ingress controller should handle this resource. Before Kubernetes 1.18, this was done via the kubernetes.io/ingress.class annotation — you will still see that pattern in older configurations.

Path Types

Kubernetes supports three pathType values, and the distinction matters for how requests are matched:

Path TypeMatching BehaviorExample
ExactOnly matches the exact URL path/api matches /api but not /api/ or /api/users
PrefixMatches based on URL path prefix split by //api matches /api, /api/, and /api/users
ImplementationSpecificMatching depends on the Ingress controllerNGINX treats it as a regex-capable path; others may differ

TLS Termination

Ingress supports TLS termination at the edge by referencing a Kubernetes Secret that contains the certificate and private key. The controller terminates HTTPS and forwards plain HTTP to your backend Services.

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: tls-ingress
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls-secret
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

The Secret must be of type kubernetes.io/tls and contain tls.crt and tls.key data fields. In practice, most teams use cert-manager to automatically provision and renew TLS certificates from Let's Encrypt, eliminating manual Secret management entirely.

Annotations: The Escape Hatch

The Ingress spec is deliberately minimal — it covers hosts, paths, backends, and TLS. Everything else (rate limiting, CORS headers, authentication, WebSocket support, custom timeouts) must be configured through controller-specific annotations. This is the Ingress API's biggest practical reality: annotations are where most of your configuration lives.

yaml
metadata:
  annotations:
    # NGINX Ingress Controller annotations
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/cors-allow-origin: "https://app.example.com"
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
Annotations are not portable

Annotations are scoped to a specific controller implementation. If you migrate from NGINX Ingress Controller to Traefik, every annotation must be rewritten. This tight coupling is one of the primary motivations behind the Gateway API.

Comparing Popular Ingress Controllers

Kubernetes does not ship with an Ingress controller. You must install one, and the choice significantly affects your operational experience. Each controller is a reverse proxy that watches for Ingress resources and translates them into its own configuration format.

ControllerProxy EngineBest ForKey StrengthWatch Out For
NGINX Ingress Controller NGINX / OpenResty General purpose, high traffic Mature, huge community, extensive annotation catalog Two forks exist (community kubernetes/ingress-nginx vs. NGINX Inc's nginxinc/kubernetes-ingress) — don't confuse them
Traefik Traefik (Go) Dynamic environments, automatic TLS Built-in Let's Encrypt, dashboard, middleware chains, native CRDs Performance slightly lower than NGINX under extreme load; Traefik v2/v3 CRD changes can be disruptive
HAProxy Ingress HAProxy Ultra-low latency, TCP workloads Battle-tested proxy, excellent for mixed TCP/HTTP traffic Smaller community, fewer ready-made examples
AWS ALB Ingress Controller AWS ALB (cloud-native) AWS-native deployments Provisions actual AWS ALBs; uses target groups for Pod-level routing AWS-only, costs per ALB, slower to provision than in-cluster proxies
Envoy-based (Contour, Emissary) Envoy Service mesh integration, gRPC xDS-based dynamic config, HTTP/2 and gRPC first-class support Higher resource footprint, steeper learning curve
Choosing a controller

If you have no strong preference, start with the community NGINX Ingress Controller (kubernetes/ingress-nginx). It has the broadest documentation, the most Stack Overflow answers, and handles the vast majority of workloads well. Move to specialized controllers when you have a specific need: Envoy for gRPC-heavy traffic, AWS ALB for deep AWS integration, or Traefik if you want automatic Let's Encrypt with zero configuration.

Limitations of the Ingress API

After years of production use, the Kubernetes community identified several fundamental problems with the Ingress API that no amount of annotation hacking could fix:

  • Lowest common denominator spec. The Ingress spec only covers basic host/path routing and TLS. Every controller extends it differently through annotations, making manifests non-portable.
  • No role separation. A single Ingress resource mixes infrastructure concerns (which ports to listen on, what TLS policy to use) with application concerns (which paths route where). Both the platform team and the app developer edit the same object.
  • No support for non-HTTP protocols. TCP, UDP, and gRPC routing require controller-specific CRDs or annotations — there is no standard way to express them.
  • Header-based routing is impossible. You cannot route based on HTTP headers, query parameters, or request methods in the Ingress spec.
  • Traffic splitting is absent. Canary deployments, A/B testing, and weighted routing all require annotations or custom CRDs.
  • Single resource, single namespace. An Ingress resource can only reference backend Services in its own namespace, making cross-namespace routing cumbersome.

These limitations led to the development of the Gateway API — not as a patch to Ingress, but as a ground-up redesign.

The Gateway API

The Gateway API is a collection of Kubernetes CRDs that provide expressive, extensible, and role-oriented routing. It graduated to GA (v1.0) in October 2023 for its core HTTP routing features. Unlike Ingress, the Gateway API was designed by a multi-vendor working group with explicit goals: portability across implementations, a rich feature set without annotations, and clear separation of concerns between personas.

The Three-Resource Model

The Gateway API splits what was a single Ingress resource into three distinct resource types, each managed by a different persona:

flowchart TB
    subgraph infra["Infrastructure Provider"]
        GC["GatewayClass
Defines the controller implementation
e.g., Envoy, NGINX, Cilium"] end subgraph platform["Cluster Operator / Platform Team"] GW["Gateway
Declares listeners: ports, protocols, TLS
e.g., HTTPS on port 443"] end subgraph appdev["Application Developer"] HR["HTTPRoute
Defines host/path matching,
backends, traffic splitting"] end GC -->|"referenced by"| GW GW -->|"routes attach to"| HR
ResourceManaged ByResponsibility
GatewayClassInfrastructure provider (cloud vendor, mesh operator)Defines which controller implementation to use — analogous to StorageClass for storage
GatewayCluster operator / platform teamDeclares listeners (ports, protocols, TLS certificates), namespaces allowed to attach routes
HTTPRouteApplication developerSpecifies host and path matching rules, backend Services, filters (header modification, redirects, mirroring), and traffic weights

This separation is powerful. The platform team configures TLS policies and which namespaces are allowed to bind routes. App developers create HTTPRoutes in their own namespace without needing to touch infrastructure configuration or request cluster-admin privileges. Each persona controls only what they own.

Gateway API in Practice

Here is a complete example showing all three resources working together. The GatewayClass is typically provided by the controller installation; you will usually only create the Gateway and HTTPRoute.

yaml
# 1. GatewayClass — installed by the infrastructure provider
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: eg
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
yaml
# 2. Gateway — created by the platform team
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: infra
spec:
  gatewayClassName: eg
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-tls
            kind: Secret
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              gateway-access: "true"
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: Same
yaml
# 3. HTTPRoute — created by the app developer
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: store-routes
  namespace: store
spec:
  parentRefs:
    - name: main-gateway
      namespace: infra
  hostnames:
    - "store.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: store-api
          port: 8080
          weight: 90
        - name: store-api-canary
          port: 8080
          weight: 10
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: store-frontend
          port: 3000

Notice the weight fields on the first rule — 90% of /api traffic goes to the stable backend and 10% to the canary. This traffic splitting is a first-class feature of the Gateway API, not an annotation hack. Also note that the HTTPRoute lives in the store namespace but attaches to a Gateway in the infra namespace via parentRefs. Cross-namespace routing is built in.

HTTPRoute Features Beyond Basic Routing

The HTTPRoute resource supports matching and filtering capabilities that are impossible in the Ingress API without annotations:

yaml
rules:
  # Header-based routing
  - matches:
      - headers:
          - name: x-api-version
            value: "v2"
        path:
          type: PathPrefix
          value: /api
    backendRefs:
      - name: api-v2
        port: 8080

  # Request redirect
  - matches:
      - path:
          type: Exact
          value: /old-page
    filters:
      - type: RequestRedirect
        requestRedirect:
          scheme: https
          statusCode: 301
          path:
            type: ReplaceFullPath
            replaceFullPath: /new-page

  # Header modification
  - matches:
      - path:
          type: PathPrefix
          value: /internal
    filters:
      - type: RequestHeaderModifier
        requestHeaderModifier:
          add:
            - name: X-Internal-Request
              value: "true"
          remove:
            - X-Debug
    backendRefs:
      - name: internal-service
        port: 8080

Other Route Types

The Gateway API is not limited to HTTP. Alongside HTTPRoute, the specification defines additional route types for other protocols:

Route TypeProtocolStatus (as of v1.2)Use Case
HTTPRouteHTTP / HTTPSGA (v1.0+)Web applications, REST APIs
GRPCRoutegRPCGA (v1.1+)gRPC services with method-level routing
TLSRouteTLS passthroughExperimentalSNI-based routing without termination
TCPRouteRaw TCPExperimentalDatabases, custom TCP protocols
UDPRouteRaw UDPExperimentalDNS, gaming, streaming

Ingress vs. Gateway API: Side-by-Side

CapabilityIngressGateway API
Host-based routing✅ Built-in✅ Built-in
Path-based routing✅ Built-in✅ Built-in
TLS termination✅ Built-in✅ Built-in
Header-based routing❌ Annotations only✅ Built-in
Traffic splitting / canary❌ Annotations only✅ Built-in weights
Request redirect / rewrite❌ Annotations only✅ Built-in filters
Cross-namespace routing❌ Not supported✅ Via parentRefs
Role-based ownership❌ Single resource✅ GatewayClass → Gateway → Route
TCP/UDP routing❌ Not in spec✅ TCPRoute / UDPRoute (experimental)
gRPC routing❌ Not in spec✅ GRPCRoute (GA)
Portability across controllers⚠️ Spec only; annotations break✅ Conformance tests enforce it
Maturity / ecosystem✅ Widely deployed, massive docs⚠️ Growing rapidly; most major controllers support it
Which should you use today?

For new projects, prefer the Gateway API. Its core features are GA, every major controller supports it (NGINX, Envoy Gateway, Cilium, Traefik, Istio, Kong), and it is the official successor to Ingress. For existing clusters with Ingress already in production, there is no rush to migrate — Ingress is not deprecated and will be supported for the foreseeable future. When you do migrate, most controllers support both APIs simultaneously, so you can run them in parallel.

Installing and Using the Gateway API

The Gateway API CRDs are not bundled with Kubernetes itself. You install them separately before creating any Gateway or HTTPRoute resources.

bash
# Install the Gateway API CRDs (standard channel — GA resources only)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml

# Or include experimental resources (TCPRoute, UDPRoute, TLSRoute)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/experimental-install.yaml

# Verify the CRDs are installed
kubectl get crds | grep gateway.networking.k8s.io

After installing the CRDs, you still need a controller that implements them. Envoy Gateway, Cilium, Istio, and the NGINX Gateway Fabric are all popular choices. Each controller installs its own GatewayClass — check the controller's documentation for setup instructions.

In the next section, you will see how DNS resolution works inside the cluster via CoreDNS — the mechanism that lets Pods discover Services by name, which is ultimately what Ingress and Gateway backends rely on to forward traffic.

DNS Resolution in Kubernetes with CoreDNS

Every time a Pod connects to a Service by name — curl http://api-server or a database connection string like postgres://db-primary:5432 — it relies on DNS resolution happening inside the cluster. Kubernetes uses CoreDNS as its default cluster DNS server (replacing kube-dns since v1.13). CoreDNS runs as a Deployment in the kube-system namespace and provides name resolution for Services, Pods, and external domains to every workload in the cluster.

Understanding how cluster DNS works is essential because DNS misconfiguration is one of the most common causes of connectivity failures in Kubernetes. A surprising amount of latency problems also trace back to how DNS search domains and the ndots setting interact with external lookups.

How CoreDNS Is Deployed

CoreDNS runs as a Deployment (typically two replicas for high availability) behind a Service named kube-dns in the kube-system namespace. The Service gets a well-known ClusterIP — usually 10.96.0.10 or whatever the first usable IP in your Service CIDR is. This IP is critical because the kubelet on every node writes it into the /etc/resolv.conf of every Pod it creates.

bash
# Inspect the CoreDNS Deployment and Service
kubectl get deploy coredns -n kube-system
kubectl get svc kube-dns -n kube-system

# Check what a Pod sees as its DNS config
kubectl exec -it my-pod -- cat /etc/resolv.conf

A typical Pod's /etc/resolv.conf looks like this:

text
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Three things to note here: the nameserver points to the kube-dns Service ClusterIP, the search line lists domain suffixes that get appended during resolution, and ndots:5 controls when those suffixes are used. All three are configured by the kubelet based on the Pod's namespace and DNS policy.

DNS Resolution Flow

sequenceDiagram
    participant Pod
    participant resolv as /etc/resolv.conf
    participant CoreDNS as CoreDNS (kube-dns)
    participant Upstream as Upstream DNS

    Pod->>resolv: resolve "api-server"
    resolv->>CoreDNS: api-server.default.svc.cluster.local?
    CoreDNS->>CoreDNS: Look up in cluster zone
    CoreDNS-->>Pod: 10.100.23.45 (ClusterIP)

    Note over Pod,Upstream: External domain resolution

    Pod->>resolv: resolve "api.github.com"
    resolv->>CoreDNS: api.github.com.default.svc.cluster.local?
    CoreDNS-->>resolv: NXDOMAIN
    resolv->>CoreDNS: api.github.com.svc.cluster.local?
    CoreDNS-->>resolv: NXDOMAIN
    resolv->>CoreDNS: api.github.com.cluster.local?
    CoreDNS-->>resolv: NXDOMAIN
    resolv->>CoreDNS: api.github.com?
    CoreDNS->>Upstream: api.github.com?
    Upstream-->>CoreDNS: 140.82.121.6
    CoreDNS-->>Pod: 140.82.121.6
    

DNS Naming Conventions

Kubernetes creates DNS records automatically when you create Services and (optionally) Pods. The naming follows a strict hierarchy rooted at cluster.local (the default cluster domain). Understanding these patterns lets you address any resource by its fully qualified domain name (FQDN).

Service DNS Records

Every Service gets an A/AAAA record in the format:

text
<service-name>.<namespace>.svc.cluster.local

For a ClusterIP Service, this resolves to the virtual ClusterIP. For a headless Service (clusterIP: None), it resolves to the set of Pod IPs backing the Service. Headless Services also create individual A records for each Pod when used with a StatefulSet:

text
# Regular Service
my-service.production.svc.cluster.local → 10.100.23.45 (ClusterIP)

# Headless Service with StatefulSet
redis-0.redis-headless.production.svc.cluster.local → 10.244.1.12 (Pod IP)
redis-1.redis-headless.production.svc.cluster.local → 10.244.2.8  (Pod IP)
redis-2.redis-headless.production.svc.cluster.local → 10.244.3.19 (Pod IP)

Services with named ports also get SRV records, which encode both the port number and the protocol:

text
# SRV record format
_<port-name>._<protocol>.<service>.<namespace>.svc.cluster.local

# Example: Service with port named "http" on TCP
_http._tcp.my-service.production.svc.cluster.local → 0 0 8080 my-service.production.svc.cluster.local

Pod DNS Records

Pods get DNS records based on their IP address, with dots replaced by dashes:

text
# Pod DNS record format
<pod-ip-dashed>.<namespace>.pod.cluster.local

# Example: Pod with IP 10.244.1.12 in "production" namespace
10-244-1-12.production.pod.cluster.local → 10.244.1.12

In practice, you rarely query Pod DNS records directly. Service DNS is the primary mechanism for discovery. The table below summarizes the key record types.

Record TypeDNS Name PatternResolves To
Service (ClusterIP)svc.ns.svc.cluster.localClusterIP address
Service (Headless)svc.ns.svc.cluster.localSet of Pod IPs
StatefulSet Podpod-0.svc.ns.svc.cluster.localIndividual Pod IP
Pod10-244-1-12.ns.pod.cluster.localPod IP
SRV_port._proto.svc.ns.svc.cluster.localPort number + target
Short names and search domains

You don't have to use the full FQDN. Within the same namespace, my-service works. Across namespaces, my-service.other-namespace suffices. The search line in /etc/resolv.conf automatically appends .default.svc.cluster.local, .svc.cluster.local, and .cluster.local during resolution.

DNS Policies

Not every Pod should resolve DNS the same way. A Pod running a node-level agent might need the host's DNS, while a regular application Pod should use cluster DNS. Kubernetes provides four DNS policies, set via the dnsPolicy field in the Pod spec, that control how /etc/resolv.conf is constructed.

PolicyBehaviorUse Case
ClusterFirstQueries go to CoreDNS. Non-cluster domains are forwarded upstream. This is the default.Standard application Pods
DefaultPod inherits the DNS config from the node it runs on (/etc/resolv.conf of the host).Pods that need host-level DNS without cluster resolution
ClusterFirstWithHostNetSame as ClusterFirst, but explicitly required for Pods using hostNetwork: true (which would otherwise default to Default).Host-networked Pods that still need cluster DNS (e.g., ingress controllers)
NoneKubernetes does not set up DNS at all. You must supply your own config via dnsConfig.Custom DNS setups, split-horizon DNS, or environments with special resolvers

The None policy is often paired with dnsConfig to build a completely custom resolver setup. Here's an example that uses a corporate DNS server alongside CoreDNS:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: custom-dns-pod
spec:
  dnsPolicy: "None"
  dnsConfig:
    nameservers:
      - 10.96.0.10        # CoreDNS
      - 172.16.0.53       # Corporate DNS
    searches:
      - default.svc.cluster.local
      - svc.cluster.local
      - corp.example.com
    options:
      - name: ndots
        value: "3"
  containers:
    - name: app
      image: nginx:1.27

The Corefile — CoreDNS Configuration

CoreDNS is configured through a file called the Corefile, stored in a ConfigMap named coredns in the kube-system namespace. The Corefile is a chain of server blocks, each declaring a DNS zone and the plugins that handle it. Plugins execute in the order they are listed.

bash
# View the current Corefile
kubectl get configmap coredns -n kube-system -o yaml

A typical default Corefile looks like this:

text
.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

Key Plugins Explained

PluginPurposeDetails
kubernetesCluster DNS recordsReads Services and Pods from the Kubernetes API and serves A, AAAA, SRV, and PTR records for the cluster.local zone.
forwardUpstream resolutionForwards queries that don't match the cluster zone to upstream nameservers. /etc/resolv.conf refers to the node's resolver. You can also specify explicit IPs like 8.8.8.8.
cacheResponse cachingCaches DNS responses for the specified TTL (in seconds). Reduces load on upstream servers and speeds up repeated lookups.
loopLoop detectionDetects forwarding loops (CoreDNS forwarding to itself) and halts the server to prevent infinite recursion. Don't remove this.
reloadHot reloadWatches the Corefile for changes and reloads the configuration without restarting the Pod. Typically checks every 30 seconds.
errorsError loggingLogs errors to stdout, which is picked up by the container's log stream.
healthHealth endpointExposes http://:8080/health for liveness probes. The lameduck option adds a grace period before reporting unhealthy during shutdown.
readyReadiness endpointExposes http://:8181/ready. Returns 200 only when all plugins are operational.
prometheusMetricsExposes Prometheus metrics on :9153. Key metrics include coredns_dns_requests_total and coredns_dns_responses_total.
loadbalanceRound-robinRandomizes the order of A records in each response, providing basic client-side load balancing.

Customizing the Corefile

Two common customizations are forwarding specific domains to custom DNS servers and adding extra DNS entries. You edit the coredns ConfigMap directly. The reload plugin picks up changes automatically.

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health { lameduck 5s }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30
        }
        prometheus :9153
        forward . 8.8.8.8 8.8.4.4 {
            max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
    # Forward corp.example.com to internal DNS
    corp.example.com:53 {
        errors
        cache 30
        forward . 172.16.0.53 172.16.0.54
    }

The second server block above handles all queries for corp.example.com and its subdomains, forwarding them to corporate DNS servers at 172.16.0.53 and 172.16.0.54. Everything else goes through the main block. This pattern is common in hybrid environments where internal services have their own DNS zone.

The ndots:5 Problem

The ndots:5 setting in /etc/resolv.conf is Kubernetes' most misunderstood DNS behavior. It means: if a name has fewer than 5 dots, treat it as a relative name and try all search domains before querying it as-is. The intent is to let you write short names like my-service or my-service.other-namespace and have them resolve through the search list.

The problem emerges with external domains. A name like api.github.com has only 2 dots — fewer than 5 — so the resolver doesn't query it directly. Instead, it tries the search domains first:

  1. api.github.com.default.svc.cluster.local → NXDOMAIN
  2. api.github.com.svc.cluster.local → NXDOMAIN
  3. api.github.com.cluster.local → NXDOMAIN
  4. api.github.com → ✔ resolved

That's four DNS queries instead of one for every external domain lookup. In applications that make heavy use of external APIs, this creates significant unnecessary DNS traffic and adds latency (typically 2–10ms per extra query). Multiply that across thousands of Pods making frequent HTTP calls and the load on CoreDNS becomes substantial.

Mitigations

You have several options to reduce the overhead, depending on how much control you have over the application and the Pod spec.

Option 1: Use FQDNs with a trailing dot. A trailing dot tells the resolver the name is absolute — skip the search list entirely. This requires changing application code or configuration.

text
# Instead of this (triggers search list):
api.github.com

# Use this (absolute, no search list):
api.github.com.

Option 2: Lower ndots in the Pod spec. Setting ndots:1 means any name with at least one dot (like api.github.com) is queried directly first. The trade-off is that cross-namespace short names like my-service.other-namespace require you to add .svc.cluster.local explicitly.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: low-ndots-pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
  containers:
    - name: app
      image: my-app:1.0

Option 3: Use NodeLocal DNSCache. This DaemonSet runs a DNS caching agent on every node, reducing the latency and load caused by repeated queries. It doesn't eliminate extra queries, but it makes them much cheaper by serving cached NXDOMAIN responses locally.

Recommended approach

For most workloads, lowering ndots to 2 is the best balance. Single-name Service lookups (my-service) and cross-namespace lookups (my-service.other-ns) still work through the search list, but external domains like api.github.com resolve in a single query. Only deep subdomains like a.b.c.example.com would still trigger extra lookups.

Debugging DNS Issues

When Pods can't connect to Services or external endpoints, DNS is one of the first things to investigate. The fastest approach is to launch a temporary Pod with DNS tools and query CoreDNS directly.

Step 1: Launch a debug Pod

The busybox image includes nslookup, and the registry.k8s.io/e2e-test-images/jessie-dnsutils image includes both nslookup and dig. Use whichever is appropriate:

bash
# Quick debug pod with nslookup
kubectl run dns-debug --rm -it --restart=Never \
  --image=busybox:1.36 -- sh

# Full debug pod with dig + nslookup
kubectl run dns-debug --rm -it --restart=Never \
  --image=registry.k8s.io/e2e-test-images/jessie-dnsutils -- bash

Step 2: Test cluster DNS resolution

bash
# Resolve a Service in the same namespace
nslookup my-service

# Resolve a Service in another namespace
nslookup my-service.other-namespace.svc.cluster.local

# Resolve an external domain
nslookup google.com

# Use dig for detailed output (shows query path, TTL, response code)
dig my-service.default.svc.cluster.local

# Query CoreDNS directly by IP
dig @10.96.0.10 kubernetes.default.svc.cluster.local

Step 3: Check CoreDNS health

bash
# Are CoreDNS pods running?
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Verify the kube-dns Service has endpoints
kubectl get endpoints kube-dns -n kube-system

Common DNS Problems and Causes

SymptomLikely CauseInvestigation
nslookup times outCoreDNS pods are down or the kube-dns Service has no endpointsCheck kubectl get pods -n kube-system -l k8s-app=kube-dns
Cluster names resolve but external domains don'tUpstream forwarder is misconfigured or unreachable from the nodeCheck the forward plugin in the Corefile; test upstream connectivity from the node
Intermittent SERVFAIL responsesUDP conntrack race condition (Linux kernel < 5.0) causing dropped packetsEnable use-vc option in dnsConfig to force TCP, or deploy NodeLocal DNSCache
Slow external lookupsndots:5 causing unnecessary search domain queriesRun dig +search api.example.com and count queries; lower ndots
Wrong Service IP returnedStale DNS cache or Service was recreated with a new ClusterIPRestart the client Pod to flush its local resolver cache
CoreDNS CrashLoopBackOffForwarding loop detected by the loop plugin (often caused by host /etc/resolv.conf pointing to 127.0.0.1)Check CoreDNS logs; fix the forward directive to point to a real upstream server
The forwarding loop trap

On nodes where /etc/resolv.conf points to 127.0.0.1 or 127.0.0.53 (common with systemd-resolved), the default forward . /etc/resolv.conf in the Corefile can create a loop: CoreDNS forwards to localhost, which forwards back to CoreDNS. The loop plugin detects this and crashes CoreDNS on purpose. Fix it by pointing forward to explicit upstream servers like 8.8.8.8 or your infrastructure's DNS.

Network Policies — Microsegmentation in Kubernetes

By default, every Pod in a Kubernetes cluster can talk to every other Pod — across namespaces, across nodes, with zero restrictions. This is the "flat network" model baked into the Kubernetes networking specification. It makes getting started easy, but it is a serious security liability in any real environment.

Imagine a compromised frontend Pod freely reaching your database Pod, or a rogue workload in a shared cluster scanning every service on the network. Without network policies, there is nothing stopping lateral movement once an attacker gains a foothold in any Pod. NetworkPolicy resources give you microsegmentation — fine-grained firewall rules that restrict which Pods can communicate with which, on which ports, and in which direction.

flowchart LR
    subgraph "Default: No Network Policies"
        A[Pod A
frontend] <-->|allowed| B[Pod B
api] A <-->|allowed| C[Pod C
database] B <-->|allowed| C D[Pod D
untrusted] <-->|allowed| C D <-->|allowed| B D <-->|allowed| A end style D fill:#e74c3c,color:#fff,stroke:#c0392b style C fill:#2ecc71,color:#fff,stroke:#27ae60

In the diagram above, every Pod can reach every other Pod. The untrusted Pod has full access to the database — exactly the scenario Network Policies are designed to prevent.

The NetworkPolicy Resource

A NetworkPolicy is a namespaced resource that defines traffic rules for a group of Pods. It has three core building blocks: a pod selector that identifies which Pods the policy targets, ingress rules that control incoming traffic, and egress rules that control outgoing traffic. Here is the skeleton:

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-frontend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432

This policy targets all Pods with app: api in the production namespace. It allows inbound traffic only from Pods labeled app: frontend on port 8080, and permits outbound traffic only to Pods labeled app: database on port 5432. All other traffic to and from the api Pods is denied.

spec.podSelector — Targeting Pods

The podSelector field uses standard label selectors to determine which Pods in the policy's namespace are affected. An empty selector (podSelector: {}) matches all Pods in the namespace — this is how you build default deny rules. A selector like matchLabels: {app: api, tier: backend} targets only Pods carrying both labels.

Ingress Rules — Controlling Inbound Traffic

The ingress array defines who is allowed to send traffic to the selected Pods. Each rule in the array is an independent allow rule. Within a single rule, the from array has three types of selectors that can be combined:

SelectorScopeExample Use Case
podSelectorPods in the same namespace as the policyAllow frontend Pods to reach API Pods
namespaceSelectorAll Pods in namespaces matching the label selectorAllow monitoring namespace to scrape metrics
ipBlockCIDR ranges (external or internal IPs)Allow traffic from corporate VPN 10.0.0.0/8
AND vs OR — a common gotcha

Items within a single from entry are ANDed. Separate entries in the from array are ORed. Placing podSelector and namespaceSelector in the same entry means "Pods matching this label in namespaces matching that label." Placing them as separate entries means "Pods matching this label or any Pod in namespaces matching that label." Getting this wrong silently opens or closes traffic you did not intend.

Here is the difference spelled out in YAML. The first example is an AND (both conditions must be true). The second is an OR (either condition allows traffic):

yaml
# AND — Pods labeled app:prometheus IN namespaces labeled team:monitoring
ingress:
  - from:
      - namespaceSelector:
          matchLabels:
            team: monitoring
        podSelector:
          matchLabels:
            app: prometheus

# OR — Pods labeled app:prometheus OR any Pod in team:monitoring namespaces
ingress:
  - from:
      - namespaceSelector:
          matchLabels:
            team: monitoring
      - podSelector:
          matchLabels:
            app: prometheus

Egress Rules — Controlling Outbound Traffic

Egress rules mirror ingress exactly, but use the to field instead of from. They support the same three selectors: podSelector, namespaceSelector, and ipBlock. Egress policies are critical for preventing compromised Pods from exfiltrating data or reaching external command-and-control servers.

yaml
egress:
  # Allow DNS resolution (critical — almost always needed)
  - to:
      - namespaceSelector: {}
    ports:
      - protocol: UDP
        port: 53
      - protocol: TCP
        port: 53
  # Allow outbound to the database
  - to:
      - podSelector:
          matchLabels:
            app: database
    ports:
      - protocol: TCP
        port: 5432
Don't forget DNS

When you add egress restrictions, you almost certainly need to explicitly allow DNS (UDP and TCP port 53) to the kube-system namespace or to all namespaces. Without this, Pods cannot resolve Service names and virtually everything breaks — even though the underlying IP might be allowed.

Port Specifications

Both ingress and egress rules support a ports array. Each entry specifies a protocol (TCP, UDP, or SCTP) and a port number or named port. You can also use endPort to specify a range. If you omit the ports field entirely, the rule applies to all ports.

yaml
ports:
  - protocol: TCP
    port: 8080            # single port
  - protocol: TCP
    port: 9090
    endPort: 9099         # port range 9090-9099
  - protocol: TCP
    port: http            # named port from Pod spec

How Policies Combine — The Additive Model

NetworkPolicies are additive. When multiple policies select the same Pod, the effective set of allowed connections is the union of all those policies. You cannot write a policy that removes access granted by another policy. This is by design — it makes the system predictable and prevents accidental lockouts from conflicting rules.

Here is how it works in practice: if Policy A allows ingress from the frontend on port 8080, and Policy B allows ingress from the monitoring namespace on port 9090, then both types of traffic are allowed. There is no priority or ordering between policies. The only way to restrict traffic is to not include it in any policy that selects those Pods — which is why starting with a deny-all baseline is so important.

Default Deny Policies

The strongest pattern for securing a namespace is to start with a deny-all policy and then layer on explicit allows. The moment any NetworkPolicy selects a Pod, all traffic not explicitly permitted by some policy is denied. A policy with an empty podSelector and empty rules achieves exactly this.

Default Deny All Ingress

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}    # selects ALL Pods in namespace
  policyTypes:
    - Ingress        # no ingress rules = deny all inbound

This selects every Pod in the production namespace and declares Ingress as a policy type — but provides zero ingress rules. The result: all inbound traffic to every Pod in the namespace is blocked. Outbound traffic is unaffected because Egress is not listed in policyTypes.

Default Deny All Egress

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress         # no egress rules = deny all outbound

Default Deny Both Directions

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

With this applied, no Pod in the namespace can send or receive any traffic until you explicitly add policies that allow it. This is the zero-trust starting point.

CNI Plugin Requirements

Here is the part that catches many teams off guard: NetworkPolicy resources are only enforced if your CNI plugin supports them. The Kubernetes API server will happily accept your NetworkPolicy manifests regardless — no errors, no warnings — but they are silently ignored if the CNI cannot enforce them.

CNI PluginNetworkPolicy SupportNotes
Calico✅ Full supportAlso supports its own extended GlobalNetworkPolicy CRD
Cilium✅ Full supporteBPF-based; also supports L7 policies (HTTP, gRPC)
Weave Net✅ Full supportStandard Kubernetes NetworkPolicy
Antrea✅ Full supportAlso offers tiered policy CRDs
Flannel❌ No supportProvides connectivity only; pair with Calico ("Canal") for policies
AWS VPC CNI⚠️ PartialRequires enabling the network policy agent add-on on EKS
Verify enforcement before relying on it

After applying a deny-all policy, test it. Run kubectl exec into a Pod and try to curl a service that should now be blocked. If the request succeeds, your CNI is not enforcing policies. This is a critical validation step — do not assume policies are active just because the resource was created.

Practical Patterns

Pattern 1: Isolating a Namespace

The most common starting point is locking down an entire namespace so that only Pods within it can communicate with each other. External namespaces cannot reach in, and Pods inside cannot reach out to other namespaces (except DNS).

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-namespace
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow traffic only from within this namespace
    - from:
        - podSelector: {}
  egress:
    # Allow traffic within this namespace
    - to:
        - podSelector: {}
    # Allow DNS resolution
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Every Pod in tenant-a can reach other Pods in tenant-a and resolve DNS. Nothing else — no cross-namespace communication, no internet access.

Pattern 2: Deny All, Then Allow Specific Service-to-Service Communication

This is the zero-trust approach: start with a full deny, then create targeted policies for each legitimate communication path. The following example models a three-tier application where the frontend talks to the API, and the API talks to the database. Nothing else is permitted.

yaml
# 1. Deny all traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
# 2. Allow DNS for all Pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
---
# 3. Frontend can receive external ingress, send to API
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frontend-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frontend
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - ports:
        - protocol: TCP
          port: 80
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: api
      ports:
        - protocol: TCP
          port: 8080
---
# 4. API receives from frontend, sends to database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432
---
# 5. Database receives from API only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api
      ports:
        - protocol: TCP
          port: 5432
flowchart LR
    EXT[External
Traffic] -->|":80"| FE[frontend] FE -->|":8080"| API[api] API -->|":5432"| DB[database] FE x--x|blocked| DB EXT x--x|blocked| DB EXT x--x|blocked| API style EXT fill:#95a5a6,color:#fff,stroke:#7f8c8d style FE fill:#3498db,color:#fff,stroke:#2980b9 style API fill:#f39c12,color:#fff,stroke:#e67e22 style DB fill:#2ecc71,color:#fff,stroke:#27ae60

This is five manifests, but the pattern is systematic. Each service gets exactly the connections it needs and nothing more. If the API is compromised, it can reach the database — but the database policy restricts it to port 5432, and the API cannot reach the internet or other namespaces.

Pattern 3: Allowing Cross-Namespace Monitoring

A common real-world requirement is allowing your monitoring stack (Prometheus, Grafana) to scrape metrics from application namespaces. Label the monitoring namespace and use a namespaceSelector:

yaml
# First, label the monitoring namespace:
# kubectl label namespace monitoring purpose=monitoring

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scraping
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              purpose: monitoring
          podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 9090
        - protocol: TCP
          port: 9091

Notice this uses the AND form — namespaceSelector and podSelector are in the same from entry. This means only Pods labeled app: prometheus within namespaces labeled purpose: monitoring can reach the metrics ports. Any other Pod in the monitoring namespace is still blocked.

Pattern 4: Restricting Egress to External CIDRs

Sometimes a Pod needs to reach an external service (a SaaS API, a managed database outside the cluster). Use ipBlock to allow traffic only to specific IP ranges:

yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-external-egress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Egress
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
    # Allow Stripe API (example IPs)
    - to:
        - ipBlock:
            cidr: 54.187.174.169/32
        - ipBlock:
            cidr: 54.187.205.235/32
      ports:
        - protocol: TCP
          port: 443

The payment-service can reach exactly two external IP addresses on port 443 (HTTPS) and nothing else. Even if the service is compromised, data exfiltration to arbitrary internet hosts is blocked.

Debugging Network Policies

When traffic is unexpectedly blocked (or unexpectedly allowed), use this checklist:

  • Verify your CNI supports NetworkPolicy. Run kubectl get pods -n kube-system and check which CNI is running.
  • List policies affecting a Pod. Use kubectl get networkpolicy -n <namespace> and check each policy's podSelector against your Pod's labels.
  • Check label accuracy. A typo in a label selector silently matches zero Pods. Run kubectl get pods --show-labels to verify.
  • Test connectivity directly. Exec into a Pod and use curl, wget, or nc to test reachability: kubectl exec -it <pod> -- curl -m 3 <target>:<port>
  • Remember DNS. If egress is restricted and DNS is not explicitly allowed, name resolution fails and every connection attempt by hostname fails with it.

Volumes — Attaching Storage to Pods

Every container in Kubernetes starts with a fresh, isolated filesystem layered from its image. When that container restarts — whether from a crash, a liveness probe failure, or a rolling update — everything written to that filesystem is gone. This is by design: containers are ephemeral. But most real applications need to persist data, share files between containers, or load configuration from external sources.

Kubernetes volumes solve this by providing a directory that is mounted into a container's filesystem and whose lifecycle is tied to the Pod (not the container). A volume outlives container restarts within the same Pod, and different volume types connect you to everything from temporary scratch space to cloud provider block storage.

The Ephemeral Filesystem Problem

To understand why volumes matter, consider what happens without them. A container writes log files, caches data, or stores user uploads to its local filesystem. When the kubelet restarts that container (say, after an OOMKill), the replacement container starts from a clean image layer. All written data is lost. If you have a sidecar container that needs to read those same log files, it cannot — each container has its own isolated filesystem root.

Volumes address both problems. They survive container restarts within a Pod, and they can be mounted into multiple containers simultaneously, enabling shared-storage patterns like the sidecar log collector or the adapter pattern.

Volume Lifecycle

A Kubernetes volume is declared at the Pod level (under spec.volumes) and mounted into one or more containers (under spec.containers[].volumeMounts). The critical rule: a volume's lifetime is tied to the Pod's lifetime. When the Pod is deleted, the volume is cleaned up — unless it points to external, persistent storage (which we cover in the next section on PersistentVolumes).

Volume ≠ PersistentVolume

The volume types covered here (emptyDir, configMap, secret, etc.) are Pod-scoped. They are created when the Pod is scheduled and destroyed when the Pod is removed. PersistentVolumes and PersistentVolumeClaims, covered in the next section, decouple storage lifecycle from Pod lifecycle entirely.

emptyDir — Temporary Shared Storage

An emptyDir volume starts as an empty directory when a Pod is assigned to a node. It exists for the lifetime of that Pod and is shared across all containers in the Pod. This is the workhorse volume for multi-container patterns: a main application container writes to the volume, and a sidecar reads from it.

By default, emptyDir is backed by the node's disk (whatever medium backs the node's filesystem). You can set medium: Memory to use a tmpfs RAM-backed filesystem instead, which is faster but counts against the container's memory limit and disappears on node reboot.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: log-collector
spec:
  containers:
    - name: app
      image: nginx:1.27
      volumeMounts:
        - name: shared-logs
          mountPath: /var/log/nginx
    - name: sidecar
      image: busybox:1.36
      command: ["sh", "-c", "tail -F /logs/access.log"]
      volumeMounts:
        - name: shared-logs
          mountPath: /logs
          readOnly: true
  volumes:
    - name: shared-logs
      emptyDir: {}

In this example, Nginx writes access logs to /var/log/nginx. The sidecar container sees those same files at /logs because both mount the same emptyDir volume. The sidecar mounts it read-only since it only needs to tail the logs.

For a RAM-backed scratch volume with a size cap:

yaml
volumes:
  - name: scratch-space
    emptyDir:
      medium: Memory
      sizeLimit: 128Mi

hostPath — Access to the Node Filesystem

A hostPath volume mounts a file or directory from the host node's filesystem directly into the Pod. This gives containers access to node-level resources like Docker's socket, system logs, or device files. It is powerful but dangerous — you are breaking the isolation boundary between your Pod and the underlying node.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: node-log-reader
spec:
  containers:
    - name: log-reader
      image: busybox:1.36
      command: ["sh", "-c", "cat /host-logs/syslog"]
      volumeMounts:
        - name: host-logs
          mountPath: /host-logs
          readOnly: true
  volumes:
    - name: host-logs
      hostPath:
        path: /var/log
        type: Directory

The type field adds safety checks before the mount. Common values include Directory (must already exist as a directory), File (must already exist as a file), DirectoryOrCreate (create if missing), and "" (empty string, no checks — the default).

Security risk with hostPath

A Pod with a writable hostPath mount to / has full read/write access to the entire node filesystem — effectively root on the host. In production, restrict hostPath usage via Pod Security Admission or OPA/Gatekeeper policies. Most workloads should never need it. The exception is system-level DaemonSets (log agents, monitoring agents, CSI drivers) that intentionally need node access.

configMap and secret — Mount Configuration as Files

ConfigMaps and Secrets are Kubernetes API objects that store key-value data. When you mount them as volumes, each key becomes a file in the mounted directory, with the value as the file content. This is how you inject configuration files, TLS certificates, or credential files into containers without baking them into the image.

ConfigMap Volume

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  nginx.conf: |
    server {
      listen 80;
      location / {
        root /usr/share/nginx/html;
      }
    }
  extra-settings.conf: |
    gzip on;
---
apiVersion: v1
kind: Pod
metadata:
  name: web-server
spec:
  containers:
    - name: nginx
      image: nginx:1.27
      volumeMounts:
        - name: config-vol
          mountPath: /etc/nginx/conf.d
  volumes:
    - name: config-vol
      configMap:
        name: nginx-config

After mounting, the container's /etc/nginx/conf.d directory contains two files: nginx.conf and extra-settings.conf. If you only need specific keys, use the items field to select them and optionally remap filenames:

yaml
volumes:
  - name: config-vol
    configMap:
      name: nginx-config
      items:
        - key: nginx.conf
          path: default.conf

This mounts only the nginx.conf key, but names the resulting file default.conf inside the mount directory.

Secret Volume

Secret volumes work identically to ConfigMap volumes, except the data comes from a Secret object. Kubernetes mounts Secret files with 0644 permissions by default; you can tighten this with defaultMode. The files are backed by a tmpfs in memory — they are never written to the node's disk.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: tls-app
spec:
  containers:
    - name: app
      image: my-app:2.1
      volumeMounts:
        - name: tls-certs
          mountPath: /etc/tls
          readOnly: true
  volumes:
    - name: tls-certs
      secret:
        secretName: app-tls
        defaultMode: 0400

The Secret app-tls might contain keys tls.crt and tls.key. After mounting, the container finds them at /etc/tls/tls.crt and /etc/tls/tls.key, both with restrictive 0400 permissions (owner read-only).

downwardAPI — Expose Pod Metadata as Files

The downwardAPI volume projects Pod metadata — labels, annotations, resource limits, the Pod name, namespace, and node name — into files inside the container. This is useful when your application needs self-awareness without querying the Kubernetes API directly.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: self-aware-app
  labels:
    app: backend
    version: v2
  annotations:
    owner: platform-team
spec:
  containers:
    - name: app
      image: my-app:2.1
      volumeMounts:
        - name: pod-info
          mountPath: /etc/podinfo
  volumes:
    - name: pod-info
      downwardAPI:
        items:
          - path: labels
            fieldRef:
              fieldPath: metadata.labels
          - path: annotations
            fieldRef:
              fieldPath: metadata.annotations
          - path: cpu-limit
            resourceFieldRef:
              containerName: app
              resource: limits.cpu

Inside the container, /etc/podinfo/labels contains the content app="backend"\nversion="v2". The cpu-limit file contains the numeric CPU limit. Labels and annotations are kept in sync — if an annotation changes on the running Pod, the mounted file updates automatically (with a short delay).

projected — Combine Multiple Sources

A projected volume merges multiple volume sources — Secrets, ConfigMaps, downwardAPI fields, and serviceAccountToken — into a single mount point. Without projected volumes, you would need separate volume declarations and separate mount paths for each source. This quickly becomes unwieldy when a container needs a TLS cert from a Secret, a config file from a ConfigMap, and a service account token all in the same directory.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: combined-config
spec:
  containers:
    - name: app
      image: my-app:2.1
      volumeMounts:
        - name: all-config
          mountPath: /etc/app-config
          readOnly: true
  volumes:
    - name: all-config
      projected:
        sources:
          - configMap:
              name: app-settings
              items:
                - key: app.yaml
                  path: app.yaml
          - secret:
              name: app-tls
              items:
                - key: tls.crt
                  path: certs/tls.crt
                - key: tls.key
                  path: certs/tls.key
          - downwardAPI:
              items:
                - path: pod-name
                  fieldRef:
                    fieldPath: metadata.name
          - serviceAccountToken:
              path: token
              expirationSeconds: 3600
              audience: vault

The resulting directory at /etc/app-config contains app.yaml (from the ConfigMap), certs/tls.crt and certs/tls.key (from the Secret), pod-name (from the downward API), and token (a bound service account token with a 1-hour expiry). The serviceAccountToken source is particularly useful — it generates short-lived, audience-scoped tokens, which is the preferred alternative to the long-lived tokens from the default service account Secret.

subPath — Mount a Single File Without Hiding a Directory

When you mount a volume to a container path, it replaces the entire directory at that path. If the container image already has files in /etc/nginx/conf.d and you mount a ConfigMap there, the original files disappear. The subPath field solves this by mounting a single file (or subdirectory) from the volume into the target path, leaving the rest of the directory untouched.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: subpath-demo
spec:
  containers:
    - name: nginx
      image: nginx:1.27
      volumeMounts:
        - name: custom-config
          mountPath: /etc/nginx/conf.d/custom.conf
          subPath: custom.conf
  volumes:
    - name: custom-config
      configMap:
        name: nginx-custom

Now only custom.conf is placed at /etc/nginx/conf.d/custom.conf. The image's default default.conf and any other files in that directory remain intact.

subPath blocks automatic updates

ConfigMap and Secret volumes normally auto-update when the underlying object changes (with a delay of up to the kubelet sync period, ~60 seconds by default). When you use subPath, this auto-update does not happen. The file is mounted as a bind mount, not a symlink, so the kubelet cannot swap it. If you need live-reloading config, mount the full volume to a separate directory and symlink or have your application watch that path.

volumeMounts Options Reference

Beyond mountPath and subPath, the volumeMounts field supports several options that control how the volume behaves inside the container.

Field Type Description
mountPath string Absolute path inside the container where the volume is mounted. Required.
readOnly bool Mount the volume as read-only. Defaults to false. Use for secrets, config, and any data the container should not modify.
subPath string Mount a single file or subdirectory from the volume instead of the root. Prevents hiding existing directory contents.
subPathExpr string Like subPath but supports environment variable expansion ($(VAR_NAME)). Useful for per-Pod paths in StatefulSets.
mountPropagation string Controls whether mounts made inside the container are visible to the host and vice versa. Values: None (default), HostToContainer, Bidirectional. Only needed for CSI drivers and system-level Pods.

Volume Type Comparison

Choosing the right volume type depends on what data you need and how long it should survive. This table summarizes the core Pod-level volume types at a glance.

Volume Type Data Source Lifetime Writable Primary Use Case
emptyDir Empty (node disk or RAM) Pod Yes Scratch space, inter-container sharing
hostPath Node filesystem Node Yes System DaemonSets, node-level access
configMap ConfigMap object Pod No* Config files, env-specific settings
secret Secret object Pod No* TLS certs, credentials, tokens
downwardAPI Pod metadata Pod No Expose labels, annotations, resource limits
projected Multiple sources combined Pod No* Unified mount for config + secrets + tokens

* ConfigMap and Secret volumes are mounted read-only by default since Kubernetes 1.19+ when the Immutable feature is used. Without that, writes technically succeed in the container but are not persisted back to the API object.

Putting It Together — A Complete Multi-Volume Pod

Real-world Pods often combine several volume types. Here is a complete example: a web application Pod that loads its config from a ConfigMap, its TLS certificates from a Secret, writes temporary cache data to an emptyDir, and exposes Pod labels to the application.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: production-web
  labels:
    app: web
    tier: frontend
spec:
  containers:
    - name: web
      image: my-web-app:3.4
      ports:
        - containerPort: 8443
      volumeMounts:
        - name: app-config
          mountPath: /etc/app/config.yaml
          subPath: config.yaml
          readOnly: true
        - name: tls
          mountPath: /etc/tls
          readOnly: true
        - name: cache
          mountPath: /tmp/cache
        - name: pod-meta
          mountPath: /etc/podinfo
          readOnly: true
      resources:
        limits:
          memory: 256Mi
          cpu: 500m
  volumes:
    - name: app-config
      configMap:
        name: web-app-config
    - name: tls
      secret:
        secretName: web-tls-cert
        defaultMode: 0400
    - name: cache
      emptyDir:
        sizeLimit: 100Mi
    - name: pod-meta
      downwardAPI:
        items:
          - path: labels
            fieldRef:
              fieldPath: metadata.labels

This pattern — config from a ConfigMap, secrets from a Secret, ephemeral cache from emptyDir, and metadata from the downward API — covers the majority of volume needs for stateless applications. When you need storage that survives Pod deletion, that is when you reach for PersistentVolumes, covered next.

PersistentVolumes and PersistentVolumeClaims

Containers are ephemeral, but data often is not. A database, a file upload service, or a message queue all need storage that survives Pod restarts, rescheduling, and even node failures. Kubernetes solves this with a two-object model: PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). This separation cleanly divides infrastructure provisioning from application consumption.

The Separation of Concerns

A PersistentVolume (PV) is a cluster-level resource representing a piece of physical or cloud storage — an NFS export, an AWS EBS volume, a GCE Persistent Disk, or a local SSD. PVs are created by cluster administrators (or dynamically by a provisioner) and exist independently of any Pod. Think of a PV as the actual hard drive sitting in a rack.

A PersistentVolumeClaim (PVC) is a namespaced request for storage made by a user or workload. It specifies how much storage is needed and how it will be accessed, without caring about where the storage comes from. The PVC is the purchase order; the PV is the inventory item that fulfills it.

Note

PVs are cluster-scoped — they do not belong to any namespace. PVCs are namespace-scoped — they live alongside the Pods that use them. This is the core of the separation: admins manage PVs globally, developers request PVCs within their namespace.

The PV Lifecycle

A PersistentVolume moves through four distinct phases from creation to cleanup. Understanding this lifecycle is critical for debugging storage issues and planning capacity.

stateDiagram-v2
    [*] --> Available : Provisioning (Static or Dynamic)
    Available --> Bound : PVC matches PV
    Bound --> Released : PVC is deleted
    Released --> Available : Reclaim policy = Recycle (deprecated)
    Released --> [*] : Reclaim policy = Delete
    Released --> Released : Reclaim policy = Retain (manual cleanup)
    

1. Provisioning

Before a PV can be used, it must exist. There are two provisioning strategies, and most production clusters use both.

Static provisioning means a cluster administrator creates PV objects by hand, each pointing to a specific backing storage resource. This is common with on-premises infrastructure — NFS servers, iSCSI targets, or local disks — where automated provisioning is not available.

yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv-01
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: 10.0.0.5
    path: /exports/data

Dynamic provisioning eliminates the need for pre-created PVs. When a PVC references a StorageClass, Kubernetes automatically provisions a matching PV through the class's provisioner plugin. This is the default model on cloud providers (AWS, GCP, Azure) and CSI-based storage systems.

2. Binding

When a PVC is created, the control plane searches for an available PV that satisfies the claim's requirements: sufficient capacity, compatible access modes, matching StorageClass, and any label selectors. If a match is found, the PVC and PV are bound in a one-to-one relationship. This binding is exclusive — once a PV is bound to a PVC, no other PVC can use it, even if the PV has more capacity than the claim requested.

If no matching PV exists and no StorageClass can dynamically provision one, the PVC remains in a Pending state indefinitely until a suitable PV becomes available.

3. Using

Once bound, a Pod can mount the PVC as a volume. The cluster looks up the PVC, finds the bound PV, and mounts the underlying storage into the Pod's container at the specified path. Multiple Pods can use the same PVC simultaneously if the access mode permits it (e.g., ReadWriteMany).

yaml
apiVersion: v1
kind: Pod
metadata:
  name: app-server
spec:
  containers:
    - name: app
      image: myapp:3.2
      volumeMounts:
        - mountPath: /var/data
          name: data-volume
  volumes:
    - name: data-volume
      persistentVolumeClaim:
        claimName: app-data-pvc

4. Reclaiming

When a PVC is deleted, its bound PV enters the Released phase. What happens next depends on the PV's reclaim policy. This is where storage cleanup strategy is configured, and getting it wrong can mean either data loss or orphaned cloud volumes accumulating cost.

Access Modes

Access modes define how a volume can be mounted by nodes. They are constraints on node-level access, not Pod-level — a volume with ReadWriteOnce can be mounted by multiple Pods, but only if they all run on the same node.

Access ModeAbbreviationMeaningTypical Use Case
ReadWriteOnceRWORead-write by a single nodeDatabases, single-instance apps
ReadOnlyManyROXRead-only by many nodesShared config, static assets
ReadWriteManyRWXRead-write by many nodesShared file uploads, CMS content
ReadWriteOncePodRWOPRead-write by a single Pod (K8s 1.27+)Strict single-writer guarantees

Not every storage backend supports every access mode. Block storage (EBS, GCE PD, Azure Disk) typically supports only RWO. File-based storage (NFS, EFS, CephFS) can support RWX. Always check your storage provider's documentation.

Note

ReadWriteOncePod was introduced as stable in Kubernetes 1.29. Unlike ReadWriteOnce, which restricts to a single node, RWOP restricts to a single Pod across the entire cluster. This is the right choice when exactly one writer must exist — for example, a leader-elected process writing to a WAL.

Reclaim Policies

The reclaim policy determines what happens to a PV (and its underlying storage) after its bound PVC is deleted. This is set on the PV, not the PVC.

PolicyWhat HappensData Preserved?When to Use
RetainPV moves to Released but is not cleaned up. The underlying storage and its data remain intact. An admin must manually reclaim the PV.YesProduction databases, any data you cannot afford to lose
DeletePV object and the underlying storage resource (e.g., the EBS volume) are both deleted automatically.NoDynamically provisioned volumes for stateless or easily-reproducible workloads
RecycleRuns rm -rf /thevolume/* on the volume and makes the PV available again.NoDeprecated. Do not use. Use dynamic provisioning instead.

Dynamically provisioned PVs inherit the reclaim policy from their StorageClass. The default for most cloud StorageClasses is Delete. If you are running stateful workloads like databases, you should either change the StorageClass default to Retain or patch individual PVs after creation.

bash
# Patch an existing PV to Retain so deleting the PVC won't destroy data
kubectl patch pv my-database-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Volume Binding Modes

Volume binding mode controls when a PV is bound to a PVC. This is configured on the StorageClass, not on the PV or PVC directly. The choice has real implications for scheduling and data locality.

Immediate (the default) — The PV is provisioned and bound as soon as the PVC is created, regardless of whether any Pod has requested it yet. This works fine for storage that is accessible from any node (like network-attached storage). But it can cause problems with topology-constrained storage: if an EBS volume is provisioned in us-east-1a but the Pod gets scheduled to a node in us-east-1b, the Pod will be stuck in Pending.

WaitForFirstConsumer — The PV binding and provisioning are delayed until a Pod that uses the PVC is scheduled. The scheduler considers the Pod's node assignment first, then provisions storage in the same topology zone. This is the recommended mode for zone-constrained block storage like EBS, GCE PD, and Azure Disk.

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  iops: "4000"
Tip

If you see Pods stuck in Pending with events like "volume node affinity conflict", the most common cause is using Immediate binding mode with zone-constrained storage. Switch the StorageClass to WaitForFirstConsumer to fix it.

How PV-PVC Matching Works

When a PVC is created with no specific volumeName, the control plane runs a matching algorithm to find the best available PV. The criteria are evaluated in this order:

  1. StorageClass — The PVC's storageClassName must exactly match the PV's storageClassName. A PVC with storageClassName: "" (empty string) only matches PVs with no class. A PVC with no storageClassName field uses the cluster's default StorageClass.
  2. Access Modes — The PV must support at least the access modes requested by the PVC. A PV offering [RWO, ROX] satisfies a PVC requesting [RWO].
  3. Capacity — The PV's capacity must be greater than or equal to the PVC's requested storage. Kubernetes picks the smallest PV that satisfies the claim to minimize waste.
  4. Label Selectors — If the PVC defines a selector with matchLabels or matchExpressions, only PVs matching those labels are considered.
  5. Volume Name — If the PVC specifies volumeName, it skips all other matching logic and binds directly to that specific PV (assuming access modes and capacity are compatible).

Putting It All Together: Static Provisioning Example

The following example shows the complete workflow: an admin creates a PV backed by NFS, a developer creates a PVC that matches it, and a Deployment mounts the claim.

yaml
# 1. Admin creates the PersistentVolume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: shared-nfs-pv
  labels:
    environment: production
    tier: storage
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""          # Empty string = no dynamic provisioning
  nfs:
    server: 10.0.0.5
    path: /exports/shared-data
yaml
# 2. Developer creates a PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-data-pvc
  namespace: web-app
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: ""          # Must match the PV's storageClassName
  selector:
    matchLabels:
      environment: production
yaml
# 3. Deployment mounts the PVC
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          volumeMounts:
            - mountPath: /usr/share/nginx/html
              name: shared-content
      volumes:
        - name: shared-content
          persistentVolumeClaim:
            claimName: shared-data-pvc

Dynamic Provisioning Example

With dynamic provisioning, you skip the PV creation entirely. The PVC references a StorageClass, and Kubernetes creates the PV automatically. This is the standard approach on cloud-managed clusters.

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: database
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: fast-ssd     # References the StorageClass provisioner

Once this PVC is applied and a Pod referencing it is scheduled, the fast-ssd StorageClass provisioner creates a 20Gi volume in the same availability zone as the node (because we set WaitForFirstConsumer earlier). A corresponding PV object is automatically created and bound to this PVC.

Inspecting PVs and PVCs

When debugging storage issues, these commands show you the current state and help identify mismatches.

bash
# List all PVs with their status, capacity, access modes, and bound claim
kubectl get pv

# List PVCs in a namespace with their bound PV
kubectl get pvc -n database

# Describe a PVC to see events — useful when a PVC is stuck in Pending
kubectl describe pvc postgres-data -n database

# Check why a PV is in Released state and not reusable
kubectl get pv shared-nfs-pv -o yaml | grep -A 5 claimRef

A common gotcha: when a PV has reclaim policy Retain and its PVC is deleted, the PV moves to Released but still holds a claimRef pointing to the old PVC. No new PVC can bind to it until you manually clear that reference:

bash
# Remove the stale claimRef so the PV becomes Available again
kubectl patch pv shared-nfs-pv --type json \
  -p '[{"op": "remove", "path": "/spec/claimRef"}]'

StorageClasses and Dynamic Provisioning

In the previous section, you created PersistentVolumes by hand and then matched them with PersistentVolumeClaims. That workflow is fine for a handful of volumes, but it collapses under real-world conditions. If you have 50 microservices each needing their own volume — across dev, staging, and production — someone has to manually create and manage 150+ PV manifests. Worse, those PVs must be pre-provisioned in the cloud provider before a workload can claim them.

StorageClasses solve this by letting you declare what kind of storage you want. When a PVC references a StorageClass, Kubernetes automatically calls the appropriate provisioner plugin to create the underlying volume, wraps it in a PV object, and binds it to the PVC — all without human intervention. This is dynamic provisioning, and it is how virtually every production cluster manages storage.

The StorageClass Spec

A StorageClass is a cluster-scoped resource (not namespaced) that acts as a template for volume creation. It tells Kubernetes which provisioner to call and what parameters to pass. Here is the anatomy of a StorageClass:

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com        # Which plugin creates the volume
parameters:                          # Provider-specific settings
  type: gp3
  fsType: ext4
  iopsPerGB: "50"
reclaimPolicy: Delete                # What happens when PVC is deleted
volumeBindingMode: WaitForFirstConsumer  # When to provision
allowVolumeExpansion: true           # Can PVCs grow after creation?

Each field controls a different aspect of the provisioning lifecycle. Let's walk through them.

provisioner

The provisioner field identifies the volume plugin responsible for creating and deleting the actual storage backend. Every cloud provider and storage system has its own provisioner. The older "in-tree" provisioners (like kubernetes.io/aws-ebs) are built into Kubernetes itself, but they are deprecated. Modern clusters use out-of-tree CSI drivers instead.

Cloud / SystemLegacy In-Tree ProvisionerModern CSI Provisioner
AWS EBSkubernetes.io/aws-ebsebs.csi.aws.com
GCP Persistent Diskkubernetes.io/gce-pdpd.csi.storage.gke.io
Azure Diskkubernetes.io/azure-diskdisk.csi.azure.com
Azure Filekubernetes.io/azure-filefile.csi.azure.com
Rancher Local Pathrancher.io/local-path
Ceph RBDkubernetes.io/rbdrbd.csi.ceph.com
In-tree provisioners are deprecated

Kubernetes has been migrating all in-tree volume plugins to CSI drivers since v1.23. As of v1.31, in-tree AWS EBS and GCE PD code is removed from the core codebase. Always use the CSI provisioner for new clusters. If you see kubernetes.io/* provisioners in existing manifests, plan a migration to the CSI equivalents.

parameters

The parameters map is passed directly to the provisioner. Its contents are entirely provider-specific — Kubernetes does not validate them. Here are common parameters for the major cloud providers:

ParameterAWS EBS (CSI)GCP PD (CSI)Description
typegp3, gp2, io2, st1pd-ssd, pd-standard, pd-balancedVolume type / performance tier
fsTypeext4, xfsext4, xfsFilesystem to format the volume with
iopsPerGB"50" (io1/io2 only)Provisioned IOPS per GiB
encrypted"true"Enable encryption at rest
replication-typeregional-pdEnable regional replication on GCP

All parameter values must be strings — even numeric ones. Writing iopsPerGB: 50 without quotes will cause YAML to pass an integer, which some provisioners reject. Always quote: iopsPerGB: "50".

reclaimPolicy

The reclaimPolicy determines what happens to the underlying volume when the PVC is deleted. There are two options:

  • Delete (default): The PV and its backing storage are destroyed. This is the right choice for ephemeral or easily-recreatable data.
  • Retain: The PV is kept and its status changes to Released. The actual cloud volume is not deleted. You must manually reclaim or clean it up. Use this for databases and anything you cannot afford to lose.

Note that Recycle (which ran rm -rf on the volume) is deprecated and unsupported by CSI drivers. If you need similar behavior, use a Delete policy and rely on backups.

volumeBindingMode

This field controls when the volume is actually provisioned and bound. It has a significant impact on scheduling and is the most commonly misconfigured StorageClass field.

ModeBehaviorWhen to Use
Immediate The volume is provisioned the moment the PVC is created, before any Pod references it. Only when your storage is available in all zones (e.g., NFS, network-attached storage).
WaitForFirstConsumer Provisioning is delayed until a Pod that uses the PVC is scheduled. The volume is created in the same zone as the node the Pod lands on. Cloud block storage (EBS, PD, Azure Disk) — virtually always.
Why WaitForFirstConsumer matters

Cloud block volumes like EBS and GCP PD are zonal — an EBS volume in us-east-1a cannot be attached to a node in us-east-1b. With Immediate binding, the volume might get created in zone A while the scheduler places the Pod in zone B, causing a permanent scheduling failure. WaitForFirstConsumer avoids this entirely by letting the scheduler pick the node first and then creating the volume in the correct zone.

allowVolumeExpansion

When set to true, you can increase the size of an existing PVC by editing its spec.resources.requests.storage field. The CSI driver will expand the underlying volume and, if needed, resize the filesystem. Most cloud CSI drivers support this. You cannot shrink a volume — only grow it.

The Default StorageClass

When a PVC does not specify a storageClassName, Kubernetes assigns it the default StorageClass. A StorageClass is marked as default using the storageclass.kubernetes.io/is-default-class annotation. Most managed Kubernetes clusters (EKS, GKE, AKS) come with a default StorageClass pre-configured.

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

You can check which StorageClass is the default with kubectl get sc. The default will have (default) next to its name:

bash
$ kubectl get storageclass
NAME                 PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE       AGE
fast-ssd             ebs.csi.aws.com   Delete          WaitForFirstConsumer    5d
standard (default)   ebs.csi.aws.com   Delete          WaitForFirstConsumer    30d

If you have more than one StorageClass annotated as default, the behavior is undefined — some admission controllers will reject PVCs, while others will pick one arbitrarily. Keep exactly one default per cluster.

How Dynamic Provisioning Works End-to-End

Understanding the full lifecycle clarifies what Kubernetes does behind the scenes when you create a PVC. Here is the sequence:

  1. You create a PVC that references a StorageClass (either explicitly via storageClassName or implicitly via the default).
  2. The PVC enters Pending state. If the binding mode is WaitForFirstConsumer, it stays Pending until a Pod claims it.
  3. A Pod is scheduled that mounts the PVC. The scheduler picks a node, taking storage topology into account.
  4. The provisioner plugin fires. It calls the cloud provider API (e.g., ec2:CreateVolume) to create a volume in the appropriate zone with the specified parameters.
  5. Kubernetes creates a PV object that represents the newly provisioned volume. The PV's spec.claimRef is set to the PVC.
  6. The PVC is bound to the PV. Both transition to Bound status.
  7. The kubelet mounts the volume into the Pod's container at the specified mountPath.

When the PVC is deleted, the process runs in reverse: the volume is unmounted, the PV is removed, and (if reclaimPolicy: Delete) the cloud volume is destroyed.

Example: AWS EBS with the CSI Driver

This is the most common setup on EKS. You define a StorageClass for gp3 SSD volumes and create a PVC that triggers dynamic provisioning.

yaml
# storageclass-aws.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
  encrypted: "true"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# pvc-aws.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-data
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-gp3
  resources:
    requests:
      storage: 20Gi

Apply both and watch the PVC wait for a consumer:

bash
$ kubectl apply -f storageclass-aws.yaml -f pvc-aws.yaml

$ kubectl get pvc app-data
NAME       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
app-data   Pending                                      ebs-gp3        5s
# Pending — waiting for a Pod to be scheduled (WaitForFirstConsumer)

$ kubectl run test --image=nginx --overrides='{
  "spec": {"containers": [{"name": "nginx", "image": "nginx",
    "volumeMounts": [{"name": "data", "mountPath": "/data"}]}],
    "volumes": [{"name": "data",
      "persistentVolumeClaim": {"claimName": "app-data"}}]}}'

$ kubectl get pvc app-data
NAME       STATUS   VOLUME                                     CAPACITY   STORAGECLASS   AGE
app-data   Bound    pvc-a1b2c3d4-5678-90ef-ghij-klmnopqrstuv   20Gi       ebs-gp3        45s

The PVC transitions from Pending to Bound once the Pod is scheduled. A PV named pvc-a1b2c3d4-... was automatically created by the EBS CSI driver, backed by a real EBS volume in the same availability zone as the node.

Example: GCP Persistent Disk with the CSI Driver

On GKE, the pattern is identical — only the provisioner name and parameters differ.

yaml
# storageclass-gcp.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pd-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  fsType: ext4
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# pvc-gcp.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: pd-ssd
  resources:
    requests:
      storage: 50Gi

For workloads that need replication across zones (e.g., a regional GKE cluster), add the replication-type parameter:

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pd-ssd-regional
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Example: Local Path Provisioner for Development

Cloud provisioners don't work on local clusters like kind, minikube, or k3s. The Rancher Local Path Provisioner fills this gap by creating volumes as directories on the node's filesystem. It ships by default with k3s and can be installed on any cluster.

yaml
# This StorageClass comes pre-installed on k3s clusters
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

PVCs referencing local-path work exactly like cloud-backed PVCs. The provisioner creates a directory under /opt/local-path-provisioner on the node, and the PV points to that host path. This gives you a fully functional dynamic provisioning loop for development and CI without any cloud infrastructure.

bash
# Install local-path-provisioner on kind or minikube
$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml

# Verify it's running
$ kubectl -n local-path-storage get pod
NAME                                      READY   STATUS    AGE
local-path-provisioner-7745554f7f-k8x2q   1/1     Running   30s

Expanding a Volume After Creation

If your StorageClass has allowVolumeExpansion: true, you can grow a PVC in place. Edit the PVC and increase the storage request — the CSI driver handles the rest.

bash
# Expand from 20Gi to 50Gi
$ kubectl patch pvc app-data -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'

# Watch the resize progress
$ kubectl get pvc app-data -o jsonpath='{.status.conditions[*].type}'
FileSystemResizePending

# After a moment (once the Pod remounts or the filesystem is resized online):
$ kubectl get pvc app-data
NAME       STATUS   VOLUME          CAPACITY   STORAGECLASS   AGE
app-data   Bound    pvc-a1b2c3d4    50Gi       ebs-gp3        2h

Some CSI drivers resize the filesystem online (while the Pod is running). Others require the Pod to be restarted for the resize to take effect. Check your driver's documentation. AWS EBS CSI and GCP PD CSI both support online expansion.

Define multiple StorageClasses for different tiers

A well-organized cluster typically has 2–4 StorageClasses: a general-purpose default (e.g., gp3), a high-performance class for databases (e.g., io2 with provisioned IOPS), and optionally a cheap throughput class for logs (st1). Name them descriptively — fast-ssd, high-iops, bulk-hdd — so developers can pick the right tier without knowing cloud-specific details.

CSI Drivers and Storage Patterns for Production

Before Kubernetes v1.13, every storage backend — AWS EBS, GCE PD, Cinder, Ceph — was compiled directly into the Kubernetes codebase as an "in-tree" volume plugin. This meant that adding a new storage driver or fixing a bug required a full Kubernetes release. CSI (Container Storage Interface) decouples storage from core Kubernetes, letting vendors ship and update their own drivers on their own schedule.

CSI is not Kubernetes-specific. It is a cross-platform specification (also adopted by Mesos and Cloud Foundry) that defines a standard gRPC interface between a container orchestrator and a storage provider. In Kubernetes, CSI drivers run as a set of sidecar containers alongside a driver-specific plugin, all deployed as regular Pods.

Why In-Tree Plugins Had to Go

The in-tree model created three compounding problems. First, release coupling: a storage vendor could not ship a fix without waiting for the next Kubernetes minor release cycle (roughly every four months). Second, binary bloat: every kubelet and controller-manager binary contained code for every supported storage backend, even the ones that cluster never used. Third, privileged access: storage code ran inside core Kubernetes components, which meant a bug in a volume plugin could crash the kubelet or controller-manager.

CSI solves all three. Drivers are independently versioned container images. They run in their own Pods with scoped privileges. And the core Kubernetes codebase no longer needs to know anything about the underlying storage technology — it just speaks the CSI gRPC protocol.

Note

As of Kubernetes 1.26, in-tree volume plugins for AWS EBS, GCE PD, Azure Disk, and others have their CSI migration permanently enabled. The in-tree code delegates all operations to the corresponding CSI driver. You should ensure the CSI driver is installed even if you previously relied on in-tree plugins.

CSI Architecture

A CSI driver deployment consists of two parts: a controller component (typically a Deployment or StatefulSet with one replica) that handles cluster-level operations like provisioning and attaching volumes, and a node component (a DaemonSet) that runs on every node to handle mounting and unmounting. Both parts are composed of Kubernetes-maintained sidecar containers plus the vendor-specific CSI driver container.

flowchart LR
    PVC["PVC Created"] --> EP["external-provisioner"]
    EP -->|"CreateVolume gRPC"| Driver["CSI Driver\n(Controller)"]
    Driver --> Backend["Storage Backend\n(EBS, Ceph, NFS...)"]
    Backend -->|"Volume Ready"| EA["external-attacher"]
    EA -->|"ControllerPublishVolume\ngRPC"| Driver
    Driver --> Attach["Volume Attached\nto Node"]
    Attach --> Kubelet["kubelet"]
    Kubelet -->|"NodeStageVolume\nNodePublishVolume"| NodeDriver["CSI Driver\n(Node DaemonSet)"]
    NodeDriver --> Mount["Volume Mounted\nin Container"]

    style PVC fill:#e0e7ff,stroke:#4f46e5,color:#1e1b4b
    style Backend fill:#fef3c7,stroke:#d97706,color:#78350f
    style Mount fill:#d1fae5,stroke:#059669,color:#064e3b
    

The diagram above shows the full lifecycle of a dynamically provisioned volume. The external sidecars watch for Kubernetes API events (new PVC, new VolumeAttachment) and translate them into gRPC calls to the CSI driver. The driver then talks to the actual storage backend.

CSI Sidecar Containers

Kubernetes maintains a set of standard sidecar containers that handle the orchestrator-side logic. These run alongside the vendor's CSI driver container within the same Pod. Here is what each one does:

SidecarRuns InWatchesCalls (gRPC)
external-provisionerControllerPersistentVolumeClaim objectsCreateVolume / DeleteVolume
external-attacherControllerVolumeAttachment objectsControllerPublishVolume / ControllerUnpublishVolume
external-snapshotterControllerVolumeSnapshot objectsCreateSnapshot / DeleteSnapshot
external-resizerControllerPVC size changesControllerExpandVolume
node-driver-registrarNode (DaemonSet)N/A (registers driver with kubelet)Registers CSI driver socket path
livenessprobeBothN/AProbe (health check endpoint)

How CSI Drivers Are Deployed

Most CSI drivers ship as Helm charts or kustomize manifests. Under the hood, the deployment always follows the same two-part pattern: a controller Deployment and a node DaemonSet. Here is a simplified view of the node DaemonSet for the AWS EBS CSI driver:

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebs-csi-node
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: ebs-csi-node
  template:
    metadata:
      labels:
        app: ebs-csi-node
    spec:
      containers:
        - name: ebs-plugin
          image: public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.28.0
          volumeMounts:
            - name: kubelet-dir
              mountPath: /var/lib/kubelet
              mountPropagation: Bidirectional
            - name: plugin-dir
              mountPath: /csi
        - name: node-driver-registrar
          image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0
          args:
            - "--csi-address=/csi/csi.sock"
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock"
          volumeMounts:
            - name: plugin-dir
              mountPath: /csi
            - name: registration-dir
              mountPath: /registration
      volumes:
        - name: kubelet-dir
          hostPath:
            path: /var/lib/kubelet
            type: Directory
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins/ebs.csi.aws.com/
            type: DirectoryOrCreate
        - name: registration-dir
          hostPath:
            path: /var/lib/kubelet/plugins_registry/
            type: Directory

Notice the mountPropagation: Bidirectional on the kubelet directory — this is critical. It allows the CSI driver to mount volumes inside its own mount namespace and have those mounts propagate to the kubelet, which then bind-mounts them into application containers. The node-driver-registrar sidecar registers the driver's Unix socket with the kubelet so the kubelet knows how to reach it.

Volume Snapshots

Volume snapshots bring point-in-time copy capabilities to Kubernetes. They are modeled as three API resources that mirror the PV/PVC pattern: VolumeSnapshotClass defines how to snapshot (which CSI driver, what parameters), VolumeSnapshot is the user-facing request (like a PVC), and VolumeSnapshotContent is the actual snapshot object bound to the backend (like a PV).

Volume snapshots require the CSI driver to implement the snapshot capability, and you must install the snapshot CRDs and the snapshot-controller separately — they are not part of core Kubernetes. Most managed Kubernetes offerings (EKS, GKE, AKS) pre-install these.

yaml
# 1. Define a VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
---
# 2. Take a snapshot of an existing PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: db-snapshot-2024-01-15
spec:
  volumeSnapshotClassName: ebs-snapshot-class
  source:
    persistentVolumeClaimName: postgres-data

The deletionPolicy controls what happens to the underlying snapshot in the storage backend when you delete the VolumeSnapshot object. Retain keeps the backend snapshot (safer for backups), while Delete removes it. Always use Retain for any snapshot that serves as a backup.

Restoring from a Snapshot

To restore, you create a new PVC with a dataSource pointing to the snapshot. Kubernetes provisions a new volume pre-populated with the snapshot data:

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3-encrypted
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: db-snapshot-2024-01-15
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Volume Cloning

Volume cloning creates a duplicate of an existing PVC without going through a snapshot intermediate. The clone is a new, independent volume with the same data as the source at the time of the clone request. It is useful for spinning up test environments from production data or creating read replicas.

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-clone
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3-encrypted
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgres-data        # source PVC
    kind: PersistentVolumeClaim

The source and clone must be in the same namespace and use the same StorageClass. The clone size must be equal to or larger than the source. Some CSI drivers (like AWS EBS CSI) implement cloning via an internal snapshot-and-restore, so it can take time proportional to volume size.

Volume Expansion

CSI drivers that support the ControllerExpandVolume and NodeExpandVolume RPCs allow you to grow PVCs without downtime. The StorageClass must set allowVolumeExpansion: true, and then you simply patch the PVC's spec.resources.requests.storage to a larger value:

bash
# Expand a PVC from 100Gi to 200Gi
kubectl patch pvc postgres-data -p \
  '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Monitor the resize progress
kubectl get pvc postgres-data -o jsonpath='{.status.conditions[*]}'

Expansion happens in two phases: the controller expands the underlying volume in the storage backend, then the node resizes the filesystem when the volume is next mounted (or immediately, for online expansion). You can track progress via the FileSystemResizePending condition on the PVC.

Production Storage Patterns

ReadWriteMany with NFS, CephFS, or EFS

Most block storage (EBS, Azure Disk, GCE PD) only supports ReadWriteOnce — a single node can mount the volume for read-write at a time. When multiple Pods across different nodes need to read and write the same data (shared uploads, CMS content, ML training datasets), you need a storage backend that supports ReadWriteMany (RWX).

OptionProtocolPerformanceBest For
AWS EFS (efs.csi.aws.com)NFSv4.1Moderate latency, elastic throughputShared config, media uploads
CephFS (Rook)Ceph native / FUSEHigh throughput, tunableData-intensive shared workloads
NFS Server + nfs-subdir-external-provisionerNFSv3/v4Depends on server hardwareDev/test, legacy integration
Azure FilesSMB / NFSModerateCross-node file sharing on AKS
yaml
# StorageClass for AWS EFS (ReadWriteMany)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap       # uses EFS Access Points
  fileSystemId: fs-0a1b2c3d4e5f
  directoryPerms: "700"
  basePath: "/dynamic_provisioning"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-uploads
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 50Gi

Block Storage vs. Filesystem Storage

CSI supports two volume modes: Filesystem (the default) and Block. With Filesystem mode, the CSI driver creates a filesystem (ext4, xfs) on the volume and mounts it as a directory. With Block mode, the raw block device is exposed directly to the container as a device file at the specified devicePath.

Block mode is used by databases that manage their own storage layout (like certain configurations of Oracle, or high-performance key-value stores), and by applications that need direct I/O without filesystem overhead. Most workloads should use Filesystem mode.

yaml
# Raw block volume PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block              # not Filesystem
  storageClassName: gp3-encrypted
  resources:
    requests:
      storage: 50Gi
---
# Pod consuming a raw block device
apiVersion: v1
kind: Pod
metadata:
  name: block-consumer
spec:
  containers:
    - name: app
      image: myapp:latest
      volumeDevices:             # not volumeMounts
        - name: data
          devicePath: /dev/xvda
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: raw-block-pvc

Ephemeral CSI Volumes

Some CSI drivers support ephemeral inline volumes — volumes defined directly in the Pod spec rather than through a PVC. These volumes have the same lifecycle as the Pod: they are created when the Pod starts and deleted when the Pod is removed. This pattern is ideal for injecting short-lived secrets, certificates, or identity tokens from external systems (e.g., HashiCorp Vault via the Secrets Store CSI Driver).

yaml
apiVersion: v1
kind: Pod
metadata:
  name: app-with-secrets
spec:
  containers:
    - name: app
      image: myapp:latest
      volumeMounts:
        - name: vault-secrets
          mountPath: /mnt/secrets
          readOnly: true
  volumes:
    - name: vault-secrets
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: vault-db-creds

Generic Ephemeral Volumes

Generic ephemeral volumes (GA since Kubernetes 1.23) combine the lifecycle of ephemeral volumes with the full power of PVCs. You embed a PVC template directly inside the Pod spec. Kubernetes creates a real PVC for each Pod, provisions a volume through the normal StorageClass flow, and automatically deletes the PVC when the Pod terminates.

This is powerful for workloads that need fast scratch space backed by real block storage (not just emptyDir on the node's disk). Think ML training jobs that need 500 GiB of NVMe-backed scratch, or CI runners that need isolated high-IOPS build volumes.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-job
spec:
  containers:
    - name: trainer
      image: ml-framework:latest
      volumeMounts:
        - name: scratch
          mountPath: /scratch
  volumes:
    - name: scratch
      ephemeral:
        volumeClaimTemplate:
          spec:
            accessModes: ["ReadWriteOnce"]
            storageClassName: io2-high-iops
            resources:
              requests:
                storage: 500Gi

Storage Best Practices for Production

Capacity Planning

Do not wait for PersistentVolume fullness alerts to plan capacity. Set up monitoring that tracks volume utilization percentage (available via the kubelet's /metrics endpoint as kubelet_volume_stats_used_bytes and kubelet_volume_stats_capacity_bytes). Alert at 70% utilization to give yourself time to expand. Combine this with volume expansion so that responding to an alert is a single kubectl patch rather than a data migration.

IOPS and Throughput Requirements

Storage performance problems are the number one cause of unexplained application slowness in Kubernetes. Cloud block storage IOPS scales with volume size (e.g., AWS gp3 provides a baseline of 3,000 IOPS for any size, but io2 scales up to 64,000). Match your StorageClass to your workload profile:

WorkloadIOPS ProfileRecommended Storage
PostgreSQL / MySQL (OLTP)High random read/writeio2 Block Express, local NVMe
Elasticsearch / KafkaHigh sequential throughputgp3 (tuned throughput), st1
WordPress / CMSLow, burstygp3 default
ML training scratchVery high sequential writeLocal NVMe, instance store

Backup Strategies with Snapshots

Snapshots are not backups by themselves — they often live in the same storage system as the original volume (e.g., EBS snapshots are in the same region). A production backup strategy should include: (1) scheduled VolumeSnapshots via a CronJob or a tool like Velero, (2) cross-region copy of snapshots for disaster recovery, and (3) periodic restore tests to verify snapshot integrity. Here is a minimal CronJob-based snapshot approach:

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-snapshot-daily
spec:
  schedule: "0 2 * * *"           # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: snapshot-creator
          containers:
            - name: snapshot
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  SNAP_NAME="db-snap-$(date +%Y%m%d-%H%M%S)"
                  cat <<EOF | kubectl apply -f -
                  apiVersion: snapshot.storage.k8s.io/v1
                  kind: VolumeSnapshot
                  metadata:
                    name: ${SNAP_NAME}
                  spec:
                    volumeSnapshotClassName: ebs-snapshot-class
                    source:
                      persistentVolumeClaimName: postgres-data
                  EOF
                  echo "Created snapshot: ${SNAP_NAME}"
          restartPolicy: OnFailure
Tip

For production databases, use Velero instead of hand-rolled CronJobs. Velero coordinates application-consistent snapshots (with pre/post hooks to flush writes), manages snapshot retention policies, and handles cross-region copies — all features you would otherwise need to build yourself.

Cross-AZ Considerations

Block storage volumes (EBS, Azure Disk, GCE PD) are tied to a single availability zone. If a Pod is scheduled on a node in us-east-1a but its PV lives in us-east-1b, the volume attach will fail. Kubernetes handles this through topology-aware scheduling: when a StorageClass sets volumeBindingMode: WaitForFirstConsumer, the PV is not provisioned until the Pod is scheduled, ensuring the volume is created in the same AZ as the node.

For StatefulSets, this means each replica can land in a different AZ with its volume in the matching zone — exactly what you want for high availability. However, if a node in one AZ goes down, the Pod cannot be rescheduled to another AZ because its volume is stuck in the original zone. Plan for this by either using replicated storage (Ceph, Longhorn) that spans AZs, or by accepting that recovery requires restoring from a snapshot in the new AZ.

Testing Storage Failover

Do not wait for a real outage to find out your storage layer fails ungracefully. Test these scenarios regularly:

  • Node drain: Run kubectl drain on a node with active PVs. Verify the volume detaches cleanly and the replacement Pod mounts it on the new node within your SLA (typical: 1–3 minutes for EBS).
  • Force-delete a Pod: Use kubectl delete pod --force --grace-period=0. Confirm the volume attachment is released and the new Pod is not stuck in ContainerCreating waiting for the stale VolumeAttachment to expire (the default force-detach timeout is 6 minutes).
  • AZ failure simulation: Cordon all nodes in one AZ. Verify that StatefulSet Pods with WaitForFirstConsumer volumes do not automatically recover (expected behavior), and that you can restore from a snapshot in another AZ.
  • Snapshot restore: Restore a VolumeSnapshot into a new PVC and verify data integrity. If you cannot read back what you wrote, your backup strategy is broken.
Warning

When a node fails abruptly (power loss, kernel panic), Kubernetes waits for the node.kubernetes.io/unreachable taint timeout (default 5 minutes) before marking Pods for eviction. Combined with the volume force-detach timeout, a StatefulSet Pod with a block volume can take 6–11 minutes to recover on a new node. Factor this into your availability SLAs.

Popular CSI Drivers

The CSI ecosystem is mature, with production-grade drivers available for all major cloud providers and several open-source storage systems. Here is a quick reference for the most widely deployed drivers:

DriverProviderAccess ModesSnapshotsExpansionUse Case
ebs.csi.aws.comAWSRWOYesYesGeneral-purpose block storage on EKS
efs.csi.aws.comAWSRWXNoN/AShared file storage (NFS-backed) on EKS
pd.csi.storage.gke.ioGCPRWO / ROXYesYesBlock storage on GKE
disk.csi.azure.comAzureRWOYesYesManaged disks on AKS
file.csi.azure.comAzureRWXNoYesAzure Files (SMB/NFS) on AKS
Rook-Ceph (rbd / cephfs)Open SourceRWO / RWXYesYesSoftware-defined storage, multi-AZ replication
LonghornSUSE/RancherRWO / RWXYesYesLightweight replicated storage for edge/bare-metal
secrets-store.csi.k8s.ioSIG AuthEphemeralN/AN/AMount secrets from Vault, AWS SM, Azure KV

When choosing a CSI driver, verify that it supports the specific features you need (snapshots, expansion, cloning, RWX) and check its Kubernetes version compatibility matrix. Run the driver's conformance tests in your CI pipeline before upgrading versions in production.

ConfigMaps and Secrets — Externalizing Configuration

The twelve-factor app methodology defines a strict boundary: configuration that varies between deployments must live outside your code. Database URLs, feature flags, API keys, TLS certificates — none of these belong in a container image. An image built once should run identically in dev, staging, and production. The only thing that changes is the configuration injected at runtime.

Kubernetes operationalizes this principle through two first-class objects: ConfigMaps for non-sensitive data and Secrets for sensitive data. Both decouple configuration from your container image, but they differ in how the cluster handles, stores, and exposes the data they carry.

ConfigMaps: Non-Sensitive Configuration

A ConfigMap is a namespaced Kubernetes object that stores key-value pairs. The values can be short strings (a feature flag, a log level) or entire files (an nginx.conf, a .properties file). The maximum size of a ConfigMap is 1 MiB.

Creating ConfigMaps

You can create ConfigMaps from multiple sources. Understanding the distinction matters because it determines how keys are named inside the resulting object.

From literal key-value pairs — useful for simple settings:

bash
kubectl create configmap app-settings \
  --from-literal=LOG_LEVEL=info \
  --from-literal=MAX_RETRIES=3 \
  --from-literal=FEATURE_DARK_MODE=true

From a file — the filename becomes the key, the file content becomes the value:

bash
# Key will be "nginx.conf", value will be the file contents
kubectl create configmap nginx-config --from-file=nginx.conf

# Override the key name explicitly
kubectl create configmap nginx-config --from-file=my-custom-key=nginx.conf

From a directory — each file in the directory becomes a key-value pair:

bash
# Every file in ./config/ becomes a key
kubectl create configmap app-config --from-file=./config/

From an env file — parses KEY=VALUE lines (ignores comments and blank lines):

bash
kubectl create configmap app-env --from-env-file=app.env

The declarative equivalent in YAML gives you version-controlled, reproducible configuration:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-settings
  namespace: production
data:
  LOG_LEVEL: "info"
  MAX_RETRIES: "3"
  nginx.conf: |
    server {
        listen 80;
        server_name example.com;
        location / {
            proxy_pass http://backend:8080;
        }
    }
All ConfigMap values are strings

Even though MAX_RETRIES looks like an integer and FEATURE_DARK_MODE looks like a boolean, ConfigMap values are always stored as strings. Your application code is responsible for parsing them into the correct type. For binary data, use the binaryData field with base64-encoded content.

Consuming ConfigMaps as Environment Variables

There are two approaches to injecting ConfigMap data as environment variables, and choosing between them affects maintainability and naming control.

envFrom injects every key in the ConfigMap as an environment variable. It is convenient but gives you less control — if someone adds a key to the ConfigMap, it automatically appears in your container's environment.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
    - name: app
      image: myapp:1.4.0
      envFrom:
        - configMapRef:
            name: app-settings
          prefix: CFG_  # optional: prepends CFG_ to every key

Individual env entries with valueFrom give you explicit control. You pick exactly which keys to expose and can rename them in the process.

yaml
spec:
  containers:
    - name: app
      image: myapp:1.4.0
      env:
        - name: APP_LOG_LEVEL
          valueFrom:
            configMapKeyRef:
              name: app-settings
              key: LOG_LEVEL
              optional: true  # pod starts even if key is missing
AspectenvFromIndividual env
Ease of useOne line imports all keysEach key must be listed explicitly
Naming controlOnly global prefixFull control over env var names
Invalid keysSilently skipped (keys not valid as env vars)You choose only valid keys
AuditabilityHarder to trace where a variable comes fromClear mapping in pod spec
Best forApps expecting many config valuesPrecise injection of a few values

Consuming ConfigMaps as Volume Mounts

When your application reads configuration from files (not environment variables), mount the ConfigMap as a volume. Each key in the ConfigMap becomes a file in the mount directory, and the value becomes the file's content.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx:1.25
      volumeMounts:
        - name: config-volume
          mountPath: /etc/nginx/conf.d
          readOnly: true
  volumes:
    - name: config-volume
      configMap:
        name: nginx-config
        items:                  # optional: select specific keys
          - key: nginx.conf
            path: default.conf  # rename inside the mount

The items field lets you project specific keys and rename them in the mount. Without items, every key in the ConfigMap appears as a file. You can also use subPath to mount a single file without overwriting the entire target directory — but be aware that subPath mounts do not receive automatic updates.

Auto-update behavior

When you update a ConfigMap, volume-mounted files are eventually updated by the kubelet — typically within 30–60 seconds (controlled by --sync-frequency and the kubelet's ConfigMap cache TTL). However, environment variables are never updated after pod startup. You must restart the pod (or use a rolling restart) to pick up env var changes. This makes volume mounts the better choice for applications that watch config files for changes.

ConfigMaps as Command-Line Arguments

You can reference ConfigMap values in a container's command or args by first mapping them to environment variables and then using the $(VAR_NAME) substitution syntax.

yaml
spec:
  containers:
    - name: worker
      image: myworker:2.1.0
      env:
        - name: LOG_LEVEL
          valueFrom:
            configMapKeyRef:
              name: app-settings
              key: LOG_LEVEL
      command: ["./worker"]
      args: ["--log-level", "$(LOG_LEVEL)", "--max-retries", "5"]

Secrets: Sensitive Configuration

Secrets look almost identical to ConfigMaps in terms of consumption (env vars, volumes, command args). The critical difference is intent: Secrets signal to the cluster that the data is sensitive. Kubernetes responds by storing them in tmpfs on nodes (never written to disk on the node), restricting access through RBAC, and — if configured — encrypting them at rest in etcd.

Secret Types

Kubernetes defines several built-in Secret types. The type constrains what keys the Secret must contain and enables specialized behavior in controllers and the kubelet.

TypePurposeRequired Keys
OpaqueArbitrary user-defined data (default type)None — any keys allowed
kubernetes.io/dockerconfigjsonPrivate registry credentials for image pulls.dockerconfigjson
kubernetes.io/tlsTLS certificate and private key pairstls.crt, tls.key
kubernetes.io/basic-authBasic authentication credentialsusername, password
kubernetes.io/ssh-authSSH private key credentialsssh-privatekey
kubernetes.io/service-account-tokenService account tokens (legacy, auto-created)token, ca.crt, namespace

Creating Secrets

The imperative approach mirrors ConfigMap creation. Kubernetes automatically base64-encodes the values when you use kubectl create secret.

bash
# Generic (Opaque) secret
kubectl create secret generic db-credentials \
  --from-literal=username=admin \
  --from-literal=password='S3cur3P@ss!'

# TLS secret from certificate files
kubectl create secret tls app-tls \
  --cert=tls.crt \
  --key=tls.key

# Docker registry secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=deploy \
  --docker-password=token123

In declarative YAML, you must base64-encode values yourself in the data field — or use stringData for plain text that Kubernetes encodes on submission:

yaml
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
# data: values must be base64-encoded
data:
  username: YWRtaW4=           # echo -n 'admin' | base64
  password: UzNjdXIzUEBzcyE=  # echo -n 'S3cur3P@ss!' | base64
---
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials-easy
type: Opaque
# stringData: plain text — Kubernetes encodes it for you
stringData:
  username: admin
  password: "S3cur3P@ss!"
Base64 is encoding, not encryption

A common misconception: base64-encoding a Secret does not protect it. Anyone with kubectl get secret db-credentials -o yaml access can decode the values instantly with echo 'YWRtaW4=' | base64 -d. Base64 exists purely for safe transport of binary data — it provides zero confidentiality. Real protection comes from RBAC (restricting who can read Secrets), encryption at rest, and external secret management.

Consuming Secrets

Secrets are consumed identically to ConfigMaps — through envFrom, individual env entries, and volume mounts. The only difference in the YAML is that you reference secretKeyRef instead of configMapKeyRef, or use a secret volume source.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
    - name: api
      image: myapi:3.0.0
      env:
        - name: DB_USERNAME
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
      volumeMounts:
        - name: tls-certs
          mountPath: /etc/tls
          readOnly: true
  volumes:
    - name: tls-certs
      secret:
        secretName: app-tls
        defaultMode: 0400  # restrict file permissions

Setting defaultMode: 0400 ensures that only the container's user can read the mounted Secret files. This is a simple but effective defense-in-depth measure. For registry credentials, reference the Secret in the pod's imagePullSecrets field rather than mounting it.

Encryption at Rest

By default, Secrets are stored unencrypted in etcd. Anyone with direct etcd access can read every Secret in the cluster. To fix this, you configure an EncryptionConfiguration on the API server that encrypts Secret data before it is written to etcd.

yaml
# /etc/kubernetes/enc/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: key1
              secret: <base64-encoded-32-byte-key>
      - identity: {}  # fallback: read unencrypted data

The API server loads this file via the --encryption-provider-config flag. The providers list is ordered: the first provider is used for writing, and all providers are tried for reading (which is how you rotate keys gracefully). Common providers include aescbc, aesgcm, secretbox, and kms (for integrating with cloud KMS services like AWS KMS, GCP Cloud KMS, or Azure Key Vault).

After enabling encryption, existing Secrets remain unencrypted until you re-write them. Force re-encryption of all Secrets with:

bash
kubectl get secrets --all-namespaces -o json | \
  kubectl replace -f -

External Secret Management

For production clusters, encryption at rest is the minimum. Most teams go further by keeping secret values entirely outside of Kubernetes and syncing them in at runtime. This avoids having sensitive values in etcd at all, and centralizes secret lifecycle management (rotation, auditing, access control) in a dedicated vault system.

SolutionHow It WorksBest For
External Secrets OperatorCRD-based operator that syncs secrets from external stores (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault) into native Kubernetes SecretsMulti-cloud teams; standardized CRD interface across providers
Sealed SecretsEncrypts Secrets client-side with a cluster-specific public key; only the in-cluster controller can decrypt. Safe to commit to Git.GitOps workflows where you want secrets in version control
HashiCorp Vault + Agent InjectorVault sidecar agent fetches secrets and writes them to a shared in-memory volume. Secrets never become Kubernetes Secret objects.Strict compliance; dynamic short-lived credentials; database credential rotation

Here is a minimal External Secrets Operator example syncing a database password from AWS Secrets Manager:

yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials          # resulting K8s Secret name
    creationPolicy: Owner
  data:
    - secretKey: username         # key in the K8s Secret
      remoteRef:
        key: prod/db-credentials  # path in AWS Secrets Manager
        property: username
    - secretKey: password
      remoteRef:
        key: prod/db-credentials
        property: password

Immutable ConfigMaps and Secrets

Kubernetes 1.21 graduated the immutable field to stable. Setting immutable: true on a ConfigMap or Secret provides two concrete benefits:

  • Performance — The kubelet stops polling the API server for updates to that object. In clusters with thousands of ConfigMaps, this significantly reduces API server load and watch traffic.
  • Safety — Accidental or malicious edits are blocked. The only way to change the configuration is to create a new object with a new name and update the pods referencing it.
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-settings-v3
immutable: true
data:
  LOG_LEVEL: "warn"
  MAX_RETRIES: "5"
---
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials-v2
immutable: true
type: Opaque
stringData:
  username: admin
  password: "NewS3cur3P@ss!"

The common pattern is to append a version suffix or content hash to the name (app-settings-v3, app-settings-a8f3d). Tools like Kustomize automate this with configMapGenerator and the --append-hash behavior, producing names like app-settings-k5m8h and updating all Deployment references automatically. This triggers a rolling update whenever config changes — giving you the equivalent of a config-driven redeployment.

Putting It All Together

Here is a realistic Deployment that uses a ConfigMap for application settings, a Secret for database credentials, and a volume-mounted ConfigMap for a custom configuration file — all in one spec.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: app
          image: order-service:4.2.1
          ports:
            - containerPort: 8080
          envFrom:
            - configMapRef:
                name: order-settings   # LOG_LEVEL, MAX_RETRIES, etc.
          env:
            - name: DB_HOST
              valueFrom:
                configMapKeyRef:
                  name: order-settings
                  key: DB_HOST
            - name: DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: order-db-creds
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: order-db-creds
                  key: password
          volumeMounts:
            - name: app-config
              mountPath: /etc/order-service
              readOnly: true
            - name: tls-certs
              mountPath: /etc/tls
              readOnly: true
      volumes:
        - name: app-config
          configMap:
            name: order-service-config
        - name: tls-certs
          secret:
            secretName: order-tls
            defaultMode: 0400
Rolling restart on config change

If your application only reads config at startup (environment variables or non-watched files), use kubectl rollout restart deployment/order-service after updating a ConfigMap or Secret to trigger a zero-downtime rolling restart. For fully automated config-driven rollouts, use a tool like Reloader that watches ConfigMaps and Secrets and automatically triggers rolling updates on the associated Deployments.

Resource Requests, Limits, and Quality of Service

Every container in a Kubernetes cluster competes for finite CPU and memory on the nodes it runs on. Without explicit resource declarations, the scheduler is flying blind — it cannot make intelligent placement decisions, and the kubelet cannot protect one workload from another's greed. Resource requests and limits are how you communicate your workload's needs to Kubernetes, and they directly determine scheduling, runtime enforcement, and eviction priority.

Requests vs. Limits — Two Different Guarantees

A request is the amount of CPU or memory that Kubernetes guarantees to a container. The scheduler uses requests to decide which node has enough room to place a pod. If a node has 4 CPU cores and existing pods have requested 3.5 cores total, the scheduler will only place a new pod there if its CPU request is 0.5 cores or less.

A limit is the maximum amount a container is allowed to consume at runtime. The kubelet enforces limits through kernel mechanisms — Linux cgroups. A container can use resources up to its limit, but never beyond. The request is a floor for scheduling; the limit is a ceiling for enforcement.

AspectRequestLimit
PurposeScheduling guarantee — "I need at least this much"Runtime ceiling — "I must never exceed this"
When it mattersPod scheduling (kube-scheduler)Runtime enforcement (kubelet / kernel)
Enforcement mechanismNode allocatable capacity accountingLinux cgroups (CFS quota for CPU, OOM killer for memory)
Can exceed?Yes — containers can use more than requested if availableNo — hard enforcement at runtime
Default if omitted0 (no guarantee), unless LimitRange sets a defaultUnbounded (no cap), unless LimitRange sets a default

Here is a pod spec that sets both requests and limits for CPU and memory:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: web-server
spec:
  containers:
    - name: nginx
      image: nginx:1.27
      resources:
        requests:
          cpu: "250m"       # 0.25 CPU cores guaranteed
          memory: "128Mi"   # 128 MiB guaranteed
        limits:
          cpu: "500m"       # Can burst up to 0.5 cores
          memory: "256Mi"   # Hard cap — OOMKilled if exceeded

CPU Resources: Millicores and CFS Throttling

CPU is measured in millicores (or millicpu). One core equals 1000m. You can express CPU as a decimal (0.5) or in millicores (500m) — they are equivalent. Unlike memory, CPU is a compressible resource: when a container hits its CPU limit, it is not killed — it is throttled.

Throttling is enforced through the Linux kernel's Completely Fair Scheduler (CFS) bandwidth control. The CFS works in periods (typically 100ms). If a container has a CPU limit of 500m, it receives a quota of 50ms per 100ms period. Once the container exhausts its quota within a period, the kernel suspends its threads until the next period begins. The container stays alive but gets slower — its processes stall mid-execution waiting for their next CPU slice.

CFS throttling can cause latency spikes

A container with a 100m CPU limit gets only 10ms of CPU per 100ms period. If a request handler needs 15ms of CPU time, it will be paused partway through and forced to wait for the next period — turning a 15ms operation into one that spans two periods (~115ms wall-clock time). This is why latency-sensitive services often experience tail-latency spikes with tight CPU limits. You can monitor throttling via container_cpu_cfs_throttled_periods_total in Prometheus.

Memory Resources: Bytes and OOMKill

Memory is measured in bytes, using standard suffixes: Ki, Mi, Gi (power-of-two) or K, M, G (power-of-ten). Always use the binary suffixes (Mi, Gi) to avoid confusion — 128Mi is 134,217,728 bytes, while 128M is 128,000,000 bytes. Memory is incompressible: unlike CPU, you cannot simply slow a process down when it uses too much memory. The kernel must reclaim the memory, and it does so violently.

When a container's memory usage exceeds its limit, the Linux kernel's OOM (Out of Memory) killer terminates the container's process. Kubernetes sees this as a container failure with exit code 137 (128 + SIGKILL signal 9) and reason OOMKilled. If the pod's restartPolicy allows it, the kubelet restarts the container — but if it keeps getting OOMKilled, Kubernetes applies exponential back-off (CrashLoopBackOff).

Memory requests do not prevent OOMKill

A common misconception: setting a memory request does not reserve that memory exclusively for your container. Requests only affect scheduling decisions. If your container allocates more memory than its limit allows, it will be OOMKilled — regardless of its request value. If no limit is set, the container can grow until the node itself runs out of memory, at which point the kubelet's eviction manager starts killing pods based on QoS priority.

Quality of Service Classes

Kubernetes automatically assigns every pod a QoS class based on the resource requests and limits of its containers. The QoS class determines eviction priority when a node is under memory pressure — Kubernetes kills lower-priority pods first to free resources for higher-priority ones. You never set the QoS class directly; it is computed from your resource configuration.

QoS ClassConditionEviction PriorityUse Case
GuaranteedEvery container sets requests equal to limits for both CPU and memoryLast to be evicted (highest priority)Databases, payment services, control plane components
BurstableAt least one container has a request or limit set, but they are not all equalEvicted after BestEffort podsWeb servers, API backends, most production workloads
BestEffortNo container sets any request or limitFirst to be evicted (lowest priority)Batch jobs, dev/test workloads, non-critical tasks

The following diagram shows how Kubernetes determines the QoS class:

flowchart TD
    A["Pod Created"] --> B{"All containers have
CPU & memory requests
AND limits set?"} B -- No --> C{"Any container has at
least one request
or limit set?"} B -- Yes --> D{"requests == limits
for every container?"} C -- No --> E["BestEffort
Lowest priority — evicted first"] C -- Yes --> F["Burstable
Medium priority"] D -- Yes --> G["Guaranteed
Highest priority — evicted last"] D -- No --> F

Guaranteed QoS Example

For Guaranteed QoS, every container must declare both CPU and memory with requests exactly equal to limits. This is the most predictable configuration — the container gets a fixed, reserved allocation.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: payment-processor
spec:
  containers:
    - name: processor
      image: payments:3.2.1
      resources:
        requests:
          cpu: "1"
          memory: "512Mi"
        limits:
          cpu: "1"          # Same as request → Guaranteed
          memory: "512Mi"   # Same as request → Guaranteed

Burstable QoS Example

Most production workloads fall into Burstable. You set requests to size for typical load and limits to allow headroom for spikes.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
    - name: api
      image: myapp-api:2.1.0
      resources:
        requests:
          cpu: "250m"       # Typical steady-state usage
          memory: "256Mi"
        limits:
          cpu: "1"          # Allow 4x burst for traffic spikes
          memory: "512Mi"   # Double the request for safety

BestEffort QoS Example

A pod with no resource declarations at all gets BestEffort. It can use whatever is available but will be the first to be evicted under memory pressure.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: batch-worker
spec:
  containers:
    - name: worker
      image: data-cruncher:latest
      # No resources block at all → BestEffort

You can verify a pod's QoS class at any time:

bash
kubectl get pod payment-processor -o jsonpath='{.status.qosClass}'
# Output: Guaranteed

LimitRanges — Default and Boundary Guardrails

Relying on every developer to remember setting requests and limits is fragile. A LimitRange is a namespace-scoped policy that defines default values, minimum values, and maximum values for resource requests and limits on containers and pods. When a pod is created without explicit resource declarations, the LimitRange admission controller injects the defaults automatically.

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: container-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:           # Applied as limits if not specified
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:    # Applied as requests if not specified
        cpu: "100m"
        memory: "128Mi"
      min:               # Reject pods requesting less than this
        cpu: "50m"
        memory: "64Mi"
      max:               # Reject pods requesting more than this
        cpu: "4"
        memory: "4Gi"
    - type: Pod
      max:               # Total across all containers in a pod
        cpu: "8"
        memory: "8Gi"

If a developer deploys a container with no resource configuration into the production namespace, the LimitRange controller automatically sets requests.cpu: 100m, requests.memory: 128Mi, limits.cpu: 500m, and limits.memory: 256Mi. If they try to request cpu: 10 (10 cores), the API server rejects the pod with a validation error because it exceeds the max constraint.

ResourceQuotas — Namespace-Level Budgets

While LimitRanges constrain individual containers and pods, ResourceQuotas constrain an entire namespace's aggregate consumption. They prevent a single team or application from monopolizing cluster resources. A ResourceQuota specifies the total amount of compute resources, storage, and even object counts that all pods in a namespace can collectively consume.

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    # Compute totals
    requests.cpu: "20"         # Max 20 CPU cores of requests
    requests.memory: "40Gi"    # Max 40 GiB of memory requests
    limits.cpu: "40"           # Max 40 CPU cores of limits
    limits.memory: "80Gi"      # Max 80 GiB of memory limits

    # Object counts
    pods: "50"                 # Max 50 pods in namespace
    services: "20"             # Max 20 services
    configmaps: "30"           # Max 30 ConfigMaps
    persistentvolumeclaims: "10"
    secrets: "30"

    # Storage
    requests.storage: "200Gi"  # Total PVC storage

Check current usage against the quota:

bash
kubectl describe resourcequota team-alpha-quota -n team-alpha

# Name:                  team-alpha-quota
# Resource               Used    Hard
# --------               ----    ----
# limits.cpu             12      40
# limits.memory          28Gi    80Gi
# pods                   18      50
# requests.cpu           6       20
# requests.memory        14Gi    40Gi
ResourceQuota forces resource declarations

When a ResourceQuota is active in a namespace and it specifies compute limits (e.g., requests.cpu), every new pod must declare matching resource requests — otherwise the API server rejects it. This is why you should always pair a ResourceQuota with a LimitRange that provides sensible defaults. The LimitRange injects defaults for pods that omit resources, and the ResourceQuota enforces the namespace-wide budget.

Ephemeral Storage Requests and Limits

Beyond CPU and memory, containers consume local disk space for logs, temporary files, and writable layers. Kubernetes tracks this as ephemeral-storage — space consumed on the node's root filesystem. You can set requests and limits for ephemeral storage the same way you do for CPU and memory. If a container exceeds its ephemeral-storage limit, the kubelet evicts the pod.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: log-processor
spec:
  containers:
    - name: processor
      image: log-processor:1.4.0
      resources:
        requests:
          cpu: "500m"
          memory: "256Mi"
          ephemeral-storage: "1Gi"   # Need at least 1 GiB disk
        limits:
          cpu: "1"
          memory: "512Mi"
          ephemeral-storage: "2Gi"   # Evicted if exceeds 2 GiB

Ephemeral-storage accounting includes the container's writable layer, log files (/var/log), and emptyDir volumes (unless they are backed by tmpfs or a dedicated medium). This is particularly important for workloads that write large temporary files — image processing pipelines, build agents, or log-heavy applications — where unchecked disk usage could fill the node's filesystem and impact all pods on that node.

The CPU Limits Debate

There is an active and legitimate debate in the Kubernetes community about whether you should set CPU limits at all. Both sides have valid arguments, and the right answer depends on your workload characteristics and priorities.

The Case Against CPU Limits

The argument: if a node has idle CPU capacity, why prevent a container from using it? CPU limits enforce a hard ceiling via CFS quota, meaning a container gets throttled even when no other workload wants those CPU cycles. This leads to wasted capacity and unnecessary latency. Companies like Google (in Borg, the predecessor to Kubernetes) and several prominent community voices recommend setting only CPU requests and omitting CPU limits entirely.

The Case For CPU Limits

The counter-argument: without limits, a single misbehaving container (an infinite loop, a regex backtrack, a memory leak in a JIT compiler) can consume all available CPU on a node, degrading every other pod's performance. CPU requests only guarantee a minimum share — they do not prevent a container from consuming far more during contention. Limits provide predictability and isolation, which matters when you have mixed workloads from different teams.

FactorNo CPU LimitsWith CPU Limits
Resource efficiencyHigher — idle CPU is available to any containerLower — CPU goes unused if limit not reached
Latency predictabilityLess predictable — noisy neighbors possibleMore predictable — CFS throttling is deterministic
Tail latency (p99)Risk of spikes from bursts by other podsRisk of spikes from CFS throttling within the pod
Noisy neighbor protectionWeak — relies on kernel CFS fair shares onlyStrong — hard caps prevent resource hogging
Capacity planningHarder — actual usage can vary wildly from requestsEasier — limits provide a bounded upper estimate
Multi-tenant clustersRisky without strong trust boundariesRecommended for isolation
A practical middle ground

Always set memory limits — memory is incompressible, and an unbounded container can trigger node-wide OOM events. For CPU, start with limits set (especially in multi-tenant clusters), then remove them selectively for latency-sensitive services where throttling is a bigger problem than noisy neighbors. Monitor container_cpu_cfs_throttled_periods_total to detect when throttling is harming your workloads. If throttling is high and the node has spare capacity, removing the CPU limit for that workload is a reasonable decision.

Putting It All Together

In practice, you rarely configure resources in isolation. A well-managed namespace uses all three primitives together: LimitRange for sane defaults, ResourceQuota for budget enforcement, and explicit resource declarations in your pod specs for precision. Here is a complete namespace setup:

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: team-backend
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "4Gi"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "16"
    requests.memory: "32Gi"
    limits.cpu: "32"
    limits.memory: "64Gi"
    pods: "40"

With this configuration, any pod deployed to team-backend without resource declarations automatically gets requests of 100m CPU / 128Mi memory and limits of 500m CPU / 256Mi memory. No single container can exceed 4 cores or 4Gi memory. The namespace as a whole cannot exceed 16 cores of CPU requests or 40 pods total. This layered approach gives you safety by default while letting individual workloads override values within the defined guardrails.

Namespaces, Labels, and Annotations — Organizing Resources

A running Kubernetes cluster quickly accumulates hundreds — sometimes thousands — of resources. Without organizational structure, finding the right Deployment or debugging a failing Service becomes a needle-in-a-haystack problem. Kubernetes provides three complementary mechanisms to keep things manageable: Namespaces partition the cluster logically, Labels tag resources for selection and grouping, and Annotations attach arbitrary metadata for tools and humans.

These three features operate at different levels. Namespaces are a hard boundary enforced by the API server — they affect resource visibility, access control, and resource quotas. Labels are soft, queryable tags that controllers and kubectl use to match resources together. Annotations are freeform metadata that have no effect on selection or scheduling but carry essential information for external tools, operators, and auditing.

Namespaces: Logical Cluster Partitions

A namespace is a virtual cluster inside your physical cluster. Resources in one namespace are invisible to kubectl commands scoped to another namespace (unless you explicitly ask). This isolation makes namespaces the primary tool for multi-tenancy, environment separation, and organizational boundaries.

The Four Built-in Namespaces

Every Kubernetes cluster ships with four namespaces out of the box, each serving a specific purpose:

NamespacePurposeWhat Lives Here
defaultThe catch-all for resources created without specifying a namespaceYour workloads, if you don't create custom namespaces
kube-systemControl plane and core cluster componentsCoreDNS, kube-proxy, metrics-server, CNI pods
kube-publicPublicly readable resources (even by unauthenticated users)The cluster-info ConfigMap used during bootstrapping
kube-node-leaseLightweight heartbeats for node healthOne Lease object per node, updated every few seconds
Don't deploy into kube-system

It is tempting to drop cluster-wide tools (monitoring agents, log collectors) into kube-system. Resist this. That namespace is managed by the cluster itself, and upgrades or managed-Kubernetes providers may overwrite or garbage-collect unexpected resources. Create a dedicated namespace like monitoring or infra instead.

When to Create New Namespaces

The right namespace strategy depends on your organization. There is no single correct answer, but three patterns cover most real-world cases:

  • Per-teamteam-backend, team-frontend, team-data. Good when teams own distinct services and you want to apply separate RBAC policies and resource quotas per team.
  • Per-environmentdev, staging, production. Simple, but only works well for small clusters. In larger organizations, environments typically live in separate clusters entirely.
  • Per-applicationcheckout-service, payment-service. Provides fine-grained isolation and makes it easy to tear down everything related to one application at once.

Creating a namespace is straightforward:

bash
# Imperative
kubectl create namespace team-backend

# Declarative (preferred for GitOps)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: team-backend
  labels:
    team: backend
    environment: production
EOF

Namespace-Scoped vs. Cluster-Scoped Resources

Not every Kubernetes resource lives inside a namespace. Resources that apply to the entire cluster — like Nodes, PersistentVolumes, ClusterRoles, and Namespaces themselves — are cluster-scoped. Most workload resources (Pods, Deployments, Services, ConfigMaps, Secrets) are namespace-scoped.

bash
# List all namespace-scoped resource types
kubectl api-resources --namespaced=true

# List all cluster-scoped resource types
kubectl api-resources --namespaced=false

Cross-Namespace Communication

Namespaces are a logical boundary, not a network boundary. By default, a Pod in namespace team-backend can freely communicate with a Service in namespace team-frontend. The key is DNS: Kubernetes DNS gives every Service a fully qualified name following the pattern <service>.<namespace>.svc.cluster.local.

bash
# Within the same namespace, short names work
curl http://payment-api:8080/health

# From a different namespace, use the FQDN
curl http://payment-api.team-backend.svc.cluster.local:8080/health
Enforcing network boundaries

If you need actual network isolation between namespaces, you must use NetworkPolicies. Without them, namespaces provide organizational separation only — every Pod can still reach every other Pod across the cluster.

Labels: Identifying and Selecting Resources

Labels are key-value pairs attached to any Kubernetes object's metadata. Unlike namespaces, which provide hard partitions, labels are a flexible tagging system. Their real power comes from selectors — the mechanism that controllers, Services, and kubectl use to find matching resources.

Every major Kubernetes abstraction depends on labels. A Service routes traffic to Pods that match its selector. A Deployment manages ReplicaSets through label matching. A DaemonSet picks which nodes to run on using node labels. Understanding labels means understanding how Kubernetes wires everything together.

Label Syntax Rules

Labels follow strict formatting requirements. Keys can have an optional prefix (a DNS subdomain up to 253 characters) separated by a / from the name. The name portion must be 63 characters or fewer and match the pattern [a-z0-9A-Z] with -, _, and . allowed in the middle. Values follow the same rules but can also be empty.

yaml
metadata:
  labels:
    app: payment-api           # simple key
    version: v2.3.1            # version tracking
    tier: backend              # architectural layer
    app.kubernetes.io/name: payment-api        # recommended label (prefixed)
    app.kubernetes.io/component: server
    app.kubernetes.io/managed-by: helm

Recommended Labels

Kubernetes defines a set of recommended labels under the app.kubernetes.io prefix. These labels are not enforced by the system, but they create a shared vocabulary that tools like Helm, ArgoCD, and various dashboards understand.

LabelExample ValuePurpose
app.kubernetes.io/namepayment-apiThe name of the application
app.kubernetes.io/instancepayment-api-prodA unique instance identifier
app.kubernetes.io/version3.1.0The current application version
app.kubernetes.io/componentdatabaseThe component within the architecture
app.kubernetes.io/part-ofe-commerceThe higher-level application this belongs to
app.kubernetes.io/managed-byhelmThe tool managing this resource

Label Selectors

Selectors are how Kubernetes answers the question "which resources match?" There are two flavors: equality-based and set-based. Equality-based selectors use =, ==, and !=. Set-based selectors use in, notin, and exists. Both can be combined in a single query — all conditions must be satisfied (logical AND).

bash
# Equality-based: find all pods for the payment-api app
kubectl get pods -l app=payment-api

# Inequality: everything except the frontend tier
kubectl get pods -l tier!=frontend

# Set-based: pods in either staging or production
kubectl get pods -l 'environment in (staging, production)'

# Set-based: pods that have a "release" label (any value)
kubectl get pods -l 'release'

# Set-based: pods without a "canary" label
kubectl get pods -l '!canary'

# Combining selectors (AND logic)
kubectl get pods -l 'app=payment-api,environment in (production)'

How Controllers Use Selectors

The selector mechanism is the glue that holds Kubernetes' declarative model together. A Deployment does not directly manage Pods — it manages ReplicaSets, which in turn manage Pods. The connection at each level is made through label selectors. Here is a concrete example showing the full chain:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 3
  selector:
    matchLabels:              # Deployment finds ReplicaSets with these labels
      app: payment-api
  template:
    metadata:
      labels:                 # Pods created with these labels — must match selector above
        app: payment-api
        version: v2.3.1
    spec:
      containers:
        - name: api
          image: myregistry/payment-api:2.3.1
---
apiVersion: v1
kind: Service
metadata:
  name: payment-api
spec:
  selector:                   # Service routes to Pods matching these labels
    app: payment-api
  ports:
    - port: 80
      targetPort: 8080

The Deployment's spec.selector.matchLabels must be a subset of the Pod template's metadata.labels. If they don't match, the API server rejects the manifest. The Service's spec.selector independently matches Pods by label — it has no knowledge of the Deployment at all. This loose coupling is intentional: a Service can route to Pods managed by different Deployments, StatefulSets, or even bare Pods, as long as the labels match.

Annotations: Metadata for Tools and Humans

Annotations look like labels — they are key-value pairs in metadata — but they serve a fundamentally different purpose. You cannot select or filter resources by annotations. Instead, annotations carry non-identifying information that tools, controllers, and operators read at runtime. Think of them as a structured comment attached to a resource.

Annotations have relaxed constraints compared to labels. Values can be much larger (up to 256 KB total for all annotations on a resource) and can contain any UTF-8 characters, including JSON, URLs, and multi-line strings.

Common Annotations in the Wild

AnnotationSet ByPurpose
kubectl.kubernetes.io/last-applied-configurationkubectl applyStores the full JSON of the last applied manifest for three-way merge diffs
kubernetes.io/change-causeUserRecords why a rollout was triggered (shown in kubectl rollout history)
prometheus.io/scrapeUserTells Prometheus to scrape metrics from this Pod
prometheus.io/portUserSpecifies the port Prometheus should scrape
nginx.ingress.kubernetes.io/rewrite-targetUserConfigures URL rewriting in the NGINX Ingress Controller
checksum/configHelmA hash of the ConfigMap contents, used to trigger Pod restarts on config changes
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  annotations:
    kubernetes.io/change-cause: "Upgraded to v2.3.1 — fixes CVE-2024-1234"
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: api
          image: myregistry/payment-api:2.3.1

Labels vs. Annotations: When to Use Which

The decision rule is simple: if Kubernetes or your tooling needs to select or group resources by this piece of metadata, use a label. If the data is informational, consumed by external tools, or too large for a label, use an annotation.

CriterionLabelsAnnotations
Primary purposeIdentification and selectionNon-identifying metadata
Used by selectorsYes — Services, Deployments, kubectl -lNo — never queryable via selectors
Value constraints≤ 63 characters, alphanumeric + -_.Up to 256 KB, any UTF-8 string
Examplesapp: nginx, tier: frontendBuild SHA, config hash, Ingress hints
Indexable by API serverYes — efficient lookupsNo — stored but not indexed
Typical consumersKubernetes controllers, schedulersCI/CD tools, monitoring agents, operators
Rule of thumb

Start with a label. If you find yourself exceeding the 63-character value limit, needing to store structured data (JSON, URLs), or the value has no relevance to selection, move it to an annotation. A common mistake is putting build hashes or Git SHAs in labels — they are not useful for selection and are better suited as annotations.

Practical: Managing Labels with kubectl

Beyond filtering with -l, kubectl gives you commands to add, update, and remove labels on live resources — useful for quick operational tasks like marking a Pod for debugging or excluding it from a Service's traffic.

bash
# Add a label to a running Pod
kubectl label pod payment-api-7d6f8b4c5-x9kzq debug=true

# Overwrite an existing label (--overwrite is required)
kubectl label pod payment-api-7d6f8b4c5-x9kzq version=v2.4.0 --overwrite

# Remove a label (use the key followed by a minus sign)
kubectl label pod payment-api-7d6f8b4c5-x9kzq debug-

# Show labels as columns in output
kubectl get pods --show-labels

# Use a custom column to display a specific label
kubectl get pods -L app,version

# Label all pods in a namespace at once
kubectl label pods --all environment=staging -n team-backend

Similarly, you can manage annotations imperatively:

bash
# Add an annotation
kubectl annotate deployment payment-api kubernetes.io/change-cause="Rollback to v2.2.0"

# Remove an annotation
kubectl annotate deployment payment-api kubernetes.io/change-cause-

# View annotations on a resource
kubectl get deployment payment-api -o jsonpath='{.metadata.annotations}'

Putting It All Together

In practice, you use all three mechanisms in concert. Namespaces divide ownership and access. Labels connect resources and enable querying. Annotations carry the metadata that your CI/CD pipeline, monitoring stack, and Ingress controllers rely on.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: team-backend                        # Namespace: ownership boundary
  labels:                                        # Labels: identification & selection
    app.kubernetes.io/name: payment-api
    app.kubernetes.io/version: "2.3.1"
    app.kubernetes.io/component: server
    app.kubernetes.io/part-of: e-commerce
    app.kubernetes.io/managed-by: argocd
  annotations:                                   # Annotations: tool metadata
    kubernetes.io/change-cause: "Release 2.3.1"
    argocd.argoproj.io/sync-wave: "2"
    git.commit/sha: "a1b2c3d4e5f6"
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: payment-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: payment-api
        app.kubernetes.io/version: "2.3.1"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      containers:
        - name: api
          image: myregistry/payment-api:2.3.1
          ports:
            - containerPort: 8080
            - containerPort: 9090
              name: metrics

With this foundation — namespaces for boundaries, labels for selection, and annotations for metadata — you have the organizational toolkit to keep a complex cluster understandable as it scales. The next section builds directly on namespaces by introducing RBAC and ServiceAccounts, which control who can access the resources inside each namespace.

RBAC and ServiceAccounts — Who Can Do What

Every request that hits the Kubernetes API server passes through three gates: authentication (who are you?), authorization (are you allowed to do this?), and admission control (should we modify or reject this request?). This section focuses on the first two gates — specifically, how Kubernetes identifies callers and how RBAC grants or denies their actions.

Getting RBAC wrong is one of the fastest paths to a cluster security incident. Overly permissive ClusterRoleBindings to cluster-admin are disturbingly common in production. Understanding the model — and knowing how to scope permissions tightly — is essential.

Authentication — Proving Who You Are

Kubernetes does not have a built-in user database. It does not store usernames and passwords. Instead, the API server delegates identity verification to external systems through pluggable authentication strategies. When a request arrives, the API server tries each configured authenticator in order until one succeeds.

StrategyHow It WorksTypical Use Case
X.509 Client CertificatesClient presents a TLS certificate signed by the cluster CA. The Common Name (CN) becomes the username; Organization (O) fields become groups.Admin users, kubeadm-bootstrapped clusters
Bearer TokensA static token or ServiceAccount JWT is sent in the Authorization: Bearer <token> header.ServiceAccounts, legacy static token files
OpenID Connect (OIDC)API server validates a JWT issued by an external identity provider (Google, Azure AD, Keycloak, Dex).Enterprise SSO for human users
Webhook Token AuthenticationAPI server sends the token to an external webhook service that responds with the user's identity.Custom auth integrations, cloud-specific identity
Users vs. ServiceAccounts

Kubernetes has two categories of identity: normal users (humans, managed externally — no User API object exists) and ServiceAccounts (managed as Kubernetes objects in namespaces). You cannot create a "User" resource with kubectl. User identity comes entirely from the authentication layer.

Authorization — Deciding What You Can Do

Once the API server knows who you are, it determines what you can do. Kubernetes supports multiple authorization modes, configured via the --authorization-mode flag on the API server. The modes are evaluated in order, and the first one that makes a decision (allow or deny) wins.

ModeDescriptionStatus
RBACRole-based access control using Role/ClusterRole and Binding objects. The standard.Default on virtually all clusters
ABACAttribute-based access control using a static policy file. Requires API server restart to change.Legacy — not recommended
WebhookDelegates authorization decisions to an external HTTP service.Used alongside RBAC for custom policies
NodeSpecial-purpose authorizer that grants kubelets the minimum permissions they need.Always enabled alongside RBAC

In practice, you will work with RBAC on every cluster. The rest of this section is a deep dive into the RBAC model.

The RBAC Model

RBAC in Kubernetes is built from four resource types that connect in a clear pattern: Roles define what actions are allowed, and Bindings connect those roles to subjects (users, groups, or ServiceAccounts). The "namespace vs. cluster" axis gives you two levels of scope.

erDiagram
    Role {
        string namespace
        string name
        list rules
    }
    ClusterRole {
        string name
        list rules
    }
    RoleBinding {
        string namespace
        string roleRef
        list subjects
    }
    ClusterRoleBinding {
        string roleRef
        list subjects
    }
    User {
        string name
    }
    Group {
        string name
    }
    ServiceAccount {
        string namespace
        string name
    }

    Role ||--o{ RoleBinding : "referenced by"
    ClusterRole ||--o{ RoleBinding : "referenced by"
    ClusterRole ||--o{ ClusterRoleBinding : "referenced by"
    RoleBinding }o--|| User : "grants to"
    RoleBinding }o--|| Group : "grants to"
    RoleBinding }o--|| ServiceAccount : "grants to"
    ClusterRoleBinding }o--|| User : "grants to"
    ClusterRoleBinding }o--|| Group : "grants to"
    ClusterRoleBinding }o--|| ServiceAccount : "grants to"
    

There are four key resources to understand:

ResourceScopePurpose
RoleNamespaceDefines permissions (verbs on resources) within a single namespace
ClusterRoleClusterDefines permissions cluster-wide, or for cluster-scoped resources (nodes, PVs, namespaces)
RoleBindingNamespaceBinds a Role or ClusterRole to subjects within a namespace
ClusterRoleBindingClusterBinds a ClusterRole to subjects across all namespaces

A subtle but powerful pattern: a RoleBinding can reference a ClusterRole. This lets you define a reusable set of permissions once as a ClusterRole and then grant it in specific namespaces through RoleBindings. The subject only gets those permissions within the RoleBinding's namespace — not cluster-wide.

RBAC Verbs

Each rule in a Role or ClusterRole specifies which verbs (actions) are allowed on which resources in which API groups. Kubernetes defines eight verbs that map directly to HTTP methods on the API server.

VerbHTTP MethodDescription
getGET (single)Read a single resource by name
listGET (collection)List all resources of a type
watchGET (streaming)Stream real-time changes to resources
createPOSTCreate a new resource
updatePUTReplace an entire resource
patchPATCHModify specific fields of a resource
deleteDELETE (single)Delete a single resource by name
deletecollectionDELETE (collection)Delete all resources matching a selector

Common groupings: read-only access typically means get, list, watch. Full management adds create, update, patch, delete. The wildcard * matches all verbs — use it sparingly.

Role and ClusterRole

A Role grants permissions within a specific namespace. Each rule in the rules array specifies API groups, resources, and the verbs allowed on those resources.

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: staging
  name: deployment-manager
rules:
  # Full control over Deployments
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  # Read-only access to Pods and their logs
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]

The apiGroups field determines which API group the resources belong to. Core resources (Pods, Services, ConfigMaps) use "" (the empty string). Resources in named groups use the group name — "apps" for Deployments, "batch" for Jobs, "networking.k8s.io" for NetworkPolicies.

A ClusterRole looks identical but has no namespace field and can also grant access to cluster-scoped resources like Nodes, PersistentVolumes, and Namespaces themselves.

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get", "list"]

You can also restrict access to specific resource names using the resourceNames field. This is useful for granting access to a particular ConfigMap or Secret without opening up all resources of that type.

yaml
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["app-config", "feature-flags"]
    verbs: ["get", "update"]

RoleBinding and ClusterRoleBinding

Roles and ClusterRoles are inert until you bind them to subjects. A RoleBinding grants the permissions defined in a Role (or ClusterRole) to one or more subjects — within a single namespace. A ClusterRoleBinding grants permissions cluster-wide.

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: deploy-manager-binding
  namespace: staging
subjects:
  # A specific user (from certificate CN or OIDC)
  - kind: User
    name: jane.doe
    apiGroup: rbac.authorization.k8s.io
  # A group (from certificate O field or OIDC groups claim)
  - kind: Group
    name: platform-team
    apiGroup: rbac.authorization.k8s.io
  # A ServiceAccount in the same namespace
  - kind: ServiceAccount
    name: ci-deployer
    namespace: staging
roleRef:
  kind: Role
  name: deployment-manager
  apiGroup: rbac.authorization.k8s.io

The roleRef is immutable — once a Binding is created, you cannot change which Role it references. To point it at a different Role, delete and recreate the Binding. This prevents privilege escalation through in-place modification.

Here is a ClusterRoleBinding that grants node-reader to the monitoring group across the entire cluster:

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-node-reader
subjects:
  - kind: Group
    name: monitoring
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: node-reader
  apiGroup: rbac.authorization.k8s.io

ServiceAccounts

ServiceAccounts are the identity mechanism for workloads running inside the cluster. Every namespace has a default ServiceAccount, and every Pod runs as some ServiceAccount. When a Pod makes API calls (e.g., a controller querying Pods, or a CI tool creating Deployments), it authenticates using its ServiceAccount token.

Automatic Token Mounting

By default, Kubernetes mounts a ServiceAccount token into every Pod at /var/run/secrets/kubernetes.io/serviceaccount/. Since Kubernetes 1.22, these are bound service account tokens — time-limited JWTs projected through the TokenRequest API, scoped to a specific audience and expiration (default: 1 hour, auto-refreshed by the kubelet).

bash
# Inspect the projected token volume inside a running Pod
kubectl exec my-pod -- ls /var/run/secrets/kubernetes.io/serviceaccount/
# Output: ca.crt  namespace  token

# Decode the JWT to see its claims (bound, time-limited)
kubectl exec my-pod -- cat /var/run/secrets/kubernetes.io/serviceaccount/token \
  | cut -d'.' -f2 | base64 -d 2>/dev/null | python3 -m json.tool

Disabling Automatic Token Mounting

Most application Pods never call the Kubernetes API. Mounting a token into these Pods is an unnecessary attack surface — if the Pod is compromised, the attacker gets a valid API token. Disable it at the ServiceAccount level, the Pod level, or both.

yaml
# Option 1: Disable at the ServiceAccount level (affects all Pods using it)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
automountServiceAccountToken: false
---
# Option 2: Disable at the Pod spec level (overrides ServiceAccount setting)
apiVersion: v1
kind: Pod
metadata:
  name: my-app-pod
spec:
  serviceAccountName: my-app
  automountServiceAccountToken: false
  containers:
    - name: app
      image: my-app:1.4.0
Security Best Practice

Set automountServiceAccountToken: false on the default ServiceAccount in every namespace. Create dedicated ServiceAccounts with explicit permissions only for Pods that actually need API access. The default ServiceAccount should never have RoleBindings attached to it.

Bound Service Account Token Volume Projection

Modern Kubernetes clusters (1.22+) use projected volumes to mount ServiceAccount tokens. Unlike the legacy approach of long-lived Secret-based tokens, projected tokens are issued on-demand through the TokenRequest API with three important properties:

  • Time-limited — tokens expire (default 1 hour) and are automatically rotated by the kubelet before expiration.
  • Audience-bound — tokens are scoped to a specific audience (typically the API server), preventing misuse with other services.
  • Object-bound — tokens are tied to the specific Pod; if the Pod is deleted, the token becomes invalid immediately.

You can also request tokens with custom audiences and expiration times using a projected volume explicitly:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: vault-consumer
spec:
  serviceAccountName: vault-auth
  containers:
    - name: app
      image: my-app:2.0.0
      volumeMounts:
        - name: vault-token
          mountPath: /var/run/secrets/vault
          readOnly: true
  volumes:
    - name: vault-token
      projected:
        sources:
          - serviceAccountToken:
              path: token
              expirationSeconds: 3600
              audience: vault

Aggregated ClusterRoles

Aggregated ClusterRoles let you compose a ClusterRole from multiple smaller ClusterRoles using label selectors. Instead of maintaining one monolithic role, you define granular roles and aggregate them automatically. Kubernetes' built-in admin, edit, and view ClusterRoles use this mechanism — which is why installing a CRD can automatically make its resources visible in those roles.

yaml
# Parent: aggregates all ClusterRoles with the matching label
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-aggregate
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.example.com/aggregate-to-monitoring: "true"
rules: []  # Rules are auto-populated by the controller
---
# Child: contributes rules to the aggregate
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-metrics-reader
  labels:
    rbac.example.com/aggregate-to-monitoring: "true"
rules:
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods", "nodes"]
    verbs: ["get", "list", "watch"]
---
# Another child: automatically merged into the aggregate
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-events-reader
  labels:
    rbac.example.com/aggregate-to-monitoring: "true"
rules:
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["get", "list", "watch"]

The parent ClusterRole's rules field is automatically filled by the RBAC controller. Leave it as an empty array — any manually added rules will be overwritten. Adding or removing a child ClusterRole with the matching label dynamically updates the aggregate.

Auditing Permissions with kubectl auth

You do not need to read through dozens of Roles and Bindings to understand what a subject can do. The kubectl auth can-i command answers permission questions directly.

bash
# Can I create deployments in the staging namespace?
kubectl auth can-i create deployments --namespace staging
# yes

# Can the "ci-deployer" ServiceAccount list pods in staging?
kubectl auth can-i list pods \
  --namespace staging \
  --as system:serviceaccount:staging:ci-deployer
# yes

# List ALL permissions for a specific ServiceAccount
kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-deployer

# Check cluster-scoped permissions: can user jane delete nodes?
kubectl auth can-i delete nodes --as jane.doe
# no

# Check who you are currently authenticated as (Kubernetes 1.27+)
kubectl auth whoami

The --as flag triggers impersonation, which lets cluster admins test permissions from another identity's perspective. ServiceAccounts use the format system:serviceaccount:<namespace>:<name>. The --list flag outputs a table of all allowed actions — invaluable for auditing.

Practical Example: CI/CD Pipeline ServiceAccount

Here is a complete, production-ready example: a ServiceAccount for a CI/CD pipeline (like ArgoCD or a GitHub Actions runner) that can manage Deployments and Services in the staging namespace — and nothing else.

yaml
# 1. Create a dedicated ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-deployer
  namespace: staging
---
# 2. Define the minimum required permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-deploy-role
  namespace: staging
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments/status"]
    verbs: ["get"]
---
# 3. Bind the Role to the ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-deploy-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-deployer
    namespace: staging
roleRef:
  kind: Role
  name: ci-deploy-role
  apiGroup: rbac.authorization.k8s.io

After applying this, verify the permissions are correct:

bash
# Apply the manifests
kubectl apply -f ci-deployer-rbac.yaml

# Verify: can it create deployments in staging?
kubectl auth can-i create deployments \
  --namespace staging \
  --as system:serviceaccount:staging:ci-deployer
# yes

# Verify: can it delete secrets? (should be denied)
kubectl auth can-i delete secrets \
  --namespace staging \
  --as system:serviceaccount:staging:ci-deployer
# no

# Verify: can it create deployments in production? (should be denied)
kubectl auth can-i create deployments \
  --namespace production \
  --as system:serviceaccount:staging:ci-deployer
# no

# List all granted permissions
kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-deployer

Built-in ClusterRoles Reference

Kubernetes ships with several default ClusterRoles. Understanding these helps you decide when to use a built-in role versus creating your own.

ClusterRolePermissionsWhen to Use
cluster-adminFull access to all resources in all namespaces. Equivalent to root.Break-glass emergency access only. Never for day-to-day use.
adminFull access within a namespace (Roles, RoleBindings, most resources). No access to ResourceQuotas or the namespace itself.Namespace owners, team leads.
editRead/write access to most resources in a namespace. No access to Roles or RoleBindings.Developers deploying applications.
viewRead-only access to most resources. No access to Secrets.Observers, dashboard users, auditors.
Reuse ClusterRoles with Namespace-scoped RoleBindings

Instead of creating per-namespace Roles, use a RoleBinding that references the built-in edit or view ClusterRole. For example, binding ClusterRole/edit via a RoleBinding in the dev namespace gives the dev team edit access in dev only — without needing a custom Role.

Pod Security Standards and Pod Security Admission

For years, PodSecurityPolicy (PSP) was the built-in mechanism for controlling what pods could and couldn't do on a Kubernetes cluster. It was powerful but deeply flawed — difficult to reason about, impossible to dry-run, and full of surprising interactions with RBAC. In Kubernetes 1.21, PSP was officially deprecated. In Kubernetes 1.25, it was removed entirely.

Its replacement is a two-part system: Pod Security Standards (PSS), which define three tiered security profiles, and Pod Security Admission (PSA), a built-in admission controller that enforces those standards at the namespace level. Together, they give you a simpler, more predictable way to prevent dangerous pod configurations from running on your cluster.

Why PodSecurityPolicy Had to Go

PSP suffered from fundamental design problems that made it unreliable in practice. Understanding why it failed helps you appreciate what PSA does differently.

  • Confusing policy selection. When multiple PSPs existed, Kubernetes would silently pick one based on alphabetical ordering and mutation rules. Administrators couldn't easily predict which policy applied to a given pod.
  • Tight RBAC coupling. PSPs were authorized through RBAC bindings on the use verb. This meant the effective policy depended on the identity creating the pod — not on the namespace or workload. A Deployment created via a controller used the controller's ServiceAccount, not the user's, leading to subtle bypasses.
  • No dry-run or audit mode. You couldn't test a policy before enforcing it. Rolling out a new PSP to an existing cluster was a high-risk operation with no way to preview what would break.
  • Mutation side effects. PSPs could mutate pod specs (setting default seccomp profiles, dropping capabilities), which made debugging unpredictable and conflicted with GitOps workflows expecting immutable manifests.
Note

PSA intentionally does not mutate pods. It only validates. If a pod violates the policy, it's rejected, warned about, or logged — but never silently modified. This is a deliberate design choice that makes behavior predictable and audit trails trustworthy.

The Three Pod Security Standards

Pod Security Standards define three progressively restrictive security profiles. Each level is a superset of the one before it — Restricted includes everything Baseline checks, and Baseline includes everything Privileged allows (which is nothing, since Privileged is unrestricted). Think of them as presets, not custom policies.

LevelIntentTypical Use Case
Privileged Unrestricted. No checks applied. System-level workloads like CNI plugins, storage drivers, logging agents that require host access
Baseline Prevents known privilege escalations while remaining broadly compatible General application workloads — the sensible default for most namespaces
Restricted Heavily locked-down, follows current pod hardening best practices Security-sensitive workloads, multi-tenant environments, compliance-mandated clusters

Privileged — The Escape Hatch

The Privileged level applies zero restrictions. Any pod configuration is allowed: privileged containers, host namespaces, host paths — everything. This exists because certain infrastructure components (CNI plugins like Calico, storage provisioners like Longhorn, monitoring agents like the node exporter) genuinely need elevated access to the host.

Apply this level only to dedicated infrastructure namespaces like kube-system, and keep those namespaces tightly controlled via RBAC. Never use Privileged as the default for application namespaces.

Baseline — Sensible Defaults

Baseline blocks the most dangerous pod configurations while remaining compatible with the vast majority of application workloads. It prevents things like privileged containers, host networking, and dangerous capability additions, but still allows running as root and most volume types. Most off-the-shelf Helm charts and container images will pass Baseline without modification.

Restricted — Hardened Workloads

Restricted builds on Baseline and adds requirements that many existing images don't meet out of the box: pods must run as non-root, must drop ALL capabilities, must set a seccomp profile, and can only use a limited set of volume types. This is the level you want for multi-tenant clusters or environments with compliance requirements, but expect to update your Dockerfiles and pod specs to conform.

What Each Level Actually Checks

The specific checks at each level are defined in the Kubernetes documentation and are version-pinned. Here is a breakdown of the key controls and which level enforces them.

CheckPrivilegedBaselineRestricted
HostProcess (Windows)AllowedDisallowedDisallowed
Host namespaces (hostNetwork, hostPID, hostIPC)AllowedDisallowedDisallowed
Privileged containersAllowedDisallowedDisallowed
Capabilities (beyond a safe set)AllowedOnly allows NET_BIND_SERVICE additionMust drop ALL; only NET_BIND_SERVICE may be added
HostPath volumesAllowedDisallowedDisallowed
Host portsAllowedDisallowed (or limited range)Disallowed
/proc mount typeAllowedMust be DefaultMust be Default
Seccomp profileAnyAny (not Unconfined since v1.25)Must be RuntimeDefault or Localhost
SysctlsAnyOnly safe set allowedOnly safe set allowed
Volume typesAnyBroad set (no hostPath)Limited: configMap, csi, downwardAPI, emptyDir, ephemeral, persistentVolumeClaim, projected, secret
runAsNonRootNot requiredNot requiredMust be true
Run as non-root user (UID)Not requiredNot requiredrunAsUser must not be 0
Privilege escalation (allowPrivilegeEscalation)AllowedAllowedMust be false

Enforcement Modes: Enforce, Audit, and Warn

PSA does not just have an on/off switch. It provides three modes that control what happens when a pod violates the configured security level. You can combine multiple modes on the same namespace, and this is in fact the recommended approach for gradual rollouts.

ModeEffectVisibility
enforce Rejects pods that violate the policy API request fails with an error — the pod is never created
audit Allows the pod but records the violation in the API server audit log Visible only in audit logs — invisible to the user
warn Allows the pod but sends a warning back to the API client Displayed as a warning in kubectl output

The key insight is that audit and warn let you preview what would break before you turn on enforcement. You can set enforce: baseline (to block the worst offenders now) while simultaneously setting warn: restricted and audit: restricted (to surface everything that is not yet hardened). Users see warnings in their terminal, and your security team can query the audit log for violations.

Applying PSA with Namespace Labels

PSA is configured entirely through labels on namespace objects. There are no separate policy resources to create — you just label the namespace. The label format is:

text
pod-security.kubernetes.io/<MODE>: <LEVEL>
pod-security.kubernetes.io/<MODE>-version: <VERSION>

Where MODE is enforce, audit, or warn; LEVEL is privileged, baseline, or restricted; and VERSION is a Kubernetes minor version like v1.30 or latest. Pinning a version ensures that policy checks do not change under your feet when you upgrade the cluster.

Here is a namespace configured to enforce Baseline and warn on Restricted:

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.30
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.30
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.30

You can also apply labels imperatively with kubectl:

bash
# Enforce baseline, warn and audit on restricted
kubectl label namespace my-app \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

What Happens When a Pod Violates the Policy

Let's see PSA in action. Suppose the my-app namespace enforces baseline. If someone tries to create a privileged container, the request is rejected:

yaml
# This pod will be REJECTED in a baseline-enforced namespace
apiVersion: v1
kind: Pod
metadata:
  name: dangerous-pod
  namespace: my-app
spec:
  hostNetwork: true
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        privileged: true
text
Error from server (Forbidden): pods "dangerous-pod" is forbidden:
  violates PodSecurity "baseline:v1.30":
    host namespaces (hostNetwork=true),
    privileged (container "app" must not set securityContext.privileged=true)

Notice the error message is specific — it tells you exactly which checks failed and which container caused the violation. With warn mode on the same namespace, the pod would be created but you would see a warning in the kubectl output:

text
Warning: would violate PodSecurity "restricted:v1.30":
  allowPrivilegeEscalation != false (container "app" ...),
  unrestricted capabilities (container "app" ...),
  runAsNonRoot != true (pod or container "app" ...),
  seccompProfile (pod or container "app" ...)

Writing Pods That Pass the Restricted Level

The Restricted level is where most teams hit friction. It requires explicit security context settings that many container images and Helm charts do not include by default. Here is a pod spec that satisfies every Restricted check:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
  namespace: my-app
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: my-registry/app:v2.1.0
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
      ports:
        - containerPort: 8080
      volumeMounts:
        - name: tmp
          mountPath: /tmp
  volumes:
    - name: tmp
      emptyDir: {}

Let's break down each requirement and why it matters:

  • runAsNonRoot: true — Ensures the container process does not run as UID 0. Your container image must define a USER directive in the Dockerfile, or you set runAsUser to a non-zero UID.
  • allowPrivilegeEscalation: false — Prevents child processes from gaining more privileges than the parent via setuid binaries or filesystem capabilities.
  • capabilities.drop: ["ALL"] — Drops all Linux capabilities. You can selectively add back NET_BIND_SERVICE if the container needs to bind to ports below 1024.
  • seccompProfile.type: RuntimeDefault — Applies the container runtime's default seccomp filter, which blocks around 40-60 dangerous syscalls while allowing normal application behavior.
  • readOnlyRootFilesystem: true — Not strictly required by Restricted, but a strong best practice. Use emptyDir mounts for directories that need writes (like /tmp).
Tip

Set pod-level securityContext for settings that apply to all containers (runAsNonRoot, seccompProfile), and container-level securityContext for per-container settings (allowPrivilegeEscalation, capabilities). If both are set, the container-level value takes precedence.

Dry-Running PSA Checks with kubectl

Before changing enforcement labels on a live namespace, you can test what would fail. The --dry-run=server flag processes the request through admission but does not persist the object. Combined with warn mode, this lets you check any pod spec against a PSA level without risk.

bash
# Dry-run: will this pod be accepted in a restricted namespace?
kubectl apply -f secure-app.yaml --dry-run=server

# Check all existing pods in a namespace against restricted level
kubectl label --dry-run=server --overwrite namespace my-app \
  pod-security.kubernetes.io/enforce=restricted

The second command is especially powerful. When you label a namespace with --dry-run=server, Kubernetes evaluates all existing pods against the new level and returns warnings for any violations — without actually changing the namespace label. This is how you safely audit before flipping the switch.

Migrating from PodSecurityPolicy to PSA

If you are running a cluster that used PSPs, migrating to PSA requires a methodical approach. The two systems are fundamentally different — PSPs are cluster-scoped resources bound via RBAC, while PSA is namespace-scoped via labels. There is no automatic conversion.

Step 1: Audit Your Existing PSPs

Start by understanding what your current PSPs actually allow. Map each PSP to the closest PSS level:

bash
# List all PSPs and their key settings
kubectl get psp -o custom-columns=\
NAME:.metadata.name,\
PRIV:.spec.privileged,\
HOST_NET:.spec.hostNetwork,\
HOST_PID:.spec.hostPID,\
RUN_AS:.spec.runAsUser.rule,\
VOLUMES:.spec.volumes

Step 2: Label Namespaces with Warn and Audit First

Do not jump to enforcement. Start by adding warn and audit labels to every namespace at the level you intend to enforce. Let the warnings accumulate for a week or two.

bash
# Phase 1: warn and audit only — nothing is blocked
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  kubectl label namespace "$ns" \
    pod-security.kubernetes.io/warn=baseline \
    pod-security.kubernetes.io/audit=baseline \
    --overwrite
done

Step 3: Fix Violations in Workloads

Review the warnings in kubectl output and the violations in audit logs. Update your Deployments, StatefulSets, and DaemonSets to pass the target level. Common fixes include removing hostNetwork: true, dropping capabilities, adding seccompProfile, and ensuring containers do not run as root.

Step 4: Enable Enforcement

Once violations are resolved, add the enforce label. If you are aiming for Restricted long-term, a phased approach works well:

yaml
# Phase 2: enforce baseline, warn+audit on restricted
apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.30
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
---
# Phase 3 (later): enforce restricted
apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.30

Step 5: Remove PSP Resources

After enforcement is active and stable, clean up the old PSP resources:

bash
# Delete all PodSecurityPolicies (already non-functional on 1.25+)
kubectl delete psp --all

# Remove RBAC bindings that referenced PSPs
kubectl delete clusterrole psp:privileged psp:restricted --ignore-not-found
kubectl delete clusterrolebinding psp:privileged psp:restricted --ignore-not-found

Cluster-Wide Defaults with the Admission Configuration

Namespace labels give you per-namespace control, but you can also configure cluster-wide defaults and exemptions by passing an AdmissionConfiguration file to the API server. This lets you define a default level for all namespaces, exempt specific users or namespaces, and avoid the need to label every namespace individually.

yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: PodSecurity
    configuration:
      apiVersion: pod-security.admission.config.k8s.io/v1
      kind: PodSecurityConfiguration
      defaults:
        enforce: baseline
        enforce-version: latest
        warn: restricted
        warn-version: latest
        audit: restricted
        audit-version: latest
      exemptions:
        usernames: []
        runtimeClasses: []
        namespaces:
          - kube-system
          - kube-node-lease
          - cert-manager

This configuration is passed to the kube-apiserver via the --admission-control-config-file flag. Namespaces listed under exemptions.namespaces bypass PSA checks entirely, which is how you carve out space for infrastructure workloads that need elevated privileges.

Warning

PSA only evaluates pods at creation time. It does not retroactively evict existing pods when you change a namespace label. If you tighten enforcement on a namespace, existing non-compliant pods continue running until they are restarted or rescheduled. Always check running workloads with kubectl label --dry-run=server before assuming compliance.

PSA Limitations and When You Need More

Pod Security Admission is intentionally simple. It covers the most impactful pod-level security controls, but it does not handle everything. Understanding its boundaries helps you decide when to supplement it with additional tools.

  • No fine-grained policies. You cannot say "allow hostNetwork only for pods with label X" or "restrict image registries." It is all-or-nothing within a level.
  • No mutation. PSA will not auto-inject security contexts or seccomp profiles. If you want defaulting behavior, use a mutating webhook.
  • Namespace-scoped only. There is no way to target individual workloads within a namespace — the level applies to every pod in that namespace.
  • No image policy. PSA does not validate image signatures, enforce registry allowlists, or check for CVEs.

For more granular control, consider policy engines like Kyverno or OPA Gatekeeper, which can express arbitrary admission rules via custom policies. These complement PSA well — use PSA as the baseline floor, and layer custom policies on top for organization-specific rules. The next section on Admission Controllers and Dynamic Webhooks covers exactly how these tools integrate with the admission pipeline.

Admission Controllers and Dynamic Webhooks

Every request to the Kubernetes API server passes through a pipeline before any object is persisted to etcd. After authentication confirms who you are and authorization confirms what you can do, the request enters a third stage: admission control. This is where the cluster enforces policies, injects defaults, and applies guardrails that RBAC alone cannot express.

Admission controllers are plugins compiled into the API server binary. They intercept requests after authN/authZ but before the object is written to storage. Some are mutating (they modify the request), some are validating (they approve or reject it), and some do both. Understanding this pipeline is essential for anyone operating production clusters or building platform engineering tooling.

The Admission Pipeline

The diagram below shows the full lifecycle of an API request. Notice that mutating webhooks run before validating webhooks — this ordering is intentional. Mutations happen first so that validators see the final version of the object. If any stage rejects the request, the entire operation fails and nothing is written to etcd.

sequenceDiagram
    participant Client as kubectl / Client
    participant Auth as Authentication
    participant Authz as Authorization
    participant MW as Mutating Admission
    participant SV as Schema Validation
    participant VW as Validating Admission
    participant etcd as etcd

    Client->>Auth: API Request
    Auth->>Auth: Verify identity (certs, tokens)
    alt Auth fails
        Auth-->>Client: 401 Unauthorized
    end
    Auth->>Authz: Authenticated request
    Authz->>Authz: Check RBAC policies
    alt Authz fails
        Authz-->>Client: 403 Forbidden
    end
    Authz->>MW: Authorized request
    MW->>MW: Mutate (inject defaults, sidecars, labels)
    alt Mutation webhook rejects
        MW-->>Client: 400/500 Rejected
    end
    MW->>SV: Mutated object
    SV->>SV: Validate against OpenAPI schema
    alt Schema validation fails
        SV-->>Client: 422 Unprocessable
    end
    SV->>VW: Schema-valid object
    VW->>VW: Validate policies (image registry, labels)
    alt Validation webhook rejects
        VW-->>Client: 403 Denied by policy
    end
    VW->>etcd: Persist object
    etcd-->>Client: 200 OK / 201 Created
        

Built-in Admission Controllers

Kubernetes ships with dozens of admission controllers, and the API server enables a recommended set by default. You can see the active controllers with kube-apiserver --help | grep enable-admission-plugins. Here are the most important ones you should understand:

ControllerTypePurpose
NamespaceLifecycleValidatingPrevents creating objects in namespaces that are being terminated, and rejects requests to delete the default, kube-system, and kube-public namespaces.
LimitRangerMutating + ValidatingEnforces LimitRange objects — injects default CPU/memory requests and limits into Pods that don't specify them, and rejects Pods that exceed range constraints.
ResourceQuotaValidatingTracks and enforces resource consumption against ResourceQuota objects. Rejects requests that would cause a namespace to exceed its quota.
ServiceAccountMutatingAutomatically assigns the default ServiceAccount to Pods that don't specify one, and mounts the corresponding API token.
DefaultStorageClassMutatingAssigns the default StorageClass to PersistentVolumeClaim objects that don't request a specific class.
PodSecurityValidatingEnforces Pod Security Standards (Privileged, Baseline, Restricted) at the namespace level. Replaced the deprecated PodSecurityPolicy.
MutatingAdmissionWebhookMutatingCalls external webhook services to modify incoming objects. This is the gateway for dynamic mutation — sidecar injection, label addition, default overrides.
ValidatingAdmissionWebhookValidatingCalls external webhook services to approve or deny requests. This is the gateway for dynamic policy enforcement without recompiling the API server.
Note

The last two controllers in the table — MutatingAdmissionWebhook and ValidatingAdmissionWebhook — are the built-in controllers that dispatch to your custom webhook services. They are the bridge between the static admission pipeline and your dynamic, external logic.

Dynamic Admission Webhooks

Built-in controllers cover common defaults, but real-world clusters need custom policies: "all images must come from our private registry," "every Deployment must have an owner label," or "inject an Envoy sidecar into every Pod in the mesh namespace." Dynamic webhooks let you implement these rules as external HTTPS services and register them with the API server — no recompilation required.

MutatingWebhookConfiguration

A mutating webhook intercepts API requests and modifies the object before it continues through the pipeline. The webhook receives the object as JSON, changes it, and returns a JSON Patch (or the modified object). The API server applies the patch to the original request. Classic use cases include sidecar injection (Istio, Linkerd), adding default labels or annotations, and setting security context defaults.

yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: sidecar-injector
webhooks:
  - name: sidecar.example.com
    admissionReviewVersions: ["v1"]
    sideEffects: None
    clientConfig:
      service:
        name: sidecar-injector
        namespace: webhook-system
        path: "/inject"
      caBundle: LS0tLS1C...  # Base64-encoded CA cert
    rules:
      - apiGroups: [""]
        apiVersions: ["v1"]
        operations: ["CREATE"]
        resources: ["pods"]
    namespaceSelector:
      matchLabels:
        sidecar-injection: enabled
    failurePolicy: Ignore
    timeoutSeconds: 5

The key fields to understand: rules determines which API requests trigger the webhook (here, only Pod CREATE operations). The namespaceSelector restricts the webhook to namespaces with a specific label, which prevents it from intercepting system Pods. The clientConfig tells the API server where to send the admission review — either a service reference (for in-cluster webhooks) or a url (for external endpoints).

ValidatingWebhookConfiguration

A validating webhook cannot modify objects — it can only approve or deny them. It receives the final (post-mutation) version of the object and returns an allowed: true or allowed: false response with an optional message. This is the right tool for enforcing organizational policies: image source restrictions, required labels, prohibited configurations, and resource naming conventions.

yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: image-policy
webhooks:
  - name: image-policy.example.com
    admissionReviewVersions: ["v1"]
    sideEffects: None
    clientConfig:
      service:
        name: image-policy-webhook
        namespace: webhook-system
        path: "/validate"
      caBundle: LS0tLS1C...
    rules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
    failurePolicy: Fail
    timeoutSeconds: 10
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system", "webhook-system"]

Webhook Configuration Details

Failure Policy: Fail vs. Ignore

The failurePolicy field controls what happens when the webhook is unreachable or returns an error (not a clean "deny," but an actual failure like a timeout or 500 response). You have two choices, and the right one depends on the webhook's purpose:

PolicyBehavior on FailureBest For
FailThe API request is rejected. Nothing gets through if the webhook is down.Security-critical validations (image policies, compliance checks). You would rather block all changes than risk a policy bypass.
IgnoreThe API request is allowed through as if the webhook didn't exist.Non-critical mutations (label injection, observability sidecars). Availability matters more than enforcement.
Warning

A webhook with failurePolicy: Fail that becomes unavailable will block all matching API requests cluster-wide. If your webhook matches Pod creations in kube-system, a webhook outage can prevent critical system Pods from starting — cascading into a full cluster failure. Always exclude system namespaces with namespaceSelector, and keep webhook timeout values low (5–10 seconds).

Timeout and Matching Configuration

The timeoutSeconds field (default: 10, max: 30) controls how long the API server waits for a webhook response. Mutating webhooks run sequentially (each sees the output of the previous one), so their timeouts are additive. Validating webhooks run in parallel, so only the slowest one matters. Keep timeouts low — webhook latency directly adds to every API call that matches.

Use namespaceSelector and objectSelector to precisely target the resources your webhook cares about. The namespaceSelector matches labels on the namespace of the target object, while objectSelector matches labels on the object itself. Combine both to minimize unnecessary webhook invocations.

yaml
# Only match Pods that have the "validate: true" label
# in namespaces labeled "environment: production"
objectSelector:
  matchLabels:
    validate: "true"
namespaceSelector:
  matchLabels:
    environment: production
# Also supports matchExpressions for exclusions:
# namespaceSelector:
#   matchExpressions:
#     - key: kubernetes.io/metadata.name
#       operator: NotIn
#       values: ["kube-system", "kube-node-lease"]

Building a Validating Webhook — Practical Example

Let's build a validating webhook that enforces a simple policy: every Deployment must have a team label. This is a common organizational requirement for tracking ownership. The webhook is a small HTTPS server that receives AdmissionReview requests from the API server and responds with allow/deny decisions.

Step 1: The Webhook Server

The webhook server is a standard HTTP handler that parses the AdmissionReview request, inspects the object, and returns a response. Here is a minimal Go implementation:

go
package main

import (
    "encoding/json"
    "fmt"
    "io"
    "net/http"

    admissionv1 "k8s.io/api/admission/v1"
    appsv1 "k8s.io/api/apps/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func validateDeployment(w http.ResponseWriter, r *http.Request) {
    body, _ := io.ReadAll(r.Body)

    var review admissionv1.AdmissionReview
    json.Unmarshal(body, &review)

    var deployment appsv1.Deployment
    json.Unmarshal(review.Request.Object.Raw, &deployment)

    allowed := true
    message := "Deployment is valid"

    if _, ok := deployment.Labels["team"]; !ok {
        allowed = false
        message = fmt.Sprintf(
            "Deployment %q denied: missing required label 'team'",
            deployment.Name,
        )
    }

    review.Response = &admissionv1.AdmissionResponse{
        UID:     review.Request.UID,
        Allowed: allowed,
        Result:  &metav1.Status{Message: message},
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(review)
}

func main() {
    http.HandleFunc("/validate", validateDeployment)
    fmt.Println("Webhook server listening on :8443")
    http.ListenAndServeTLS(":8443", "/certs/tls.crt", "/certs/tls.key", nil)
}

Step 2: TLS Certificates

Webhooks must be served over HTTPS — the API server will not call HTTP endpoints. For production, use cert-manager to automatically provision and rotate certificates. For development, you can generate self-signed certs:

bash
# Generate a CA and signed certificate for the webhook service
openssl genrsa -out ca.key 2048
openssl req -x509 -new -key ca.key -days 365 -out ca.crt \
  -subj "/CN=webhook-ca"

openssl genrsa -out tls.key 2048
openssl req -new -key tls.key \
  -subj "/CN=label-validator.webhook-system.svc" | \
  openssl x509 -req -CA ca.crt -CAkey ca.key \
  -CAcreateserial -days 365 -out tls.crt \
  -extfile <(echo "subjectAltName=DNS:label-validator.webhook-system.svc")

# Create the TLS secret in the webhook namespace
kubectl create namespace webhook-system
kubectl -n webhook-system create secret tls webhook-certs \
  --cert=tls.crt --key=tls.key

# Base64-encode the CA for the webhook configuration
export CA_BUNDLE=$(cat ca.crt | base64 | tr -d '\n')
echo "caBundle: $CA_BUNDLE"

Step 3: Deploy the Webhook and Register It

Package the Go server into a container image, deploy it as a Kubernetes Service, and register the ValidatingWebhookConfiguration with the API server.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: label-validator
  namespace: webhook-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: label-validator
  template:
    metadata:
      labels:
        app: label-validator
    spec:
      containers:
        - name: webhook
          image: registry.example.com/label-validator:v1
          ports:
            - containerPort: 8443
          volumeMounts:
            - name: certs
              mountPath: /certs
              readOnly: true
      volumes:
        - name: certs
          secret:
            secretName: webhook-certs
---
apiVersion: v1
kind: Service
metadata:
  name: label-validator
  namespace: webhook-system
spec:
  selector:
    app: label-validator
  ports:
    - port: 443
      targetPort: 8443
yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: label-validator
webhooks:
  - name: label-validator.example.com
    admissionReviewVersions: ["v1"]
    sideEffects: None
    clientConfig:
      service:
        name: label-validator
        namespace: webhook-system
        path: "/validate"
      caBundle: ${CA_BUNDLE}  # Replace with base64-encoded CA cert
    rules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
    failurePolicy: Fail
    timeoutSeconds: 5
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system", "kube-node-lease", "webhook-system"]

Step 4: Test the Webhook

Deploy a Deployment without the team label and verify it gets rejected:

bash
# This should be REJECTED — no "team" label
kubectl create deployment nginx-bad --image=nginx
# Error: admission webhook "label-validator.example.com" denied the request:
# Deployment "nginx-bad" denied: missing required label 'team'

# This should SUCCEED — "team" label present
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-good
  labels:
    team: platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
EOF
# deployment.apps/nginx-good created

ValidatingAdmissionPolicy — CEL-Based Validation Without Webhooks

Dynamic webhooks are powerful but operationally expensive. You need to build, deploy, and maintain an HTTPS service with TLS certificates, handle high availability, and worry about latency. Starting with Kubernetes 1.26 (beta) and GA in 1.30, ValidatingAdmissionPolicy lets you write validation rules directly as CEL (Common Expression Language) expressions — no external webhook needed.

CEL expressions run inside the API server process itself. This eliminates network hops, TLS management, and the risk of webhook unavailability. For straightforward validation rules (required labels, image prefix checks, resource limit enforcement), ValidatingAdmissionPolicy is the better choice.

yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: require-team-label
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
  validations:
    - expression: "has(object.metadata.labels) && 'team' in object.metadata.labels"
      message: "All Deployments must have a 'team' label"
    - expression: "object.metadata.labels['team'].size() > 0"
      message: "The 'team' label must not be empty"

A policy on its own does nothing — you need a ValidatingAdmissionPolicyBinding to activate it and specify which resources or namespaces it applies to:

yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: require-team-label-binding
spec:
  policyName: require-team-label
  validationActions:
    - Deny     # Reject non-compliant requests
  matchResources:
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system", "kube-node-lease"]

You can also set validationActions to [Warn] during rollout to log violations without blocking requests — useful for gradually introducing a new policy. The Audit action records violations in the API server audit log without any user-facing warning.

CEL Expression Examples

CEL is concise but expressive. Here are common patterns you can use in validations[].expression:

PolicyCEL Expression
Image must come from private registryobject.spec.template.spec.containers.all(c, c.image.startsWith('registry.example.com/'))
Replicas must be at least 2object.spec.replicas >= 2
Must not use latest tagobject.spec.template.spec.containers.all(c, !c.image.endsWith(':latest'))
Memory limit is requiredobject.spec.template.spec.containers.all(c, has(c.resources.limits) && has(c.resources.limits.memory))
No hostNetwork!has(object.spec.template.spec.hostNetwork) || object.spec.template.spec.hostNetwork == false

Webhooks vs. ValidatingAdmissionPolicy — When to Use Which

With two mechanisms for custom validation, how do you choose? The decision hinges on complexity and operational maturity:

CriteriaValidatingAdmissionPolicy (CEL)Dynamic Webhooks
ComplexitySimple field checks, label/annotation rules, value constraintsComplex logic: external lookups, cross-resource validation, stateful decisions
Operational costZero — runs in the API serverHigh — deploy, maintain, and monitor an HTTPS service
LatencyMicroseconds (in-process)Milliseconds to seconds (network hop + TLS)
Availability riskNone — always available with the API serverWebhook outage can block the cluster if failurePolicy: Fail
Mutation supportNo — validation onlyYes — mutating webhooks can modify objects
Kubernetes version1.26+ (beta), 1.30+ (GA)1.16+ (stable)
Tip

Start with ValidatingAdmissionPolicy for pure validation rules. Reserve webhooks for mutations (sidecar injection, defaulting) and complex validations that need external data or multi-resource checks. If you are running Kubernetes 1.30+, CEL-based policies should be your default choice for validation.

Policy Engines: OPA/Gatekeeper and Kyverno

Building individual webhooks per policy doesn't scale. Policy engines provide a framework for managing many policies declaratively — they handle the webhook infrastructure and let you focus on writing rules.

OPA Gatekeeper uses Open Policy Agent with the Rego language. You define ConstraintTemplate resources (parameterized policy templates) and Constraint resources (instances of those templates applied to specific resources). Gatekeeper runs as a validating webhook and caches replicated cluster data for cross-resource checks.

Kyverno takes a Kubernetes-native approach — policies are YAML resources with no separate language to learn. It supports both validation and mutation, can generate resources, and integrates with the Kubernetes API directly. Kyverno policies use familiar patterns like match/exclude blocks and JSON patches for mutations.

Both engines are CNCF projects (Gatekeeper is part of OPA, a graduated project; Kyverno is an incubating project). For teams that need dozens of admission policies and want auditing, dry-run modes, and centralized reporting, a policy engine is far more practical than hand-rolling webhooks.

Cluster Hardening and Security Best Practices

Kubernetes clusters expose a large attack surface by default. The API server accepts requests, etcd stores every secret in the cluster, kubelets run arbitrary containers on nodes, and the flat pod network lets any workload talk to any other. Hardening is the process of systematically reducing that surface — disabling what you don't need, encrypting what you can't disable, and restricting everything else to least privilege.

The CIS Kubernetes Benchmark is the industry-standard checklist for cluster security. It covers the control plane, worker nodes, policies, and managed services with specific, auditable recommendations. Every hardening measure in this section maps to one or more CIS controls. Tools like kube-bench can automatically audit your cluster against the benchmark and flag gaps.

Defense in Depth

No single control protects a cluster. Security is layered: API authentication stops unauthorized users, RBAC limits what authorized users can do, network policies restrict pod communication, runtime security constrains what containers can execute, and audit logging records what actually happened. Each layer compensates for failures in the others.

API Server Hardening

The API server is the front door to your cluster. Every kubectl command, every controller reconciliation loop, and every kubelet heartbeat passes through it. If an attacker gains unrestricted API access, they effectively control the entire cluster. Hardening the API server means limiting who can reach it, how they authenticate, and what gets logged.

Start with three critical flags. Disable anonymous authentication so every request must present valid credentials. Enable audit logging so you have a forensic trail of every API call. And restrict which encryption ciphers the API server accepts — older TLS cipher suites have known vulnerabilities.

yaml
# /etc/kubernetes/manifests/kube-apiserver.yaml (static pod manifest)
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - name: kube-apiserver
    command:
    - kube-apiserver
    # Disable anonymous authentication (CIS 1.2.1)
    - --anonymous-auth=false
    # Enable audit logging (CIS 1.2.22–1.2.25)
    - --audit-policy-file=/etc/kubernetes/audit-policy.yaml
    - --audit-log-path=/var/log/kubernetes/audit.log
    - --audit-log-maxage=30
    - --audit-log-maxbackup=10
    - --audit-log-maxsize=100
    # Restrict TLS ciphers (CIS 1.2.31)
    - --tls-min-version=VersionTLS12
    - --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    # Disable insecure port (deprecated but verify it's off)
    - --insecure-port=0
    # Enable RBAC and Node authorization
    - --authorization-mode=Node,RBAC
    # Set request timeout
    - --request-timeout=300s

Audit Policy

An audit policy tells the API server what to log and at what detail level. The four levels are None, Metadata, Request, and RequestResponse. Logging every request body is expensive, so a good policy logs metadata for most resources and full request/response only for sensitive operations like Secrets access and RBAC changes.

yaml
# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Skip read-only requests to endpoints or events (high volume, low value)
  - level: None
    resources:
    - group: ""
      resources: ["endpoints", "events"]
    verbs: ["get", "list", "watch"]

  # Log full request+response for Secrets (sensitive data access)
  - level: RequestResponse
    resources:
    - group: ""
      resources: ["secrets"]

  # Log full request+response for RBAC changes
  - level: RequestResponse
    resources:
    - group: "rbac.authorization.k8s.io"
      resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]

  # Log request body for all write operations
  - level: Request
    verbs: ["create", "update", "patch", "delete"]

  # Log metadata for everything else
  - level: Metadata

Restricting API Access with Firewall Rules

Beyond authentication, limit network-level access to the API server. On cloud providers, use security groups or firewall rules to restrict port 6443 to known CIDR ranges — your office network, VPN endpoints, and CI/CD runner IPs. On bare metal, use iptables or firewall rules on the control plane nodes.

bash
# Restrict API server access to specific CIDR ranges (iptables example)
iptables -A INPUT -p tcp --dport 6443 -s 10.0.0.0/16 -j ACCEPT   # Pod/node network
iptables -A INPUT -p tcp --dport 6443 -s 192.168.1.0/24 -j ACCEPT # Admin VPN
iptables -A INPUT -p tcp --dport 6443 -j DROP                      # Drop all other

# On GKE, use master-authorized-networks
gcloud container clusters update my-cluster \
  --enable-master-authorized-networks \
  --master-authorized-networks 203.0.113.0/24,198.51.100.0/24

Securing etcd

etcd is the cluster's brain — it stores every object including Secrets, ConfigMaps, RBAC rules, and service account tokens. If an attacker reads etcd directly, they bypass all Kubernetes authorization. Two protections are essential: encrypt data at rest so raw etcd snapshots are useless, and enforce mTLS so only authenticated clients (the API server) can connect.

Encrypting Data at Rest

By default, Kubernetes stores Secrets in etcd as base64-encoded plaintext. Anyone with access to the etcd data directory or a backup can read every secret in the cluster. The EncryptionConfiguration resource tells the API server to encrypt specified resources before writing them to etcd.

yaml
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    - configmaps
    providers:
    # aescbc encrypts with a local key — simple but you manage rotation
    - aescbc:
        keys:
        - name: key-2024
          secret: c2VjcmV0LWtleS1mb3ItZW5jcnlwdGlvbi0zMmJ5dGVz  # 32-byte base64
    # identity is the fallback — allows reading unencrypted data
    - identity: {}

Pass this configuration to the API server with --encryption-provider-config=/etc/kubernetes/encryption-config.yaml. After enabling encryption, re-encrypt existing Secrets by reading and writing them back:

bash
# Re-encrypt all existing secrets with the new encryption key
kubectl get secrets --all-namespaces -o json | kubectl replace -f -

# Verify a secret is encrypted in etcd (run on control plane node)
ETCDCTL_API=3 etcdctl get /registry/secrets/default/my-secret \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key | hexdump -C | head

etcd mTLS and Access Restriction

etcd should never be exposed to anything except the API server. Configure mTLS so etcd requires client certificates for every connection, and bind etcd to a private interface — not 0.0.0.0. The CIS Benchmark (Section 2) explicitly requires that etcd's client and peer communication use TLS with valid certificates.

yaml
# etcd static pod manifest — TLS flags (CIS 2.1–2.6)
spec:
  containers:
  - name: etcd
    command:
    - etcd
    # Client-server TLS (CIS 2.1, 2.2)
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --client-cert-auth=true
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    # Peer TLS (CIS 2.4, 2.5)
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-client-cert-auth=true
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    # Bind to private interface only
    - --listen-client-urls=https://10.0.1.10:2379
    - --listen-peer-urls=https://10.0.1.10:2380

Kubelet Security

The kubelet runs on every node and has the power to create, destroy, and inspect containers. It exposes an HTTPS API (port 10250) that can return container logs, execute commands inside pods, and list running workloads. If anonymous access is enabled — which it is by default on many installations — anyone who can reach a node's IP can exploit this API.

yaml
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Disable anonymous access (CIS 4.2.1)
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true    # Delegate auth to the API server
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
# Use Webhook authorization, not AlwaysAllow (CIS 4.2.2)
authorization:
  mode: Webhook
# Disable read-only port (CIS 4.2.4)
readOnlyPort: 0
# Rotate kubelet certificates automatically
rotateCertificates: true
# Protect kernel defaults
protectKernelDefaults: true
# Restrict which event types are recorded
eventRecordQPS: 5

The Webhook authorization mode is critical: it makes the kubelet ask the API server "is this caller allowed to do this?" for every request. Combined with the Node authorizer on the API server side, this ensures kubelets can only access the Secrets, ConfigMaps, and PersistentVolumes bound to pods scheduled on their node — not resources belonging to other nodes.

Don't Overlook the Read-Only Port

The kubelet's read-only port (10255) serves metrics and pod listings without any authentication. It is often left enabled for legacy monitoring setups. Set readOnlyPort: 0 and migrate monitoring to the authenticated port 10250, or use the /metrics endpoint with proper bearer token authentication.

Network-Level Security

By default, every pod in a Kubernetes cluster can communicate with every other pod — across namespaces, across nodes, no restrictions. This flat networking model is great for getting started, but in production it means a compromised pod in the frontend namespace can directly reach your database pods in backend. Network policies let you enforce segmentation at the cluster level.

Default Deny Policy

The most impactful single security measure for networking is a default-deny ingress policy in every namespace. This inverts the model: instead of "everything allowed unless denied," you get "nothing allowed unless explicitly permitted." Apply this first, then add targeted allow rules.

yaml
# Default deny all ingress traffic in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}    # Selects ALL pods in this namespace
  policyTypes:
  - Ingress          # No ingress rules = deny all inbound
---
# Allow frontend pods to receive traffic on port 8080 only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-ingress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frontend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

Blocking Cloud Metadata Service Access

On AWS, GCP, and Azure, the instance metadata service at 169.254.169.254 is a high-value target. A compromised pod can query it to steal node IAM credentials, read instance identity tokens, or discover internal network topology. Block this with a NetworkPolicy that denies egress to the metadata IP.

yaml
# Block access to cloud metadata service from all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-metadata-service
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  # Allow DNS resolution (required for most workloads)
  - to:
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow all egress EXCEPT the metadata service
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.169.254/32

Service Mesh mTLS

NetworkPolicies operate at L3/L4 — they filter by IP and port but cannot inspect or encrypt traffic. A service mesh like Istio, Linkerd, or Cilium adds L7 policy enforcement and automatic mTLS between pods. Every pod-to-pod connection is encrypted and authenticated by the mesh sidecar, so even if an attacker is on the pod network, they cannot eavesdrop on or impersonate legitimate services.

MechanismLayerEncryptionIdentity VerificationComplexity
NetworkPolicyL3/L4NoneIP/label-basedLow
Service mesh mTLSL4/L7TLS 1.2/1.3Certificate-based (SPIFFE)Medium-High
Cilium with WireGuardL3WireGuardNode-level identityMedium

Image Security and Supply Chain

The container image is the package that runs in your cluster. If an attacker pushes a malicious image to your registry or you pull an image with a known CVE, no amount of runtime security will fully protect you. Image security is about controlling what enters the cluster: which registries you trust, which images are allowed, and whether you can verify they haven't been tampered with.

Use Image Digests, Not Tags

Tags are mutable pointers. nginx:1.25 can point to a different image tomorrow if someone pushes a new build with the same tag. Digests are immutable content hashes — they guarantee you get exactly the image you tested and approved. In production, always pin images by digest.

yaml
# BAD — tag is mutable, could be overwritten
containers:
- name: app
  image: myregistry.io/myapp:v2.1.0

# GOOD — digest is immutable, cryptographically verified
containers:
- name: app
  image: myregistry.io/myapp@sha256:a3ed95caeb02ffe68cdd9fd844066...

Enforcing Image Policies with Kyverno

You can't rely on developers to always use digests or approved registries. Policy engines like Kyverno and OPA Gatekeeper run as admission webhooks — they intercept every pod creation and reject those that violate your rules. Here is a Kyverno policy that requires images from trusted registries and blocks the latest tag:

yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: >-
        Images must come from approved registries.
      pattern:
        spec:
          containers:
          - image: "gcr.io/my-project/* | myregistry.io/*"
  - name: block-latest-tag
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "The 'latest' tag is not allowed. Use a specific version or digest."
      pattern:
        spec:
          containers:
          - image: "!*:latest"

Image Scanning

Integrate vulnerability scanning into your CI pipeline and your cluster admission flow. Tools like Trivy, Grype, and Snyk scan images for known CVEs in OS packages and application dependencies. Use them at two points: in CI (to catch vulnerabilities before merge) and as an admission webhook (to block deployment of images with critical findings).

bash
# Scan an image for vulnerabilities — fail CI if HIGH/CRITICAL found
trivy image --severity HIGH,CRITICAL --exit-code 1 myregistry.io/myapp:v2.1.0

# Scan and output results as a table for human review
trivy image --format table --severity HIGH,CRITICAL myregistry.io/myapp:v2.1.0

# Generate an SBOM (Software Bill of Materials) for audit trails
trivy image --format spdx-json --output sbom.json myregistry.io/myapp:v2.1.0

Runtime Security

Runtime security is the last line of defense. Even after you have locked down the API server, encrypted etcd, restricted the network, and verified your images, a container might still be compromised through an application-level vulnerability. Runtime controls restrict what a container process can do once it is running: which system calls it can make, which files it can access, and what kernel capabilities it holds.

Security Context: The Foundation

Every pod spec should include a securityContext that enforces the principle of least privilege. Run as a non-root user, drop all Linux capabilities, set the root filesystem to read-only, and prevent privilege escalation. These four settings alone block the majority of container breakout techniques.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
spec:
  # Pod-level security context
  securityContext:
    runAsNonRoot: true
    runAsUser: 10000
    runAsGroup: 10000
    fsGroup: 10000
    seccompProfile:
      type: RuntimeDefault    # Apply default Seccomp profile
  containers:
  - name: app
    image: myregistry.io/myapp@sha256:abc123...
    # Container-level security context
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL               # Drop every Linux capability
    # Use emptyDir for writable temp directories
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: cache
      mountPath: /var/cache
  volumes:
  - name: tmp
    emptyDir:
      sizeLimit: 100Mi
  - name: cache
    emptyDir:
      sizeLimit: 200Mi

Seccomp Profiles

Seccomp (Secure Computing Mode) filters system calls at the kernel level. The RuntimeDefault profile blocks roughly 44 of the ~300+ Linux syscalls, including dangerous ones like ptrace, mount, and reboot. For higher security, create a custom profile that allows only the exact syscalls your application needs.

json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": [
        "accept4", "bind", "clone", "close", "connect",
        "epoll_ctl", "epoll_wait", "execve", "exit_group",
        "fcntl", "fstat", "futex", "getpid", "getsockopt",
        "ioctl", "listen", "mmap", "mprotect", "nanosleep",
        "openat", "read", "recvfrom", "rt_sigaction",
        "sendto", "setsockopt", "socket", "write"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Place custom Seccomp profiles in /var/lib/kubelet/seccomp/ on each node, then reference them in your pod spec:

yaml
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: profiles/my-strict-profile.json

AppArmor and SELinux

AppArmor (Ubuntu/Debian) and SELinux (RHEL/CentOS) are Linux Security Modules (LSMs) that confine processes to a defined set of file paths, network operations, and capabilities. They operate below Kubernetes and enforce policies even if a container escapes its cgroup or namespace. As of Kubernetes 1.30, AppArmor support is GA with the appArmorProfile field in the security context.

yaml
# AppArmor (Kubernetes 1.30+ GA field)
securityContext:
  appArmorProfile:
    type: RuntimeDefault      # Use container runtime's default profile
---
# SELinux — assign an MCS label to isolate containers
securityContext:
  seLinuxOptions:
    level: "s0:c123,c456"     # Multi-Category Security label
    type: "container_t"

Hardening Checklist

Use this table as a quick-reference audit sheet. Each item maps to a CIS Kubernetes Benchmark section. Run kube-bench to automate the audit.

AreaControlCIS SectionPriority
API ServerDisable anonymous auth (--anonymous-auth=false)1.2.1Critical
API ServerEnable audit logging1.2.22–1.2.25Critical
API ServerUse Node,RBAC authorization mode1.2.8Critical
etcdEncrypt secrets at rest1.2.33Critical
etcdEnable client-cert auth (--client-cert-auth=true)2.2Critical
KubeletDisable anonymous auth, use Webhook mode4.2.1–4.2.2Critical
KubeletDisable read-only port4.2.4High
NetworkApply default-deny NetworkPolicies5.3.2High
NetworkBlock cloud metadata service (169.254.169.254)High
ImagesUse image digests, not tagsHigh
ImagesRestrict to trusted registries (Kyverno/OPA)5.5.1High
RuntimeRun as non-root, drop all capabilities5.2.6–5.2.9Critical
RuntimeRead-only root filesystem5.2.4High
RuntimeApply Seccomp RuntimeDefault profile5.7.2High
Automate the Audit

Run kube-bench run --targets master,node,etcd,policies regularly — ideally as a CronJob in your cluster or as part of your CI pipeline. It checks your cluster against the CIS Benchmark and produces a pass/fail report for every control. Address CRITICAL and HIGH findings first; they represent the most exploitable gaps.

Health Checks — Liveness, Readiness, and Startup Probes

Kubernetes can restart failed containers and reschedule pods onto healthy nodes — but only if it knows something is wrong. Without health checks, the kubelet has a single signal: whether the container process is running. A process can be alive yet completely stuck — deadlocked, out of file descriptors, wedged in an infinite loop. From the outside, the container appears healthy. Requests pile up, users see timeouts, and nothing self-heals.

Health check probes give the kubelet fine-grained insight into your application's actual state. They are the mechanism behind self-healing and zero-downtime deployments. Configure them wrong (or skip them entirely), and you undermine two of the most valuable things Kubernetes offers.

The Three Probe Types

Kubernetes provides three distinct probes. Each answers a different question about your container, and each triggers a different response from the system. Understanding the distinction is essential — mixing up their purposes is one of the most common sources of production outages.

Liveness Probe — "Is the process stuck?"

The liveness probe detects situations where your application is running but can no longer make progress. Deadlocks, infinite loops, and corrupted internal state are classic examples. When a liveness probe fails beyond its configured threshold, the kubelet kills the container and restarts it. This is the nuclear option — a full container restart — so the probe must only check the application's own internal health.

yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

In this example, the kubelet waits 15 seconds after the container starts, then hits /healthz every 10 seconds. If three consecutive checks fail (or time out after 3 seconds each), the container is killed and restarted according to the pod's restartPolicy.

Critical anti-pattern: checking dependencies in liveness probes

Never have your liveness probe check a database connection, an external API, or a downstream service. If the database goes down, your liveness probe fails, Kubernetes restarts every pod, the pods come back up, can't reach the database, fail again — and you've created a cascading restart storm that makes a partial outage into a total one. Liveness probes must only check the process itself.

Readiness Probe — "Can this pod serve traffic?"

The readiness probe controls whether a pod's IP address appears in a Service's Endpoints object. When the probe fails, the pod is removed from the load balancer — it stops receiving new requests but keeps running. When the probe passes again, traffic resumes. Unlike liveness, this is a gentle, reversible action.

This is the probe where you should check dependencies. If your app needs a database connection to serve requests, let the readiness probe verify it. A failing readiness probe during a rolling update prevents the new version from receiving traffic until it's truly ready, which is how Kubernetes achieves zero-downtime deployments.

yaml
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 2

Notice the shorter initialDelaySeconds and periodSeconds compared to liveness. You want to detect unreadiness quickly so the Service routes traffic only to healthy pods. The successThreshold of 1 means a single passing check brings the pod back into rotation.

Startup Probe — "Has the application finished starting?"

Some applications — legacy Java apps with heavy classloading, applications that run database migrations on boot, or ML services loading large models — can take minutes to start. Without a startup probe, you'd have to set a massive initialDelaySeconds on the liveness probe, which also delays detection of stuck containers after they've started.

The startup probe solves this cleanly. While it runs, both liveness and readiness probes are disabled. Once the startup probe succeeds, it never runs again and the other probes take over. If the startup probe exhausts its failure threshold, the container is killed — the application is considered broken, not just slow.

yaml
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30  # 10s × 30 = up to 5 minutes to start

The math here is straightforward: periodSeconds × failureThreshold gives the maximum startup window. In this example, the application gets up to 5 minutes to start. Once /healthz returns a 200, the startup probe is done and the liveness/readiness probes begin their normal cycles.

Probe Mechanisms

Every probe type supports four mechanisms for checking container health. The mechanism you choose depends on what your application exposes and how much control you need.

MechanismHow It WorksSuccess CriteriaBest For
httpGetSends an HTTP GET to a path and portStatus code 200–399Web servers and APIs with a health endpoint
tcpSocketAttempts a TCP connection to a portPort accepts the connectionNon-HTTP services (databases, caches, TCP servers)
execRuns a command inside the containerCommand exits with code 0Custom checks, script-based validation, sidecar health
grpcCalls the gRPC Health Checking ProtocolResponse status is SERVINGgRPC services implementing the standard health protocol

Here's an example of each mechanism in a probe definition:

yaml
# HTTP — most common for web applications
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
      - name: X-Custom-Header
        value: probe-check

# TCP — useful when there's no HTTP endpoint
livenessProbe:
  tcpSocket:
    port: 3306

# exec — run arbitrary commands
livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy

# gRPC — native support since Kubernetes 1.27 (stable)
livenessProbe:
  grpc:
    port: 50051
    service: my.package.MyService  # optional, defaults to ""
exec probes have overhead

Each exec probe forks a new process inside the container. For high-frequency probes across many pods, this can add non-trivial CPU and PID pressure. Prefer httpGet or tcpSocket when possible. If you must use exec, keep the command lightweight and avoid shell invocations like sh -c "...".

Configuration Parameters Explained

All three probe types share the same set of timing and threshold parameters. Getting these values right is the difference between a resilient deployment and one that either ignores failures or restarts too aggressively.

ParameterDefaultDescription
initialDelaySeconds0Seconds to wait after the container starts before the first probe. Use this if you aren't using a startup probe.
periodSeconds10How often (in seconds) the probe is executed. Lower values detect failures faster but increase load.
timeoutSeconds1Seconds to wait for a probe response before counting it as a failure. Must be less than periodSeconds.
successThreshold1Consecutive successes required to mark the probe as passing. Must be 1 for liveness and startup probes.
failureThreshold3Consecutive failures before taking action (restart for liveness, remove from endpoints for readiness).

The total time before action is taken on failure is approximately initialDelaySeconds + (periodSeconds × failureThreshold). For example, with initialDelaySeconds: 10, periodSeconds: 5, and failureThreshold: 3, a stuck container is restarted roughly 25 seconds after it starts failing.

Full Working Example

Here's a production-grade pod spec for a web application that uses all three probes together. The startup probe gives the app up to 3 minutes to initialize. Once startup succeeds, the liveness probe watches for hangs and the readiness probe controls traffic flow.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: myregistry/order-service:2.4.1
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 10
            failureThreshold: 18   # 10s × 18 = 3 min max startup
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3    # restart after ~30s of failures
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 2    # remove from Service after ~10s
            successThreshold: 1

Note that the liveness probe hits /healthz (a lightweight internal check) while the readiness probe hits /ready (which may verify database connectivity, cache availability, or other dependencies). These are deliberately different endpoints with different responsibilities.

Common Anti-Patterns to Avoid

Misconfigured probes cause more outages than missing probes. These are the mistakes that show up repeatedly in incident postmortems.

1. Using the same endpoint for liveness and readiness

If your /health endpoint checks the database and you use it for both probes, a database outage will trigger liveness failures. Kubernetes restarts all your pods — which still can't reach the database — creating a restart loop. Split your endpoints: /healthz for liveness (internal-only checks) and /ready for readiness (dependency checks).

2. Setting timeoutSeconds too low

The default timeoutSeconds is 1 second. If your health endpoint does any I/O — even a trivial database ping — it can occasionally exceed 1 second under load. This triggers spurious failures and restarts. Set timeoutSeconds to at least 2–3 seconds for endpoints that touch any external resource, and always keep your liveness endpoint free of I/O.

3. Missing readiness probes during rolling updates

Without a readiness probe, Kubernetes considers a pod ready the moment its container starts. During a rolling update, the old pod is terminated as soon as the new one's container is running — even if the new application hasn't opened its listening socket yet. Users hit connection refused errors. Always define a readiness probe on pods behind a Service.

4. Using initialDelaySeconds instead of a startup probe

A long initialDelaySeconds (e.g., 120 seconds) on the liveness probe means that if the application deadlocks within those first 120 seconds after startup, it won't be detected. The startup probe is strictly better: it protects slow starts while keeping liveness detection responsive once the application is running.

Debugging probe failures

When a probe fails, Kubernetes emits an event on the pod. Run kubectl describe pod <name> and check the Events section for messages like Liveness probe failed: HTTP probe failed with statuscode: 503. For deeper debugging, temporarily exec into the container and call the health endpoint manually: kubectl exec -it <pod> -- curl -v localhost:8080/healthz.

Logging Architecture — From Containers to Centralized Storage

Kubernetes does not ship with a built-in log aggregation system. It gives you the primitives — container stdout/stderr capture, node-level log files, and API access — but the responsibility of collecting, shipping, and storing those logs at scale falls squarely on you. Understanding how logs flow through the system is the first step to building a reliable observability stack.

Logging in Kubernetes happens at three distinct levels, each building on the one below it: container-level (what your application writes), node-level (how the kubelet and container runtime manage log files on disk), and cluster-level (how you centralize logs from every node into a queryable backend). We will cover all three, then walk through practical stack deployments.

Level 1: Container Logs — stdout and stderr

The simplest form of logging in Kubernetes: your application writes to stdout and stderr, and the container runtime (containerd, CRI-O) captures those streams and writes them to log files on the node's filesystem. This is the 12-Factor App approach to logging — treat logs as event streams, not files.

You access these logs with kubectl logs. Under the hood, the kubelet reads the log files written by the container runtime and streams them back through the Kubernetes API.

bash
# View logs from a running pod
kubectl logs my-app-pod-7f8b9c6d4-x2k9z

# Follow logs in real-time (like tail -f)
kubectl logs -f my-app-pod-7f8b9c6d4-x2k9z

# View logs from a specific container in a multi-container pod
kubectl logs my-app-pod-7f8b9c6d4-x2k9z -c sidecar-logger

# View logs from a previous container instance (after a crash restart)
kubectl logs my-app-pod-7f8b9c6d4-x2k9z --previous

# View last 100 lines from the past hour
kubectl logs my-app-pod-7f8b9c6d4-x2k9z --tail=100 --since=1h

kubectl logs is effective for interactive debugging, but it has hard limits. It queries one pod at a time (unless you use label selectors with --selector), it only shows logs still on the node (subject to rotation), and it cannot search across the cluster. For anything beyond "what is this one pod doing right now?", you need centralized logging.

Level 2: Node-Level Logging — Where Log Files Live

When a container writes to stdout/stderr, the container runtime doesn't just hold it in memory — it writes each log line to a JSON-formatted file on the node. The standard path is /var/log/containers/, which contains symlinks to the actual log files under /var/log/pods/. The filename encodes the pod name, namespace, container name, and container ID.

bash
# Symlink structure on a node
ls /var/log/containers/
# my-app-7f8b9c6d4-x2k9z_default_app-abc123def456.log -> /var/log/pods/default_my-app-.../app/0.log

# The actual log file is CRI-format JSON — one JSON object per line
cat /var/log/pods/default_my-app-7f8b9c6d4-x2k9z_uid/app/0.log
# {"log":"Starting server on port 8080\n","stream":"stdout","time":"2024-11-15T10:23:01.234Z"}
# {"log":"Error: connection refused\n","stream":"stderr","time":"2024-11-15T10:23:05.891Z"}

The kubelet manages log rotation for container logs. Two kubelet flags control this behavior: --container-log-max-size (default 10Mi) sets the maximum size of each log file before rotation, and --container-log-max-files (default 5) sets how many rotated files to keep. When a log file hits the size limit, the runtime rotates it (e.g., 0.log becomes 0.log.20241115-102301.gz) and starts a fresh file.

Why this matters

With default settings, each container can use up to 50 MiB of disk for logs (5 files × 10 MiB). On a node running 50 pods, that is 2.5 GiB of log storage. High-throughput applications can burn through these limits fast, causing old logs to disappear before anyone reads them. This is the core reason you need cluster-level log shipping — node-level storage is ephemeral and bounded.

Level 3: Cluster-Level Logging — Centralized Aggregation

Cluster-level logging is not a Kubernetes feature — it is an architecture pattern you implement yourself. The goal: ship logs from every node to a central backend where they can be searched, filtered, and retained independently of node lifecycle. Kubernetes documentation describes three architectures for achieving this.

flowchart LR
    subgraph Node["Worker Node"]
        App["App Container<br/>stdout/stderr"]
        CR["Container Runtime"]
        LF["/var/log/containers/"]
        FB["Fluent Bit<br/>DaemonSet"]
    end

    App -->|writes| CR
    CR -->|"JSON log files"| LF
    FB -->|tails log files| LF
    FB -->|"enriches with<br/>K8s metadata"| FB

    subgraph Backend["Centralized Backend"]
        ES["Elasticsearch<br/>or Loki"]
        UI["Kibana<br/>or Grafana"]
    end

    FB -->|ships logs| ES
    ES -->|queries| UI
    

Architecture 1: Node-Level DaemonSet Agent (Most Common)

A logging agent runs as a DaemonSet — one pod per node — tailing the log files from /var/log/containers/. The agent parses the CRI JSON format, enriches each line with Kubernetes metadata (pod name, namespace, labels, annotations), and forwards the processed logs to a centralized backend. This is by far the most widely used pattern because it requires zero changes to your application code.

Popular DaemonSet agents include Fluent Bit (lightweight, C-based, low memory footprint), Fluentd (Ruby-based, plugin-rich, more flexible routing), Vector (Rust-based, high performance), and Promtail (purpose-built for Grafana Loki). The agent reads from the node filesystem, so the DaemonSet needs a volume mount to /var/log.

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.1
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: config
              mountPath: /fluent-bit/etc/
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              memory: 128Mi
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config

Architecture 2: Sidecar Container Streaming to stdout

Some applications write logs to files inside the container (e.g., /var/log/app/access.log and /var/log/app/error.log) rather than stdout. A sidecar container can tail those files and re-emit them to its own stdout, making them available to the node-level DaemonSet agent via the standard /var/log/containers/ path.

This pattern is useful when you cannot modify an application to log to stdout, or when you need to split multiple log streams from a single container into separate streams. The downside is resource overhead — each sidecar consumes CPU and memory, and you double the disk I/O since logs are written twice.

yaml
apiVersion: v1
kind: Pod
metadata:
  name: legacy-app
spec:
  containers:
    - name: app
      image: legacy-app:2.1
      volumeMounts:
        - name: log-volume
          mountPath: /var/log/app
    - name: log-streamer
      image: busybox:1.36
      args:
        - /bin/sh
        - -c
        - tail -F /var/log/app/access.log
      volumeMounts:
        - name: log-volume
          mountPath: /var/log/app
          readOnly: true
  volumes:
    - name: log-volume
      emptyDir: {}

Architecture 3: Direct Push from Application

The application pushes logs directly to a logging backend — bypassing the node filesystem entirely. This is common with application-level logging libraries (e.g., a Java app using Logback with an Elasticsearch appender, or a Go service pushing to Loki via HTTP). You get maximum control over format and routing, but you tightly couple your application to a specific logging infrastructure and lose the ability to collect logs from application crashes or containers that fail before the logging library initializes.

Architecture Application Changes Resource Cost Crash Log Coverage Best For
DaemonSet Agent None (log to stdout) 1 agent per node Full — logs persisted on disk Most workloads
Sidecar Streaming None 1 sidecar per pod Full Legacy apps writing to files
Direct Push Logging library config In-app overhead Partial — crash logs may be lost High-cardinality custom routing

Log Aggregation Stacks

Two stacks dominate the Kubernetes logging landscape. Your choice between them depends on scale, query patterns, and infrastructure budget.

EFK Stack: Elasticsearch + Fluent Bit/Fluentd + Kibana

The EFK stack is the traditional enterprise choice. Elasticsearch indexes every log line as a full-text searchable document, making it excellent for ad-hoc queries across high-cardinality fields. Fluent Bit (or Fluentd) ships the logs. Kibana provides dashboards, saved searches, and alerting. The cost: Elasticsearch is resource-hungry. A production cluster typically needs dedicated nodes with fast SSDs and significant memory for the JVM heap.

yaml
# Fluent Bit ConfigMap — collect, parse, enrich, and ship to Elasticsearch
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            cri
        Tag               kube.*
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On

    [OUTPUT]
        Name            es
        Match           kube.*
        Host            elasticsearch.logging.svc
        Port            9200
        Logstash_Format On
        Logstash_Prefix k8s-logs
        Retry_Limit     False

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

Loki + Promtail + Grafana Stack

Grafana Loki takes a fundamentally different approach. Instead of indexing the full text of every log line (like Elasticsearch), Loki indexes only the metadata labels (namespace, pod, container, custom labels) and stores the raw log content in compressed chunks on cheap object storage (S3, GCS, MinIO). This makes Loki dramatically cheaper to run — often 10x less infrastructure than Elasticsearch for the same log volume.

The trade-off: you cannot do full-text search across all logs. Queries in Loki always start with a label selector ({namespace="production", app="api-gateway"}) to narrow the stream, then optionally apply line filters or regex. If your debugging workflow is "show me all logs from this service in the last hour", Loki is ideal. If you need "find every log line containing this UUID across the entire cluster", Elasticsearch is faster.

yaml
# Promtail DaemonSet config snippet — ships logs to Loki
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: logging
data:
  promtail.yaml: |
    server:
      http_listen_port: 3101

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://loki.logging.svc:3100/loki/api/v1/push

    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app]
            target_label: app
          - source_labels: [__meta_kubernetes_namespace]
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            target_label: pod
        pipeline_stages:
          - cri: {}
          - json:
              expressions:
                level: level
          - labels:
              level:
Characteristic EFK (Elasticsearch) Loki + Grafana
Indexing Full-text on every field Labels only, raw log chunks
Query speed (full text) Fast — inverted index Slower — brute-force scan within stream
Query speed (by label) Fast Fast
Storage cost High (SSD-backed indices) Low (object storage like S3)
Resource footprint Heavy (JVM heap, CPU) Light (single binary or microservices)
Setup complexity Moderate to high Low (especially via Helm)
Best for Security/compliance, high-cardinality search Developer debugging, cost-sensitive teams

Structured Logging and Log Levels

Unstructured log lines like User login failed for john@example.com are human-readable but machine-hostile. When you have thousands of pods producing millions of log lines, you need logs that can be parsed, filtered, and aggregated automatically. Structured logging means emitting each log entry as a JSON object with consistent, typed fields.

json
{
  "timestamp": "2024-11-15T10:23:05.891Z",
  "level": "error",
  "message": "User login failed",
  "service": "auth-api",
  "user_email": "john@example.com",
  "reason": "invalid_password",
  "request_id": "req-a4f8c2e1-9b3d",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "duration_ms": 42
}

With structured logs, Fluent Bit's Merge_Log option (or Promtail's json pipeline stage) parses the JSON and promotes fields like level, service, and request_id into indexed or queryable labels. You can then filter dashboards by level=error, trace a request across services with request_id, or correlate logs with distributed traces via trace_id.

Log Levels

Use standard severity levels consistently across all services. This lets you filter noise and alert on errors at the infrastructure level rather than parsing strings.

Level When to Use Example
debug Verbose internals — disabled in production by default Cache key lookup, SQL query parameters
info Normal operational events Server started, request handled, job completed
warn Unexpected but recoverable situations Retry attempt, deprecated API call, slow query
error Failures that need attention Database connection failed, upstream 5xx, unhandled exception
fatal Unrecoverable — process will exit Missing required config, binding port already in use

Correlation IDs in Microservices

In a microservices architecture, a single user request might traverse five or more services. Without a shared identifier, debugging a failure means manually stitching logs together by timestamp — which is error-prone and slow. The solution: generate a unique request_id (or correlation_id) at the edge gateway and propagate it through every service via HTTP headers (X-Request-ID) or gRPC metadata.

Every service includes this ID in every log line it emits. To find the full story of a failed request, you query your logging backend with that single ID and see the complete chain — from ingress to database and back. If you are also using distributed tracing (OpenTelemetry), include the trace_id in your logs to link them directly to trace spans.

Practical: Deploying the Loki Stack with Helm

The fastest way to stand up centralized logging in a Kubernetes cluster is the Loki stack via Helm. This deploys Loki (log storage), Promtail (DaemonSet agent), and connects to an existing or new Grafana instance.

bash
# Add the Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki (single-binary mode for dev/small clusters)
helm install loki grafana/loki \
  --namespace logging --create-namespace \
  --set loki.auth_enabled=false \
  --set singleBinary.replicas=1 \
  --set loki.storage.type=filesystem

# Install Promtail (DaemonSet log shipper)
helm install promtail grafana/promtail \
  --namespace logging \
  --set config.clients[0].url=http://loki:3100/loki/api/v1/push

# Install Grafana (if not already running)
helm install grafana grafana/grafana \
  --namespace logging \
  --set persistence.enabled=true \
  --set adminPassword='your-secure-password'

After installation, add Loki as a data source in Grafana (URL: http://loki:3100), open the Explore panel, and query with LogQL:

logql
# All logs from the "production" namespace
{namespace="production"}

# Error-level logs from a specific app
{namespace="production", app="api-gateway"} |= "error"

# Parse JSON logs and filter by status code
{namespace="production", app="api-gateway"} | json | status_code >= 500

# Count error rate per service over 5-minute windows
sum(rate({namespace="production"} |= "level=error" [5m])) by (app)

Log Retention Strategies

Logs are only useful if they exist when you need them — but storing everything forever is prohibitively expensive. A good retention strategy balances compliance requirements, debugging needs, and storage costs.

Tier Retention Storage Type Use Case
Hot 3–7 days SSD / local disk Active debugging, real-time dashboards
Warm 30–90 days HDD / standard cloud storage Incident investigation, trend analysis
Cold/Archive 1–7 years Object storage (S3 Glacier, GCS Coldline) Compliance, audit, legal hold

In Elasticsearch, use Index Lifecycle Management (ILM) policies to automatically roll over indices by age or size, transition them to cheaper storage tiers, and delete them on schedule. In Loki, configure the compactor component with retention_enabled: true and set per-tenant or global retention periods via limits_config.retention_period.

yaml
# Loki retention configuration snippet
limits_config:
  retention_period: 720h          # 30 days global default

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
Cost-saving tip

Drop debug-level logs at the agent level before they reach your backend. In Fluent Bit, use a grep filter to exclude lines matching debug. In Promtail, use a drop pipeline stage. This alone can reduce log volume — and storage costs — by 40–60% in verbose applications.

Key Takeaways

  • Always log to stdout/stderr. This is the Kubernetes-native convention. The container runtime, kubelet, and DaemonSet agents all expect it.
  • DaemonSet agents are the default choice. Deploy Fluent Bit or Promtail as a DaemonSet — it covers every pod on every node with zero application changes.
  • Choose Loki for cost efficiency, Elasticsearch for search power. Most teams that are not in regulated industries start with Loki and move to Elasticsearch only if query patterns demand full-text indexing.
  • Emit structured JSON logs. Consistent fields like level, message, service, and request_id transform logs from noise into a queryable observability signal.
  • Plan retention from day one. Tiered storage with automatic lifecycle policies prevents runaway costs and keeps you compliant.

Monitoring with Prometheus, Grafana, and Metrics Server

Running containers without monitoring is flying blind. Kubernetes orchestrates hundreds or thousands of Pods across a fleet of nodes, and without visibility into CPU usage, memory pressure, error rates, and API server health, you won’t know something is wrong until users start complaining. This section covers the entire Kubernetes monitoring stack — from the lightweight Metrics Server that powers kubectl top, to the Prometheus ecosystem that gives you deep, long-term observability.

The monitoring landscape in Kubernetes splits into two categories: core metrics (the minimal set required by Kubernetes itself for scheduling and autoscaling) and full monitoring pipelines (Prometheus, Grafana, Alertmanager) that give you complete observability. Understanding both — and how they complement each other — is essential for production operations.

The Monitoring Stack at a Glance

Before diving into individual components, here is how all the pieces fit together. Every metric in the Kubernetes ecosystem starts at a source (kubelet, kube-state-metrics, node-exporter, or your application), gets scraped by Prometheus, stored in its time-series database, queried by Grafana for dashboards, and evaluated by Alertmanager for alerts.

flowchart LR
    subgraph Sources["Metric Sources"]
        KSM["kube-state-metrics<br/>(Deployment, Pod, Node state)"]
        NE["node-exporter<br/>(CPU, memory, disk, network)"]
        CA["kubelet / cAdvisor<br/>(container-level metrics)"]
        APP["Application Pods<br/>(/metrics endpoints)"]
    end

    subgraph Prom["Prometheus"]
        SM["ServiceMonitor /<br/>PodMonitor CRDs"]
        SCRAPE["Scrape Engine<br/>(pull-based)"]
        TSDB["Time-Series DB<br/>(local storage)"]
    end

    subgraph Viz["Visualization & Alerting"]
        GF["Grafana<br/>(dashboards)"]
        AM["Alertmanager<br/>(routing & notifications)"]
        SLACK["Slack / PagerDuty /<br/>Email / Webhook"]
    end

    MS["Metrics Server<br/>(in-memory, real-time)"]
    HPA["HPA / VPA /<br/>kubectl top"]

    KSM --> SCRAPE
    NE --> SCRAPE
    CA --> SCRAPE
    APP --> SCRAPE
    SM -.->|"defines targets"| SCRAPE
    SCRAPE --> TSDB
    TSDB --> GF
    TSDB -->|"alert rules"| AM
    AM --> SLACK

    CA --> MS
    MS --> HPA
    

Notice the two independent paths. Metrics Server feeds the Kubernetes control plane (HPA, VPA, kubectl top) with real-time, in-memory metrics. Prometheus scrapes the same sources (plus many more) and stores them for querying, dashboarding, and alerting. They serve different purposes and you need both.

Metrics Server — Lightweight, Real-Time, Ephemeral

Metrics Server is a cluster-wide aggregator of resource usage data. It collects CPU and memory metrics from every kubelet’s built-in cAdvisor, holds them in memory only, and exposes them through the Kubernetes Metrics API (metrics.k8s.io). It is the component that makes kubectl top nodes and kubectl top pods work.

CharacteristicMetrics Server
StorageIn-memory only — no historical data, no persistence
Metrics scopeCPU and memory usage at node and pod level
Scrape interval~15 seconds (configurable)
Primary consumersHPA, VPA, kubectl top, Kubernetes scheduler
Not designed forLong-term storage, dashboards, alerting, custom metrics

Most managed Kubernetes services (EKS, GKE, AKS) pre-install Metrics Server. For self-managed clusters, deploying it is straightforward:

bash
# Install Metrics Server (official manifest)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify it is running
kubectl get deployment metrics-server -n kube-system

# Now you can check resource usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
Metrics Server Is NOT a Monitoring Solution

Metrics Server keeps only the latest data point — it has no history. It exists solely to feed Kubernetes internal components like the Horizontal Pod Autoscaler. For dashboards, alerting, and historical analysis, you need Prometheus. Think of Metrics Server as the speedometer in your car and Prometheus as the full telemetry system.

Prometheus — The Pillar of Kubernetes Monitoring

Prometheus is an open-source monitoring system purpose-built for dynamic, cloud-native environments. It is the de facto standard for Kubernetes monitoring. Unlike traditional push-based monitoring tools (where agents send data to a central server), Prometheus uses a pull-based model — it actively scrapes HTTP endpoints (/metrics) on a configurable interval.

This pull model is a deliberate design choice that works exceptionally well in Kubernetes. Pods come and go, IPs change, replicas scale up and down. Prometheus uses service discovery to dynamically find all scrape targets, so it automatically adapts to the changing cluster topology without reconfiguration.

Core Concepts

ConceptDescription
MetricA named time series with labels. Example: container_cpu_usage_seconds_total{namespace="prod", pod="api-7b5f8"}
Scrape targetAn HTTP endpoint that exposes metrics in Prometheus format. Every Kubernetes component exposes one.
Scrape intervalHow often Prometheus fetches metrics (typically 15–30 seconds).
TSDBPrometheus stores all data in a local time-series database, optimized for append-heavy writes and label-based queries.
PromQLThe query language for selecting, aggregating, and computing over time-series data.
Recording rulesPre-computed PromQL expressions stored as new time series — reduces query-time computation for dashboards.
Alert rulesPromQL expressions that fire when a condition is true for a specified duration.

Installing with kube-prometheus-stack

The recommended way to deploy Prometheus on Kubernetes is through the kube-prometheus-stack Helm chart (formerly prometheus-operator). This single chart installs Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, and a set of pre-configured alert rules and dashboards.

bash
# Add the Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install the full stack into a dedicated namespace
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=your-secure-password

# Verify all components are running
kubectl get pods -n monitoring

This gives you a fully operational monitoring stack out of the box: Prometheus scraping all Kubernetes components, Grafana loaded with dashboards, Alertmanager configured with sensible default alerts, and exporters collecting node and cluster state metrics.

ServiceMonitor and PodMonitor CRDs

The Prometheus Operator introduces Custom Resource Definitions that let you declaratively define scrape targets. Instead of editing Prometheus configuration files, you create a ServiceMonitor (targets a Kubernetes Service) or a PodMonitor (targets Pods directly). The Operator watches these CRDs and automatically updates the Prometheus scrape configuration.

yaml
# ServiceMonitor — scrape metrics from a Service's endpoints
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-metrics
  namespace: monitoring
  labels:
    release: kube-prometheus-stack   # must match Prometheus selector
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: http-metrics       # named port on the Service
      interval: 30s
      path: /metrics
yaml
# PodMonitor — scrape metrics directly from Pod ports
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: envoy-sidecar-metrics
  namespace: monitoring
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      sidecar: envoy
  podMetricsEndpoints:
    - port: admin
      interval: 15s
      path: /stats/prometheus

The key distinction: use a ServiceMonitor when your Pods are fronted by a Service (the common case). Use a PodMonitor when you need to scrape individual Pods directly — for example, sidecar proxies that don’t have their own Service, or when you need per-Pod label resolution.

Metric Sources — kube-state-metrics and node-exporter

Prometheus scrapes metrics, but it needs something to expose them. In a Kubernetes cluster, three primary metric sources cover the full picture: the kubelet’s built-in cAdvisor, kube-state-metrics, and node-exporter. Each serves a distinct purpose.

SourceWhat It ExposesExample Metrics
kubelet / cAdvisor Container-level resource usage — CPU, memory, filesystem, and network I/O for every running container container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_receive_bytes_total
kube-state-metrics Cluster object state from the Kubernetes API — Deployments, Pods, Nodes, Jobs. Answers "what is the desired vs. actual state?" kube_pod_status_phase, kube_deployment_spec_replicas, kube_node_status_condition, kube_job_status_failed
node-exporter Host-level hardware and OS metrics — CPU load, memory, disk space, network interfaces, filesystem usage node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes, node_disk_io_time_seconds_total

kube-state-metrics is particularly important because it bridges the gap between the Kubernetes API and Prometheus. cAdvisor tells you "this container is using 200m CPU." kube-state-metrics tells you "this Deployment wants 3 replicas but only 2 are available" or "this Pod has been in CrashLoopBackOff for 10 minutes." Without it, you cannot alert on most operational problems.

node-exporter runs as a DaemonSet (one Pod per node) and exposes hardware-level metrics that cAdvisor does not cover. While cAdvisor reports per-container metrics, node-exporter reports the overall health of the underlying machine — disk IOPS, network errors, CPU steal time, and available memory across the entire node.

Key Metrics to Monitor

With hundreds of metrics available, it helps to organize your monitoring strategy around two proven frameworks: the USE method (Utilization, Saturation, Errors) for infrastructure resources, and the RED method (Rate, Errors, Duration) for services.

USE Method — For Nodes and Infrastructure

SignalWhat to MeasureKey Metrics
Utilization How busy is the resource? node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_filesystem_avail_bytes
Saturation How much extra work is queued? node_load1 / node_load15, node_disk_io_time_weighted_seconds_total
Errors Are there failures? node_network_receive_errs_total, node_disk_io_time_seconds_total (anomalies)

RED Method — For Services and APIs

SignalWhat to MeasureKey Metrics
Rate Requests per second apiserver_request_total, application-specific request counters
Errors Failed requests per second apiserver_request_total{code=~"5.."}, grpc_server_handled_total{grpc_code!="OK"}
Duration Latency distribution apiserver_request_duration_seconds_bucket, application histogram metrics

PromQL — Querying Kubernetes Metrics

PromQL is Prometheus’s query language. It operates on time series identified by a metric name and a set of key-value labels. Learning a handful of patterns covers 90% of what you need for Kubernetes monitoring.

Essential Queries

promql
# CPU usage per Pod (cores) — averaged over 5 minutes
rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])

# CPU usage as a percentage of the Pod's request
sum by (namespace, pod) (
  rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[5m])
)
/
sum by (namespace, pod) (
  kube_pod_container_resource_requests{resource="cpu"}
) * 100

# Memory working set per Pod (the metric that triggers OOMKill)
container_memory_working_set_bytes{image!="", container!="POD"}

# Pods not in Running phase, grouped by namespace and phase
sum by (namespace, phase) (kube_pod_status_phase{phase!="Running"})

# API server request rate by verb and response code
sum by (verb, code) (rate(apiserver_request_total[5m]))

# API server error rate (5xx responses)
sum(rate(apiserver_request_total{code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total[5m])) * 100

# Node CPU utilization percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Available disk space per node (percentage)
(node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) * 100

# Pods in CrashLoopBackOff (restart count increasing)
sum by (namespace, pod) (
  increase(kube_pod_container_status_restarts_total[1h])
) > 5

A few key patterns to remember. rate() calculates the per-second rate of a counter over a time window — you will use this on every _total metric. sum by (label) aggregates across dimensions. increase() shows the total increase over a window, which is useful for low-frequency events like Pod restarts.

Always Filter Out Pause Containers

When querying cAdvisor metrics, include {image!="", container!="POD"} in your label selector. The container="POD" entries represent the pause container (the network namespace holder) — not your actual workload. Omitting this filter inflates your results with meaningless data.

Grafana — Dashboards for Kubernetes

Grafana is the visualization layer. It connects to Prometheus as a data source and lets you build dashboards with graphs, tables, heatmaps, and stat panels. The kube-prometheus-stack Helm chart installs Grafana pre-configured with a Prometheus data source and a comprehensive set of dashboards.

Accessing Grafana

bash
# Port-forward Grafana to your local machine
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

# Open http://localhost:3000
# Default credentials: admin / your-secure-password (set during helm install)

Essential Dashboards

The kube-prometheus-stack ships with dozens of dashboards. Focus on these four categories first — they cover the most critical operational views:

DashboardWhat It ShowsWhen to Use
Kubernetes / Cluster Overview Total cluster CPU/memory usage, node count, Pod count, failed Pods, and overall resource allocation vs. capacity Daily health check, capacity planning, initial incident triage
Node Exporter / Nodes Per-node CPU, memory, disk I/O, network throughput, system load, and filesystem usage Investigating node-level performance issues, disk pressure, or network saturation
Kubernetes / Pods Per-pod CPU/memory usage vs. requests/limits, container restarts, network traffic, and OOMKill events Debugging application performance issues, right-sizing resource requests
Kubernetes / API Server Request rate, error rate, latency percentiles, inflight requests, and etcd request durations Diagnosing control plane slowness or API server overload

Creating a Custom Dashboard Panel

You can define Grafana dashboards as code using ConfigMaps. The Grafana sidecar in kube-prometheus-stack watches for ConfigMaps with a specific label and automatically imports them. Here is an example that creates a namespace resource usage dashboard:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-namespace-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"    # sidecar picks up ConfigMaps with this label
data:
  namespace-resources.json: |
    {
      "title": "Namespace Resource Usage",
      "uid": "ns-resource-usage",
      "panels": [
        {
          "title": "CPU Usage by Namespace",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [{
            "expr": "sum by (namespace) (rate(container_cpu_usage_seconds_total[5m]))",
            "legendFormat": "{{ namespace }}"
          }],
          "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
        },
        {
          "title": "Memory Usage by Namespace",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [{
            "expr": "sum by (namespace) (container_memory_working_set_bytes)",
            "legendFormat": "{{ namespace }}"
          }],
          "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
        }
      ],
      "schemaVersion": 39,
      "version": 1
    }

Alerting with Alertmanager

Dashboards are useful when someone is looking at them. Alerts are what wake you up at 3 AM when something is actually broken. In the Prometheus ecosystem, alerting works in two stages: Prometheus evaluates alert rules (PromQL expressions with thresholds and durations) and fires alerts when conditions are met. Alertmanager receives those fired alerts, deduplicates them, groups related alerts, and routes them to the right notification channel.

Alert Rules — Defining What to Alert On

Alert rules are defined as PrometheusRule CRDs (another Prometheus Operator resource). Each rule specifies a PromQL expression, a for duration (how long the condition must be true before firing), and labels/annotations that control routing and provide context.

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-critical-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: kubernetes.pod.alerts
      rules:
        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Container {{ $labels.container }} restarted {{ $value }} times in the last hour."

        - alert: PodNotReady
          expr: |
            kube_pod_status_ready{condition="true"} == 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 15m"

    - name: kubernetes.node.alerts
      rules:
        - alert: NodeNotReady
          expr: |
            kube_node_status_condition{condition="Ready", status="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.node }} is NotReady"
            description: "Node has been NotReady for more than 5 minutes."

        - alert: HighNodeMemoryUsage
          expr: |
            (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} memory above 90%"

    - name: kubernetes.apiserver.alerts
      rules:
        - alert: APIServerHighErrorRate
          expr: |
            sum(rate(apiserver_request_total{code=~"5.."}[5m]))
            / sum(rate(apiserver_request_total[5m])) * 100 > 3
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "API server error rate above 3%"

Alertmanager Configuration — Routing, Receivers, and Silences

Alertmanager handles the "what happens after an alert fires" logic. Its configuration defines receivers (where to send notifications), routes (which alerts go to which receivers), inhibition rules (suppress lower-priority alerts when a higher-priority one is firing), and silences (temporarily mute specific alerts during maintenance).

yaml
# Alertmanager config (set via Helm values under alertmanager.config)
alertmanager:
  config:
    global:
      resolve_timeout: 5m

    route:
      receiver: default-slack
      group_by: [alertname, namespace]
      group_wait: 30s         # wait before sending first notification
      group_interval: 5m      # wait between grouped notifications
      repeat_interval: 4h     # re-notify interval for unresolved alerts
      routes:
        - match:
            severity: critical
          receiver: pagerduty-critical
          continue: false
        - match:
            severity: warning
          receiver: slack-warnings

    receivers:
      - name: default-slack
        slack_configs:
          - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
            channel: "#k8s-alerts"
            title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
            text: >-
              {{ range .Alerts }}*{{ .Annotations.summary }}*
              {{ .Annotations.description }}{{ end }}

      - name: pagerduty-critical
        pagerduty_configs:
          - service_key: your-pagerduty-service-key
            severity: critical

      - name: slack-warnings
        slack_configs:
          - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
            channel: "#k8s-warnings"

    inhibit_rules:
      - source_match:
          severity: critical
        target_match:
          severity: warning
        equal: [alertname, namespace]

The inhibition rule at the bottom is important: if a critical alert fires for a given alertname and namespace, it suppresses the corresponding warning-level alert. This prevents alert floods — you don’t want both a "node memory high" warning and a "node memory critical" alert hitting your Slack channel simultaneously.

Key Alertmanager Concepts

ConceptPurposeExample
Grouping Combines related alerts into a single notification group_by: [alertname, namespace] — all PodCrashLooping alerts in the same namespace arrive as one message
Routing Directs alerts to different receivers based on labels Critical → PagerDuty, Warning → Slack
Inhibition Suppresses lower-severity alerts when a related higher-severity alert is active NodeNotReady (critical) suppresses HighNodeCPU (warning) for the same node
Silences Temporarily mutes alerts matching specific labels (created via UI or API) Silence all alerts for namespace=staging during a maintenance window
Avoid Alert Fatigue

The most common monitoring anti-pattern is alerting on everything. If your team receives 50 notifications per day, they will start ignoring all of them. Only alert on conditions that require human intervention. Metrics that are "nice to know" belong on dashboards, not in alert rules. A good rule of thumb: every alert should have a clear runbook explaining what action to take.

Putting It All Together — A Monitoring Checklist

Here is a practical checklist for production Kubernetes monitoring. Deploy the kube-prometheus-stack with sensible defaults, customize these key areas, and you will have solid observability coverage:

  1. Deploy the full stack. Use helm install kube-prometheus-stack to get Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one shot.
  2. Verify Metrics Server is running for HPA and kubectl top functionality.
  3. Create ServiceMonitors for every application that exposes a /metrics endpoint. This is how Prometheus discovers your custom metrics.
  4. Set up alert rules for the essentials: PodCrashLooping, NodeNotReady, HighMemoryUsage, APIServerErrors, PersistentVolumeFillingUp.
  5. Configure Alertmanager routing so critical alerts go to PagerDuty (or your on-call tool) and warnings go to Slack.
  6. Add inhibition rules to prevent alert storms when a single root cause triggers multiple symptoms.
  7. Review Grafana dashboards daily. Use the cluster overview for capacity planning and the Pod dashboard for right-sizing resource requests.
  8. Configure persistent storage for Prometheus (storageSpec in Helm values) so you retain metrics across Pod restarts. 15–30 days of retention is typical.

Troubleshooting — A Systematic Guide to Common Issues

Every Kubernetes operator eventually stares at a Pod stuck in Pending, a Service that refuses to route traffic, or a node that drops to NotReady at 2 AM. The difference between a 10-minute fix and a 3-hour ordeal is almost always methodology — knowing which commands to run, in which order, and what the output actually means.

This section gives you a systematic approach to diagnosing the most common Kubernetes failures. Each issue category follows the same pattern: understand why the failure happens, then follow concrete debugging steps with real command output to identify and resolve the root cause.

The Debugging Toolkit

Before diving into specific issues, you need to internalize the five core debugging commands. These cover 90% of Kubernetes troubleshooting. Everything else is built on top of them.

CommandWhat It Tells YouWhen to Use
kubectl describe <resource> <name>Full resource spec + conditions + events (the most useful part). Shows scheduling decisions, image pulls, mount errors, and probe failures.First command for any resource-level issue. Always start here.
kubectl logs <pod> [-c container]Container stdout/stderr output. Add --previous to see logs from the last crashed container.When a container is crashing or misbehaving. Use -f to stream.
kubectl get events --sort-by='.lastTimestamp'Cluster-wide event stream — scheduling, pulling, mounting, killing, scaling events across all resources.When you need the big picture. Great for correlating issues across resources.
kubectl exec -it <pod> -- /bin/shInteractive shell inside a running container. Test DNS, connectivity, filesystem, environment variables.When you need to verify the container's runtime environment.
kubectl debug node/<name> -it --image=busyboxLaunches a privileged debugging Pod on a specific node with access to the host filesystem and network.When the issue is at the node level — disk, networking, kubelet, or container runtime.
The Golden Rule: Events First, Logs Second

When a Pod is not running, kubectl logs often returns nothing — there's no container to produce output yet. Always run kubectl describe pod <name> first. The Events section at the bottom tells you what Kubernetes tried to do and why it failed — image pull errors, scheduling failures, volume mount issues, and probe failures all surface here before any container log exists.

The Troubleshooting Decision Tree

When something goes wrong, the fastest path to a fix is to identify the category of failure first. Pod status is your primary signal — it immediately narrows your search space to one of a few well-understood failure modes.

flowchart TD
    START["Pod not working as expected"] --> STATUS{"What is the
Pod status?"} STATUS -->|"Pending"| PEND{"Check events
with describe"} PEND -->|"No nodes match"| FIX_SCHED["Fix nodeSelector,
tolerations, or affinity"] PEND -->|"Insufficient CPU/memory"| FIX_RES["Reduce requests or
add cluster capacity"] PEND -->|"PVC not bound"| FIX_PVC["Fix StorageClass
or provision PV"] STATUS -->|"ImagePullBackOff"| IMG{"Check image
name & registry"} IMG -->|"Wrong tag/name"| FIX_IMG["Fix image reference
in Pod spec"] IMG -->|"Auth required"| FIX_SEC["Create or fix
imagePullSecrets"] STATUS -->|"CrashLoopBackOff"| CRASH{"Check logs
--previous"} CRASH -->|"OOMKilled"| FIX_OOM["Increase memory
limits"] CRASH -->|"App error"| FIX_APP["Fix application
code or config"] CRASH -->|"Bad command/args"| FIX_CMD["Fix command
or entrypoint"] STATUS -->|"Running but
not working"| RUNNING{"Check Service
& networking"} RUNNING -->|"No endpoints"| FIX_LBL["Fix label selector
or targetPort"] RUNNING -->|"DNS failure"| FIX_DNS["Check CoreDNS
and DNS policy"] RUNNING -->|"Probe failing"| FIX_PROBE["Fix readiness/
liveness probe"] STATUS -->|"Evicted"| EVICT["Check node
pressure conditions"] style START fill:#f8fafc,stroke:#334155,color:#0f172a style FIX_SCHED fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_RES fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_PVC fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_IMG fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_SEC fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_OOM fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_APP fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_CMD fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_LBL fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_DNS fill:#dcfce7,stroke:#16a34a,color:#14532d style FIX_PROBE fill:#dcfce7,stroke:#16a34a,color:#14532d style EVICT fill:#fef9c3,stroke:#ca8a04,color:#713f12

With this mental model, let's walk through each failure category in detail.

Pod Issues

Pods are the atomic unit of scheduling in Kubernetes, and they're where most failures surface. The Pod's status.phase and status.containerStatuses[].state fields are your first diagnostic signals. Run kubectl get pods to see the high-level status, then drill in with describe.

ImagePullBackOff

ImagePullBackOff means the kubelet tried to pull the container image and failed. After the initial failure (ErrImagePull), Kubernetes backs off exponentially — retrying at increasing intervals up to 5 minutes. The three most common causes are: a misspelled image name or tag, a private registry that requires authentication, and a missing imagePullSecret.

bash
# Step 1: Identify the failing Pod
kubectl get pods
# NAME                     READY   STATUS             RESTARTS   AGE
# api-server-7f8b9d6c4-x2k  0/1   ImagePullBackOff   0          3m

# Step 2: Get the exact error from events
kubectl describe pod api-server-7f8b9d6c4-x2k | tail -10
# Events:
#   Warning  Failed   pull image "myregistry.io/api-server:v2.1.0":
#            rpc error: code = Unknown desc = failed to pull and unpack image:
#            401 Unauthorized

# Step 3: Verify the image exists (from your local machine)
docker pull myregistry.io/api-server:v2.1.0

# Step 4: If auth is the issue, create the pull secret
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.io \
  --docker-username=deploy-bot \
  --docker-password="${REGISTRY_TOKEN}" \
  --docker-email=deploy@example.com

# Step 5: Patch the Deployment to use the secret
kubectl patch deployment api-server -p \
  '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'

Quick checklist for ImagePullBackOff:

  • Is the image name and tag spelled correctly? (Watch for typos like ngixn vs nginx.)
  • Does the tag actually exist in the registry? (latest may not exist on every image.)
  • Is the registry private? If so, does the namespace have the correct imagePullSecret?
  • Is the node able to reach the registry? (Network policies, firewall rules, proxy settings.)
  • Is the imagePullPolicy set to Always when it should be IfNotPresent (or vice versa)?

CrashLoopBackOff

CrashLoopBackOff means the container starts, then exits, and Kubernetes keeps restarting it with increasing backoff delays (10s, 20s, 40s, up to 5 minutes). The image was pulled successfully — the problem is inside the container. This is where kubectl logs --previous becomes essential, because it captures stdout/stderr from the last terminated container instance.

bash
# Check the restart count and termination reason
kubectl get pod payment-svc-5d4f8b7a9-m3j \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' | jq .
# {
#   "exitCode": 137,
#   "reason": "OOMKilled",
#   "startedAt": "2024-11-15T08:23:01Z",
#   "finishedAt": "2024-11-15T08:23:44Z"
# }

# Grab logs from the previous (crashed) container
kubectl logs payment-svc-5d4f8b7a9-m3j --previous

# If the container exits too fast for logs, override the entrypoint
# to keep it alive for inspection:
kubectl debug payment-svc-5d4f8b7a9-m3j \
  -it --copy-to=debug-pod --container=payment \
  -- /bin/sh -c "sleep 3600"

The three dominant causes of CrashLoopBackOff:

CauseIndicatorFix
OOMKilledExit code 137, reason OOMKilled in lastState.terminatedIncrease resources.limits.memory in the Pod spec, or fix the memory leak in the application.
Application errorNon-zero exit code (1, 2, etc.), error messages in kubectl logs --previousFix the application bug. Check environment variables, ConfigMap mounts, database connection strings, and missing dependencies.
Wrong command/argsExit code 126 (permission denied) or 127 (command not found)Verify the command and args fields in the container spec. Remember: command overrides the image's ENTRYPOINT, and args overrides CMD.
OOMKilled vs. Node-Level Memory Pressure

Exit code 137 with reason OOMKilled means the container exceeded its memory limit and the kernel's OOM killer terminated it. This is different from node-level memory pressure, which causes Pod eviction (covered below). For OOMKilled, the fix is always at the container level — raise the limit or reduce memory consumption. Don't confuse the two.

Pending Pods

A Pod in Pending status has been accepted by the API server and stored in etcd, but the scheduler cannot place it on any node. The Pod will remain Pending indefinitely until the underlying constraint is resolved. The Events section in kubectl describe pod always tells you exactly why.

bash
# Diagnose why a Pod is Pending
kubectl describe pod ml-training-pod-8x4r2 | grep -A 5 "Events"
# Events:
#   Warning  FailedScheduling  0/3 nodes are available:
#     1 node(s) had untolerated taint {gpu=true},
#     2 node(s) didn't match Pod's node affinity/selector.

# Check resource availability across all nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Compare Pod requests against node capacity
kubectl get pod ml-training-pod-8x4r2 \
  -o jsonpath='{.spec.containers[*].resources}' | jq .

# Check for unbound PVCs (common with StatefulSets)
kubectl get pvc
# NAME          STATUS    VOLUME   CAPACITY   STORAGECLASS   AGE
# data-redis-0  Pending                       fast-ssd       5m

Common causes and their fixes:

  • Insufficient resources: No node has enough allocatable CPU or memory for the Pod's requests. Either reduce the requests, add nodes, or evict lower-priority workloads.
  • Node selector mismatch: The Pod specifies nodeSelector: {disktype: ssd} but no node has that label. Add the label with kubectl label node <name> disktype=ssd.
  • Taints without tolerations: All available nodes have taints the Pod doesn't tolerate. Add the matching tolerations to the Pod spec.
  • Unbound PVC: The Pod references a PersistentVolumeClaim that hasn't been provisioned yet. See the Storage Issues subsection below.

Evicted Pods

Eviction happens when a node runs critically low on resources — disk space, memory, or process IDs. The kubelet monitors these thresholds and evicts Pods to protect node stability. Evicted Pods are not restarted on the same node; the owning controller (Deployment, ReplicaSet) creates replacements that get scheduled elsewhere.

bash
# Find evicted Pods (they linger in Failed status)
kubectl get pods --field-selector=status.phase=Failed \
  -o custom-columns=NAME:.metadata.name,REASON:.status.reason,NODE:.spec.nodeName
# NAME                        REASON     NODE
# logger-5c7f8d-xq9k2         Evicted    worker-03
# cache-warmup-b8d4f-r3j7n    Evicted    worker-03

# Check node conditions for pressure signals
kubectl describe node worker-03 | grep -A 3 "Conditions"
# Conditions:
#   MemoryPressure   True    KubeletHasInsufficientMemory
#   DiskPressure     False   KubeletHasNoDiskPressure
#   PIDPressure      False   KubeletHasSufficientPID
#   Ready            True    KubeletReady

# Clean up lingering evicted Pod objects
kubectl delete pods --field-selector=status.phase=Failed

Evictions are a symptom, not the root cause. If Pods keep getting evicted from the same node, investigate: is a Pod consuming unbounded memory (no limits set)? Are log files filling the disk? Is emptyDir storage growing unchecked? Set proper resource limits and configure log rotation to prevent recurrence.

Service Issues

A Kubernetes Service provides a stable network identity (ClusterIP, DNS name) for a set of Pods. When a Service doesn't route traffic correctly, the problem almost always comes down to one thing: the Service can't find its backend Pods. Kubernetes uses label selectors to associate a Service with Pods, and the Endpoints (or EndpointSlice) object is the link between them.

Endpoints Not Populating

When kubectl get endpoints <service-name> shows <none>, the Service has no backend Pods. Traffic sent to the Service's ClusterIP goes nowhere. This is the single most common Service debugging scenario.

bash
# Step 1: Check if the Service has endpoints
kubectl get endpoints order-service
# NAME            ENDPOINTS   AGE
# order-service   <none>      12m    <-- problem!

# Step 2: Compare the Service selector with Pod labels
kubectl get svc order-service -o jsonpath='{.spec.selector}' | jq .
# { "app": "order-svc" }

kubectl get pods --show-labels | grep order
# order-svc-7f8b9c-x2k   1/1   Running   app=orders   <-- mismatch!

# The Service selects "app=order-svc" but Pods have "app=orders"
# Fix: update either the Service selector or Pod labels

# Step 3: Also verify the port mapping
kubectl get svc order-service -o jsonpath='{.spec.ports[*]}' | jq .
# { "port": 80, "targetPort": 3000, "protocol": "TCP" }

# Confirm the container actually listens on port 3000
kubectl exec order-svc-7f8b9c-x2k -- ss -tlnp

The three-point Service health check: (1) Do the label selectors match? (2) Does targetPort match the port the container is actually listening on? (3) Are the backend Pods in Ready state? A Pod that fails its readiness probe is automatically removed from the Endpoints list — it's running, but the Service won't send traffic to it.

External Access Not Working

When a LoadBalancer or NodePort Service isn't reachable from outside the cluster, the issue is often at the infrastructure layer rather than Kubernetes itself.

bash
# Check if the LoadBalancer has an external IP assigned
kubectl get svc frontend -o wide
# NAME       TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)
# frontend   LoadBalancer   10.96.45.12   <pending>     80:31247/TCP

# <pending> means the cloud controller hasn't provisioned an LB yet.
# Possible causes:
#   - No cloud-controller-manager running (bare-metal cluster)
#   - Quota exhausted for load balancers in your cloud account
#   - Missing annotation required by your cloud provider

# For NodePort, verify the port is reachable on the node
kubectl get svc frontend -o jsonpath='{.spec.ports[0].nodePort}'
# 31247

# Test from outside: curl http://<node-external-ip>:31247
# If blocked, check security group / firewall rules
# for the node port range (default 30000-32767)

Networking Issues

Kubernetes networking relies on a flat network model — every Pod gets its own IP, and every Pod can reach every other Pod without NAT. In practice, this works through a CNI (Container Network Interface) plugin. When networking breaks, it's usually DNS resolution, CNI misconfiguration, or NetworkPolicy rules blocking traffic.

DNS Resolution Failures

Every Pod in a Kubernetes cluster relies on CoreDNS for service discovery. When DNS breaks, Pods can't resolve Service names, and cascading failures follow quickly. The telltale signs: applications log "name resolution failed" or "host not found" errors while IP-based connectivity works fine.

bash
# Step 1: Test DNS from inside a Pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local
# If this fails, DNS is broken cluster-wide.

# Step 2: Check CoreDNS Pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# NAME                       READY   STATUS    RESTARTS
# coredns-5d78c9869d-4xk2m   1/1     Running   0
# coredns-5d78c9869d-r8j7n   1/1     Running   0

# Step 3: Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Step 4: Verify the kube-dns Service has endpoints
kubectl get endpoints -n kube-system kube-dns
# Should show CoreDNS Pod IPs on ports 53 (UDP+TCP)

# Step 5: Check the Pod's DNS config
kubectl exec <your-pod> -- cat /etc/resolv.conf
# nameserver 10.96.0.10  <-- should be the kube-dns ClusterIP
# search default.svc.cluster.local svc.cluster.local cluster.local

If CoreDNS Pods are healthy but DNS is still failing, check whether a NetworkPolicy is blocking UDP/TCP port 53 traffic to the kube-system namespace. Also verify that the Pod's dnsPolicy is set correctly — the default ClusterFirst routes queries to CoreDNS, while Default uses the node's /etc/resolv.conf instead.

Cross-Namespace Communication

Pods can reach Services in other namespaces by using the fully qualified domain name: <service-name>.<namespace>.svc.cluster.local. If this fails while same-namespace resolution works, the cause is almost always a NetworkPolicy restricting ingress or egress between namespaces.

bash
# Test cross-namespace connectivity
kubectl exec -n frontend deploy/web-app -- \
  wget -qO- --timeout=3 http://api-gateway.backend.svc.cluster.local/health

# If it times out, check NetworkPolicies in the target namespace
kubectl get networkpolicy -n backend
# NAME           POD-SELECTOR    AGE
# restrict-all   app=api-gw      2d

# Inspect the policy — does it allow ingress from the frontend namespace?
kubectl describe networkpolicy restrict-all -n backend
yaml
# NetworkPolicy that allows traffic from the "frontend" namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-ingress
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-gw
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: frontend
      ports:
        - protocol: TCP
          port: 8080

Storage Issues

Persistent storage in Kubernetes is mediated by two resources: PersistentVolumeClaims (PVCs) that Pods request, and PersistentVolumes (PVs) that represent the actual storage. A provisioner (usually via a StorageClass) dynamically creates PVs to satisfy PVC requests. When this chain breaks, Pods that depend on the volume get stuck in Pending.

PVC Stuck in Pending

bash
# Check PVC status
kubectl get pvc
# NAME           STATUS    STORAGECLASS   CAPACITY   AGE
# postgres-data  Pending   fast-ssd                  8m

# Get the reason from events
kubectl describe pvc postgres-data
# Events:
#   Warning  ProvisioningFailed  storageclass.storage.k8s.io "fast-ssd" not found

# List available StorageClasses
kubectl get storageclass
# NAME            PROVISIONER             RECLAIMPOLICY
# standard        kubernetes.io/gce-pd    Delete
# premium-rwo     pd.csi.storage.gke.io   Delete

# Fix: delete the PVC and recreate with a valid StorageClass
# (storageClassName is immutable after creation)
kubectl delete pvc postgres-data
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: premium-rwo
  resources:
    requests:
      storage: 20Gi
EOF

Other causes of Pending PVCs: the provisioner Pod itself is not running (check kube-system), the cloud provider hit a quota limit on disk volumes, or the PVC requests a specific accessMode (like ReadWriteMany) that the provisioner doesn't support. For WaitForFirstConsumer binding mode, the PVC intentionally stays Pending until a Pod that references it is scheduled — this is normal behavior, not an error.

Volume Mount Failures

Even after a PVC is bound, the volume can fail to mount on the node. This typically causes the Pod to be stuck in ContainerCreating status with a FailedAttachVolume or FailedMount event.

bash
# Pod stuck in ContainerCreating
kubectl describe pod postgres-0 | grep -A 10 "Events"
# Events:
#   Warning  FailedAttachVolume  AttachVolume.Attach failed for volume "pvc-9a8b7c":
#     rpc error: code = Internal desc = Could not attach volume:
#     volume is already attached to node "worker-01"

# This happens when a ReadWriteOnce volume is still attached to another node.
# Common during node drains or Pod rescheduling.

# Check which node currently has the volume attached
kubectl get volumeattachment | grep pvc-9a8b7c

# If the old Pod is gone but the attachment lingers, force-detach:
kubectl delete volumeattachment <attachment-name>

Node Issues

Nodes are the physical (or virtual) machines that run your workloads. When a node has problems, every Pod on it is affected. The kubelet reports node health via conditions, and the node controller in the control plane takes action when conditions degrade — marking nodes as NotReady, evicting Pods, or preventing new scheduling.

NotReady Nodes

A NotReady node means the kubelet has stopped reporting healthy heartbeats to the API server. After the node-monitor-grace-period (default: 40 seconds), the node controller marks it NotReady. After pod-eviction-timeout (default: 5 minutes), Pods are evicted and rescheduled to healthy nodes.

bash
# Identify NotReady nodes
kubectl get nodes
# NAME        STATUS     ROLES    AGE   VERSION
# worker-01   Ready      <none>   45d   v1.29.2
# worker-02   NotReady   <none>   45d   v1.29.2
# worker-03   Ready      <none>   45d   v1.29.2

# Get conditions and recent events for the bad node
kubectl describe node worker-02 | grep -A 20 "Conditions"

# If you can SSH into the node, check kubelet status
# ssh worker-02
  systemctl status kubelet
  journalctl -u kubelet --since "10 minutes ago" --no-pager | tail -30

# Common causes:
#   - kubelet process crashed or was OOM-killed
#   - Container runtime (containerd/CRI-O) unresponsive
#   - Certificate expired (kubelet can't authenticate to API server)
#   - Network partition between node and control plane

# Restart kubelet as a quick recovery step
sudo systemctl restart kubelet

Node Resource Pressure

Even when a node is Ready, it can be under resource pressure — triggering Pod evictions. The kubelet monitors three thresholds and sets corresponding conditions when they're breached.

ConditionDefault ThresholdWhat Triggers ItRecovery Action
MemoryPressurememory.available < 100MiToo many Pods without memory limits, memory leaks, or undersized nodesEvict BestEffort Pods first, then Burstable. Set proper requests and limits.
DiskPressurenodefs.available < 10%Container images filling disk, large log files, or emptyDir volumes growing unboundedPrune unused images (crictl rmi --prune), configure log rotation, set emptyDir.sizeLimit.
PIDPressurepid.available < 1000Applications forking too many processesSet pids-limit in container runtime config. Investigate the process-leaking container.
bash
# Check actual vs. allocatable resources on a specific node
kubectl describe node worker-03 | grep -A 8 "Allocated resources"
# Allocated resources:
#   (Total limits may be over 100 percent.)
#   Resource           Requests    Limits
#   --------           --------    ------
#   cpu                3800m (95%) 7200m (180%)
#   memory             6Gi (78%)   12Gi (150%)
#   ephemeral-storage  0 (0%)      0 (0%)

# This node is at 95% CPU requests — no new Pods can schedule here.
# Limits over 100% mean overcommit — safe until Pods actually burst.
Proactive Monitoring Beats Reactive Debugging

If you're troubleshooting resource pressure regularly, you need better observability — not faster debugging. Set up Prometheus alerts on kube_node_status_condition for pressure conditions, node_memory_MemAvailable_bytes for memory headroom, and node_filesystem_avail_bytes for disk. Catch these issues at 80% utilization, not at 100%. The previous section on Monitoring with Prometheus and Grafana covers this setup in detail.

Putting It All Together: A Real Debugging Session

Real incidents rarely involve a single, obvious failure. Here's a realistic end-to-end example: a new deployment rollout appears stuck, and you need to find out why.

bash
# 1. What's the Deployment status?
kubectl rollout status deployment/checkout-api --timeout=10s
# Waiting for deployment "checkout-api" rollout to finish:
#   1 out of 3 new replicas have been updated...

# 2. Which Pods are problematic?
kubectl get pods -l app=checkout-api
# NAME                            READY   STATUS             RESTARTS
# checkout-api-6b8f9d7c4-old1     1/1     Running            0
# checkout-api-6b8f9d7c4-old2     1/1     Running            0
# checkout-api-6b8f9d7c4-old3     1/1     Running            0
# checkout-api-85d4a3f1b-new1     0/1     CrashLoopBackOff   4

# 3. Why is the new Pod crashing?
kubectl logs checkout-api-85d4a3f1b-new1 --previous
# Error: FATAL: password authentication failed for user "checkout"
# Connection to database refused

# 4. Check what secret the new Pod references
kubectl get pod checkout-api-85d4a3f1b-new1 -o jsonpath=\
  '{.spec.containers[0].envFrom}' | jq .
# [{ "secretRef": { "name": "checkout-db-creds-v2" } }]

# 5. Does that secret exist?
kubectl get secret checkout-db-creds-v2
# Error from server (NotFound): secrets "checkout-db-creds-v2" not found

# Root cause: the new Deployment version references a secret that
# hasn't been created yet. Create it and the rollout will proceed.

This five-step flow — rollout status → identify failing Pods → check logs → inspect config → trace the dependency — works for the vast majority of deployment issues. The key habit is to let each command's output guide your next command, rather than guessing randomly. Systematic beats fast every time.

Horizontal and Vertical Pod Autoscaling (HPA & VPA)

Kubernetes offers two complementary autoscaling dimensions. Horizontal Pod Autoscaling (HPA) adjusts the number of Pod replicas — more traffic means more Pods. Vertical Pod Autoscaling (VPA) adjusts the resource requests and limits on each Pod — the same number of Pods, but each one gets more (or less) CPU and memory. Together they let your workloads respond to demand without manual intervention.

Understanding when to use each — and when not to combine them — is the key to a stable, cost-efficient cluster. This section walks through the algorithms, APIs, configuration knobs, and YAML manifests for both.

graph LR
    M["Metrics Server /
Custom Metrics Adapter"] -->|current metrics| HPA["HPA Controller"] M -->|resource usage| VPA["VPA Recommender"] HPA -->|scale replicas| D["Deployment / ReplicaSet"] VPA -->|update requests & limits| D D --> P1["Pod"] D --> P2["Pod"] D --> P3["Pod +/-"] style HPA fill:#3b82f6,color:#fff,stroke:#2563eb style VPA fill:#8b5cf6,color:#fff,stroke:#7c3aed style M fill:#f59e0b,color:#fff,stroke:#d97706

Horizontal Pod Autoscaler (HPA)

The HPA is a built-in Kubernetes controller that periodically (every 15 seconds by default) fetches metrics, computes the desired replica count, and patches the target workload's scale subresource. It ships with the control plane — no extra installation is required. You only need a Metrics Server (for CPU/memory) or a custom metrics adapter (for application-level metrics) to supply the data it reads.

The HPA Algorithm

The core formula is deceptively simple:

text
desiredReplicas = ceil( currentReplicas x ( currentMetricValue / desiredMetricValue ) )

For example, if you have 3 replicas, current average CPU utilization is 80%, and the target is 50%, the calculation is ceil(3 x (80 / 50)) = ceil(4.8) = 5. The HPA scales you to 5 replicas. When multiple metrics are configured, the HPA computes a desired replica count for each metric and takes the maximum — the most aggressive scale-up wins.

Pods that are not yet ready, or that have no metrics (just started), are handled conservatively. Pods without metrics are assumed to consume 0% during scale-down calculations and 100% during scale-up. This prevents both premature scale-down and sluggish scale-up.

The 10% Tolerance Band

The HPA does not act on every tiny fluctuation. It has a default tolerance of 0.1 (10%). If the ratio currentMetric / desiredMetric is within [0.9, 1.1], no scaling action is taken. This prevents thrashing on noisy metrics.

HPA v2 API — Metric Types

The modern HPA API (autoscaling/v2, stable since Kubernetes 1.23) supports four distinct metric sources. This is a major upgrade over v1, which only supported CPU percentage.

Metric TypeSourceUse Case
ResourceMetrics Server (metrics.k8s.io)Scale on CPU or memory utilization as a percentage of requests.
PodsCustom Metrics Adapter (custom.metrics.k8s.io)Scale on a per-pod metric like requests_per_second or queue_depth from Prometheus.
ObjectCustom Metrics Adapter (custom.metrics.k8s.io)Scale on a metric attached to another Kubernetes object, like an Ingress's requests-per-second.
ExternalExternal Metrics Adapter (external.metrics.k8s.io)Scale on a metric from outside the cluster — an SQS queue length, a Pub/Sub subscription backlog, etc.

Here is a complete HPA manifest that uses three of these metric types simultaneously. The HPA evaluates all of them and picks the one that demands the most replicas:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
    # 1. Resource metric — keep average CPU at 60%
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

    # 2. Pods metric — custom metric from Prometheus adapter
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

    # 3. External metric — SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: order-processing
        target:
          type: Value
          value: "50"

The scaleTargetRef points at the workload to scale. minReplicas and maxReplicas set the guardrails — the HPA will never go below 3 or above 30 regardless of what the metrics say. Each item in the metrics array independently computes a desired replica count, and the largest value wins.

Scaling Behavior — Stabilization and Policies

Raw autoscaling with no throttling can be dangerous. A brief CPU spike could add 20 Pods, then immediately remove them when the spike subsides, causing cascading restarts. The behavior field gives you fine-grained control over how fast the HPA scales in each direction.

There are two key concepts: stabilization windows decide how long the HPA looks back to avoid reacting to transient spikes. Scaling policies limit how many replicas can be added or removed in a given time period, expressed as either an absolute number (Pods) or a percentage (Percent) of current replicas.

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0        # React immediately to scale-up need
      policies:
        - type: Percent
          value: 100                        # Allow doubling every 60s
          periodSeconds: 60
        - type: Pods
          value: 5                          # Or add at least 5 pods per 60s
          periodSeconds: 60
      selectPolicy: Max                     # Use whichever policy allows MORE pods
    scaleDown:
      stabilizationWindowSeconds: 300       # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10                         # Remove at most 10% per minute
          periodSeconds: 60
      selectPolicy: Min                     # Use the most conservative policy

This configuration follows a common production pattern: scale up aggressively, scale down conservatively. The scale-up has no stabilization window and allows doubling, so the workload reacts quickly to genuine traffic surges. The scale-down uses a 5-minute stabilization window and limits removal to 10% of replicas per minute, preventing capacity from dropping too fast after a brief traffic dip.

Behavior FieldDefault (Scale Up)Default (Scale Down)Purpose
stabilizationWindowSeconds0300 (5 min)Looks back over this window and picks the highest (up) or lowest (down) recommended replica count.
policies[].typePods (absolute count) or Percent (of current replicas).
policies[].valueThe max pods/percent that can change per periodSeconds.
policies[].periodSecondsTime window for the policy (1–1800 seconds).
selectPolicyMaxMaxMax picks the policy that allows the most change. Min picks the most restrictive. Disabled blocks scaling in that direction entirely.

Inspecting HPA Status

After creating an HPA, you can see what the controller is doing at any time:

bash
# Quick overview — shows targets, current values, and replica count
kubectl get hpa order-service -n production

# Detailed status — shows each metric, conditions, and events
kubectl describe hpa order-service -n production

# Watch scaling decisions in real time
kubectl get hpa order-service -n production -w

If the TARGETS column shows <unknown>/60%, it means the metrics pipeline is broken. Check that Metrics Server is running (kubectl get pods -n kube-system | grep metrics-server) and that your Pods have resources.requests set — the HPA cannot compute utilization percentage without a denominator.

Vertical Pod Autoscaler (VPA)

While HPA answers "how many Pods?", VPA answers "how big should each Pod be?" The Vertical Pod Autoscaler watches actual resource consumption over time and adjusts the requests and limits on containers. This is critical for workloads where developers have no idea what to request — and in practice, initial resource estimates are almost always wrong.

VPA is not built into the control plane. You install it separately (the project lives at kubernetes/autoscaler on GitHub). It consists of three components:

ComponentRole
RecommenderReads historical and real-time resource usage from the Metrics Server. Computes recommended requests for each container.
UpdaterChecks running Pods against the recommendation. If a Pod's requests are significantly off, it evicts the Pod so it gets recreated with new values.
Admission ControllerIntercepts Pod creation and mutates the requests/limits fields to match the current recommendation — so new Pods start right-sized.

VPA Update Modes

The updatePolicy.updateMode field controls how aggressively the VPA acts. Choosing the right mode depends on your tolerance for Pod restarts.

ModeBehaviorPod Restarts?Best For
OffProduces recommendations only. Does not change Pod resources.NoObservation and right-sizing analysis. Safe starting point.
InitialApplies recommendations at Pod creation time only. Does not touch running Pods.No (existing Pods)Workloads where you control rollout timing (e.g., via CI/CD deploys).
AutoApplies recommendations at creation and evicts running Pods to update them. This is the fully automatic mode.YesNon-critical workloads that tolerate occasional restarts.

Here is a VPA resource in Off mode — the safest way to start. It will produce recommendations you can review without affecting any running Pods:

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: order-service
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi
        controlledResources: ["cpu", "memory"]

The resourcePolicy sets guardrails. Without minAllowed and maxAllowed, the VPA could recommend absurdly small or large values. The controlledResources field lets you limit VPA to only CPU or only memory if needed.

Once the recommender has gathered enough data (give it at least a few hours, ideally 24 hours), inspect the recommendations:

bash
kubectl describe vpa order-service-vpa -n production

The output includes four recommendation tiers: lowerBound, target, uncappedTarget (ignores your min/max constraints), and upperBound. In most cases, you want to use the target value when manually adjusting manifests.

VPA in Auto Mode

When you are confident in the VPA's recommendations and your workload can handle Pod evictions, switch to Auto:

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: order-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: order-service
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi
        controlledResources: ["cpu", "memory"]

In Auto mode, the Updater component periodically compares each Pod's actual requests against the recommendation. If the difference exceeds a threshold, it evicts the Pod. The Deployment controller creates a replacement, and the VPA Admission Controller mutates the new Pod's resource requests to the recommended values. This means there will be brief disruptions — make sure you have a PodDisruptionBudget in place.

Why You Should Not Use HPA and VPA on CPU Together

This is one of the most common autoscaling mistakes. At first glance it seems logical: let HPA scale replica count based on CPU, and let VPA right-size each Pod's CPU requests. In practice, it creates a feedback loop that makes both controllers fight each other.

graph TD
    A["CPU usage rises"] --> B["HPA adds replicas"]
    B --> C["CPU usage per pod drops"]
    C --> D["VPA lowers CPU requests"]
    D --> E["Utilization % appears higher
(same usage, lower request)"] E --> A style A fill:#ef4444,color:#fff,stroke:#dc2626 style D fill:#8b5cf6,color:#fff,stroke:#7c3aed style B fill:#3b82f6,color:#fff,stroke:#2563eb

Here is the conflict step by step: HPA computes utilization as currentUsage / request. When VPA lowers the request value, the utilization percentage jumps — even though the actual CPU consumption has not changed. The HPA sees high utilization and adds more replicas. More replicas reduce actual per-pod usage, so VPA lowers requests further. This cycle continues until you hit maxReplicas with tiny per-pod resource requests.

Safe Combinations of HPA + VPA

If you need both, follow these rules: (1) Use HPA on a custom metric (like requests-per-second or queue depth) — not CPU or memory. (2) Let VPA manage CPU and memory requests. This way the two controllers operate on completely independent signals with no overlap. Alternatively, use VPA in Off mode purely for recommendations and manage requests manually.

HPA vs. VPA — When to Use Which

CriteriaHPAVPA
Workload can be horizontally scaled (stateless, no sticky sessions)✅ Primary choiceUse alongside for right-sizing
Workload cannot add replicas (single-instance DB, singleton worker)❌ Not applicable✅ Primary choice
Traffic is spiky and unpredictable✅ Reacts in secondsSlower — requires eviction
Resource requests are unknown or driftingNot its job✅ Built for this
Cost optimization / right-sizingPrevents over-provisioned replica counts✅ Prevents over-provisioned per-pod resources
Zero disruption requirement✅ No Pod eviction⚠️ Auto mode evicts Pods (use Off or Initial)

Putting It Together — A Production-Ready Example

A real-world setup often combines HPA on a custom metric with VPA in Off or Initial mode. Below is a Deployment with resource requests, an HPA that scales on HTTP request rate, and a VPA in recommendation-only mode to continuously right-size the requests over time.

yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
        - name: api-gateway
          image: myregistry/api-gateway:v2.4.1
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 512Mi
---
# hpa.yaml — scale on custom metric, NOT cpu
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
---
# vpa.yaml — recommendation only, no conflicts with HPA
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-gateway-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
      - containerName: api-gateway
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi

With this setup, the HPA handles scaling replica count based on actual traffic (requests per second). The VPA in Off mode watches resource usage and continuously generates right-sizing recommendations. Periodically — during a maintenance window or as part of a deployment cycle — you read the VPA recommendations and update the Deployment's resources.requests accordingly. No feedback loop, no conflicts, the best of both worlds.

Automate the Feedback Loop

You can build a CI/CD step that queries VPA recommendations via kubectl get vpa api-gateway-vpa -o jsonpath='{.status.recommendation}' and opens a pull request to update the Deployment manifest. This gives you the benefits of VPA's analysis without any runtime Pod evictions — a pattern sometimes called "VPA as a recommender."

Cluster Autoscaler and KEDA — Infrastructure and Event-Driven Scaling

HPA and VPA scale your Pods — they add replicas or resize containers. But what happens when the cluster itself runs out of room? If there are no nodes with enough CPU or memory to schedule a new Pod, HPA's additional replicas sit in Pending forever. This is where infrastructure-level scaling steps in.

Two tools dominate this space. The Cluster Autoscaler watches for unschedulable Pods and provisions new nodes from your cloud provider. KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with external event sources — message queues, databases, cron schedules — and enables the powerful ability to scale to and from zero. Together they close the loop: KEDA and HPA scale workloads, and the Cluster Autoscaler scales the infrastructure beneath them.

How the Cluster Autoscaler Works

The Cluster Autoscaler runs as a Deployment inside your cluster (typically in kube-system). It performs two operations on a continuous loop: scale-up when Pods can't be scheduled, and scale-down when nodes are underutilized. It does not watch CPU or memory metrics directly — it watches for Pod scheduling failures.

Scale-up is triggered when the scheduler cannot place a Pod on any existing node because of insufficient resources, taints, affinity rules, or other constraints. The Cluster Autoscaler simulates adding a node from each configured node group and picks the one that would allow the pending Pod(s) to schedule. It then calls the cloud provider API to add that node.

Scale-down is triggered when a node's resource utilization (based on requests, not actual usage) falls below a configurable threshold for a sustained period. Before removing a node, the autoscaler checks that all Pods on it can be rescheduled elsewhere, that none are controlled by a controller that would block eviction (e.g., Pods with local storage, PodDisruptionBudgets that can't be satisfied), and that the node isn't annotated to prevent scale-down.

flowchart TB
    Start["Autoscaler Loop\n(every scan-interval)"] --> CheckPending{"Unschedulable\nPods exist?"}
    CheckPending -->|Yes| Simulate["Simulate scheduling\nagainst each node group"]
    Simulate --> Expand["Select best node group\n(expander strategy)"]
    Expand --> ScaleUp["Call cloud API:\nincrease node group size"]
    ScaleUp --> Start

    CheckPending -->|No| CheckUtil{"Any node below\nutilization threshold?"}
    CheckUtil -->|Yes| CheckSafe{"All Pods safely\nreschedulable?"}
    CheckSafe -->|Yes| Drain["Cordon & drain node"]
    Drain --> ScaleDown["Call cloud API:\ndecrease node group size"]
    ScaleDown --> Start
    CheckSafe -->|No| Start
    CheckUtil -->|No| Start
    

Cloud Provider Integration

The Cluster Autoscaler doesn't manage VMs directly. It talks to your cloud provider's node group abstraction — the mechanism that manages a pool of identically configured machines. Each provider uses a different primitive, but the concept is the same: a group of nodes that can be scaled by changing a "desired count" value.

Cloud ProviderNode Group PrimitiveHow Autoscaler Interacts
AWS (EKS)Auto Scaling Groups (ASGs)Modifies the ASG's DesiredCapacity. Each ASG maps to a node group with a specific instance type, AMI, and launch template. Supports mixed instance types via ASG mixed instance policies.
GCP (GKE)Managed Instance Groups (MIGs)Adjusts the MIG's target size. GKE's built-in autoscaler uses the same logic but is integrated natively — you enable it via gcloud or the console rather than deploying a separate controller.
Azure (AKS)Virtual Machine Scale Sets (VMSSs)Changes the VMSS instance count. AKS integrates the Cluster Autoscaler natively — you configure it per node pool with az aks nodepool update --enable-cluster-autoscaler.
GKE and AKS Have Built-In Autoscalers

On GKE and AKS, the Cluster Autoscaler is a native feature you enable per node pool — there's no need to deploy the autoscaler yourself. On EKS (and self-managed clusters), you install it as a Helm chart or Deployment. In all cases, the underlying logic is the same open-source Cluster Autoscaler project.

Configuring the Cluster Autoscaler

The Cluster Autoscaler exposes several key parameters that control how aggressively it scales up and how cautiously it scales down. Getting these right is the difference between responsive scaling and either wasted spend or prolonged scheduling delays.

ParameterDefaultWhat It Controls
--scan-interval10sHow often the autoscaler checks for unschedulable Pods and underutilized nodes. Lower values react faster but increase API server load.
--scale-down-delay-after-add10mCooldown after a scale-up before scale-down is considered. Prevents thrashing when a newly added node is still stabilizing.
--scale-down-delay-after-delete0s (scan-interval)Cooldown after a node is removed before another can be removed. Controls how quickly the cluster shrinks.
--scale-down-unneeded-time10mHow long a node must remain underutilized before it becomes eligible for removal. Guards against premature removal from temporary dips.
--scale-down-utilization-threshold0.5A node is considered underutilized if the sum of requested resources (CPU or memory) is below this fraction of its capacity. 0.5 means <50% utilized.
--max-graceful-termination-sec600Maximum time to wait for Pods to terminate gracefully during node drain before forceful eviction.
--skip-nodes-with-local-storagetrueWhen true, nodes with Pods using emptyDir volumes won't be removed. Set to false if your workloads can tolerate local data loss.

Here's a Helm values file that configures the Cluster Autoscaler for a production EKS cluster with tuned parameters:

yaml
# cluster-autoscaler-values.yaml
autoDiscovery:
  clusterName: my-production-cluster
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/my-production-cluster

extraArgs:
  scan-interval: 10s
  scale-down-delay-after-add: 10m
  scale-down-delay-after-delete: 0s
  scale-down-unneeded-time: 10m
  scale-down-utilization-threshold: "0.5"
  skip-nodes-with-local-storage: "false"
  expander: least-waste
  balance-similar-node-groups: "true"
  max-node-provision-time: 15m

rbac:
  create: true
  serviceAccount:
    create: true
    name: cluster-autoscaler
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ClusterAutoscalerRole
bash
# Install with Helm
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f cluster-autoscaler-values.yaml

Node Group Auto-Discovery

Rather than hard-coding every ASG or MIG name, you can tell the Cluster Autoscaler to discover node groups automatically based on tags (AWS), labels (GCP), or tags (Azure). This is the recommended approach — when your infrastructure team creates a new node group with the right tags, the autoscaler picks it up automatically.

bash
# AWS: Tag your ASGs with these two tags
# Key: k8s.io/cluster-autoscaler/enabled          Value: true
# Key: k8s.io/cluster-autoscaler/<cluster-name>   Value: owned

# The autoscaler discovers them with:
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled=true,\
k8s.io/cluster-autoscaler/my-cluster=owned

# You can also set min/max per ASG using ASG tags:
# Key: k8s.io/cluster-autoscaler/node-template/resources/cpu
# Key: k8s.io/cluster-autoscaler/node-template/resources/memory

Expander Strategies

When multiple node groups could satisfy the pending Pod(s), the expander decides which one to grow. This is a critical choice that directly affects cost and bin-packing efficiency. You set it with --expander=<strategy>.

StrategyHow It ChoosesBest For
randomPicks a node group at random from the candidates. Simple and fast.Homogeneous clusters where all node groups have the same instance type. Spreads load evenly by chance.
most-podsPicks the node group whose new node would schedule the most pending Pods.Batch workloads with many identical small Pods. Maximizes the impact of each new node.
least-wastePicks the node group whose new node would have the least idle resources after scheduling pending Pods. Calculates waste as unused CPU + unused memory fractions.Mixed workloads with varying resource requests. Optimizes for cost by minimizing leftover capacity. Recommended for most production clusters.
priorityUses a ConfigMap to define an ordered priority list of node groups. Falls back to lower-priority groups when higher-priority ones are at max capacity.Clusters with spot/preemptible nodes alongside on-demand nodes. You prioritize cheap capacity and fall back to expensive capacity.

Here's the ConfigMap used by the priority expander. Higher numbers mean higher priority:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    50:
      - .*spot.*           # Prefer spot node groups (regex match on ASG name)
    30:
      - .*arm64-ondemand.* # Fall back to cheaper ARM on-demand
    10:
      - .*x86-ondemand.*   # Last resort: x86 on-demand

KEDA — Scaling Based on External Events

The standard HPA scales based on CPU, memory, or custom metrics exposed via the Kubernetes metrics API. But many real-world scaling decisions depend on signals that live outside the cluster: the depth of a Kafka topic, the length of a RabbitMQ queue, a Prometheus query result, or a cron schedule. KEDA (Kubernetes Event-Driven Autoscaling) bridges this gap.

KEDA is a lightweight component that acts as a Kubernetes metrics adapter. It reads external event sources (called scalers) and feeds those metrics into the standard HPA machinery. This means you get all of HPA's stabilization, scaling policies, and behavior — but driven by any event source KEDA supports. Crucially, KEDA adds one capability that HPA alone cannot provide: scaling to and from zero replicas.

flowchart LR
    subgraph External["External Event Sources"]
        Kafka["Kafka Topic"]
        RMQ["RabbitMQ Queue"]
        Prom["Prometheus"]
        SQS["AWS SQS"]
        Cron["Cron Schedule"]
    end

    subgraph KEDA_NS["KEDA Components"]
        Operator["KEDA Operator"]
        Adapter["Metrics Adapter"]
    end

    subgraph K8s["Kubernetes"]
        HPA["HPA"]
        Deploy["Deployment / Job"]
        Pods["Pods (0 to N)"]
    end

    Kafka --> Operator
    RMQ --> Operator
    Prom --> Operator
    SQS --> Operator
    Cron --> Operator

    Operator -->|"creates & manages"| HPA
    Operator -->|"scale 0 to 1"| Deploy
    Adapter -->|"serves metrics"| HPA
    HPA -->|"scale 1 to N"| Deploy
    Deploy --> Pods
    

KEDA splits the scaling responsibility. The KEDA operator handles the zero-to-one and one-to-zero transitions (since HPA requires at least 1 replica to calculate metrics). Once there's at least one replica, the HPA takes over for the one-to-N scaling, using metrics fed by KEDA's metrics adapter. This architecture means KEDA doesn't replace HPA — it enhances it.

Installing KEDA

KEDA installs cleanly via Helm and runs in its own namespace. It deploys three components: the operator, the metrics API server, and an admission webhook for validating ScaledObject configurations.

bash
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.16.0

# Verify the installation
kubectl get pods -n keda

ScaledObject — Scaling Deployments and StatefulSets

A ScaledObject is KEDA's primary CRD. It binds an external event source to a Kubernetes workload (Deployment, StatefulSet, or any resource with a /scale subresource). When you create a ScaledObject, KEDA automatically creates and manages an HPA behind the scenes.

Here's a ScaledObject that scales a consumer Deployment based on a Kafka topic's consumer lag:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor        # Deployment name
  pollingInterval: 15            # Check trigger every 15s
  cooldownPeriod: 120            # Wait 120s before scaling to zero
  idleReplicaCount: 0            # Scale to zero when idle
  minReplicaCount: 0             # Minimum replicas (0 = scale-to-zero)
  maxReplicaCount: 50            # Maximum replicas
  fallback:
    failureThreshold: 3          # After 3 failed polls...
    replicas: 5                  # ...fall back to 5 replicas
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        consumerGroup: order-processor-group
        topic: orders
        lagThreshold: "10"       # Scale up when lag > 10 per partition
        activationLagThreshold: "1" # Activate (0 to 1) when lag > 1

The idleReplicaCount and minReplicaCount are the keys to scale-to-zero. When the Kafka lag drops to zero (or below activationLagThreshold), KEDA waits for cooldownPeriod seconds, then scales the Deployment to idleReplicaCount. When new messages arrive, KEDA immediately scales to minReplicaCount (or 1, whichever is higher) and hands off to HPA for further scaling.

ScaledJob — Scaling Kubernetes Jobs

Not every workload is a long-running Deployment. For batch processing — where each item in a queue should trigger a discrete Job that runs to completion — KEDA offers the ScaledJob CRD. Instead of scaling replicas, it creates new Job instances proportional to the event source's backlog.

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: video-encoder
  namespace: media
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: encoder
            image: myregistry/video-encoder:3.2
            envFrom:
              - secretRef:
                  name: sqs-credentials
        restartPolicy: Never
    backoffLimit: 3
  pollingInterval: 10
  maxReplicaCount: 20            # Max 20 concurrent Jobs
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 5
  scalingStrategy:
    strategy: accurate           # Create exactly as many Jobs as queue items
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/video-jobs
        queueLength: "1"         # 1 Job per message
        awsRegion: us-east-1
      authenticationRef:
        name: aws-credentials

Common KEDA Triggers

KEDA ships with 60+ built-in scalers. Here are the most commonly used triggers and when you'd reach for each one:

TriggerEvent SourceTypical Use Case
kafkaApache Kafka consumer group lagScale consumers to keep up with message throughput. Scales per-partition.
rabbitmqRabbitMQ queue depthScale workers processing task queues. Supports both AMQP and HTTP API protocols.
redisRedis list length or stream pending countScale based on Redis-backed job queues (Sidekiq, Celery with Redis broker).
prometheusAny Prometheus query resultScale on custom business metrics — request latency, error rate, active users. Very flexible.
cronTime-based schedulePre-scale before known traffic peaks (e.g., scale up at 8 AM, scale down at 8 PM).
aws-sqs-queueAWS SQS queue depthScale processors for SQS-based job queues. Integrates with IRSA for auth.
azure-servicebusAzure Service Bus queue/topic message countScale handlers for Azure messaging workloads.
postgresqlPostgreSQL query resultScale based on pending row count in a work table. Useful for database-driven job patterns.
metrics-apiAny HTTP JSON endpointScale on custom API responses — anything that returns a number.

Prometheus Trigger Example

The prometheus trigger is the most versatile — if you can express it as a PromQL query that returns a scalar, you can scale on it. This example scales an API gateway based on the request rate and includes a cron trigger for predictive pre-scaling:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-gateway-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-gateway
  minReplicaCount: 2             # Always keep at least 2 replicas
  maxReplicaCount: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{service="api-gateway"}[2m]))
        threshold: "100"         # Scale up when RPS exceeds 100 per replica
        activationThreshold: "5" # Only activate from zero at 5 RPS
    - type: cron                 # Combine triggers for pre-scaling
      metadata:
        timezone: America/New_York
        start: 0 7 * * 1-5       # 7 AM weekdays
        end: 0 9 * * 1-5         # 9 AM weekdays
        desiredReplicas: "10"    # Pre-scale for morning traffic
Combine Multiple Triggers

A ScaledObject can have multiple triggers — KEDA uses the highest replica count recommended by any trigger. This is powerful for layering strategies: use a Prometheus trigger for reactive scaling and a cron trigger for predictive pre-scaling. The replica count will be the greater of the two at any given moment.

KEDA Authentication

Many external event sources require credentials. KEDA handles this through TriggerAuthentication and ClusterTriggerAuthentication CRDs, which decouple credentials from the ScaledObject. This lets you reuse auth configurations and keep secrets out of your scaling manifests.

yaml
# TriggerAuthentication pulling from a Kubernetes Secret
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: kafka-auth
  namespace: production
spec:
  secretTargetRef:
    - parameter: sasl            # KEDA trigger parameter name
      name: kafka-credentials    # Kubernetes Secret name
      key: sasl-config           # Key inside the Secret
---
# Reference it in the ScaledObject trigger:
# triggers:
#   - type: kafka
#     authenticationRef:
#       name: kafka-auth
#     metadata:
#       bootstrapServers: kafka:9092

How Cluster Autoscaler and KEDA Work Together

The real power of these tools emerges when they operate as a pipeline. KEDA detects that your Kafka lag is climbing and tells HPA to increase replicas from 3 to 20. The scheduler tries to place 17 new Pods but only 5 fit on existing nodes — the remaining 12 stay Pending. The Cluster Autoscaler detects the unschedulable Pods, provisions 3 new nodes from the appropriate node group, and the scheduler places the Pods as nodes become ready.

sequenceDiagram
    participant ES as Event Source (Kafka)
    participant KEDA as KEDA Operator
    participant HPA as HPA
    participant Sched as Scheduler
    participant CA as Cluster Autoscaler
    participant Cloud as Cloud Provider

    ES->>KEDA: Consumer lag = 200
    KEDA->>HPA: Target metric = 200, threshold = 10
    HPA->>HPA: Desired replicas = 20 (currently 3)
    HPA->>Sched: Create 17 new Pods
    Sched->>Sched: 5 scheduled, 12 unschedulable
    Note over Sched: Insufficient CPU/memory
    CA->>CA: Detects 12 Pending Pods
    CA->>Cloud: Add 3 nodes to node group
    Cloud-->>CA: Nodes provisioned
    Sched->>Sched: Place 12 Pods on new nodes
    

The total time from event spike to all Pods running is dominated by node provisioning — typically 2-5 minutes depending on the cloud provider and instance type. This latency is why it's important to keep a small buffer of headroom capacity, either via a higher minReplicaCount or by using pause Pods (low-priority Pods that reserve node capacity and get preempted when real workloads need the space).

Always Set Resource Requests on Your Workloads

The Cluster Autoscaler makes decisions based on resource requests, not actual usage. If your Pods don't have CPU and memory requests, the scheduler considers them zero-cost, packs unlimited Pods onto each node, and the Cluster Autoscaler never triggers because Pods are technically "schedulable." This leads to OOM kills and CPU starvation with no automatic remediation. Set realistic requests on every container.

Debugging Autoscaler Behavior

When scaling isn't happening as expected, use these commands to diagnose. The Cluster Autoscaler writes its decision logic into a ConfigMap, and KEDA creates standard HPA objects you can inspect directly.

bash
# ---- Cluster Autoscaler ----

# Check the autoscaler's status ConfigMap for its latest decisions
kubectl get cm cluster-autoscaler-status -n kube-system -o yaml

# View autoscaler logs for scale-up/down events
kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler \
  --tail=100 | grep -E "Scale|Expanding|Removing"

# See why a Pod is unschedulable
kubectl describe pod <pending-pod-name> | grep -A 5 "Events"

# ---- KEDA ----

# List all ScaledObjects and their status
kubectl get scaledobjects -A

# Inspect the HPA that KEDA created (named keda-hpa-<scaledobject-name>)
kubectl get hpa -A | grep keda

# Check KEDA operator logs for trigger errors
kubectl logs -n keda -l app=keda-operator --tail=50

# Describe a ScaledObject for detailed status
kubectl describe scaledobject order-processor-scaler -n production

Resource Optimization and Cost Management

Kubernetes makes it easy to run workloads — and just as easy to waste money doing it. Studies consistently show that the average Kubernetes cluster runs at 20–35% CPU utilization, meaning most organizations are paying for 2–3x more compute than they actually need. The root causes are predictable: over-provisioned resource requests, idle dev/staging environments running 24/7, and nodes sized without considering pod density tradeoffs.

This section gives you a concrete framework for cutting Kubernetes costs without sacrificing reliability. You will learn to right-size workloads, pack nodes efficiently, leverage cheaper compute for the right workloads, and get visibility into where every dollar goes.

Right-Sizing Workloads with VPA Recommendations

Over-provisioning is the single biggest source of Kubernetes waste. Developers set resource requests and limits during initial deployment — often by guessing — and never revisit them. A pod requesting 1 CPU and 1Gi of memory but actually using 50m CPU and 128Mi wastes over 90% of its reserved resources. Those ghost resources block scheduling and inflate your node count.

The Vertical Pod Autoscaler (VPA) solves this by observing actual resource consumption and recommending — or automatically applying — right-sized requests. Even if you don't enable auto-updating mode, running VPA in recommendation-only mode gives you a data-driven starting point.

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"  # Recommendation-only — no live changes
  resourcePolicy:
    containerPolicies:
    - containerName: api-server
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

After running VPA for a few days, query the recommendations to see what your workloads actually need:

bash
# View VPA recommendations for a deployment
kubectl get vpa api-server-vpa -o jsonpath='{.status.recommendation}' | jq .

# Example output:
# {
#   "containerRecommendations": [{
#     "containerName": "api-server",
#     "lowerBound":  { "cpu": "80m",  "memory": "180Mi" },
#     "target":      { "cpu": "120m", "memory": "256Mi" },
#     "upperBound":  { "cpu": "350m", "memory": "512Mi" }
#   }]
# }
Use the "target" recommendation, not "lowerBound"

Set resource requests to the VPA target value and limits to near the upperBound. Using lowerBound as your request leaves zero headroom for traffic spikes. The target value already accounts for the p90 usage with a safety margin built in.

ResourceQuotas and LimitRanges: Guardrails Against Waste

Right-sizing individual workloads is not enough if any team can deploy unlimited resources into a shared cluster. ResourceQuotas cap the total resources a namespace can consume, while LimitRanges set per-pod and per-container defaults and ceilings. Together, they prevent resource hoarding and ensure fair sharing across teams.

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "8"          # Total CPU requests across all pods
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "40"                 # Max 40 pods in this namespace
    persistentvolumeclaims: "10"
yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
  - type: Container
    default:           # Applied if no limits are set
      cpu: 500m
      memory: 512Mi
    defaultRequest:    # Applied if no requests are set
      cpu: 100m
      memory: 128Mi
    max:
      cpu: 2
      memory: 4Gi
    min:
      cpu: 50m
      memory: 64Mi

When a ResourceQuota is active in a namespace, every pod must specify resource requests and limits — otherwise the API server rejects it. The LimitRange fills in defaults for pods that don't specify them, so developers aren't blocked. This combination gives you a safety net: quotas prevent namespace-level runaway, and LimitRanges prevent individual container-level extremes.

Spot and Preemptible Instances

Cloud providers offer spare compute capacity at 60–90% discounts under names like Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure). The tradeoff: the provider can reclaim these nodes with as little as 30 seconds notice. This makes them a poor fit for stateful databases but excellent for fault-tolerant workloads that Kubernetes can reschedule automatically.

Workload TypeSpot-Friendly?Why
Stateless APIs behind an HPA✅ YesHPA replaces lost pods instantly; multiple replicas absorb the loss
Batch jobs / CronJobs✅ YesJobs have built-in retry; partial progress can be checkpointed
CI/CD build runners✅ YesBuilds are idempotent; a failed build simply re-queues
Dev/staging environments✅ YesBrief interruptions are acceptable; nobody carries a pager for staging
Stateful databases (Postgres, Redis)❌ NoData loss risk; failover adds latency and complexity
Single-replica critical services❌ NoNo redundancy to absorb the eviction

The standard pattern is to run a mixed cluster with an on-demand node pool for critical workloads and a spot node pool for everything else. Use taints and tolerations to control which workloads land on spot nodes:

yaml
# Deployment tolerating spot node taints + preferring spot nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 6
  selector:
    matchLabels:
      app: batch-processor
  template:
    metadata:
      labels:
        app: batch-processor
    spec:
      tolerations:
      - key: "cloud.google.com/gke-spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 90
            preference:
              matchExpressions:
              - key: cloud.google.com/gke-spot
                operator: In
                values: ["true"]
      terminationGracePeriodSeconds: 25  # Less than the 30s eviction notice
      containers:
      - name: processor
        image: myapp/batch-processor:v2.1
        resources:
          requests:
            cpu: 250m
            memory: 512Mi

Bin Packing: Maximizing Node Utilization

The default Kubernetes scheduler spreads pods across nodes to maximize availability. This is great for resilience but terrible for cost — you end up with many lightly loaded nodes instead of fewer well-packed ones. Bin packing is the opposite strategy: pack as many pods as possible onto existing nodes before adding new ones.

You can tune the scheduler's scoring plugin to prefer nodes that already have pods on them. The NodeResourcesFit plugin supports a MostAllocated strategy that scores partially-filled nodes higher:

yaml
# KubeSchedulerConfiguration — bin packing profile
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated      # Pack pods tightly onto existing nodes
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

Bin packing works especially well when paired with the Cluster Autoscaler. The autoscaler removes underutilized nodes when their pods can be rescheduled elsewhere. With bin packing pushing pods onto fewer nodes, the autoscaler has more opportunities to drain and terminate empty or near-empty nodes — directly reducing your bill.

Bin packing trades availability for cost savings

Packing many pods onto one node means losing that node takes out more workloads simultaneously. Only use MostAllocated for workloads that can tolerate brief disruptions (batch jobs, dev environments, stateless services with multiple replicas). For mission-critical production services, keep the default spread strategy and pair it with PodDisruptionBudgets.

Pod Density, Node Size, and Cluster Overhead

Every Kubernetes node reserves a chunk of resources for system daemons — the kubelet, container runtime, OS kernel, and kube-proxy. This is called system reservation, and it is not available to your workloads. The overhead is roughly fixed per node regardless of size, which creates a key tradeoff:

StrategyMany Small NodesFewer Large Nodes
System overhead ratioHigh — each node loses ~0.5–1 CPU and 0.5–1Gi to reservationsLow — same fixed cost amortized over more allocatable resources
Blast radiusSmall — losing one node affects few podsLarge — losing one node affects many pods
Pod densityLimited by max-pods-per-node (default 110) and allocatable resourcesCan fit many more pods per node
IP address usageMore nodes = more IPs consumed by node networkingFewer IPs wasted on node overhead
Scaling granularityFine-grained — can add small increments of capacityCoarse — each new node adds a large block of capacity
Best forVaried workloads, strict isolation requirementsHomogeneous workloads, cost-optimized steady state

Here is a concrete example. On a t3.medium (2 vCPU, 4Gi RAM) in AWS EKS, system reservation takes approximately 70m CPU and 574Mi memory. That leaves only 1.93 vCPU and 3.4Gi for pods — a 14% memory overhead. On a m5.4xlarge (16 vCPU, 64Gi RAM), the reservation is about 110m CPU and 1.7Gi memory — a 2.6% memory overhead. Running 8 small nodes instead of 1 large node wastes roughly 3Gi of memory to system reservation alone.

graph LR
    subgraph Small["8 x t3.medium (2 CPU, 4Gi each)"]
        S_Total["Total: 16 CPU, 32Gi"]
        S_Reserved["Reserved: 0.56 CPU, 4.6Gi"]
        S_Allocatable["Allocatable: 15.4 CPU, 27.4Gi"]
    end
    subgraph Large["1 x m5.4xlarge (16 CPU, 64Gi)"]
        L_Total["Total: 16 CPU, 64Gi"]
        L_Reserved["Reserved: 0.11 CPU, 1.7Gi"]
        L_Allocatable["Allocatable: 15.9 CPU, 62.3Gi"]
    end
    S_Total --> S_Reserved --> S_Allocatable
    L_Total --> L_Reserved --> L_Allocatable
    

Cost Visibility: Know Where the Money Goes

You cannot optimize what you cannot measure. Cloud provider bills show you total Kubernetes spend, but they cannot break costs down to the namespace, team, or workload level. Dedicated cost tools fill this gap by combining resource usage metrics with real pricing data.

ToolTypeKey StrengthBest For
KubecostCommercial (free tier available)Real-time cost allocation per namespace/label/deployment with savings recommendationsTeams that want a turnkey dashboard with actionable alerts
OpenCostOpen source (CNCF sandbox)Kubecost's cost-allocation engine, fully open. Exposes a cost API you can query programmaticallyTeams that want to build custom dashboards or integrate cost into CI/CD
AWS Cost Explorer + CURCloud-nativeTag-based cost allocation with EKS split-cost allocation for per-pod costsAWS-only shops already using Cost Explorer
GKE Cost EstimationCloud-nativeBuilt into GKE console; breaks down cost by namespace and workloadGCP-only teams wanting zero extra tooling
Prometheus + custom metricsDIYFull control; combine container_cpu_usage_seconds_total with pricing dataTeams with existing Prometheus and Grafana stacks

The quickest way to start is deploying OpenCost alongside your existing Prometheus installation. It reads resource metrics, maps them to on-demand pricing, and exposes a cost allocation API:

bash
# Install OpenCost via Helm
helm install opencost opencost/opencost \
  --namespace opencost --create-namespace \
  --set opencost.prometheus.internal.serviceName=prometheus-server \
  --set opencost.prometheus.internal.namespaceName=monitoring

# Query cost allocation per namespace for the last 24h
kubectl port-forward -n opencost svc/opencost 9090:9090 &
curl -s "http://localhost:9090/allocation/compute?window=24h&aggregate=namespace" | jq '
  .data[0] | to_entries[] | {
    namespace: .key,
    cpu_cost: .value.cpuCost,
    memory_cost: .value.ramCost,
    total: .value.totalCost
  }'

Dev/Staging Environment Strategies

Non-production environments are often the largest source of hidden waste. A staging cluster that mirrors production at full scale runs 24/7 but is only actively used 8–10 hours on weekdays. That is 70% idle time at production-grade cost. Here are three strategies, ordered from easiest to most aggressive:

1. Reduce Replica Counts

The simplest approach: run 1 replica instead of 3 in non-production. Create a Kustomize overlay or Helm values file per environment:

yaml
# kustomize/overlays/staging/replica-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 1     # Production uses 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
spec:
  replicas: 1     # Production uses 5

2. Namespace-Level Shutdown

Scale all deployments in a namespace to zero replicas during off-hours. A simple CronJob or CI pipeline can handle this:

bash
# Save current replica counts as annotations, then scale to zero
for deploy in $(kubectl get deploy -n staging -o name); do
  replicas=$(kubectl get "$deploy" -n staging -o jsonpath='{.spec.replicas}')
  kubectl annotate "$deploy" -n staging original-replicas="$replicas" --overwrite
  kubectl scale "$deploy" -n staging --replicas=0
done

# Restore original replica counts in the morning
for deploy in $(kubectl get deploy -n staging -o name); do
  replicas=$(kubectl get "$deploy" -n staging \
    -o jsonpath='{.metadata.annotations.original-replicas}')
  kubectl scale "$deploy" -n staging --replicas="${replicas:-1}"
done

3. Cluster Hibernation

For environments not needed overnight or on weekends, scale the entire cluster's node pool to zero. Most managed Kubernetes services support this — your control plane stays up (and costs almost nothing on GKE, about $0.10/hr on EKS) while worker nodes are fully removed. On resume, the Cluster Autoscaler brings nodes back as pods become schedulable.

bash
# GKE: Scale node pool to zero
gcloud container clusters resize staging-cluster \
  --node-pool default-pool --num-nodes 0 --zone us-central1-a --quiet

# EKS: Set desired capacity to zero via ASG
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name eks-staging-node-group \
  --min-size 0 --desired-capacity 0

Node Type and Size Decision Framework

Choosing the right node type is one of the highest-leverage cost decisions you can make. The wrong choice locks in waste for months. Use this framework to narrow down the decision based on your workload characteristics.

flowchart TD
    A["What is your dominant workload pattern?"] --> B{"CPU-bound or\nMemory-bound?"}
    B -->|CPU-bound| C{"Burstable or\nsteady utilization?"}
    B -->|Memory-bound| D["Memory-optimized instances\n(r-series / e2-highmem)"]
    B -->|Balanced| E["General-purpose instances\n(m-series / e2-standard)"]
    C -->|Burstable| F["Burstable instances\n(t3/t3a on AWS, e2 on GCP)"]
    C -->|Steady| G["Compute-optimized instances\n(c-series / c2/c3)"]
    F --> H{"Many small pods\nor few large pods?"}
    G --> H
    D --> H
    E --> H
    H -->|"Many small pods\n(under 0.5 CPU each)"| I["Larger nodes for better density\n(8+ vCPU, 16+ Gi RAM)"]
    H -->|"Few large pods\n(2+ CPU each)"| J["Size nodes to fit 3-5 pods\nwith 10-15% headroom"]
    I --> K["Enable bin packing\n+ Cluster Autoscaler"]
    J --> K
    

Practical Sizing Rules of Thumb

RuleDetails
Target 70–80% allocatable utilizationBelow 60% means you are paying for idle capacity. Above 85% leaves no room for traffic spikes or rolling updates (which temporarily double pod count).
Size nodes for your largest pod x 3–5If your biggest pod requests 2 CPU and 4Gi, nodes should have at least 8 CPU and 16Gi allocatable so you can fit multiple pods and maintain flexibility.
Use 2+ node poolsOne for general workloads (general-purpose instances) and one for specialized workloads (GPU, memory-heavy). This avoids paying GPU prices for a pod that only needs CPU.
Do not go below 2 vCPU / 4Gi per nodeAfter system reservations, very small nodes have too little allocatable capacity. DaemonSets (logging, monitoring, CNI) eat a large percentage of tiny nodes.
Account for DaemonSet overheadLogging agents, monitoring exporters, and CNI plugins run on every node. On a 4Gi node, DaemonSets consuming 300Mi is a 7.5% tax. On a 32Gi node, it is under 1%.
The 30% Rule for Rolling Updates

During a rolling deployment with maxSurge: 25%, Kubernetes creates new pods before terminating old ones. A 10-replica deployment briefly runs 12–13 pods. If your nodes are at 95% allocation, those surge pods cannot schedule and the rollout stalls. Keep at least 20–30% headroom cluster-wide, or size maxSurge and maxUnavailable to work within your capacity.

Putting It All Together: A Cost Optimization Checklist

Cost optimization is not a one-time project — it is an ongoing practice. Use this as a recurring checklist to keep your cluster lean:

  1. Deploy VPA in recommendation mode on all namespaces. Review and apply right-sized requests quarterly.
  2. Set ResourceQuotas and LimitRanges on every namespace. No namespace should have unlimited access to cluster resources.
  3. Move fault-tolerant workloads to spot instances. Target at least 40–60% of non-critical workloads on spot for maximum savings.
  4. Enable bin packing if your workloads are predominantly stateless, and pair it with Cluster Autoscaler for automatic node removal.
  5. Deploy a cost visibility tool (OpenCost or Kubecost). Set alerts for namespaces exceeding their budget by more than 10%.
  6. Implement off-hours shutdown for dev/staging environments. A scheduled scale-down from 7pm to 8am on weekdays plus full weekends cuts compute costs by 65%.
  7. Review node sizing annually. Workload profiles change over time. What was optimal six months ago may be wasteful now.

Helm — Kubernetes Package Management

Deploying a real application to Kubernetes usually means managing a collection of interconnected manifests — Deployments, Services, ConfigMaps, Secrets, Ingresses, ServiceAccounts, RBAC rules. These files share values like the application name, image tag, and replica count. Copying and pasting those values across a dozen YAML files is fragile. When you need the same app deployed to staging and production with different configurations, plain YAML falls apart fast.

Helm solves this by introducing templating, packaging, and release management for Kubernetes manifests. Think of it as the apt or brew of Kubernetes — it bundles manifests into reusable, versioned, configurable packages called charts, and tracks every deployment as a release with full rollback history.

Helm v3 Architecture

Helm v2 required a server-side component called Tiller running inside the cluster. Tiller held broad permissions and was a well-known security concern — any user who could reach Tiller could deploy anything to any namespace. Helm v3, released in November 2019, removed Tiller entirely.

In Helm v3, the CLI talks directly to the Kubernetes API server using your existing kubeconfig credentials. Release state — the record of what was installed, which revision is active, and the rendered manifests — is stored as Kubernetes Secrets (by default) or ConfigMaps in the release's target namespace. This means Helm respects your existing RBAC rules with zero extra infrastructure.

graph LR
    USER["👤 Developer / CI"]
    HELM["Helm CLI"]
    API["kube-apiserver"]
    SEC["Release Secrets<br/>(namespace-scoped)"]
    RES["Deployed Resources<br/>(Pods, Services, etc.)"]
    REPO["Chart Repository<br/>(OCI / HTTP)"]

    USER --> HELM
    HELM -->|"helm install / upgrade"| API
    HELM -->|"helm pull / search"| REPO
    API --> SEC
    API --> RES
    SEC -.->|"stores release history<br/>revisions, values, manifests"| API
    
Release Secrets Live in the Target Namespace

Each Helm release stores its history as Secrets named sh.helm.release.v1.<release-name>.v<revision> in the namespace where the release is deployed. This makes namespace-scoped RBAC sufficient to control who can install or modify releases — no cluster-admin required for namespace-bound operations.

Core Concepts

Helm has four foundational concepts you need to internalize before running any commands: charts, releases, revisions, and repositories. Every Helm operation maps back to these.

ConceptWhat It IsAnalogy
ChartA versioned package containing templated Kubernetes manifests, default values, metadata, and optional dependencies. A directory or .tgz archive.A Debian .deb package or a Homebrew formula
ReleaseA specific instance of a chart deployed to a cluster. One chart can produce multiple releases (e.g., myapp-staging and myapp-prod) with different values.An installed instance of a package on your system
RevisionA snapshot of a release at a point in time. Each helm install, helm upgrade, or helm rollback increments the revision number. Enables full rollback.A Git commit for your deployment
RepositoryAn HTTP server or OCI registry that hosts packaged charts. Public repos include Artifact Hub, Bitnami, and official project charts.A package registry like npm or PyPI

Chart Directory Structure

A Helm chart is a directory with a specific layout. The structure is convention over configuration — Helm looks for files in exact locations. Here is what a well-formed chart looks like:

text
myapp/
├── Chart.yaml          # Chart metadata: name, version, appVersion, dependencies
├── Chart.lock          # Locked dependency versions (generated by helm dependency update)
├── values.yaml         # Default configuration values
├── values.schema.json  # Optional: JSON Schema to validate values
├── templates/          # Kubernetes manifest templates (Go templates)
│   ├── _helpers.tpl    # Named template definitions (partials)
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── hpa.yaml
│   ├── serviceaccount.yaml
│   ├── tests/          # Helm test Pod definitions
│   │   └── test-connection.yaml
│   └── NOTES.txt       # Post-install usage instructions (rendered and shown to user)
├── charts/             # Dependency subcharts (populated by helm dependency update)
└── .helmignore         # Files to exclude when packaging

Chart.yaml — The Manifest

Chart.yaml is the identity card of your chart. It declares metadata, the chart version (for the package itself), the app version (for the software being deployed), and any dependencies on other charts.

yaml
apiVersion: v2
name: myapp
description: A web application with Redis caching
type: application          # "application" or "library"
version: 1.2.0             # Chart version — follows SemVer
appVersion: "3.5.1"        # Version of the app being deployed
keywords:
  - web
  - api
maintainers:
  - name: Platform Team
    email: platform@example.com
dependencies:
  - name: redis
    version: "18.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled

values.yaml — Default Configuration

values.yaml is where you define every configurable parameter with sensible defaults. Users override these at install time with --set flags or custom value files. Keep it well-commented — this file is the primary interface for anyone consuming your chart.

yaml
# -- Number of application replicas
replicaCount: 2

image:
  # -- Container image repository
  repository: ghcr.io/myorg/myapp
  # -- Image pull policy
  pullPolicy: IfNotPresent
  # -- Overrides the image tag (default is the chart appVersion)
  tag: ""

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: false
  className: nginx
  hosts:
    - host: myapp.example.com
      paths:
        - path: /
          pathType: Prefix

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 256Mi

redis:
  enabled: true

Go Template Syntax in Helm

Helm templates are standard Go templates augmented with the Sprig function library and Helm-specific objects. Templates receive a top-level context object (referred to as .) that contains several built-in objects you access with dot notation.

ObjectContainsExample Usage
.ValuesMerged values from values.yaml and user overrides{{ .Values.image.repository }}
.ReleaseRelease metadata: .Name, .Namespace, .IsInstall, .IsUpgrade, .Revision{{ .Release.Name }}
.ChartContents of Chart.yaml: .Name, .Version, .AppVersion{{ .Chart.AppVersion }}
.CapabilitiesCluster info: .APIVersions, .KubeVersion{{ .Capabilities.KubeVersion.Minor }}
.TemplateCurrent template info: .Name, .BasePath{{ .Template.Name }}

Templates in Action

Here is a real templates/deployment.yaml that uses conditionals, ranges, includes, and value references. This single template can produce different Deployment manifests depending on the values passed in.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            {{- range .Values.service.ports }}
            - containerPort: {{ .targetPort }}
              protocol: {{ .protocol | default "TCP" }}
            {{- end }}
          {{- if .Values.resources }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          {{- end }}
          {{- if .Values.env }}
          env:
            {{- range $key, $value := .Values.env }}
            - name: {{ $key }}
              value: {{ $value | quote }}
            {{- end }}
          {{- end }}

A few details to note about the syntax. The {{- with a dash trims whitespace before the directive, and -}} trims whitespace after — this keeps your rendered YAML clean. The nindent function adds a newline and indentation, solving the most common frustration with Helm: getting YAML indentation right inside templates. The pipe operator | chains functions left to right, just like Unix pipes.

Named Templates with _helpers.tpl

The _helpers.tpl file (the leading underscore tells Helm not to render it as a standalone manifest) defines reusable template snippets using the define action. You call these with include from other templates. This eliminates duplication across your manifests.

yaml
{{/*
Expand the name of the chart.
*/}}
{{- define "myapp.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a fully qualified app name, truncated to 63 chars (K8s label limit).
*/}}
{{- define "myapp.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}

{{/*
Common labels applied to every resource.
*/}}
{{- define "myapp.labels" -}}
helm.sh/chart: {{ include "myapp.chart" . }}
{{ include "myapp.selectorLabels" . }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Selector labels — must be identical on Deployment and Service.
*/}}
{{- define "myapp.selectorLabels" -}}
app.kubernetes.io/name: {{ include "myapp.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

Essential Helm Commands

The Helm CLI covers the full lifecycle: searching for charts, installing releases, upgrading, rolling back, inspecting, and cleaning up. Here are the commands you will use daily, grouped by operation.

Install, Upgrade, and Rollback

bash
# Add a chart repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Install a chart as a new release
helm install my-redis bitnami/redis \
  --namespace caching --create-namespace \
  --set auth.password=secretpass \
  --set replica.replicaCount=3

# Install from a local chart directory
helm install myapp ./myapp -f values-prod.yaml

# Upgrade an existing release (creates a new revision)
helm upgrade myapp ./myapp \
  --set image.tag=3.6.0 \
  --reuse-values          # keep previously supplied values

# Install OR upgrade in one command (idempotent — great for CI/CD)
helm upgrade --install myapp ./myapp -f values-prod.yaml

# Rollback to a previous revision
helm rollback myapp 2     # roll back to revision 2

# Uninstall a release and all its resources
helm uninstall myapp --namespace default

Inspect and Debug

bash
# Render templates locally WITHOUT deploying (essential for debugging)
helm template myapp ./myapp -f values-staging.yaml

# Render templates and validate against the cluster's API (catches schema errors)
helm template myapp ./myapp --validate

# Dry-run an install/upgrade against the live cluster
helm install myapp ./myapp --dry-run --debug

# Lint a chart for best practices and syntax errors
helm lint ./myapp --strict

# Show computed values for a deployed release
helm get values myapp --all

# Show the rendered manifests of a deployed release
helm get manifest myapp

# List all releases across namespaces
helm list --all-namespaces

# Show release history (all revisions)
helm history myapp

Dependencies

bash
# Download dependencies declared in Chart.yaml into charts/
helm dependency update ./myapp

# List dependency status
helm dependency list ./myapp

# Rebuild the Chart.lock file
helm dependency build ./myapp
Use helm upgrade --install in CI/CD Pipelines

The --install flag makes helm upgrade idempotent: it installs the release if it doesn't exist, or upgrades it if it does. This eliminates the need for conditional logic in your pipeline scripts. Combine it with --atomic to auto-rollback on failure and --wait to block until all resources are ready.

Helm Hooks

Hooks let you run Kubernetes resources at specific points in the release lifecycle — before install, after upgrade, before deletion, and more. A hook is any template with a helm.sh/hook annotation. Common use cases include running database migrations before an upgrade, populating seed data after install, or performing cleanup before uninstall.

HookWhen It FiresTypical Use Case
pre-installAfter templates render, before any resources are createdCreate a database schema, run preflight checks
post-installAfter all resources are createdRegister with a service mesh, send a Slack notification
pre-upgradeAfter templates render, before any resources are updatedRun database migrations, take a backup
post-upgradeAfter all resources are updatedClear a CDN cache, warm application caches
pre-deleteBefore any release resources are deletedDrain connections, export data
post-deleteAfter all resources are deletedRemove DNS records, clean up external resources
pre-rollbackBefore rollbackReverse a database migration
post-rollbackAfter rollbackNotify monitoring systems
testWhen helm test is invokedSmoke tests, connectivity checks

Here is a concrete example — a Job that runs database migrations before each upgrade:

yaml
# templates/migrate-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "myapp.fullname" . }}-migrate
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"           # lower weight runs first
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  backoffLimit: 1
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          command: ["./migrate", "--target", "latest"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: {{ include "myapp.fullname" . }}-db
                  key: url

The hook-weight controls ordering when multiple hooks fire at the same point — lower numbers run first. The hook-delete-policy controls when the hook resource is cleaned up: before-hook-creation deletes any prior instance before creating the new one, and hook-succeeded deletes it after successful completion. Without a delete policy, hook resources accumulate across revisions.

Helm Tests

Helm tests are Pods defined in templates/tests/ with the "helm.sh/hook": test annotation. They run when you invoke helm test <release> and validate that a deployed release is actually working — not just that resources were created, but that the application responds correctly.

yaml
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
  name: {{ include "myapp.fullname" . }}-test
  annotations:
    "helm.sh/hook": test
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  restartPolicy: Never
  containers:
    - name: curl-test
      image: curlimages/curl:8.5.0
      command:
        - sh
        - -c
        - |
          echo "Testing {{ include "myapp.fullname" . }} health endpoint..."
          curl --fail --silent --max-time 10 \
            http://{{ include "myapp.fullname" . }}:{{ .Values.service.port }}/healthz
          echo "Test passed!"
bash
# Run tests against a deployed release
helm test myapp --timeout 60s

# Run tests and view logs on failure
helm test myapp --logs

Subcharts and Dependencies

A chart can depend on other charts. This is how you compose complex stacks — your application chart might depend on a Redis chart and a PostgreSQL chart. Dependencies are declared in Chart.yaml and downloaded into the charts/ directory.

You pass configuration to subcharts by nesting values under the dependency name. The parent chart's values.yaml controls everything — the subchart doesn't need modification.

yaml
# Chart.yaml — declaring dependencies
apiVersion: v2
name: myapp
version: 1.2.0
dependencies:
  - name: postgresql
    version: "13.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled    # toggle with values
  - name: redis
    version: "18.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled
yaml
# values.yaml — configuring subcharts via nested keys
postgresql:
  enabled: true
  auth:
    postgresPassword: "changeme"
    database: "myapp_db"
  primary:
    resources:
      requests:
        cpu: 250m
        memory: 256Mi

redis:
  enabled: true
  auth:
    password: "redis-secret"
  replica:
    replicaCount: 2
bash
# Download/update dependencies
helm dependency update ./myapp

# Install the full stack — app + PostgreSQL + Redis
helm install myapp ./myapp -f values-prod.yaml

# Deploy without the Redis subchart
helm install myapp ./myapp --set redis.enabled=false

Building a Custom Chart from Scratch

The best way to understand Helm is to build a chart end to end. Below is a walkthrough that creates a chart for a Go API server, scaffolds it, customizes the templates, lints, renders, and deploys it.

bash
# 1. Scaffold a new chart
helm create order-api

# 2. Examine what was generated
tree order-api/

# 3. Edit Chart.yaml — set your app metadata
cat > order-api/Chart.yaml <<'EOF'
apiVersion: v2
name: order-api
description: Order processing API service
type: application
version: 0.1.0
appVersion: "1.0.0"
EOF

# 4. Define your values
cat > order-api/values.yaml <<'EOF'
replicaCount: 2

image:
  repository: ghcr.io/myorg/order-api
  pullPolicy: IfNotPresent
  tag: ""

service:
  type: ClusterIP
  port: 8080

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: orders.example.com
      paths:
        - path: /api/orders
          pathType: Prefix

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: "1"
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

probes:
  liveness:
    path: /healthz
    port: 8080
  readiness:
    path: /readyz
    port: 8080
EOF

# 5. Lint the chart
helm lint ./order-api --strict

# 6. Render templates locally to verify output
helm template order-api ./order-api -f order-api/values.yaml

# 7. Dry-run against the cluster (validates API compatibility)
helm install order-api ./order-api --dry-run --debug

# 8. Deploy for real
helm install order-api ./order-api --namespace orders --create-namespace

# 9. Verify the release
helm list -n orders
helm get values order-api -n orders
kubectl get all -n orders

# 10. Run tests
helm test order-api -n orders

The Release Lifecycle Flow

Understanding how install, upgrade, rollback, and uninstall interact with revisions is critical for production operations. Every mutation creates a new revision, and Helm keeps a configurable history depth (default: 10) for rollback.

stateDiagram-v2
    [*] --> Deployed : helm install (rev 1)
    Deployed --> Deployed : helm upgrade (rev N+1)
    Deployed --> Superseded : new revision deployed
    Deployed --> Uninstalled : helm uninstall
    Deployed --> PendingUpgrade : helm upgrade (in progress)
    PendingUpgrade --> Deployed : success
    PendingUpgrade --> Failed : error / timeout
    Failed --> Deployed : helm rollback (rev N+1)
    Failed --> Deployed : helm upgrade --force (rev N+1)
    Superseded --> Deployed : helm rollback to this rev
    Uninstalled --> [*]
    

Helm vs. Alternatives

Helm is the most widely adopted Kubernetes packaging tool, but it is not the only option. Each alternative makes different trade-offs between power, complexity, and approach. Your choice depends on team size, use case, and how much you value DRY templating versus straightforward patching.

ToolApproachStrengthsWeaknessesBest For
Helm Go templating + packaging + release management Huge ecosystem of prebuilt charts. Built-in versioning, rollback, and dependency management. OCI registry support. Go templates can be hard to read. Debugging whitespace issues is frustrating. Complex charts become a maintenance burden. Teams consuming third-party software (databases, monitoring stacks) and shipping reusable internal charts.
Kustomize Overlay-based patching of plain YAML — no templating Built into kubectl (kubectl apply -k). No new syntax to learn — just YAML patches. Easy to understand diffs. No logic layer. No packaging or versioning. No release management. Repetitive for more than a handful of environments. No dependency concept. Teams managing a small number of environments for their own services. Works well alongside Helm (helm template | kustomize).
Jsonnet / Tanka A data-templating language that generates JSON/YAML programmatically Full programming language with functions, imports, conditionals, comprehensions. Excellent for complex, highly parameterized configurations. Steep learning curve. Small ecosystem. Not widely adopted outside of observability teams (Grafana, Prometheus). Power users generating complex, deeply nested configurations — especially monitoring and alerting stacks.
cdk8s Define Kubernetes resources using TypeScript, Python, Java, or Go — generates YAML Full programming language. IDE autocompletion. Type safety. Reuse existing test frameworks. Requires a build step. Overkill for simple deployments. Smaller community than Helm. Development teams who prefer writing infrastructure in the same language as their application code.
Helm and Kustomize Are Not Mutually Exclusive

A common production pattern is to render Helm charts with helm template, then apply Kustomize overlays on top for environment-specific patches. This gives you Helm's ecosystem for third-party charts and Kustomize's simplicity for last-mile customization. ArgoCD and Flux both support this workflow natively.

Production Best Practices

Helm is straightforward to start with but has sharp edges at scale. These practices come from production experience managing hundreds of releases across clusters.

  • Pin chart versions in CI/CD. Never use helm install bitnami/redis without a --version flag. A new upstream chart version can break your deployment with zero warning.
  • Use values.schema.json. Define a JSON Schema for your chart's values. Helm validates user-supplied values against the schema at install/upgrade time, catching typos and invalid configurations before they reach the cluster.
  • Always use --atomic and --timeout in CI/CD. The --atomic flag automatically rolls back if an upgrade fails, preventing half-deployed states. Set --timeout to a reasonable value so failures don't hang your pipeline.
  • Limit release history. Set --history-max 5 on upgrades. Each revision stores the full rendered manifest as a Secret, and clusters with hundreds of revisions accumulate significant etcd storage.
  • Template locally before deploying. Run helm template and helm lint --strict in CI before any helm upgrade. Catch errors in minutes, not in production.
  • Use library charts for shared templates. If multiple charts share the same label conventions, RBAC patterns, or monitoring annotations, extract them into a library chart (type: library in Chart.yaml) and import it as a dependency.
  • Store custom charts in an OCI registry. Helm v3.8+ supports OCI registries (e.g., ECR, GHCR, ACR, Harbor) as first-class chart repositories. Use helm push and helm pull oci:// for versioned, authenticated chart distribution.

Custom Resource Definitions — Extending the Kubernetes API

Kubernetes ships with a rich set of built-in resources — Pods, Deployments, Services, ConfigMaps — but real-world platforms inevitably need domain-specific abstractions. A Custom Resource Definition (CRD) lets you register an entirely new resource type with the API server, so it can be created, listed, watched, and deleted with kubectl just like any native object.

CRDs are the foundation of the Kubernetes extension model. When you install Prometheus via the Operator, it registers CRDs like Prometheus, ServiceMonitor, and AlertmanagerConfig. When you deploy Istio, it brings VirtualService and DestinationRule. Every major ecosystem tool extends Kubernetes this way. Understanding CRDs is the prerequisite for building operators and designing platform APIs.

How CRDs Extend the API Server

When you apply a CRD manifest, the API server dynamically creates a new RESTful endpoint. No recompilation, no restart — the new resource type is available within seconds. The API server handles storage (in etcd), RBAC authorization, admission control, and watch notifications for your custom resource exactly as it does for built-in types.

sequenceDiagram
    participant U as User / kubectl
    participant A as API Server
    participant E as etcd

    U->>A: Apply CRD manifest (kind: CustomResourceDefinition)
    A->>E: Store CRD definition
    A-->>A: Register new REST endpoint
    A-->>U: CRD created

    Note over A: New endpoint is now live

    U->>A: kubectl apply -f myapp.yaml (kind: MyApp)
    A-->>A: Validate against OpenAPI schema in CRD
    A->>E: Store MyApp instance
    A-->>U: myapp.apps.example.com/my-sample created

    U->>A: kubectl get myapps
    A->>E: List MyApp objects
    A-->>U: Return list with printer columns
    

The key insight is that a CRD is just data stored in etcd — it tells the API server the shape and rules for your custom resource. The actual behavior (reconciliation, automation) comes from a controller or operator watching those custom resources. CRDs without controllers are still useful for configuration storage, but the real power emerges when you pair them with custom controllers.

Anatomy of a CRD

Every CRD defines four fundamental properties that determine how the resource appears in the API. These map directly to the resource's API path and how users interact with it.

FieldPurposeExample
GroupAPI group the resource belongs to. Use a domain you own to avoid collisions.apps.example.com
VersionAPI version string. Follows Kubernetes conventions: v1alpha1v1beta1v1.v1alpha1
KindThe PascalCase name of your resource type as it appears in manifests.WebApplication
ScopeNamespaced or Cluster. Determines whether instances live inside a namespace or are cluster-wide.Namespaced

The combination of group, version, and kind (GVK) uniquely identifies a resource type across the entire cluster. The API path follows the pattern /apis/{group}/{version}/namespaces/{ns}/{plural} for namespaced resources, or /apis/{group}/{version}/{plural} for cluster-scoped ones.

Namespaced vs. Cluster Scope

Choose Namespaced for resources that belong to a team or application (most CRDs). Choose Cluster only for resources that are inherently global — like cluster-wide policies, storage classes, or infrastructure definitions that span namespaces. You cannot change the scope after creation without deleting and recreating the CRD.

Practical Example: A WebApplication CRD

Let's build a complete CRD for a WebApplication resource. This custom type will represent a web application deployment with its image, replicas, and ingress configuration — the kind of abstraction a platform team might offer to developers.

Step 1 — Define the CRD

This manifest registers the WebApplication type with the API server. Pay attention to the openAPIV3Schema section — it defines exactly what fields users can set and their types.

yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: webapplications.apps.example.com
spec:
  group: apps.example.com
  scope: Namespaced
  names:
    plural: webapplications
    singular: webapplication
    kind: WebApplication
    shortNames:
      - webapp
      - wa
    categories:
      - all                          # Show in `kubectl get all`

  versions:
    - name: v1alpha1
      served: true
      storage: true

      # --- Schema validation ---
      schema:
        openAPIV3Schema:
          type: object
          required: ["spec"]
          properties:
            spec:
              type: object
              required: ["image", "replicas"]
              properties:
                image:
                  type: string
                  description: "Container image in repository:tag format."
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                  default: 2
                port:
                  type: integer
                  minimum: 1
                  maximum: 65535
                  default: 8080
                ingress:
                  type: object
                  properties:
                    host:
                      type: string
                    tlsSecret:
                      type: string
                env:
                  type: array
                  items:
                    type: object
                    required: ["name", "value"]
                    properties:
                      name:
                        type: string
                      value:
                        type: string
            status:
              type: object
              properties:
                readyReplicas:
                  type: integer
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                        enum: ["True", "False", "Unknown"]
                      lastTransitionTime:
                        type: string
                        format: date-time
                      message:
                        type: string

      # --- Printer columns for kubectl get ---
      additionalPrinterColumns:
        - name: Image
          type: string
          jsonPath: .spec.image
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Ready
          type: integer
          jsonPath: .status.readyReplicas
        - name: Host
          type: string
          jsonPath: .spec.ingress.host
          priority: 1                 # Only shown with -o wide
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp

      # --- Subresources ---
      subresources:
        status: {}                    # Enable /status subresource
        scale:                        # Enable /scale subresource
          specReplicasPath: .spec.replicas
          statusReplicasPath: .status.readyReplicas

Step 2 — Apply the CRD and Verify

Once applied, the API server immediately recognizes the new type. You can verify this by checking the API resources list.

bash
# Register the CRD
kubectl apply -f webapplication-crd.yaml

# Verify the CRD is established
kubectl get crd webapplications.apps.example.com

# Confirm it appears in API resources
kubectl api-resources | grep webapp
# webapplications   webapp,wa   apps.example.com/v1alpha1   true   WebApplication

# Inspect the CRD details
kubectl describe crd webapplications.apps.example.com

Step 3 — Create an Instance

With the CRD registered, you create instances of WebApplication using standard kubectl apply. The API server validates the manifest against the OpenAPI schema you defined — try submitting an invalid field and it will be rejected.

yaml
apiVersion: apps.example.com/v1alpha1
kind: WebApplication
metadata:
  name: frontend
  namespace: production
spec:
  image: my-company/frontend:2.4.1
  replicas: 3
  port: 3000
  ingress:
    host: app.example.com
    tlsSecret: app-tls-cert
  env:
    - name: NODE_ENV
      value: production
    - name: API_URL
      value: https://api.example.com
bash
# Create the custom resource
kubectl apply -f frontend-webapp.yaml

# List all WebApplications (using the short name)
kubectl get webapp -n production
# NAME       IMAGE                        REPLICAS   READY   AGE
# frontend   my-company/frontend:2.4.1    3          <none>   5s

# Detailed view with priority columns
kubectl get webapp -n production -o wide

# Full YAML output
kubectl get webapp frontend -n production -o yaml

Schema Validation with OpenAPI v3

Since Kubernetes 1.16, CRDs require structural schemas. A structural schema means every field must have a declared type, no untyped objects are allowed at any nesting level, and the schema must be self-contained (no external $ref pointers). This isn't just a formality — structural schemas are what enable server-side validation, pruning of unknown fields, and defaulting.

Key Schema Features

FeatureSchema KeywordExample
Required fieldsrequiredrequired: ["image", "replicas"]
Default valuesdefaultdefault: 2
Range constraintsminimum, maximumminimum: 1, maximum: 100
String patternspatternpattern: "^[a-z0-9-]+$"
Enum valuesenumenum: ["True", "False", "Unknown"]
String formatsformatformat: date-time
Prune unknown fieldsx-kubernetes-preserve-unknown-fieldsSet to true to allow arbitrary JSON
Immutable fieldsx-kubernetes-validationsCEL rule preventing changes after creation

When a user submits a custom resource, the API server validates it against the schema and prunes any fields not declared in the schema. This means typos like replcia: 3 are silently removed rather than stored — which can be confusing. You can catch these issues by using kubectl apply --dry-run=server -o yaml to see exactly what the API server will store.

CEL Validation Rules (Kubernetes 1.25+)

OpenAPI schemas handle type-level validation well, but they cannot express cross-field constraints like "if ingress is set, host must not be empty" or "replicas must be odd for quorum-based workloads." Starting in Kubernetes 1.25 (beta) and GA in 1.29, you can embed Common Expression Language (CEL) rules directly in the schema using x-kubernetes-validations.

yaml
# Add these to the spec-level schema in your CRD
spec:
  type: object
  x-kubernetes-validations:
    # Cross-field validation: ingress requires a host
    - rule: "!has(self.ingress) || has(self.ingress.host)"
      message: "ingress.host is required when ingress is specified"
    # Enforce image tag is not 'latest'
    - rule: "!self.image.endsWith(':latest')"
      message: "Using :latest tag is not allowed; pin to a specific version"
  properties:
    image:
      type: string
      x-kubernetes-validations:
        # Ensure image contains a tag
        - rule: "self.contains(':')"
          message: "Image must include a tag (e.g., myapp:v1.0)"
    replicas:
      type: integer
      x-kubernetes-validations:
        # Transition rule: prevent scaling down more than 50% at once
        - rule: "self >= oldSelf / 2"
          message: "Cannot scale down by more than 50% in a single update"

CEL rules using oldSelf are transition rules — they compare the new value against the previous one and only apply during updates, not creation. This is how you enforce constraints like "this field is immutable" (self == oldSelf) or "replicas can only increase" (self >= oldSelf) without a validating webhook.

Additional Printer Columns

By default, kubectl get for custom resources shows only NAME and AGE. The additionalPrinterColumns field in the CRD lets you surface important fields in the table output — making your custom resources feel like first-class citizens.

yaml
additionalPrinterColumns:
  - name: Image
    type: string
    jsonPath: .spec.image
  - name: Replicas
    type: integer
    jsonPath: .spec.replicas
  - name: Ready
    type: integer
    jsonPath: .status.readyReplicas
  - name: Host
    type: string
    jsonPath: .spec.ingress.host
    priority: 1                       # Only visible with -o wide
  - name: Age
    type: date
    jsonPath: .metadata.creationTimestamp

Each column maps a jsonPath expression to a named column. The priority field controls visibility: columns with priority: 0 (default) appear in normal output, while higher priorities only show with -o wide. Use this to keep the default view clean while still exposing detailed information when requested.

Subresources: Status and Scale

Subresources give your custom resource separate API endpoints for specific concerns. Kubernetes supports two CRD subresources: status and scale. Without them, the entire resource is a single blob — any controller or user can overwrite any field.

The Status Subresource

Enabling the status subresource splits the resource into two independently updatable halves. Updates to /status can only modify the .status field, and updates to the main resource ignore changes to .status. This separation is critical for the controller pattern: users own .spec (desired state), controllers own .status (observed state).

yaml
subresources:
  status: {}       # Enables PUT /apis/apps.example.com/v1alpha1/.../frontend/status
  scale:
    specReplicasPath: .spec.replicas
    statusReplicasPath: .status.readyReplicas
    # labelSelectorPath: .status.selector   # Optional, needed for HPA

The Scale Subresource

The scale subresource makes your custom resource compatible with kubectl scale and the Horizontal Pod Autoscaler (HPA). It exposes a standard Scale object at /scale, mapping your spec and status fields to the canonical replica count fields.

bash
# Scale using kubectl (works because of the scale subresource)
kubectl scale webapp frontend --replicas=5 -n production

# The HPA can also target your custom resource
kubectl autoscale webapp frontend --min=2 --max=10 --cpu-percent=70 -n production

Versioning Strategies

APIs evolve. You will rename fields, add required properties, or restructure the schema entirely. CRDs support multiple versions in the versions array, each with its own schema, printer columns, and subresource configuration. But only one version can be the storage version — the one used to persist objects in etcd.

Single Version (Simple Case)

If your CRD is internal to your team or still in early development, a single version is fine. Use v1alpha1 to signal instability. Promote to v1beta1 and then v1 as the API stabilizes. When you have only one version, there is no conversion needed.

Multiple Versions with Conversion Webhooks

When you need to serve two versions simultaneously — for example, v1alpha1 for existing users and v1beta1 with a breaking schema change — you deploy a conversion webhook. The API server calls this webhook to translate objects between versions on the fly.

flowchart LR
    A["Client requests v1beta1"] --> B["API Server"]
    B --> C{"Object stored as\nv1alpha1 in etcd"}
    C --> D["Conversion Webhook"]
    D --> E["Returns v1beta1\nrepresentation"]
    E --> B
    B --> A

    style D fill:#f9f0ff,stroke:#7c3aed,stroke-width:2px
    
yaml
spec:
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions: ["v1"]
      clientConfig:
        service:
          name: webapp-conversion
          namespace: webapp-system
          path: /convert
          port: 443
        caBundle: <base64-encoded-CA-cert>

  versions:
    - name: v1alpha1
      served: true
      storage: true              # Currently the storage version
      schema:
        openAPIV3Schema:
          # ... v1alpha1 schema ...

    - name: v1beta1
      served: true
      storage: false             # Served but not stored
      schema:
        openAPIV3Schema:
          # ... v1beta1 schema (may have different field names) ...

The conversion webhook receives a ConversionReview object containing the objects to convert and the target version. It must be able to convert between any two served versions — not just adjacent ones. A common pattern is to use a "hub" version internally and convert to/from every other version through that hub.

Avoid Conversion Webhooks When Possible

Conversion webhooks add operational complexity — they must be highly available (if the webhook is down, all reads and writes to the custom resource fail). For additive, non-breaking changes (adding optional fields, adding a new version with the same schema), you don't need a webhook. Use strategy: None and let the API server round-trip the same data to both versions.

CRD Best Practices

PracticeWhy It Matters
Always define a structural schemaWithout it, you get no validation, no pruning, no defaulting. Raw x-kubernetes-preserve-unknown-fields at the root defeats the purpose.
Enable the status subresourceWithout it, users can accidentally overwrite controller-managed status, and controllers can clobber user-specified spec fields.
Use shortNames and categoriesShort names improve daily usability. Adding to the all category means kubectl get all includes your resources.
Pin your CRD naming to a domain you ownPrevents naming collisions when multiple tools install CRDs in the same cluster.
Set meaningful printer columnsUsers should be able to assess resource health from kubectl get output without needing -o yaml.
Add CEL validations for business rulesCatch invalid configurations at admission time, not at reconciliation time when the error is harder to surface.
Start with v1alpha1Signals to users that the API may change. Promote versions deliberately following Kubernetes API conventions.
Deleting a CRD Deletes All Instances

Running kubectl delete crd webapplications.apps.example.com immediately removes the CRD and every WebApplication instance in every namespace. There is no confirmation prompt and no undo. In production, protect CRDs with RBAC and consider adding a finalizer that blocks deletion until instances are migrated.

Building Operators — Encoding Operational Knowledge

A Kubernetes Operator is a CRD paired with a custom controller that watches it — together, they encode human operational knowledge into software. Instead of an engineer running a runbook to install, upgrade, back up, or failover a complex stateful system like PostgreSQL or Kafka, the Operator does it automatically. The operator is the runbook, compiled into a reconciliation loop.

The concept was introduced by CoreOS in 2016 and has since become the standard pattern for managing stateful and complex workloads on Kubernetes. If you have already read the section on Custom Resource Definitions, you know how to extend the Kubernetes API with new resource types. Operators take the next step: they give those custom resources a brain.

The Core Pattern: CRD + Controller = Operator

Every Operator follows the same fundamental structure. A Custom Resource Definition declares a new API type (e.g., PostgresCluster), and a custom controller watches instances of that type and acts on them. The controller continuously compares the desired state (what the user declared in the CR) with the actual state (what is running in the cluster), and takes action to close the gap.

graph LR
    User["👤 User"] -->|"kubectl apply"| API["API Server"]
    API -->|"stores"| ETCD["etcd"]

    subgraph Operator["Operator Pod"]
        CTRL["Controller / Reconciler"]
    end

    API -->|"watch events"| CTRL
    CTRL -->|"read CR spec"| API
    CTRL -->|"create/update owned resources"| API
    CTRL -->|"write CR status"| API

    CTRL -->|"manages"| DEP["Deployment"]
    CTRL -->|"manages"| SVC["Service"]
    CTRL -->|"manages"| CM["ConfigMap"]
    CTRL -->|"manages"| PVC["PVC"]
    

The user interacts only with the custom resource. They declare replicas: 3 and version: "15.4" on a PostgresCluster CR — the Operator translates that into the dozens of Kubernetes primitives (StatefulSets, Services, ConfigMaps, PVCs, Jobs) required to make it real. This is the key value proposition: the Operator abstracts away operational complexity behind a clean, domain-specific API.

The Operator Maturity Model

Not all Operators are created equal. The Operator Capability Model, originally defined by the Operator Framework project, classifies operators into five levels based on the depth of operational knowledge they encode. Each level subsumes the previous one.

LevelNameCapabilitiesExample
1Basic InstallAutomated provisioning via CR. Installs the application with sensible defaults. No lifecycle management beyond initial deployment.Helm-based operator that templates and applies manifests
2Seamless UpgradesSupports version upgrades and configuration changes with minimal disruption. Handles rollbacks on failure.Operator that performs rolling upgrades of a database cluster
3Full LifecycleBackup, restore, and disaster recovery. The Operator can recreate the application's state from a snapshot.CloudNativePG: continuous WAL archiving + point-in-time recovery
4Deep InsightsExposes operational metrics, logs, and alerts. Integrates with Prometheus, dashboards, and alerting pipelines.Prometheus Operator: auto-generates scrape configs and alert rules
5Auto PilotAutomatic scaling, self-healing, tuning, and anomaly detection. Makes operational decisions without human input.Operator that auto-scales read replicas based on query latency
Most Production Operators Sit at Level 3–4

Reaching Level 5 (Auto Pilot) requires encoding deep domain expertise — auto-tuning PostgreSQL's shared_buffers or rebalancing Kafka partitions based on broker load. Very few operators reach this level. If you are building an operator, aim for Level 3 as a solid production baseline: install, upgrade, backup, and restore.

The Reconciliation Loop — How Controllers Think

At the heart of every Operator is the reconciliation loop, powered by the controller-runtime library. This is the same pattern used by Kubernetes' built-in controllers (Deployment controller, ReplicaSet controller), but applied to your custom resources. The loop follows a precise sequence.

flowchart TD
    A["Informer watches API Server"] -->|"Event: Create/Update/Delete"| B["Work Queue"]
    B -->|"Dequeue item"| C["Reconcile(request)"]
    C --> D{"Desired == Actual?"}
    D -->|"Yes"| E["Return success — done"]
    D -->|"No"| F["Take corrective action"]
    F --> G["Create / Update / Delete owned resources"]
    G --> H["Update CR status"]
    H --> I{"Error?"}
    I -->|"Yes"| J["Requeue with backoff"]
    I -->|"No"| E
    J --> B
    

The controller does not receive a stream of events and react to each one. Instead, controller-runtime uses informers (cached watches) and a work queue to deduplicate and batch events. Your Reconcile function receives only a name and namespace — it must fetch the current state itself and decide what to do. This design makes reconciliation idempotent: calling Reconcile ten times in a row produces the same result as calling it once.

Key Principles of Reconciliation

  • Level-triggered, not edge-triggered. Your reconciler reacts to the current state of the world, not to "what changed." It re-reads the resource on every invocation and computes the full diff.
  • Idempotent. If the reconciler creates a Service that already exists, it should update it or skip it — never crash or duplicate it.
  • Optimistic concurrency. Kubernetes uses resourceVersion on every object. If two reconcilers try to update the same resource, one gets a conflict error and requeues.
  • Requeue on failure. If any step fails, return an error. The controller-runtime will requeue the item with exponential backoff (default: 5ms to 16 minutes).

Comparing Operator Frameworks

You do not build an Operator from scratch. Several frameworks scaffold the boilerplate — project layout, RBAC manifests, CRD generation, leader election setup — so you focus on writing the Reconcile function. Here is how the major frameworks compare.

FrameworkLanguage / ApproachStrengthsBest For
Kubebuilder Go (controller-runtime) The upstream standard. Generates CRD YAMLs from Go types via markers. Tight integration with controller-runtime and controller-tools. Used by most production operators. Teams comfortable with Go who need full control over reconciliation logic.
Operator SDK Go, Ansible, or Helm Builds on Kubebuilder for Go operators, but adds first-class support for Ansible playbooks and Helm charts as operator backends. Includes OLM (Operator Lifecycle Manager) integration and scorecard testing. Teams that want Ansible/Helm-based operators (Level 1–2), or Go operators that integrate with OLM for marketplace distribution.
KUDO Declarative YAML (plans & steps) Define operator behavior entirely in YAML — no code. Uses "plans" (install, upgrade, backup) composed of "steps" and "tasks." Good for encoding multi-step procedures. Operations teams without Go expertise who need to encode multi-step Day-2 workflows.
Metacontroller Any language (webhook-based) You write a sync webhook in any language (Python, Node.js, etc.). Metacontroller handles the watch/queue/reconcile infrastructure and calls your webhook with the parent resource and its children. Polyglot teams, rapid prototyping, or when Go is not an option.
Start with Kubebuilder Unless You Have a Reason Not To

Kubebuilder is the upstream project that Operator SDK's Go support is built on. If you are writing a Go-based operator, starting with Kubebuilder gives you the thinnest abstraction layer and the broadest community support. Use Operator SDK if you specifically need Ansible/Helm operator types or OLM integration. Use Metacontroller if your team does not write Go.

Common Operator Patterns

Regardless of which framework you choose, production operators share a set of recurring implementation patterns. These patterns solve real problems around ownership, cleanup, status reporting, and high availability.

Owned Resources and OwnerReferences

When your operator creates a Deployment, Service, or ConfigMap on behalf of a CR, it sets an ownerReference on the child resource pointing back to the CR. This gives you two things for free: garbage collection (when the CR is deleted, Kubernetes automatically deletes all owned resources) and watch filtering (controller-runtime can map events on owned resources back to the parent CR for re-reconciliation).

go
// Set the CR as the owner of the child Deployment
if err := ctrl.SetControllerReference(myCR, deployment, r.Scheme); err != nil {
    return ctrl.Result{}, err
}
// Now if myCR is deleted, this Deployment is garbage-collected automatically

Status Conditions

Operators report health and progress through status conditions — a standardized pattern borrowed from core Kubernetes resources (Pods have Ready, Initialized, etc.). Each condition has a type, status (True/False/Unknown), reason, and message. This lets users and monitoring tools query the CR's status programmatically.

yaml
status:
  conditions:
    - type: Ready
      status: "True"
      reason: AllReplicasRunning
      message: "3/3 replicas are running and healthy"
      lastTransitionTime: "2024-11-15T10:30:00Z"
    - type: BackupComplete
      status: "True"
      reason: ScheduledBackupSucceeded
      message: "Last backup completed at 2024-11-15T06:00:00Z"
      lastTransitionTime: "2024-11-15T06:00:12Z"

Finalizers

Sometimes deleting a CR requires cleanup that goes beyond Kubernetes — removing external DNS records, deprovisioning cloud resources, or flushing data to object storage. Finalizers solve this. Your operator adds a finalizer string to the CR's metadata when it creates external resources. When a user deletes the CR, Kubernetes sets the deletionTimestamp but does not remove the object until all finalizers are cleared. Your reconciler detects the deletion, performs cleanup, then removes the finalizer to let the delete proceed.

go
const finalizerName = "myapp.example.com/cleanup"

// In Reconcile:
if myCR.ObjectMeta.DeletionTimestamp.IsZero() {
    // CR is NOT being deleted — ensure finalizer is present
    if !controllerutil.ContainsFinalizer(myCR, finalizerName) {
        controllerutil.AddFinalizer(myCR, finalizerName)
        return ctrl.Result{}, r.Update(ctx, myCR)
    }
} else {
    // CR IS being deleted — run cleanup logic
    if controllerutil.ContainsFinalizer(myCR, finalizerName) {
        if err := r.deleteExternalResources(ctx, myCR); err != nil {
            return ctrl.Result{}, err  // requeue until cleanup succeeds
        }
        controllerutil.RemoveFinalizer(myCR, finalizerName)
        return ctrl.Result{}, r.Update(ctx, myCR)
    }
}

Leader Election

Operators typically run as a Deployment with multiple replicas for availability. But you do not want two replicas simultaneously reconciling the same resource — that causes conflicts and race conditions. Leader election ensures only one replica actively reconciles at a time. The others remain on hot standby and take over if the leader fails. Controller-runtime provides built-in leader election using a Lease resource in the operator's namespace.

go
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme:                 scheme,
    LeaderElection:         true,
    LeaderElectionID:       "myapp-operator-lock",
    // Only the leader processes reconcile events.
    // Standbys maintain informer caches for fast failover.
})

A Simplified Go Operator: Reconciling a WebApp CR

The following example shows a minimal but realistic operator reconciler built with Kubebuilder. It watches a custom WebApp resource and ensures a matching Deployment and Service exist with the correct replica count and container image. This is a Level 1 operator — basic install and configuration.

go
package controllers

import (
    "context"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    "k8s.io/apimachinery/pkg/runtime"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"

    myappv1 "github.com/example/webapp-operator/api/v1"
)

type WebAppReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=myapp.example.com,resources=webapps,verbs=get;list;watch;create;update;patch
// +kubebuilder:rbac:groups=myapp.example.com,resources=webapps/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the WebApp CR
    var webapp myappv1.WebApp
    if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
        if errors.IsNotFound(err) {
            return ctrl.Result{}, nil // CR deleted, owned resources auto-cleaned
        }
        return ctrl.Result{}, err
    }

    // 2. Define the desired Deployment
    replicas := int32(webapp.Spec.Replicas)
    deploy := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      webapp.Name,
            Namespace: webapp.Namespace,
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: &replicas,
            Selector: &metav1.LabelSelector{
                MatchLabels: map[string]string{"app": webapp.Name},
            },
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels: map[string]string{"app": webapp.Name},
                },
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{{
                        Name:  "web",
                        Image: webapp.Spec.Image,
                        Ports: []corev1.ContainerPort{{ContainerPort: 8080}},
                    }},
                },
            },
        },
    }

    // 3. Set owner reference for garbage collection
    if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
        return ctrl.Result{}, err
    }

    // 4. Create or update the Deployment
    var existing appsv1.Deployment
    err := r.Get(ctx, client.ObjectKeyFromObject(deploy), &existing)
    if errors.IsNotFound(err) {
        log.Info("Creating Deployment", "name", deploy.Name)
        return ctrl.Result{}, r.Create(ctx, deploy)
    } else if err != nil {
        return ctrl.Result{}, err
    }

    // Update if spec drifted
    existing.Spec.Replicas = &replicas
    existing.Spec.Template.Spec.Containers[0].Image = webapp.Spec.Image
    log.Info("Updating Deployment", "name", deploy.Name)
    return ctrl.Result{}, r.Update(ctx, &existing)
}

// SetupWithManager registers watches for WebApp and owned Deployments
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&myappv1.WebApp{}).          // watch WebApp CRs
        Owns(&appsv1.Deployment{}).       // watch owned Deployments
        Complete(r)
}

Notice the structure: fetch the CR, build the desired state, set owner references, then create-or-update. The SetupWithManager method at the bottom is where the magic happens — For() tells controller-runtime to watch WebApp resources, and Owns() means "also watch Deployments that have an ownerReference pointing back to a WebApp, and if they change, re-reconcile the parent WebApp." This is how drift detection works automatically.

Notable Production Operators

Before building your own operator, check if one already exists. The ecosystem has mature, battle-tested operators for the most common stateful workloads. Studying their source code is also one of the best ways to learn operator patterns.

OperatorManagesMaturity LevelKey Capabilities
Prometheus Operator Prometheus, Alertmanager, Thanos Level 4 (Deep Insights) CRDs for ServiceMonitor, PrometheusRule, AlertmanagerConfig. Auto-generates scrape configs and alerting rules from CRs. Powers the kube-prometheus-stack.
cert-manager TLS Certificates Level 3 (Full Lifecycle) Automates certificate issuance and renewal via Let's Encrypt, Vault, or custom CAs. CRDs: Certificate, Issuer, ClusterIssuer. Handles ACME challenges automatically.
Strimzi Apache Kafka Level 4 (Deep Insights) Manages Kafka brokers, ZooKeeper (or KRaft), MirrorMaker, Kafka Connect, and Schema Registry. Handles rolling upgrades, topic management, user authentication, and rack-aware replication.
CloudNativePG PostgreSQL Level 4–5 Manages primary + replicas with streaming replication. Continuous WAL archiving to S3/GCS, point-in-time recovery, automated failover, connection pooling via PgBouncer, and declarative backup schedules.
Operators Add Operational Surface Area

An Operator is a program running in your cluster with elevated RBAC permissions — it can create, modify, and delete resources on your behalf. A buggy reconciler can cause cascading deletions or infinite update loops. Before deploying any operator, review its RBAC scope, test it in a staging cluster, and monitor its reconciliation error rate. Building your own operator is a serious commitment: you are writing infrastructure software that must handle edge cases, API version skew, and partial failures gracefully.

Putting It Together: When to Build vs. When to Use

Build a custom operator when you have domain-specific operational logic that cannot be expressed with standard Kubernetes primitives or Helm charts — multi-step upgrade procedures, custom health checks, cross-resource coordination, or integration with external systems. A Helm chart can install an application; an operator can operate it through its full lifecycle.

Do not build an operator when a simpler tool will do. If your application is a stateless web service that just needs a Deployment and a Service, a Helm chart or a Kustomize overlay is the right answer. Operators shine for stateful, complex systems where Day-2 operations (upgrades, backup, failover, scaling) are the hard part — and those operations follow well-defined, automatable procedures. The next section explores how to deliver both applications and operators to clusters using GitOps with ArgoCD and Flux.

GitOps with ArgoCD and Flux — Declarative Continuous Delivery

GitOps is an operational model that takes the declarative philosophy Kubernetes was built on and extends it all the way to your delivery pipeline. Instead of running kubectl apply from a CI job or an engineer's laptop, you store your desired cluster state in Git and let an in-cluster agent continuously reconcile reality to match. The result is an auditable, reversible, and fully automated delivery system.

This approach solves a class of problems that traditional CI/CD pipelines struggle with: configuration drift, lack of auditability, credential sprawl, and the gap between "what we deployed" and "what's actually running." Two tools dominate the Kubernetes GitOps landscape — ArgoCD and Flux — and this section covers both in depth.

The Four Principles of GitOps

The OpenGitOps project (a CNCF Sandbox project) formalized GitOps into four principles. These aren't aspirational guidelines — they are concrete architectural constraints that your tooling must enforce.

PrincipleWhat It MeansIn Practice
DeclarativeThe entire system's desired state is expressed declarativelyAll Kubernetes manifests, Helm values, and Kustomize overlays live as files — no imperative scripts that "create if not exists"
Versioned & ImmutableThe desired state is stored in a version-controlled source of truthGit provides history, blame, branching, and the ability to revert any change to any prior commit
Pulled AutomaticallyAgents automatically pull the desired state and apply itAn in-cluster controller (ArgoCD or Flux) watches the Git repo and syncs changes without external triggers
Continuously ReconciledAgents observe actual state and correct driftIf someone manually edits a Deployment via kubectl edit, the GitOps agent reverts the change to match Git
GitOps Is Not Just "Git + CI/CD"

Storing YAML in Git and having Jenkins kubectl apply it is not GitOps. True GitOps requires a pull-based reconciliation loop running inside the cluster. The distinction matters because the pull model eliminates the need for external systems to hold cluster credentials and enables continuous drift detection — not just deploy-time synchronization.

Push-Based CI/CD vs. Pull-Based GitOps

The fundamental architectural difference between traditional CI/CD and GitOps is who initiates the deployment and where the credentials live. In push-based delivery, your CI server (Jenkins, GitHub Actions, GitLab CI) holds a kubeconfig or service account token and pushes changes into the cluster. In pull-based GitOps, an agent running inside the cluster pulls changes from Git.

flowchart LR
    subgraph PUSH["Push-Based CI/CD"]
        direction LR
        DEV1["Developer"] -->|git push| REPO1["Git Repo"]
        REPO1 -->|webhook| CI["CI Server<br/>Jenkins / GH Actions"]
        CI -->|"kubectl apply<br/>holds cluster creds"| K8S1["Kubernetes Cluster"]
    end

    subgraph PULL["Pull-Based GitOps"]
        direction LR
        DEV2["Developer"] -->|git push| REPO2["Git Repo"]
        AGENT["GitOps Agent<br/>ArgoCD / Flux"] -->|poll / webhook| REPO2
        AGENT -->|"reconcile<br/>in-cluster access"| K8S2["Kubernetes Cluster"]
    end

    PUSH ~~~ PULL
    
AspectPush-Based (CI/CD)Pull-Based (GitOps)
Credential locationCI server needs cluster credentials (kubeconfig, tokens)Agent runs in-cluster — uses Kubernetes RBAC, no external credentials needed
Drift detectionNone — cluster can diverge silently between pipeline runsContinuous — agent detects and optionally corrects drift in real time
Deployment triggerPipeline run (event-driven, one-shot)Reconciliation loop (continuous, polling or webhook-triggered)
RollbackRe-run an older pipeline or write rollback logicgit revert — the agent syncs the previous state automatically
Audit trailCI logs (may expire or be incomplete)Git history — immutable, signed commits, PR approvals
Multi-clusterCI needs credentials for every clusterEach cluster runs its own agent pointing at the same (or different) repo paths

In practice, most teams use a hybrid: CI handles build, test, and image push, then updates a Git repo (via automated PR or commit), which triggers the GitOps agent to deploy. This keeps the boundary clean — CI owns the artifact pipeline, GitOps owns the delivery pipeline.

ArgoCD Deep Dive

ArgoCD is the most widely adopted GitOps tool in the Kubernetes ecosystem. It provides a declarative, Kubernetes-native continuous delivery engine with a rich Web UI, RBAC, SSO, and multi-cluster support out of the box. ArgoCD is a CNCF Graduated project.

Architecture Overview

ArgoCD runs as a set of controllers in your cluster. The API Server exposes gRPC and REST APIs (and serves the Web UI). The Repository Server clones Git repos, renders Helm charts, runs Kustomize, and returns plain manifests. The Application Controller continuously compares the rendered manifests against the live cluster state and performs sync operations.

flowchart TB
    subgraph ARGO["ArgoCD (argocd namespace)"]
        API["API Server<br/>gRPC / REST / Web UI"]
        REPO["Repository Server<br/>clones Git, renders manifests"]
        CTRL["Application Controller<br/>reconciliation loop"]
        REDIS["Redis<br/>caching layer"]
        DEX["Dex (optional)<br/>SSO / OIDC"]
    end

    GIT["Git Repository"]
    K8S["Target Cluster(s)"]
    USER["User / CLI / Web UI"]

    USER --> API
    API --> REPO
    API --> CTRL
    CTRL --> REPO
    REPO -->|"clone & render"| GIT
    CTRL -->|"compare & sync"| K8S
    API --> REDIS
    DEX --> API
    

The Application CRD

Everything in ArgoCD revolves around the Application custom resource. An Application defines what to deploy (a path in a Git repo) and where to deploy it (a target cluster and namespace). ArgoCD watches these CRDs and reconciles accordingly.

yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests.git
    targetRevision: main
    path: apps/my-app/overlays/production
  destination:
    server: https://kubernetes.default.svc   # in-cluster
    namespace: my-app
  syncPolicy:
    automated:
      prune: true        # Delete resources removed from Git
      selfHeal: true     # Revert manual changes in the cluster
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

The source can point to plain YAML, a Kustomize directory, a Helm chart in a Git repo, or a chart from a Helm repository. ArgoCD auto-detects the format based on the directory contents (kustomization.yaml, Chart.yaml, etc.).

Sync Policies: Automated, Self-Heal, and Prune

Sync policies control how aggressively ArgoCD reconciles. The right combination depends on your risk tolerance and environment maturity. Here's what each option does and when to use it.

PolicyBehaviorRecommendation
automatedArgoCD syncs automatically when it detects Git has diverged from the clusterEnable for staging and production once you trust your review process
selfHealIf someone kubectl edits a resource, ArgoCD reverts it to match GitAlways enable in production — prevents drift from manual interventions
pruneResources deleted from Git are removed from the clusterEnable with care — a bad merge can delete production resources
Manual syncArgoCD detects drift and marks the app "OutOfSync" but waits for a human to click SyncGood starting point for teams new to GitOps

You can also add sync waves and hooks via annotations. Sync waves control ordering (e.g., create the namespace before the deployment), and hooks run Jobs at specific phases (PreSync for schema migrations, PostSync for smoke tests, SyncFail for alerting).

yaml
# Run a database migration before syncing the app
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: myorg/db-migrate:v2.1.0
          command: ["./migrate", "--target", "latest"]
      restartPolicy: Never

App-of-Apps Pattern

Managing dozens of Application CRDs individually becomes unwieldy. The app-of-apps pattern solves this: you create a single "root" Application that points to a directory containing other Application manifests. When ArgoCD syncs the root app, it creates all the child Applications, which then sync their own targets.

yaml
# Root Application — manages all other Applications
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests.git
    targetRevision: main
    path: argocd-apps/          # Directory of Application YAMLs
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true
      prune: true

The argocd-apps/ directory contains individual Application YAMLs — one for cert-manager, one for ingress-nginx, one for each microservice, etc. Adding a new service to the platform is a single Git commit that drops a new Application YAML into this directory.

ApplicationSets for Multi-Cluster and Templating

ApplicationSet is a more powerful evolution of app-of-apps. It uses generators to produce Application resources dynamically from data sources like Git directory structure, cluster lists, pull requests, or external APIs. This eliminates boilerplate when you deploy the same application across many clusters or environments.

yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app-multi-cluster
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: 'my-app-{{name}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/myorg/k8s-manifests.git
        targetRevision: main
        path: 'apps/my-app/overlays/{{metadata.labels.region}}'
      destination:
        server: '{{server}}'
        namespace: my-app

This single ApplicationSet generates one Application per production cluster registered with ArgoCD. The {{name}}, {{server}}, and {{metadata.labels.region}} template variables are populated from the cluster registration data. Other generators include git (one app per directory), list (explicit values), pullRequest (preview environments per PR), and matrix (combine generators).

RBAC and SSO

ArgoCD ships with its own RBAC system that controls who can view, sync, or override applications. Policies are defined in a ConfigMap using a Casbin-style syntax. Combined with SSO via Dex or a direct OIDC provider, this lets you map your identity provider groups to ArgoCD roles.

yaml
# argocd-rbac-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.csv: |
    # Role: team-frontend can sync only their apps
    p, role:team-frontend, applications, get, default/frontend-*, allow
    p, role:team-frontend, applications, sync, default/frontend-*, allow

    # Role: platform-admin has full access
    p, role:platform-admin, applications, *, */*, allow
    p, role:platform-admin, clusters, *, *, allow

    # Map OIDC groups to roles
    g, oidc-group:frontend-devs, role:team-frontend
    g, oidc-group:platform-team, role:platform-admin
  policy.default: role:readonly

The pattern default/frontend-* scopes access to applications in the default ArgoCD project whose names start with frontend-. The policy.default key ensures that authenticated users who don't match any group get read-only access — a safe default for visibility without risk.

The Web UI

ArgoCD's Web UI is one of its biggest differentiators. It provides a real-time visualization of your application's resource tree — every Deployment, ReplicaSet, Pod, Service, and Ingress is shown with its sync status and health. You can see diffs between Git and the live state, trigger syncs, view logs, and even exec into Pods. For teams that need operational visibility without deep kubectl fluency, the UI dramatically lowers the barrier to understanding what's running in the cluster.

Flux Deep Dive

Flux takes a fundamentally different architectural approach from ArgoCD. Rather than a monolithic application, Flux is a set of composable, single-purpose controllers that each manage one aspect of the GitOps pipeline. You install only the controllers you need, and they coordinate through Kubernetes custom resources. Flux is also a CNCF Graduated project.

Core Controllers and CRDs

Flux's architecture follows the Unix philosophy — small tools that do one thing well. Each controller watches specific CRDs and produces outputs that other controllers consume.

flowchart LR
    subgraph SOURCES["Source Controllers"]
        GR["GitRepository"]
        HR["HelmRepository"]
        OCR["OCIRepository"]
        BUCKET["Bucket (S3)"]
    end

    subgraph DEPLOY["Deployment Controllers"]
        KS["Kustomization<br/>Controller"]
        HC["Helm<br/>Controller"]
    end

    subgraph AUTO["Automation"]
        IAC["Image Reflector<br/>Controller"]
        IAU["Image Automation<br/>Controller"]
    end

    subgraph NOTIFY["Notifications"]
        NP["Notification<br/>Provider"]
        NA["Alert"]
        NR["Receiver"]
    end

    GR -->|artifact| KS
    GR -->|artifact| HC
    HR -->|chart| HC
    OCR -->|artifact| KS
    KS -->|apply manifests| CLUSTER["Kubernetes<br/>Cluster"]
    HC -->|helm install/upgrade| CLUSTER
    IAC -->|latest image tag| IAU
    IAU -->|commit update| GIT["Git Repo"]
    NR -->|webhook trigger| GR
    KS --> NA
    HC --> NA
    NA --> NP
    
ControllerCRDsResponsibility
Source ControllerGitRepository, HelmRepository, OCIRepository, BucketFetches artifacts from external sources and produces versioned tarballs for other controllers to consume
Kustomize ControllerKustomizationApplies Kustomize overlays or plain YAML from a source artifact to the cluster
Helm ControllerHelmReleaseManages Helm chart lifecycle — install, upgrade, rollback, test, uninstall
Image ReflectorImageRepository, ImagePolicyScans container registries for new image tags matching a policy
Image AutomationImageUpdateAutomationCommits image tag updates back to Git when new images match the policy
Notification ControllerProvider, Alert, ReceiverSends alerts to Slack/Teams/PagerDuty and receives webhooks to trigger reconciliation

GitRepository and Kustomization

The two most fundamental Flux CRDs are GitRepository (fetch the source) and Kustomization (apply it). Together, they form the minimum viable GitOps pipeline. The GitRepository polls your repo at a configurable interval, and the Kustomization controller applies the resulting manifests.

yaml
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 5m
  url: https://github.com/myorg/k8s-manifests.git
  ref:
    branch: main
  secretRef:
    name: git-credentials     # SSH key or token for private repos
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 10m
  retryInterval: 2m
  targetNamespace: my-app
  sourceRef:
    kind: GitRepository
    name: my-app
  path: ./apps/my-app/overlays/production
  prune: true                  # Remove resources deleted from Git
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: my-app
      namespace: my-app
  timeout: 3m

Note that Flux's Kustomization CRD is not the same as a kustomization.yaml file. The Flux Kustomization is a controller configuration that tells Flux what to apply. If the target path contains a kustomization.yaml, Flux runs Kustomize on it. If it doesn't, Flux generates one automatically from all the YAML files in the directory. The healthChecks field is powerful — Flux will wait for the specified resources to become healthy before marking the reconciliation as successful.

HelmRelease

For teams using Helm, Flux provides a HelmRelease CRD that manages the full chart lifecycle. It supports values from ConfigMaps, Secrets, inline YAML, or values files in Git. Flux handles install, upgrade, rollback on failure, and uninstall — all declaratively.

yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: bitnami
  namespace: flux-system
spec:
  interval: 1h
  url: https://charts.bitnami.com/bitnami
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: redis
  namespace: flux-system
spec:
  interval: 30m
  chart:
    spec:
      chart: redis
      version: "18.x"          # Semver range
      sourceRef:
        kind: HelmRepository
        name: bitnami
  targetNamespace: cache
  install:
    createNamespace: true
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
      remediateLastFailure: true   # Rollback on failed upgrade
  values:
    architecture: replication
    replica:
      replicaCount: 3
    auth:
      existingSecret: redis-credentials

Image Automation

Flux's image automation controllers close the loop between CI and GitOps. When your CI pipeline pushes a new container image, Flux detects it, updates the image tag in Git, and then syncs the new manifests to the cluster. This eliminates the manual step of updating image tags in your manifests repo.

yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: my-app
  namespace: flux-system
spec:
  image: ghcr.io/myorg/my-app
  interval: 5m
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: my-app
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: my-app
  policy:
    semver:
      range: "1.x"             # Only pick tags matching semver 1.x
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
  name: my-app
  namespace: flux-system
spec:
  interval: 30m
  sourceRef:
    kind: GitRepository
    name: my-app
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        name: flux-bot
        email: flux@myorg.com
      messageTemplate: "chore: update {{.AutomationObject}} images"
    push:
      branch: main
  update:
    path: ./apps/my-app
    strategy: Setters

In your deployment manifest, you mark which image fields to update using a special comment marker:

yaml
containers:
  - name: my-app
    image: ghcr.io/myorg/my-app:1.4.2  # {"$imagepolicy": "flux-system:my-app"}

When a new tag like 1.5.0 appears in the registry and matches the 1.x semver policy, Flux automatically commits an update changing 1.4.2 to 1.5.0 in your Git repo. The Kustomization controller then picks up the commit and deploys it.

Multi-Tenancy with Flux

Flux has first-class multi-tenancy support. Each tenant (team or project) gets their own GitRepository and Kustomization scoped to specific namespaces. A platform team manages the "root" Kustomization that bootstraps tenant Kustomizations, and Kubernetes RBAC ensures tenants can only deploy to their own namespaces.

yaml
# Tenant Kustomization — scoped to team-alpha's namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: team-alpha-apps
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: team-alpha-repo
  path: ./deploy
  prune: true
  targetNamespace: team-alpha
  serviceAccountName: team-alpha-sa   # RBAC-scoped SA
  validation: client

The serviceAccountName field is the key to multi-tenancy. The Kustomization controller impersonates this service account when applying resources, so it can only create or modify resources that the service account has RBAC access to. A tenant cannot accidentally (or maliciously) modify resources in another team's namespace.

ArgoCD vs. Flux — Choosing Your Tool

Both ArgoCD and Flux are CNCF Graduated projects with large communities and production deployments at scale. The choice between them often comes down to team preferences, existing tooling, and operational philosophy rather than a clear technical winner.

DimensionArgoCDFlux
ArchitectureMonolithic application with API server, controller, repo serverComposable toolkit of independent controllers
Web UIRich built-in UI with resource visualization, diffs, logsNo built-in UI — use Weave GitOps UI or Capacitor as add-ons
CLIargocd CLI for app management and admin tasksflux CLI for bootstrapping and troubleshooting
Multi-clusterCentralized — one ArgoCD manages many clustersDecentralized — each cluster runs its own Flux, or use Flux + cluster API
Helm supportRenders Helm charts to plain manifests, tracks via Application CRDNative HelmRelease CRD with full lifecycle (install, upgrade, rollback, test)
Image automationVia ArgoCD Image Updater (separate project, less mature)Built-in image reflector and automation controllers
Multi-tenancyAppProjects with RBAC policies and source/destination restrictionsService account impersonation per Kustomization with native K8s RBAC
NotificationsBuilt-in notification engine with triggers and templatesNotification controller with providers and alerts
Learning curveLower — the UI helps visualize state and debug issuesHigher — requires comfort with CRDs and CLI-based debugging
Resource footprintHeavier — runs Redis, Dex, repo server, API serverLighter — only install the controllers you need
When to Pick Which

Choose ArgoCD if your team values a visual dashboard, you manage multiple clusters from a central hub, or your developers are not deeply comfortable with kubectl. Choose Flux if you prefer a composable toolkit, want tighter Kubernetes-native RBAC integration, need built-in image automation, or run a platform where each team manages their own GitOps pipeline with strong tenant isolation.

Repository Structure Best Practices

Your Git repository layout determines how cleanly you can manage environments, teams, and promotion workflows. There's no universally correct structure, but two patterns dominate — and each serves different organizational needs.

Monorepo vs. Polyrepo

PatternStructureBest ForWatch Out For
MonorepoOne repo with all manifests, separated by directorySmall-to-medium teams, strong shared standards, easy cross-cutting changesMerge conflicts at scale, RBAC requires path-level Git permissions (e.g., CODEOWNERS)
PolyrepoSeparate repos per team or per appLarge orgs with autonomous teams, strict access controlHarder to make platform-wide changes, more repos to manage
HybridOne repo for platform/infra, separate repos per team for appsPlatform engineering model — central team controls shared infraRequires clear ownership boundaries

Here's a recommended monorepo structure that works well with both ArgoCD and Flux. It separates concerns by layer (infrastructure vs. applications), uses Kustomize overlays for environment differentiation, and keeps the GitOps tool configuration in its own directory.

text
k8s-manifests/
├── infrastructure/                 # Shared cluster infrastructure
│   ├── base/
│   │   ├── cert-manager/
│   │   ├── ingress-nginx/
│   │   ├── monitoring/
│   │   └── sealed-secrets/
│   ├── staging/
│   │   └── kustomization.yaml     # Patches for staging
│   └── production/
│       └── kustomization.yaml     # Patches for production
├── apps/                           # Application workloads
│   ├── base/
│   │   ├── frontend/
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   └── kustomization.yaml
│   │   └── backend/
│   │       ├── deployment.yaml
│   │       ├── service.yaml
│   │       └── kustomization.yaml
│   ├── staging/
│   │   ├── frontend/
│   │   │   └── kustomization.yaml # image tag, replicas, env vars
│   │   └── backend/
│   │       └── kustomization.yaml
│   └── production/
│       ├── frontend/
│       │   └── kustomization.yaml
│       └── backend/
│           └── kustomization.yaml
└── clusters/                       # GitOps tool configuration
    ├── staging/
    │   ├── infrastructure.yaml    # ArgoCD App or Flux Kustomization
    │   └── apps.yaml
    └── production/
        ├── infrastructure.yaml
        └── apps.yaml

Environment Promotion Patterns

Promoting a change from staging to production should be a deliberate, reviewable action. Two promotion patterns are common, and they work differently with your Git workflow.

Pattern 1: Branch-per-environment. The main branch represents staging, and a production branch represents production. You promote by merging main into production. This is simple but fragile — merge conflicts accumulate, and the branches inevitably diverge in ways that are hard to reason about.

Pattern 2: Directory-per-environment (recommended). A single branch (main) contains directories for each environment, using Kustomize overlays to vary configuration. Promotion is a PR that updates the production overlay — typically changing an image tag or a Kustomize patch. This is easier to audit and less error-prone.

yaml
# apps/production/backend/kustomization.yaml
# Promoting v1.5.0 to production = changing this image tag
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: backend
resources:
  - ../../base/backend
images:
  - name: ghcr.io/myorg/backend
    newTag: "1.5.0"              # <-- promotion happens here
patches:
  - patch: |
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: backend
      spec:
        replicas: 5              # Production runs more replicas

Automating promotion is possible too. After staging passes health checks, a CI job can open a PR that bumps the production image tag. The PR goes through code review, merges, and the GitOps agent deploys it — keeping the human-in-the-loop for production changes while automating the toil.

Practical Setup: Installing ArgoCD

Here's a working setup that gets ArgoCD running in your cluster and deploys an application from Git. This uses the non-HA manifest for simplicity — production clusters should use the HA manifest or the Helm chart.

bash
# 1. Install ArgoCD into its own namespace
kubectl create namespace argocd
kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 2. Wait for all components to be ready
kubectl wait --for=condition=available deployment --all -n argocd --timeout=300s

# 3. Get the initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d && echo

# 4. Port-forward the API server to access the UI
kubectl port-forward svc/argocd-server -n argocd 8080:443

# 5. Login with the CLI (optional — you can also use the Web UI)
argocd login localhost:8080 --username admin --insecure

# 6. Change the default password immediately
argocd account update-password

Once ArgoCD is running, create your first Application — either through the Web UI, the CLI, or by applying an Application YAML like the one shown in the Application CRD section above. Within seconds, ArgoCD will clone your repo, render the manifests, and show you the sync status.

bash
# Create an Application via CLI
argocd app create guestbook \
  --repo https://github.com/argoproj/argocd-example-apps.git \
  --path guestbook \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace default

# Check sync status
argocd app get guestbook

# Trigger a sync
argocd app sync guestbook

Practical Setup: Bootstrapping Flux

Flux bootstraps itself — the flux bootstrap command installs Flux into your cluster and commits its own configuration to your Git repo. From that point on, Flux manages itself through GitOps. This is one of Flux's most elegant design choices.

bash
# 1. Install the Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# 2. Check prerequisites (kubectl context, Git access, cluster version)
flux check --pre

# 3. Bootstrap Flux with GitHub (creates repo structure and installs controllers)
export GITHUB_TOKEN=<your-pat-token>

flux bootstrap github \
  --owner=myorg \
  --repository=k8s-manifests \
  --branch=main \
  --path=clusters/staging \
  --personal

# 4. Verify all controllers are running
flux check

# 5. Check the state of all Flux resources
flux get all

After bootstrap, Flux has committed its own controller manifests to clusters/staging/flux-system/ in your Git repo. To deploy an application, you add GitRepository and Kustomization YAMLs to the clusters/staging/ path and push. Flux picks them up automatically.

bash
# Create a source and kustomization via CLI (generates YAML and commits to Git)
flux create source git my-app \
  --url=https://github.com/myorg/my-app-manifests \
  --branch=main \
  --interval=5m \
  --export > ./clusters/staging/my-app-source.yaml

flux create kustomization my-app \
  --source=GitRepository/my-app \
  --path="./overlays/staging" \
  --prune=true \
  --interval=10m \
  --export > ./clusters/staging/my-app-kustomization.yaml

# Commit and push — Flux reconciles automatically
git add -A && git commit -m "feat: add my-app to staging" && git push

# Watch the reconciliation
flux get kustomizations --watch
Secrets in GitOps Repos

Never commit plain Kubernetes Secrets to your GitOps repo. Use Sealed Secrets (Bitnami), SOPS (Mozilla — natively supported by Flux), or an External Secrets Operator that syncs secrets from Vault, AWS Secrets Manager, or GCP Secret Manager. ArgoCD works with all three approaches; Flux has built-in SOPS decryption in its Kustomize controller.

High Availability and Disaster Recovery

A Kubernetes cluster that runs only when everything is perfect is not production-ready. Real infrastructure experiences node failures, network partitions, data center outages, and human errors. High availability (HA) is about surviving component failures without downtime. Disaster recovery (DR) is about getting back to a working state when HA is not enough — when you lose an entire cluster, a region, or corrupt critical data.

This section walks through HA at every layer of the stack — control plane, application, and data — then covers the DR strategies that protect you when the worst happens. Each concept is paired with the practical configuration that implements it.

Control Plane High Availability

The control plane is the brain of your cluster. If it goes down, no new Pods can be scheduled, no Deployments can roll out, and no self-healing can occur. Existing workloads keep running (kubelets operate autonomously), but the cluster is effectively frozen. A production control plane must tolerate the loss of at least one node without interruption.

API Server: Stateless and Load-Balanced

The kube-apiserver is stateless — it reads and writes all data to etcd, holding nothing in memory between requests. This makes it the easiest control plane component to scale. You run multiple replicas (typically 3) behind a load balancer, and any instance can serve any request. If one crashes, the load balancer routes traffic to the surviving instances.

Load Balancer OptionBest ForNotes
Cloud LB (AWS NLB, GCP ILB)Managed Kubernetes (EKS, GKE)Handled automatically by the cloud provider. Zero config on your part.
HAProxy / NginxSelf-managed clustersRun on dedicated hosts or as a keepalived VIP pair for the LB itself.
kube-vipBare-metal clustersRuns as a static Pod on control plane nodes. Provides a virtual IP via ARP or BGP.

etcd: The Quorum Problem

etcd is the single source of truth for all cluster state. Unlike the API server, etcd is stateful — it uses the Raft consensus protocol, which requires a majority quorum to accept writes. This means the number of nodes you run directly determines how many failures you can tolerate.

etcd NodesQuorum RequiredTolerated Failures
110 — any failure loses the cluster
321 node
532 nodes
743 nodes (rarely needed — latency increases)

Three nodes is the minimum for production. Five nodes are appropriate when you need to survive two simultaneous failures — common in multi-AZ deployments where an entire availability zone might go down. Going beyond five is almost never justified because each additional member increases write latency (every write must be replicated to a majority).

Always Use an Odd Number of etcd Nodes

An even number (e.g., 4) gives you the same fault tolerance as one fewer node (3), but with higher write latency. Four nodes still require a quorum of 3, so you can only lose 1 — the same as a 3-node cluster. The extra node adds cost and latency with no resilience benefit.

Scheduler and Controller Manager: Leader Election

Unlike the API server, the kube-scheduler and kube-controller-manager cannot run in active-active mode. If two schedulers tried to bind the same Pod to different nodes simultaneously, the cluster would enter a conflicted state. Instead, these components use leader election.

All replicas start, but only the elected leader actively does work. The others are hot standbys that watch the leader lease. If the leader crashes or its lease expires (default: 15 seconds), a standby wins the election and takes over. The mechanism uses a Kubernetes Lease object stored in the API server, so it piggybacks on the same HA infrastructure you already have for etcd and the API server.

bash
# Check which node currently holds the scheduler and controller-manager leases
kubectl get lease -n kube-system kube-scheduler -o jsonpath='{.spec.holderIdentity}'
kubectl get lease -n kube-system kube-controller-manager -o jsonpath='{.spec.holderIdentity}'

HA Control Plane Architecture

The following diagram shows a production-grade 3-node control plane. Each node runs all control plane components. The API servers sit behind a shared load balancer, while etcd forms a Raft cluster across all three nodes. The scheduler and controller-manager elect a single leader.

graph TB
    LB["Load Balancer<br/>(VIP / Cloud LB)"]

    subgraph CP1["Control Plane Node 1"]
        API1["kube-apiserver"]
        ETCD1["etcd member-1"]
        S1["scheduler (leader)"]
        CM1["controller-manager<br/>(standby)"]
    end

    subgraph CP2["Control Plane Node 2"]
        API2["kube-apiserver"]
        ETCD2["etcd member-2"]
        S2["scheduler (standby)"]
        CM2["controller-manager<br/>(leader)"]
    end

    subgraph CP3["Control Plane Node 3"]
        API3["kube-apiserver"]
        ETCD3["etcd member-3"]
        S3["scheduler (standby)"]
        CM3["controller-manager<br/>(standby)"]
    end

    LB --> API1
    LB --> API2
    LB --> API3

    ETCD1 <-->|"Raft"| ETCD2
    ETCD2 <-->|"Raft"| ETCD3
    ETCD1 <-->|"Raft"| ETCD3

    API1 --> ETCD1
    API2 --> ETCD2
    API3 --> ETCD3

    Workers["Worker Nodes"] --> LB
    

Application-Level High Availability

A highly available control plane means nothing if your application runs as a single Pod on one node. Application HA requires three things: multiple replicas, intelligent placement across failure domains, and controlled disruption during maintenance. Kubernetes gives you specific primitives for each.

Pod Anti-Affinity: Don't Put All Eggs in One Basket

Pod anti-affinity rules tell the scheduler to avoid placing Pods that match a label selector on the same node (or in the same zone). This ensures that a single node failure does not take out all replicas of a service. The requiredDuringSchedulingIgnoredDuringExecution variant is a hard rule — the Pod will not schedule if it cannot be satisfied. The preferredDuringSchedulingIgnoredDuringExecution variant is a soft hint.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - web-frontend
              topologyKey: kubernetes.io/hostname
      containers:
        - name: frontend
          image: myapp/frontend:2.4.1
          resources:
            requests:
              cpu: 250m
              memory: 256Mi

This configuration guarantees that no two web-frontend Pods land on the same node. Change the topologyKey to topology.kubernetes.io/zone to spread across availability zones instead — though a hard zone anti-affinity requirement with 3 replicas requires at least 3 zones.

Topology Spread Constraints: Even Distribution

Anti-affinity is binary: same node or different node. Topology spread constraints give you finer control — you can specify the maximum allowed skew between zones or nodes. This is critical for large deployments where you want even distribution, not just non-colocation.

yaml
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-frontend
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web-frontend

This example combines two constraints. The first is a hard rule: Pods must be evenly distributed across zones (no zone can have more than 1 extra Pod compared to the least-populated zone). The second is a soft rule: try to spread across nodes within a zone, but do not block scheduling if it cannot be perfectly balanced.

Pod Disruption Budgets: Controlled Maintenance

When a node is drained for maintenance (kubectl drain), upgrades, or autoscaler scale-down, Kubernetes evicts Pods. Without guardrails, a drain operation could evict all replicas of a critical service simultaneously. A Pod Disruption Budget (PDB) tells Kubernetes how many Pods of a given set must remain available (or how many can be unavailable) during voluntary disruptions.

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-frontend-pdb
spec:
  minAvailable: 2          # at least 2 Pods must stay running
  selector:
    matchLabels:
      app: web-frontend

With 3 replicas and minAvailable: 2, Kubernetes will only allow 1 Pod to be evicted at a time. If a second drain would violate the budget, it blocks until the first evicted Pod is rescheduled and healthy. You can alternatively use maxUnavailable — for example, maxUnavailable: 1 achieves the same result and is often clearer. You can also use percentage values like maxUnavailable: "25%" for larger deployments.

bash
# Check PDB status — ALLOWED DISRUPTIONS shows how many more Pods can be evicted
kubectl get pdb web-frontend-pdb
# NAME               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# web-frontend-pdb   2               N/A               1                     5m
PDBs Only Protect Against Voluntary Disruptions

A PDB guards against drains, evictions, and voluntary maintenance. It does not prevent a node from crashing, the OOM killer from terminating a container, or a Pod from failing its health checks. PDBs complement — but do not replace — replicas, anti-affinity, and proper resource limits.

etcd Backup and Restore

HA protects you from individual component failures, but it cannot protect against data corruption, accidental mass deletion (kubectl delete ns production), or a bug that writes bad data to etcd. For those scenarios, you need backups. etcd is the single store of all cluster state, so backing it up means backing up your entire cluster configuration.

Manual Snapshot with etcdctl

The etcdctl snapshot save command creates a point-in-time snapshot of the etcd data directory. This is the foundation of every etcd backup strategy. You must provide the etcd TLS certificates because etcd requires client authentication.

bash
# Take an etcd snapshot (run on a control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20240115-030000.db --write-table
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | 6d15a4e2 |  1284592 |       1438 |     5.2 MB |
# +----------+----------+------------+------------+

Automated Backup with a CronJob

Manual backups do not scale. The following CronJob runs every 6 hours, creates an etcd snapshot, and uploads it to an S3-compatible object store. It runs on a control plane node (via nodeSelector and toleration) and mounts the host etcd certificates.

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"            # Every 6 hours
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
          containers:
            - name: etcd-backup
              image: bitnami/etcd:3.5
              command:
                - /bin/sh
                - -c
                - |
                  SNAPSHOT="/tmp/etcd-$(date +%Y%m%d-%H%M%S).db"
                  etcdctl snapshot save "$SNAPSHOT" \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/etcd-certs/ca.crt \
                    --cert=/etc/etcd-certs/server.crt \
                    --key=/etc/etcd-certs/server.key
                  # Upload to S3 (aws-cli or mc for MinIO)
                  aws s3 cp "$SNAPSHOT" s3://my-etcd-backups/
              envFrom:
                - secretRef:
                    name: aws-backup-credentials
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/etcd-certs
                  readOnly: true
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
          restartPolicy: OnFailure

Restoring from a Snapshot

Restoring etcd is a disruptive operation — you stop the current etcd cluster and replace its data. This is a last-resort procedure, typically performed when the cluster has lost quorum or data has been corrupted. The restored cluster starts with a new cluster ID, so all etcd members must be restored from the same snapshot.

bash
# 1. Stop the kube-apiserver and etcd (move their static Pod manifests)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# 2. Remove the old etcd data directory
sudo rm -rf /var/lib/etcd/member

# 3. Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115-030000.db \
  --data-dir=/var/lib/etcd \
  --name=cp-1 \
  --initial-cluster=cp-1=https://10.0.1.10:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# 4. Restart etcd and the API server
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 5. Verify the cluster is healthy
kubectl get nodes
kubectl get pods -A

Disaster Recovery Strategies

HA handles individual component failures. Disaster recovery handles catastrophic scenarios: a full cluster loss, a region outage, a ransomware attack, or a cascading failure that corrupts data. Your DR strategy determines your Recovery Time Objective (RTO) — how long until you are back online — and your Recovery Point Objective (RPO) — how much data you can afford to lose.

graph LR
    subgraph Strategies["DR Strategy Spectrum"]
        direction LR
        A["etcd Snapshots<br/>+ GitOps Rebuild"]
        B["Velero<br/>App-Level Backup"]
        C["Active-Passive<br/>Multi-Cluster"]
        D["Active-Active<br/>Multi-Cluster"]
    end

    A -.-|"RTO: hours  RPO: hours"| B
    B -.-|"RTO: 30-60 min  RPO: minutes"| C
    C -.-|"RTO: minutes  RPO: seconds"| D

    style A fill:#fef3c7,stroke:#d97706
    style B fill:#fef3c7,stroke:#d97706
    style C fill:#dbeafe,stroke:#2563eb
    style D fill:#d1fae5,stroke:#059669
    
StrategyRTORPOCostComplexity
etcd snapshot + GitOps rebuildHoursLast snapshot (hours)LowLow
Velero scheduled backups30-60 minutesLast backup (minutes-hours)Low-MediumMedium
Active-Passive multi-clusterMinutesReplication lag (seconds-minutes)HighHigh
Active-Active multi-clusterNear-zeroNear-zeroVery HighVery High

GitOps-Based Recovery: Rebuild from Git

If you follow GitOps (see the previous section on ArgoCD and Flux), your entire cluster desired state already lives in Git. Your DR process becomes: provision a new cluster, point your GitOps tool at the same repository, and let it reconcile. This rebuilds all namespaces, Deployments, Services, ConfigMaps, RBAC policies, and network policies automatically.

The gap in GitOps-only recovery is runtime state: PersistentVolume data, Secrets not stored in Git, Custom Resource instances created by operators, and any state held in databases running inside the cluster. GitOps gives you the skeleton; you need Velero or database-native replication to restore the flesh.

Velero: Application-Level Backup and Restore

Velero (formerly Heptio Ark) is the standard tool for backing up and restoring Kubernetes resources and persistent volumes. It works at the Kubernetes API level — it queries the API server for resources, serializes them to JSON, and stores them in object storage (S3, GCS, Azure Blob). For volumes, it can take CSI snapshots or use Restic/Kopia for file-level backups.

graph LR
    V["Velero Server<br/>(in-cluster)"]
    API["kube-apiserver"]
    S3["Object Storage<br/>(S3 / GCS / MinIO)"]
    SNAP["Volume Snapshots<br/>(CSI / Cloud)"]

    V -->|"1. List resources"| API
    V -->|"2. Store manifests"| S3
    V -->|"3. Snapshot PVs"| SNAP

    S3 -->|"4. Restore manifests"| V
    SNAP -->|"5. Restore volumes"| V
    V -->|"6. Create resources"| API
    

Installing and Configuring Velero

bash
# Install Velero with AWS S3 as the backup storage location
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --use-node-agent \
  --default-volumes-to-fs-backup

Velero: Backup, Restore, and Schedule

Velero operates through three core workflows: on-demand backups for immediate snapshots, scheduled backups for automated protection, and restores to recover from a backup. Each can target specific namespaces, label selectors, or resource types.

bash
# --- On-demand backup of the "production" namespace ---
velero backup create prod-backup-manual \
  --include-namespaces production \
  --wait

# Check backup status
velero backup describe prod-backup-manual --details

# --- Schedule automatic daily backups with 7-day retention ---
velero schedule create prod-daily \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --ttl 168h

# List scheduled backups
velero schedule get

# --- Restore from a backup ---
# Restore everything from the backup
velero restore create --from-backup prod-backup-manual --wait

# Restore only specific resources (e.g., just Deployments and Services)
velero restore create --from-backup prod-backup-manual \
  --include-resources deployments,services \
  --wait

# Restore to a different namespace (rename on restore)
velero restore create --from-backup prod-backup-manual \
  --namespace-mappings production:production-restored \
  --wait

Velero Backup as a Kubernetes Resource

Behind the CLI, Velero creates Custom Resources. Here is what a Schedule and BackupStorageLocation look like as YAML — useful when you manage Velero through GitOps rather than the CLI.

yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-velero-backups
    prefix: cluster-prod
  config:
    region: us-east-1
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: full-cluster-daily
  namespace: velero
spec:
  schedule: "0 3 * * *"
  template:
    ttl: 720h                         # 30-day retention
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - kube-system
      - velero
    snapshotVolumes: true
    defaultVolumesToFsBackup: false    # prefer CSI snapshots

Multi-Cluster Disaster Recovery

For organizations that cannot tolerate the recovery time of a single-cluster strategy, multi-cluster DR provides faster failover by running a second cluster that is ready (or already serving traffic) when the primary fails.

PatternHow It WorksTrade-offs
Active-Passive A standby cluster in a second region receives replicated data (via Velero cross-cluster restore, database replication, or storage-level replication). DNS or a global load balancer switches traffic during failover. The passive cluster consumes resources but serves no traffic until failover. Data replication lag determines RPO. Failover can be automated with health checks on the global LB.
Active-Active Both clusters serve production traffic simultaneously. A global load balancer (e.g., AWS Global Accelerator, Cloudflare LB) routes users to the nearest healthy cluster. Data stores use multi-region replication (e.g., CockroachDB, Cassandra, or cloud-managed databases with multi-region writes). Near-zero RTO/RPO, but significantly more complex. You must handle data consistency, conflict resolution, and split-brain scenarios. Stateless workloads are easy; stateful workloads are hard.
Start Simple, Evolve as Needed

Most teams should start with GitOps + Velero scheduled backups. This combination covers 90% of disaster scenarios at low cost. Move to active-passive multi-cluster only when your SLAs demand RTO under 15 minutes. Move to active-active only when you need five-nines availability across regions — and be prepared for a significant jump in operational complexity.

Putting It All Together: An HA Checklist

High availability and disaster recovery are not features you bolt on later — they are architectural decisions made early. Here is a practical checklist to audit your cluster resilience:

LayerRequirementHow to Verify
Control Plane3+ API server replicas behind a load balancerkubectl get pods -n kube-system -l component=kube-apiserver
etcd3 or 5 members with quorum health checksetcdctl endpoint health --cluster
etcd BackupsAutomated snapshots every 1-6 hours, stored offsiteCheck CronJob history or Velero schedule status
Leader ElectionScheduler and controller-manager run on multiple nodeskubectl get lease -n kube-system
App ReplicasCritical services have 3+ replicaskubectl get deploy -o wide
Pod SpreadingAnti-affinity or topology spread constraints in placeReview Deployment specs for affinity or topologySpreadConstraints
PDBsPDBs defined for every critical workloadkubectl get pdb --all-namespaces
DR TestedRestore procedure tested in the last 30 daysCheck your runbook and last restore drill date

The last row is the most important. A backup that has never been tested is not a backup — it is a hope. Schedule quarterly DR drills where you restore to a fresh cluster and verify that applications are functional. The time you invest in testing is repaid the first time a real disaster strikes.

Multi-Tenancy Patterns — Sharing Clusters Safely

Running a separate Kubernetes cluster for every team, environment, or customer is simple to reason about — but expensive and operationally painful to maintain. Multi-tenancy lets multiple tenants share a single cluster while keeping their workloads isolated from each other. The challenge is finding the right balance between isolation strength, operational overhead, and cost efficiency.

There is no single "correct" multi-tenancy model. The right choice depends on your trust boundaries, compliance requirements, and how many clusters you are willing to manage. This section walks through the full spectrum — from lightweight namespace isolation to fully virtual clusters — and gives you the concrete Kubernetes primitives to implement each one.

The Multi-Tenancy Spectrum

Multi-tenancy in Kubernetes exists on a spectrum. On one end, namespaces provide logical separation within a shared cluster. On the other end, each tenant gets a dedicated cluster with complete physical isolation. In between sits a newer approach: virtual clusters that simulate a full cluster inside namespaces of a host cluster.

graph LR
    subgraph SOFT["Soft Multi-Tenancy"]
        NS["Namespace per Tenant<br/>RBAC + Quotas + NetworkPolicy"]
    end

    subgraph VIRTUAL["Virtual Clusters"]
        VC["vCluster per Tenant<br/>Dedicated API server<br/>Shared worker nodes"]
    end

    subgraph HARD["Hard Multi-Tenancy"]
        HC["Cluster per Tenant<br/>Full physical isolation"]
    end

    NS -->|"More isolation"| VC
    VC -->|"More isolation"| HC

    style SOFT fill:#e8f5e9,stroke:#388e3c,color:#1b5e20
    style VIRTUAL fill:#fff3e0,stroke:#f57c00,color:#e65100
    style HARD fill:#fce4ec,stroke:#c62828,color:#b71c1c
    
ModelIsolation LevelOperational CostTenant Self-ServiceUse Case
Namespace per TenantLogical (kernel shared)Low — single cluster to manageLimited — tenants cannot create CRDs or cluster-scoped resourcesInternal teams within the same org, dev/staging environments
Virtual Cluster (vCluster)Strong logical (separate API server)Medium — vClusters are lightweight but add a management layerHigh — tenants get their own API server, can install CRDsPlatform teams offering Kubernetes-as-a-Service, CI/CD ephemeral clusters
Cluster per TenantPhysical (separate nodes, etcd, control plane)High — N clusters to upgrade, monitor, secureFull — tenant has complete cluster adminRegulated industries, untrusted tenants, strict compliance

Namespace-Level Isolation (Soft Multi-Tenancy)

Namespace-based isolation is the most common multi-tenancy pattern. Each tenant gets one or more namespaces, and you enforce boundaries using five Kubernetes-native mechanisms: RBAC, ResourceQuotas, LimitRanges, NetworkPolicies, and Pod Security Standards. None of these is sufficient alone — you need all five working together to create a meaningful isolation boundary.

1. RBAC per Namespace

RBAC confines each tenant to their own namespace. You create a Role (namespace-scoped) with the permissions tenants need, then bind it to their identity with a RoleBinding. Tenants should never receive ClusterRole bindings unless you explicitly want them to access cluster-wide resources.

yaml
# Role: allow full control within the tenant namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tenant-developer
  namespace: team-payments
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "services", "configmaps", "secrets", "jobs", "cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["pods/log", "pods/exec"]
    verbs: ["get", "create"]
---
# Bind the role to the team's group identity
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-payments-developers
  namespace: team-payments
subjects:
  - kind: Group
    name: payments-team
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: tenant-developer
  apiGroup: rbac.authorization.k8s.io

Notice that this Role does not include access to nodes, namespaces, persistentvolumes, or any cluster-scoped resource. Tenants can work freely within team-payments but cannot see or affect anything outside it. You should also avoid granting escalate, bind, or impersonate verbs — these allow privilege escalation.

2. ResourceQuotas

RBAC controls what a tenant can do, but it does not limit how much. A single tenant could consume all the cluster's CPU and memory, starving everyone else. ResourceQuotas set hard limits on the total resources a namespace can consume and the number of objects it can create.

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "40"
    services: "10"
    persistentvolumeclaims: "5"
    secrets: "20"
    configmaps: "20"

Once a ResourceQuota is active, every Pod in the namespace must specify resource requests and limits — otherwise the API server rejects the creation. This is by design: Kubernetes cannot enforce quotas if it does not know how much a Pod plans to consume.

3. LimitRanges

ResourceQuotas cap the total for the namespace. LimitRanges cap individual Pods and containers — they set default, minimum, and maximum resource values. This prevents a single Pod from requesting 100 CPUs within a namespace that has a 16-CPU quota.

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: tenant-limits
  namespace: team-payments
spec:
  limits:
    - type: Container
      default:            # Applied when no limits are specified
        cpu: "500m"
        memory: 256Mi
      defaultRequest:     # Applied when no requests are specified
        cpu: "100m"
        memory: 128Mi
      max:
        cpu: "2"
        memory: 2Gi
      min:
        cpu: "50m"
        memory: 64Mi
    - type: Pod
      max:
        cpu: "4"
        memory: 4Gi

4. NetworkPolicies

By default, every Pod in a Kubernetes cluster can communicate with every other Pod — across namespaces. In a multi-tenant setup, this is unacceptable. NetworkPolicies act as namespace-level firewalls, restricting which Pods can talk to each other. They require a CNI plugin that supports them (Calico, Cilium, or Antrea — the default kubenet does not).

yaml
# Default deny all ingress and egress in the tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-payments
spec:
  podSelector: {}       # Applies to ALL pods in the namespace
  policyTypes:
    - Ingress
    - Egress
---
# Allow pods within the same namespace to talk to each other
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace
  namespace: team-payments
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector: {}
  egress:
    - to:
        - podSelector: {}
    - to:                 # Allow DNS resolution (kube-system)
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
Default-Deny First, Then Whitelist

Always start with a default-deny policy and then add specific allow rules. If you skip the default-deny, NetworkPolicies are purely additive — any Pod without a matching policy can still communicate freely with the entire cluster. The deny-all policy closes this gap.

5. Pod Security Standards

Pod Security Standards (PSS) replaced the deprecated PodSecurityPolicies in Kubernetes 1.25. They are enforced at the namespace level using labels and prevent tenants from deploying privileged containers, mounting host paths, or escalating privileges. Three built-in profiles exist:

ProfileWhat It AllowsUse Case
privilegedNo restrictions — full access to host namespaces, capabilities, and volumesSystem-level workloads (CNI plugins, monitoring agents in kube-system)
baselinePrevents known privilege escalations — blocks hostNetwork, hostPID, privileged containers, and dangerous capabilitiesGeneral-purpose workloads that don't need special privileges
restrictedHeavily restricted — requires non-root, read-only root filesystem encouraged, drops ALL capabilities, and restricts volume typesMulti-tenant namespaces with untrusted or third-party workloads
bash
# Enforce the restricted profile on a tenant namespace
kubectl label namespace team-payments \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

With enforce=restricted, the admission controller rejects any Pod that violates the restricted profile. The warn and audit labels produce warnings and audit log entries respectively — useful for gradual rollout where you enable warn first, fix violations, and then switch to enforce.

Hierarchical Namespace Controller (HNC)

In large organizations, the platform team typically creates namespaces and RBAC bindings for each tenant. This becomes a bottleneck when you have hundreds of teams. The Hierarchical Namespace Controller (HNC) solves this by letting you define parent-child relationships between namespaces. A parent namespace can propagate Roles, RoleBindings, NetworkPolicies, ResourceQuotas, and other objects down to child (sub) namespaces automatically.

yaml
# Install HNC and define a hierarchy
# Parent namespace: team-payments
# Child namespaces: team-payments-staging, team-payments-dev
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  name: team-payments-staging
  namespace: team-payments       # The parent namespace
---
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  name: team-payments-dev
  namespace: team-payments

With this hierarchy, any Role, RoleBinding, or NetworkPolicy you create in team-payments is automatically propagated to team-payments-staging and team-payments-dev. You configure which resource types propagate using HNCConfiguration. The team lead for team-payments can create sub-namespaces without involving the platform team — true self-service, scoped safely to their subtree.

bash
# Install HNC using kubectl plugin
kubectl apply -f https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/default.yaml

# View the hierarchy tree
kubectl hns tree team-payments
# Output:
# team-payments
# ├── team-payments-staging
# └── team-payments-dev

# Check which objects are propagated
kubectl hns describe team-payments

Virtual Clusters with vCluster

Namespace-level isolation has a fundamental limitation: tenants cannot create cluster-scoped resources like CRDs, ClusterRoles, or admission webhooks. If a tenant needs to install a Helm chart that includes CRDs, namespace isolation is not enough. This is where virtual clusters fill the gap.

vCluster (by Loft Labs) creates lightweight virtual Kubernetes clusters that run inside namespaces of a host cluster. Each vCluster has its own API server and a separate etcd (or SQLite/PostgreSQL) backing store. Tenants interact with their vCluster as if it were a real cluster — they can create CRDs, namespaces, ClusterRoles — but the actual workload Pods run on the host cluster's worker nodes.

graph TB
    subgraph HOST["Host Cluster"]
        HAPI["Host API Server"]

        subgraph NS1["Namespace: vc-tenant-a"]
            VCA_API["vCluster A<br/>API Server"]
            VCA_SYNC["Syncer"]
            VCA_STORE["etcd / SQLite"]
            VCA_API --> VCA_STORE
        end

        subgraph NS2["Namespace: vc-tenant-b"]
            VCB_API["vCluster B<br/>API Server"]
            VCB_SYNC["Syncer"]
            VCB_STORE["etcd / SQLite"]
            VCB_API --> VCB_STORE
        end

        W1["Worker Node 1<br/>Runs Pods from all vClusters"]
        W2["Worker Node 2<br/>Runs Pods from all vClusters"]
    end

    TENANT_A["Tenant A<br/>kubectl"] -->|"kubeconfig"| VCA_API
    TENANT_B["Tenant B<br/>kubectl"] -->|"kubeconfig"| VCB_API

    VCA_SYNC -->|"sync Pods,<br/>Services"| HAPI
    VCB_SYNC -->|"sync Pods,<br/>Services"| HAPI

    HAPI --> W1
    HAPI --> W2
    
bash
# Install vCluster CLI
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/latest/download/vcluster-linux-amd64"
chmod +x vcluster && sudo mv vcluster /usr/local/bin/

# Create a virtual cluster for tenant-a in the host cluster
vcluster create tenant-a --namespace vc-tenant-a

# Connect to the virtual cluster (switches kubeconfig context)
vcluster connect tenant-a --namespace vc-tenant-a

# Inside the vCluster — tenant sees a "clean" cluster
kubectl get namespaces      # Only default, kube-system, kube-public
kubectl create namespace my-app
kubectl apply -f my-crd.yaml   # CRDs work — this is a full API server

# Disconnect and return to host cluster context
vcluster disconnect

The syncer component is the bridge. It watches for Pods and Services created inside the vCluster and replicates them as real objects in the host namespace. The host scheduler places these Pods on real nodes, but the tenant only sees them through their virtual API server. This gives you strong API-level isolation with minimal infrastructure overhead — a vCluster typically consumes about 200Mi of memory for its control plane.

vCluster Does Not Provide Kernel-Level Isolation

Pods from different vClusters still share the same host kernel and worker nodes. A container escape in one vCluster can potentially affect another. If you need kernel-level isolation between tenants, combine vCluster with node isolation (dedicated node pools per tenant) or use a sandboxed runtime like gVisor or Kata Containers.

Node Isolation with Node Pools and Taints

For tenants who require stronger isolation than what namespaces or virtual clusters provide at the network and resource level, you can dedicate physical (or virtual) nodes to specific tenants. This ensures one tenant's workloads never share a kernel with another's — eliminating noisy-neighbor performance issues and reducing the blast radius of a container escape.

The mechanism is a combination of taints on nodes (to repel all Pods by default) and tolerations on tenant Pods (to explicitly allow scheduling). Pair this with nodeAffinity to guarantee Pods land only on designated nodes.

bash
# Label and taint nodes for a specific tenant
kubectl label nodes node-pool-a-1 node-pool-a-2 tenant=payments
kubectl taint nodes node-pool-a-1 node-pool-a-2 tenant=payments:NoSchedule
yaml
# Tenant Pod spec — tolerate the taint AND require the node label
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: team-payments
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      tolerations:
        - key: "tenant"
          operator: "Equal"
          value: "payments"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: tenant
                    operator: In
                    values: ["payments"]
      containers:
        - name: api
          image: payments/api:2.1.0
          resources:
            requests:
              cpu: "250m"
              memory: 256Mi

The taint prevents other tenants' Pods from landing on these nodes. The nodeAffinity prevents this tenant's Pods from landing on other tenants' nodes. You need both — a toleration alone only means a Pod can run on a tainted node, it does not prevent it from running elsewhere. On managed Kubernetes (EKS, GKE, AKS), you typically configure this through dedicated node pools with auto-applied taints and labels.

Cost Allocation per Tenant

Sharing a cluster saves money, but you need to know who is consuming how much. Kubernetes does not have built-in cost tracking, but the combination of consistent labeling and a cost allocation tool like Kubecost (or OpenCost, the CNCF sandbox project it contributed) gives you per-tenant cost visibility.

Step 1: Establish a Labeling Convention

The foundation of cost allocation is consistent labels. Every tenant workload must carry labels that identify the owning team, the environment, and the cost center. Enforce this with an admission webhook (like OPA/Gatekeeper or Kyverno) that rejects resources missing required labels.

yaml
# Standard labels for cost allocation
metadata:
  labels:
    app.kubernetes.io/name: payment-api
    app.kubernetes.io/part-of: payments-platform
    cost-center: cc-4200
    team: payments
    env: production

Step 2: Deploy Kubecost / OpenCost

bash
# Install OpenCost (CNCF sandbox project, fully open-source)
helm install opencost opencost/opencost \
  --namespace opencost --create-namespace \
  --set opencost.prometheus.internal.enabled=true

# Query per-namespace cost allocation via the API
curl -s "http://localhost:9090/allocation/compute?window=7d&aggregate=namespace" | jq '.data'

# Query per-label cost breakdown (e.g., by team label)
curl -s "http://localhost:9090/allocation/compute?window=30d&aggregate=label:team" | jq '.data'

Kubecost and OpenCost work by correlating Prometheus metrics for CPU, memory, storage, and network usage with the actual pricing from your cloud provider. They allocate shared costs (like the control plane fee or idle resources) using configurable strategies — proportional to usage, evenly split, or custom weights. The output is a per-namespace or per-label cost report you can feed into chargeback dashboards or internal billing systems.

Multi-Tenancy Decision Matrix

Choosing the right model depends on three factors: how much you trust the tenant, how strong the isolation must be, and how much operational complexity you can absorb. Use this matrix as a starting point.

FactorNamespace IsolationvClusterCluster per Tenant
Trust level requiredHigh — tenants are internal teams within the same orgMedium — tenants need more autonomy but you still control infrastructureLow — tenants are external, untrusted, or subject to strict regulatory boundaries
API isolationShared API server — tenants can see that other namespaces existSeparate API server per tenant — full cluster illusionFully separate — no shared components
CRD supportNo — CRDs are cluster-scoped, affect all tenantsYes — each vCluster has its own CRD registryYes — full cluster admin
Kernel isolationNone — shared kernel on worker nodesNone by default — add node pools or sandboxed runtimesFull — dedicated nodes and control plane
Cluster count11 host + N lightweight virtualN real clusters
Upgrade burden1 cluster to upgrade1 host cluster + vCluster versions (independent)N clusters to upgrade
Resource efficiencyHighest — full bin-packing across all tenantsHigh — small overhead per vCluster (~200Mi)Lowest — each cluster has its own control plane overhead
Setup complexityLow — native Kubernetes primitivesMedium — requires vCluster operatorHigh — requires fleet management (Fleet, Rancher, ArgoCD multi-cluster)
Blast radiusNamespace — but a kernel exploit affects allvCluster — but shared kernel still a riskFully contained to one cluster
Start with Namespaces, Graduate When You Must

For most organizations, namespace-level isolation with RBAC, quotas, network policies, and Pod Security Standards is sufficient and dramatically simpler to operate. Move to vCluster when tenants need CRDs or cluster-admin-like autonomy. Move to dedicated clusters only when compliance or zero-trust requirements demand it — the operational cost of managing many clusters is significant and often underestimated.

Putting It All Together: A Tenant Onboarding Checklist

When you onboard a new tenant using namespace-level isolation, apply all five isolation mechanisms as a cohesive unit. Missing any one of them leaves a gap that undermines the others.

bash
TENANT="team-orders"

# 1. Create namespace with Pod Security Standards
kubectl create namespace $TENANT
kubectl label namespace $TENANT \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted

# 2. Apply ResourceQuota
kubectl apply -n $TENANT -f quota.yaml

# 3. Apply LimitRange
kubectl apply -n $TENANT -f limitrange.yaml

# 4. Apply default-deny NetworkPolicy
kubectl apply -n $TENANT -f network-policy-deny-all.yaml
kubectl apply -n $TENANT -f network-policy-allow-intra-ns.yaml

# 5. Create RBAC Role and RoleBinding
kubectl apply -n $TENANT -f rbac-role.yaml
kubectl apply -n $TENANT -f rbac-binding.yaml

# Verify the setup
kubectl get quota,limitrange,networkpolicy,role,rolebinding -n $TENANT

In practice, you should codify this as a templated Helm chart, a Kustomize overlay, or a Crossplane Composition — so every new tenant gets identical isolation guarantees, and no step is accidentally skipped. The previous section on High Availability ensures the cluster itself is resilient; this section ensures tenants within it cannot interfere with each other. The next section ties both together with a production readiness checklist that covers the remaining operational concerns.

Production Readiness Checklist and Common Pitfalls

Running Kubernetes in development is forgiving. Running it in production is not. The gap between a cluster that works on a demo and one that survives real traffic, security audits, and 3 AM incidents is bridged by deliberate, systematic preparation. This section distills that preparation into a concrete checklist organized by category, followed by the ten most common mistakes teams make — and how to avoid every one of them.

Treat this as your pre-flight checklist. Before promoting any cluster or workload to production, walk through each category. A single missed item — an absent resource limit, a missing network policy, a skipped etcd backup test — can be the difference between a minor alert and a major outage.

graph LR
    subgraph Checklist["Production Readiness Categories"]
        direction TB
        W["Workloads"]
        S["Security"]
        N["Networking"]
        ST["Storage"]
        O["Observability"]
        C["Cluster Ops"]
    end

    W --> READY["Production Ready"]
    S --> READY
    N --> READY
    ST --> READY
    O --> READY
    C --> READY
    

1. Workloads

Workload configuration is where most production incidents originate. A Pod without resource limits can starve its neighbors. A Deployment without probes can route traffic to containers that aren't ready. Anti-affinity rules and Pod Disruption Budgets are not optional — they are the mechanisms that keep your application available during node failures and cluster upgrades.

Set Resource Requests and Limits on Every Container

Resource requests tell the scheduler how much CPU and memory a Pod needs; limits cap what it can consume. Without requests, the scheduler makes blind placement decisions. Without limits, a single runaway container can trigger the out-of-memory terminator on everything else on the node. Always set both, and base them on observed usage, not guesses.

yaml
containers:
- name: api-server
  image: myapp/api:v2.4.1
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 512Mi
Use LimitRanges and ResourceQuotas as Safety Nets

Even with developer discipline, someone will forget to set resources. Apply a LimitRange per namespace to inject default requests/limits on any container that omits them, and a ResourceQuota to cap total namespace consumption. This prevents any single team from monopolizing cluster resources.

Configure Liveness and Readiness Probes

Readiness probes gate traffic — a Pod won't receive Service traffic until its readiness probe passes. Liveness probes detect deadlocks — if a container is alive but stuck, the kubelet restarts it. A startup probe gives slow-starting containers (like Java apps) extra time before liveness checks begin. Use all three where appropriate.

yaml
containers:
- name: api-server
  image: myapp/api:v2.4.1
  startupProbe:
    httpGet:
      path: /healthz
      port: 8080
    failureThreshold: 30
    periodSeconds: 2          # up to 60s to start
  readinessProbe:
    httpGet:
      path: /ready
      port: 8080
    periodSeconds: 5
    failureThreshold: 3
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    periodSeconds: 10
    failureThreshold: 3

Use Anti-Affinity and Topology Spread Constraints

Running all replicas of a Deployment on a single node defeats the purpose of high availability. Pod anti-affinity rules ensure replicas spread across nodes, while topologySpreadConstraints give finer control — you can spread across zones, racks, or any custom topology domain.

yaml
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api-server
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: api-server
          topologyKey: kubernetes.io/hostname

Define Pod Disruption Budgets (PDBs)

During voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs — Kubernetes respects PDBs to ensure a minimum number of Pods remain available. Without a PDB, a node drain can evict all replicas simultaneously, causing downtime even when you have plenty of replicas.

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2          # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

2. Security

Kubernetes is secure-by-capability but permissive-by-default. A fresh cluster allows Pods to run as root, talk to any other Pod, and access the full Kubernetes API from within the cluster. Production hardening means flipping those defaults: deny everything, then allow only what's explicitly needed.

Enforce Pod Security Standards

Pod Security Standards (PSS), enforced through Pod Security Admission (PSA), replace the deprecated PodSecurityPolicy. There are three levels: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (hardened, follows security best practices). Apply at least baseline in enforce mode on every production namespace, and target restricted for workloads that don't need host access.

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Your Pod specs must match the enforced standard. Under restricted, every container needs an explicit security context:

yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 10001
  seccompProfile:
    type: RuntimeDefault
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
  readOnlyRootFilesystem: true

RBAC, Image Scanning, and API Access

RBAC should follow the principle of least privilege — grant only the verbs and resources each service account actually needs. Never bind cluster-admin to application workloads. Scan every container image in your CI pipeline using tools like Trivy, Grype, or Snyk. Restrict API server access with firewall rules and disable anonymous authentication.

yaml
# Least-privilege Role: only read Pods and ConfigMaps
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: app-reader
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: api-server-binding
subjects:
- kind: ServiceAccount
  name: api-server
  namespace: production
roleRef:
  kind: Role
  name: app-reader
  apiGroup: rbac.authorization.k8s.io

Rotate Secrets and Use External Secret Stores

Kubernetes Secrets are base64-encoded, not encrypted at rest by default. Enable etcd encryption at rest, and prefer external secret managers (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) with the Secrets Store CSI Driver or External Secrets Operator. Automate rotation so credentials are never stale.

3. Networking

By default, every Pod can talk to every other Pod in the cluster — across namespaces, across teams, across trust boundaries. This flat network is convenient for development and catastrophic for production. Network policies are your firewall rules within the cluster.

Apply NetworkPolicies with Default-Deny

Start by denying all ingress and egress traffic in each namespace, then explicitly allow only the communication paths your application requires. This mirrors the approach used in traditional firewalls: deny by default, allow by exception.

yaml
# Default deny all traffic in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# Allow specific traffic: api-server receives from ingress controller
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080

Configure Ingress TLS and Plan IP Ranges

Terminate TLS at the Ingress controller using cert-manager for automated certificate provisioning and renewal. Plan your Pod CIDR, Service CIDR, and node CIDR ranges before cluster creation — changing them later requires a full cluster rebuild. Use non-overlapping ranges with your corporate network if you need VPN or VPC peering.

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls-cert
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-server
            port:
              number: 8080

4. Storage

Stateful workloads need careful planning. Losing a PersistentVolume means losing data — there is no controller loop to reconcile that back into existence. Use dynamic provisioning to avoid manual volume management, and validate your backup and disaster recovery procedures before you need them.

Use Dynamic Provisioning with Appropriate StorageClasses

Define StorageClasses that match your performance and durability requirements. Use reclaimPolicy: Retain for any data you can't afford to lose — the default Delete policy destroys the underlying volume when the PVC is deleted.

yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-retain
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "4000"
  throughput: "250"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Backup PersistentVolumes and Test Restores

Use Velero or a CSI snapshot controller to back up PVs on a schedule. A backup you've never restored is a backup you don't have. Run quarterly disaster recovery drills: delete a PVC, restore from backup, and verify data integrity. Document the procedure so any team member can execute it at 3 AM.

bash
# Create a Velero backup of the entire production namespace
velero backup create prod-daily-$(date +%Y%m%d) \
  --include-namespaces production \
  --snapshot-volumes=true \
  --ttl 720h

# Verify backup completed successfully
velero backup describe prod-daily-$(date +%Y%m%d)

# Restore to a test namespace to validate integrity
velero restore create --from-backup prod-daily-$(date +%Y%m%d) \
  --namespace-mappings production:production-restore-test

5. Observability

You cannot manage what you cannot see. A production cluster without monitoring is flying blind. You need three pillars — metrics, logs, and traces — plus alerting that tells you about problems before your users do.

The Observability Stack

PillarTool OptionsWhat to Monitor
MetricsPrometheus + Grafana, Datadog, New RelicCPU/memory usage, Pod restarts, request latency (p50/p95/p99), error rates, node disk pressure, PVC usage
LogsLoki + Grafana, EFK (Elasticsearch + Fluentd + Kibana), CloudWatchApplication logs, kubelet logs, audit logs, ingress access logs
TracesJaeger, Tempo, Zipkin, OpenTelemetry CollectorRequest flow across microservices, latency breakdown per hop, error propagation
AlertingAlertmanager, PagerDuty, OpsgeniePod CrashLoopBackOff, node NotReady, certificate expiry, persistent volume >85% full, HPA at max replicas

Key Alerts Every Cluster Needs

Don't wait for users to report issues. These Prometheus alerting rules cover the most critical failure modes. Deploy them with the kube-prometheus-stack Helm chart, which includes Prometheus, Alertmanager, Grafana, and a comprehensive set of recording and alerting rules out of the box.

yaml
groups:
- name: critical-cluster-alerts
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"

  - alert: NodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node {{ $labels.node }} has been NotReady for 5 minutes"

  - alert: PVCAlmostFull
    expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "PVC {{ $labels.persistentvolumeclaim }} is >85% full"

6. Cluster Operations

A production cluster is a living system. Kubernetes releases a new minor version every four months, nodes need patching, and etcd — the single source of truth for all cluster state — needs regular backup and tested restore procedures. Automate as much as possible, and never skip the testing step.

Automate Cluster Upgrades

Kubernetes supports upgrading one minor version at a time (e.g., 1.29 to 1.30, not 1.28 to 1.30). On managed clusters, use the provider's upgrade mechanism. On self-managed clusters, follow the documented kubeadm upgrade workflow. Always upgrade the control plane first, then worker nodes.

bash
# 1. Check current version and available upgrades
kubeadm upgrade plan

# 2. Upgrade control plane (run on each control plane node)
sudo kubeadm upgrade apply v1.30.2

# 3. Drain a worker node before upgrading its kubelet
kubectl drain node-3 --ignore-daemonsets --delete-emptydir-data

# 4. Upgrade kubelet and kubectl on the worker node
sudo apt-get update && sudo apt-get install -y kubelet=1.30.2-* kubectl=1.30.2-*
sudo systemctl daemon-reload && sudo systemctl restart kubelet

# 5. Uncordon the node to resume scheduling
kubectl uncordon node-3

Backup and Restore etcd

etcd holds every resource definition in the cluster. If etcd is lost without a backup, the entire cluster state is gone — every Deployment, Service, Secret, and ConfigMap. On self-managed clusters, back up etcd at least every hour to an off-cluster location.

bash
# Snapshot etcd to a file
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-table

# Restore from snapshot (stop kube-apiserver and etcd first)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20240115-0300.db \
  --data-dir=/var/lib/etcd-restored

Plan Capacity and Enable Autoscaling

Monitor resource utilization at the node level. If nodes consistently run above 70% CPU or memory, add capacity before a traffic spike pushes them over the edge. Use the Cluster Autoscaler (or Karpenter on AWS) to automatically add or remove nodes based on pending Pod demand. Pair it with Horizontal Pod Autoscaling (HPA) for end-to-end elastic scaling.

The Full Checklist

Here is the complete checklist condensed into a single reference table. Pin it to your team wiki or add it to your deployment pipeline as a validation gate.

CategoryItemPriority
WorkloadsResource requests and limits set on every container🔴 Critical
Liveness, readiness, and startup probes configured🔴 Critical
Anti-affinity or topology spread constraints for HA workloads🟠 High
PodDisruptionBudgets defined for all critical services🟠 High
Graceful shutdown handled (preStop hooks, terminationGracePeriodSeconds)🟠 High
SecurityRBAC enforced with least-privilege Roles/ClusterRoles🔴 Critical
Pod Security Standards enforced (baseline or restricted)🔴 Critical
Container images scanned in CI pipeline🟠 High
Secrets encrypted at rest; external secret store integrated🟠 High
API server access restricted to trusted networks🟠 High
NetworkingDefault-deny NetworkPolicies in every namespace🔴 Critical
TLS termination at Ingress with automated certificate renewal🔴 Critical
Pod, Service, and Node CIDRs planned and non-overlapping🟠 High
StorageDynamic provisioning with appropriate StorageClasses🟠 High
PV backups scheduled and tested with restore drills🔴 Critical
reclaimPolicy: Retain for critical volumes🟠 High
ObservabilityMetrics collection (Prometheus or equivalent) deployed🔴 Critical
Centralized logging with retention policy🟠 High
Alerting configured for critical failure modes (CrashLoop, NotReady, PVC full)🔴 Critical
Cluster OpsCluster upgrade process documented and tested🟠 High
etcd backup/restore tested quarterly🔴 Critical
Cluster Autoscaler or Karpenter configured for elastic capacity🟡 Medium

Top 10 Production Pitfalls (and How to Avoid Them)

These are the mistakes that show up in postmortems again and again. Every one of them is avoidable.

1. No Resource Limits — Noisy Neighbor Outages

A single Pod without memory limits can consume all available memory on a node, triggering the out-of-memory terminator on every other Pod running there. The fix: enforce LimitRange objects in every namespace so that even forgotten containers get default limits applied automatically.

2. Liveness Probes That Check Dependencies

A liveness probe that queries the database will restart your Pod every time the database has a blip — turning a transient issue into a cascading failure across every replica. Liveness probes should check only whether the process itself is alive and responsive. Use readiness probes for dependency checks.

3. No PodDisruptionBudget — Upgrades Cause Downtime

During a kubectl drain, every Pod on the node is evicted. If all your replicas happen to be on that node (because you also forgot anti-affinity), your service goes down. A PDB with minAvailable: 1 ensures at least one replica stays running throughout the disruption.

4. Using latest Image Tags

The latest tag is mutable — it can point to a different image after every push. This means two Pods in the same Deployment can run different code. Worse, rollbacks don't work because Kubernetes sees no spec change. Always use immutable tags: digests (myapp@sha256:abc...) or explicit version tags (myapp:v2.4.1).

5. Running as Root with No Security Context

Without an explicit securityContext, most container images run as root (UID 0). A container escape exploit running as root gives an attacker root access to the host node. Set runAsNonRoot: true, drop all capabilities, and make the filesystem read-only.

6. Flat Network with No NetworkPolicies

Without network policies, a compromised Pod in the dev namespace can reach the database in the production namespace. The fix: default-deny policies in every namespace. This takes 10 minutes to implement and closes one of the largest attack surface areas in Kubernetes.

7. Not Testing etcd / PV Backup Restores

Teams that back up but never restore are surprised to discover their backups are corrupted, incomplete, or that the restore procedure takes four hours instead of 20 minutes. Schedule quarterly restore drills. Make it a calendar event. Verify data integrity after every restore.

8. Ignoring Pod Topology and Affinity

The scheduler may place all three replicas of your critical service on the same node or in the same availability zone. When that node or zone fails, all replicas go down simultaneously. Use topologySpreadConstraints with topologyKey: topology.kubernetes.io/zone for zone-level spread.

9. Skipping Graceful Shutdown Configuration

When Kubernetes terminates a Pod, it sends SIGTERM and waits for terminationGracePeriodSeconds (default: 30s). If your app doesn't handle SIGTERM — or if it needs more than 30 seconds to drain connections — requests are dropped. Implement a signal handler, add a preStop lifecycle hook, and increase the grace period if needed.

yaml
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: api-server
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]   # allow endpoints to de-register

10. No Observability Until the First Outage

Teams often delay deploying monitoring until after their first production incident, then scramble to debug blind. Deploy Prometheus, Grafana, and Alertmanager (or your stack of choice) as part of the initial cluster setup — before any workload goes live. The cost is minimal. The payoff is immediate.

Pitfalls Compound

These pitfalls rarely cause outages in isolation. The worst incidents combine multiple gaps: no resource limits plus no monitoring plus no PDB means a single memory leak cascades into a cluster-wide outage that nobody sees until customers call. Close every gap systematically — the checklist exists for exactly this reason.

Closing: From Knowledge to Practice

This page has taken you from the architecture of a Kubernetes cluster through workload design, networking, storage, security, and multi-tenancy. This final checklist ties every concept together into actionable preparation. Kubernetes rewards operators who are deliberate, systematic, and who automate their best practices into policy.

Don't try to implement every item at once. Prioritize the Critical items first — resource limits, probes, RBAC, network policies, TLS, and monitoring. Those alone will prevent the vast majority of production incidents. Then work through the High items as your operational maturity grows.

Automate the Checklist Into Your Pipeline

The best checklist is one that enforces itself. Use policy engines like Kyverno or OPA Gatekeeper to reject deployments that lack resource limits, probes, or security contexts. Use LimitRange and ResourceQuota as namespace-level guardrails. Shift compliance left so that developers get immediate feedback in CI, not a rejection in production.

Production Kubernetes is not a destination — it's an ongoing practice. Review this checklist before every major deployment, after every incident, and at the start of every quarter. The clusters that run reliably are the ones operated by teams that never stop asking: "What could go wrong, and have we prepared for it?"