Docker — From Fundamentals to Mastery

Prerequisites: Basic command-line/terminal familiarity, understanding of what a web server or application process is, and basic Linux concepts (files, processes, networking). No prior container experience required.

What Docker Is and Why It Matters

Docker is a platform for building, shipping, and running applications inside containers — lightweight, isolated environments that package your code together with every dependency it needs: libraries, system tools, runtime, and configuration files. Think of a container as a self-contained unit that runs the same way regardless of where you deploy it.

The pitch is simple: you describe your environment once in a Dockerfile, build an image, and that image runs identically on your laptop, your CI server, a staging VM, and a production Kubernetes cluster. No surprises, no drift, no "let me check which version of Python is installed."

The Problem: "Works on My Machine"

Every developer has heard — or said — the phrase "it works on my machine." The root cause is environment drift: subtle differences in OS versions, library versions, environment variables, file paths, and system configurations between development, staging, and production. These differences create bugs that are nearly impossible to reproduce and painful to debug.

Before containers, teams tried to solve this with provisioning scripts, configuration management tools like Chef or Puppet, and heavyweight virtual machines. These approaches helped, but they were slow, brittle, and expensive. Docker attacked the problem at its core by making the environment itself portable and versioned.

The Core Insight

Docker doesn't just ship your code — it ships the entire environment your code needs to run. The container is the deployment artifact, not just the application binary.

Here's what a minimal Dockerfile looks like — a plain text recipe that defines your container environment:

docker
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

This file pins the Node.js version, installs exact dependencies, and defines how the app starts. Anyone with Docker installed can run docker build -t myapp . and get an identical image — on any machine, any OS.

Containers vs. Virtual Machines

Containers and virtual machines both provide isolation, but they achieve it in fundamentally different ways. A VM virtualizes the hardware through a hypervisor (like VMware, Hyper-V, or KVM), and each VM runs a complete guest operating system — its own kernel, init system, and full userspace. A container virtualizes at the OS level, sharing the host's kernel and isolating processes using Linux namespaces and cgroups.

This architectural difference has massive practical consequences:

CharacteristicVirtual MachineContainer
Isolation levelFull hardware virtualizationProcess-level (shared kernel)
Startup time30–90 secondsMilliseconds to low seconds
Image sizeGigabytes (includes full OS)Megabytes (just app + deps)
Resource overheadHigh — each VM reserves CPU/RAM for its OSMinimal — near bare-metal efficiency
DensityTens of VMs per hostHundreds to thousands of containers per host
Guest OSAny OS (Linux, Windows, BSD)Must match host kernel type
Security boundaryStrong — full kernel separationWeaker — shared kernel attack surface

In practice, containers and VMs are not an either/or choice. Production environments often run containers inside VMs — the VM provides a strong security boundary and the containers provide fast, dense application packaging on top.

A Brief History

Container-like isolation isn't new. Unix had chroot in 1979, FreeBSD introduced Jails in 2000, and Linux added cgroups (2006) and namespaces (2002–2013) over a long stretch. What Docker did in 2013 was make this technology accessible.

Docker started life inside a PaaS company called dotCloud. Solomon Hykes open-sourced the internal container engine at PyCon 2013, and the response was so overwhelming that dotCloud pivoted entirely and rebranded as Docker, Inc. Within a year, every major cloud provider had Docker support, and the container ecosystem exploded.

In 2015, Docker helped establish the Open Container Initiative (OCI) under the Linux Foundation. The OCI defines two critical standards:

  • Image Spec (image-spec) — the format for container images, ensuring any OCI-compliant tool can build and distribute images.
  • Runtime Spec (runtime-spec) — how a container is configured, created, and run, so different runtimes (runc, crun, gVisor) are interchangeable.

These standards mean you're not locked into Docker's tooling. Podman, Buildah, containerd, and others all speak the same language. Docker popularized containers; the OCI made them an industry standard.

Docker in the Ecosystem

mindmap
  root((Docker))
    Problems It Solves
      Environment consistency
      Dependency isolation
      Reproducible builds
      Dev/prod parity
    Key Concepts
      Images
      Containers
      Registries
      Dockerfile
    Alternatives
      Podman
      LXC / LXD
      Virtual Machines
      Firecracker microVMs
    Use Cases
      Microservices
      CI/CD pipelines
      Local dev environments
      Cloud deployment
    

When Docker Is the Right Tool

Docker excels in specific scenarios. It's the default choice for packaging microservices, running reproducible CI/CD pipelines, creating consistent local development environments, and deploying to any cloud platform. If your workload is a Linux-based server process — an API, a web app, a worker, a database — Docker is almost certainly a good fit.

When Docker Is Not the Right Tool

Not everything belongs in a container. Recognize these situations and reach for something else:

  • Different-kernel workloads — You can't run a Windows container on a Linux host (or vice versa) without a VM in between, because containers share the host kernel. If you need FreeBSD jails or a macOS environment, Docker won't help.
  • Heavy GUI applications — Desktop apps with GPU acceleration, complex display requirements, and hardware peripherals are a poor fit. The isolation model adds friction (X11 forwarding, GPU passthrough) that rarely justifies the benefit.
  • Bare-metal HPC — High-performance computing workloads that need every last CPU cycle, direct hardware access (InfiniBand, RDMA), or custom kernel modules pay a measurable overhead cost from the container abstraction layer, even if small.
  • Stateful data stores in production — While you can run databases in containers (and it's great for dev/test), production databases often benefit from running on bare metal or VMs where storage performance, durability, and operational tooling are mature.
Rule of Thumb

If your workload is a Linux server process that reads config from the environment and communicates over the network, Docker is almost certainly the right packaging choice. If it needs direct hardware access or a non-Linux kernel, look elsewhere.

Docker Architecture: Daemon, Containerd, and the OCI Stack

Docker is not a single monolithic binary — it is a stack of loosely coupled components, each with a well-defined responsibility. Understanding this layered architecture helps you debug container failures, reason about security boundaries, and appreciate why Docker containers keep running even when parts of the stack restart.

The architecture flows top-to-bottom: the Docker CLI talks to the Docker daemon, which delegates to containerd, which spawns containers via runc. Each layer exists for a reason, and each can be swapped independently thanks to the OCI (Open Container Initiative) standards.

The Full Picture

graph TD
    CLI["Docker CLI
docker run, build, push"] DOCKERD["Docker Daemon - dockerd
API · Image Builds · Networking · Volumes"] CTRD["containerd
Container Supervision · Image Management · Snapshots"] SHIM["containerd-shim
Keeps container alive independently"] RUNC["runc
Creates namespaces + cgroups, then exits"] PROC["Container Process
PID 1 inside container"] IMG["Image Store
content-addressed"] SNAP["Snapshotter
overlay2"] CLI -->|"REST API over
Unix socket"| DOCKERD DOCKERD -->|"gRPC API"| CTRD CTRD --> SHIM SHIM --> RUNC RUNC -->|"fork/exec"| PROC CTRD --- IMG CTRD --- SNAP style CLI fill:#4a9eff,stroke:#2d7cd6,color:#fff style DOCKERD fill:#366ea3,stroke:#2a5580,color:#fff style CTRD fill:#7c3aed,stroke:#6025c9,color:#fff style SHIM fill:#9f67f5,stroke:#7c3aed,color:#fff style RUNC fill:#e05d44,stroke:#b8452f,color:#fff style PROC fill:#2ea043,stroke:#238636,color:#fff style IMG fill:#7c3aed,stroke:#6025c9,color:#fff style SNAP fill:#7c3aed,stroke:#6025c9,color:#fff

Docker CLI: The User-Facing Interface

The Docker CLI (docker) is a standalone binary that does almost nothing on its own. Every command you type — docker run, docker build, docker push — is translated into an HTTP request and sent to the Docker daemon over a Unix socket (/var/run/docker.sock) or a TCP endpoint.

This separation is what makes remote Docker possible. You can point your local CLI at a remote daemon by setting the DOCKER_HOST environment variable, and every command works as if the daemon were local.

bash
# The CLI just sends REST calls to the daemon
# These two commands are equivalent:
docker ps
curl --unix-socket /var/run/docker.sock http://localhost/v1.44/containers/json | jq

Docker Daemon (dockerd): The Orchestration Hub

The Docker daemon (dockerd) is the central API server. It exposes the full Docker API and is responsible for the higher-level features that Docker is known for: image builds (processing Dockerfiles), networking (bridge, overlay, host networks), volumes (named and bind mounts), and image management (pull, push, tag).

What dockerd does not do is actually run containers. When you execute docker run nginx, the daemon resolves the image, sets up networking and volumes, and then hands the actual container creation to containerd via a gRPC API. This decoupling means that dockerd can be upgraded or restarted without stopping your running containers.

bash
# You can see the daemon and containerd running as separate processes
ps aux | grep -E 'dockerd|containerd'
# root   1234  dockerd --group docker --host fd://
# root   1235  containerd
# root   1290  containerd-shim-runc-v2 -namespace moby -id <container-id>

containerd: The Container Supervisor

Containerd is a high-level container runtime — a CNCF graduated project that Docker donated to the community. It manages the complete container lifecycle: pulling and storing images, creating container snapshots using a snapshotter (typically overlay2), and supervising running containers.

Containerd exposes its own gRPC API and can be used entirely without Docker. In fact, Kubernetes clusters using the containerd CRI (Container Runtime Interface) talk directly to containerd, bypassing dockerd entirely. This is why Kubernetes deprecated the Docker shim — containerd was already doing the real work.

containerd vs. Docker

Containerd is not a Docker replacement — it is a Docker component. Docker adds image builds, Compose, Swarm, and a user-friendly CLI on top of containerd. When Kubernetes "dropped Docker support," it just stopped using the dockerd layer and talked to containerd directly.

runc: The Low-Level Runtime

Runc is the low-level OCI-compliant runtime that creates the actual container process. When containerd needs to start a container, it prepares an OCI bundle (a config.json file describing the container configuration plus a root filesystem) and invokes runc.

Runc's job is focused and short-lived. It calls the Linux kernel primitives to create namespaces (pid, net, mnt, uts, ipc, user) and configure cgroups (CPU, memory, I/O limits), then performs a fork/exec to start the container's entrypoint process. Once the container process is running, runc exits. It is not a long-running daemon.

bash
# runc can create containers directly from an OCI bundle
# First, export a root filesystem and generate a config
mkdir my-container && cd my-container
docker export $(docker create alpine) | tar -xf -
runc spec                         # generates config.json (OCI runtime spec)
sudo runc run my-alpine-ctr       # creates namespaces, cgroups, and runs /bin/sh

containerd-shim: The Decoupling Layer

The shim process (containerd-shim-runc-v2) is the unsung hero of Docker's reliability. One shim process runs per container and acts as the parent process for the container's PID 1. This design achieves a critical goal: containers survive daemon restarts.

If containerd crashes or is upgraded, the shim keeps the container running. When containerd comes back, it reconnects to the existing shims. The shim also handles keeping STDIO streams open, reporting the container's exit status, and reaping zombie processes.

ComponentLifecycleRole
dockerdLong-running daemonAPI server, image builds, networking, volumes
containerdLong-running daemonImage management, container supervision, snapshots
containerd-shimPer-container, long-runningDecouples container from containerd lifecycle
runcShort-lived (exits after setup)Creates namespaces and cgroups, fork/exec entrypoint

The OCI Standards: Why This All Works

The Open Container Initiative (OCI) defines two specifications that make Docker's components interchangeable. The Image Spec defines how container images are formatted (layers, manifests, and configuration). The Runtime Spec defines the interface for low-level runtimes — the config.json format and the expected lifecycle commands (create, start, kill, delete).

Because these specs are standardized, you can swap components freely. Replace dockerd with Podman (which is daemonless and talks directly to a container runtime). Replace containerd with CRI-O (purpose-built for Kubernetes). Replace runc with kata-containers (which runs each container inside a lightweight VM for stronger isolation). The interfaces remain the same.

Tip

Any image you build with docker build is OCI-compliant by default. It will run on Podman, CRI-O, containerd standalone, or any other OCI-compatible runtime without modification. You are not locked into Docker.

Seeing the Stack in Action

You can observe the entire chain on a running Docker host. When you start a container, each layer leaves a footprint in the process tree:

bash
# Start a container, then inspect the process hierarchy
docker run -d --name demo nginx:alpine

# Show the process tree — notice runc is already gone
pstree -p $(pgrep containerd-shim | head -1)
# containerd-shim(4521)─┬─nginx(4550)─┬─nginx(4601)
#                        │             └─nginx(4602)
#                        └─{containerd-shi}(4522)

# The shim is the parent, not dockerd or containerd
# This is why the container survives daemon restarts
Common Misconception

Docker does not use a hypervisor or virtual machine. Containers are regular Linux processes isolated with kernel namespaces and constrained with cgroups. The runc runtime sets up these kernel primitives — it does not start a VM. This is why containers start in milliseconds, not seconds.

Under the Hood: Namespaces, Cgroups, and Union Filesystems

A container is not a virtual machine. It's a regular Linux process that the kernel has been told to isolate and restrict. Three kernel features make this possible: namespaces give the process its own view of the system, cgroups cap the resources it can consume, and union filesystems provide an efficient layered filesystem. Understanding these primitives is the difference between using Docker as a black box and truly mastering it.

graph TD
    subgraph NS["Namespaces"]
        PID["PID"]
        NET["NET"]
        MNT["MNT"]
        UTS["UTS"]
        IPC["IPC"]
        USER["USER"]
    end

    subgraph CG["Cgroups"]
        CPU["CPU"]
        MEM["Memory"]
        PIDS["PIDs"]
        IO["Block I/O"]
    end

    subgraph OFS["OverlayFS"]
        BASE["Base Image Layer"]
        APP["App Layer"]
        RW["Writable Layer"]
        BASE --> APP --> RW
    end

    NS -->|"Isolated Process View"| CONTAINER
    CG -->|"Resource Limits"| CONTAINER
    OFS -->|"Layered Filesystem"| CONTAINER
    CONTAINER["🐳 Container = Isolated Process"]

    style CONTAINER fill:#0db7ed,stroke:#384d54,color:#fff,font-weight:bold
    style NS fill:#2d3748,stroke:#4a5568,color:#e2e8f0
    style CG fill:#2d3748,stroke:#4a5568,color:#e2e8f0
    style OFS fill:#2d3748,stroke:#4a5568,color:#e2e8f0
    

Namespaces: Giving Each Container Its Own World

Linux namespaces partition kernel resources so that one set of processes sees one set of resources while another set sees a different set. When Docker starts a container, it creates a new instance of each namespace type, giving the container process an isolated view of the system. The process thinks it has the entire machine to itself.

NamespaceFlagWhat It Isolates
PIDCLONE_NEWPIDProcess IDs — the container's entrypoint becomes PID 1
NETCLONE_NEWNETNetwork stack — interfaces, routing tables, iptables rules, ports
MNTCLONE_NEWNSMount points — the container sees its own root filesystem
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues, shared memory
USERCLONE_NEWUSERUID/GID mapping — root inside the container can map to an unprivileged user on the host

Seeing Namespaces in Action

Every process on Linux has namespace references listed under /proc/self/ns/. You can compare these between the host and a container to see the isolation in effect. Each namespace gets a unique inode number — different numbers mean different namespaces.

bash
# On the host — list your namespace inode numbers
ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026531836]'
# ...

# Inside a container — the inode numbers will be DIFFERENT
docker run --rm alpine ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532257]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532260]'
# ...

The different inode numbers (e.g., 4026531836 vs 4026532260 for PID) prove that the container process lives in a separate PID namespace. Inside that namespace, it sees its own PID 1 and cannot see host processes at all.

bash
# The container's entrypoint IS PID 1 inside its PID namespace
docker run --rm alpine ps aux
# PID   USER     COMMAND
#   1   root     ps aux

# But on the host, it's just a regular process with a high PID
docker run -d --name demo alpine sleep 3600
docker top demo
# PID     USER    COMMAND
# 48291   root    sleep 3600
PID 1 matters

Inside a container, your entrypoint process becomes PID 1. This means it receives signals like SIGTERM directly. If your process doesn't handle signals properly, docker stop will wait the full timeout (default 10s) and then send SIGKILL. This is why many images use tini or dumb-init as a lightweight init system.

Cgroups: Enforcing Resource Limits

Namespaces control what a process can see; cgroups (control groups) control what it can use. Cgroups are a kernel mechanism that organizes processes into hierarchical groups and applies resource limits to those groups. Without cgroups, a single container could consume all host memory or starve every other process of CPU time.

Cgroups v1 vs v2

Linux ships two cgroup architectures. Modern Docker installations (Docker Engine 20.10+ on kernels 5.2+) support both, and cgroups v2 is now the default on most current distributions including Ubuntu 22.04+, Fedora 31+, and Debian 11+.

FeatureCgroups v1Cgroups v2
HierarchyMultiple trees — one per resource controller (cpu, memory, etc.)Single unified tree at /sys/fs/cgroup/
ControllersIndependently mountableAll managed through one hierarchy
Pressure infoNot availablePSI (Pressure Stall Information) for memory, CPU, I/O
Mount path/sys/fs/cgroup/memory/, /sys/fs/cgroup/cpu/, .../sys/fs/cgroup/ (unified)

How Docker Uses Cgroups

When you pass flags like --memory or --cpus to docker run, Docker translates them into cgroup settings. Here's what the common flags control:

bash
# Run a container with resource limits
docker run -d \
  --name constrained \
  --memory=256m \
  --memory-swap=512m \
  --cpus=1.5 \
  --pids-limit=100 \
  nginx:alpine

These flags map directly to cgroup knobs. You can inspect them by reading the cgroup filesystem. On a cgroups v2 system, Docker places each container's cgroup under /sys/fs/cgroup/system.slice/docker-<container-id>.scope/.

bash
# Find the container's cgroup path
CONTAINER_ID=$(docker inspect --format '{{.Id}}' constrained)

# Cgroups v2 — read memory limit (in bytes)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# 268435456  (= 256 * 1024 * 1024 = 256MB)

# Read CPU quota (in microseconds per period)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# 150000 100000  (= 150ms of every 100ms period = 1.5 CPUs)

# Read PIDs limit
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
# 100

# See current memory usage
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
# 8388608  (current usage in bytes)
No limits by default

If you run docker run without --memory or --cpus, the container has no resource limits. It can consume all available host memory and CPU. In production, always set explicit limits — an unbounded container can trigger the Linux OOM killer and take down other containers (or the host itself).

Union Filesystems: Layers and Copy-on-Write

The third primitive solves a practical problem: if you run 50 containers from the same image, you don't want 50 copies of the filesystem. Union filesystems (specifically OverlayFS with the overlay2 storage driver) let Docker stack filesystem layers on top of each other. Multiple read-only layers from the image sit below a single thin writable layer for the running container.

How Layers Stack

An image is built from a series of layers, each representing a Dockerfile instruction. OverlayFS merges these layers into a single coherent view using three directories:

OverlayFS DirectoryRoleDescription
lowerdirRead-only image layersAll image layers stacked together. Shared across containers from the same image.
upperdirWritable container layerAll file writes, modifications, and deletions go here. Unique per container.
mergedUnified viewThe combined filesystem the container actually sees — a union of lower + upper.
workdirInternal scratch spaceUsed by OverlayFS for atomic operations like rename().
bash
# Inspect the overlay mount for a running container
docker inspect constrained --format '{{.GraphDriver.Data.MergedDir}}'
# /var/lib/docker/overlay2/abc123.../merged

docker inspect constrained --format '{{.GraphDriver.Data.UpperDir}}'
# /var/lib/docker/overlay2/abc123.../diff

docker inspect constrained --format '{{.GraphDriver.Data.LowerDir}}'
# /var/lib/docker/overlay2/layer1/diff:/var/lib/docker/overlay2/layer2/diff:...

# Each ":" separates a layer — they stack bottom-to-top

Copy-on-Write in Practice

When a container reads a file, OverlayFS looks through the layers top-down and returns the first match. When a container writes to a file that exists in a lower (read-only) layer, OverlayFS performs a copy-up: it copies the entire file to the writable upperdir, then applies the modification there. The original file in the lower layer remains untouched.

bash
# Start a container and modify a file from the base image
docker run -d --name cow-demo alpine sleep 3600

# Write to a file — this triggers copy-on-write
docker exec cow-demo sh -c "echo 'modified' > /etc/hostname"

# The change lives ONLY in the container's writable layer
docker diff cow-demo
# C /etc
# C /etc/hostname

This is why image layers are so space-efficient: 100 containers running the same nginx:alpine image share one copy of the base layers on disk. Each container only consumes additional space for the files it modifies.

Putting It All Together: A Container Is Just a Process

When you run docker run nginx, here's what actually happens at the kernel level:

  1. Prepare the filesystem

    Docker pulls image layers (if not cached), stacks them using OverlayFS, and creates a thin writable layer on top. The merged directory becomes the container's root filesystem.

  2. Create namespaces

    Docker calls clone() with namespace flags (CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) to create a new process with its own isolated view of PIDs, networking, mounts, hostname, IPC, and user IDs.

  3. Configure cgroups

    Docker creates a new cgroup for the container and writes the resource limits you specified (memory, CPU, PIDs) to the corresponding cgroup files. The container process is placed into this cgroup.

  4. Pivot root and exec

    The process calls pivot_root() to switch its root filesystem to the OverlayFS merged directory, then exec()s the container's entrypoint (e.g., nginx -g 'daemon off;'). From this point on, it's a regular Linux process — just one that can only see and use what the kernel allows.

Prove it's just a process

Run docker run -d alpine sleep 3600, then find it on the host with ps aux | grep "sleep 3600". You'll see it listed as a normal process. You can even strace it or inspect /proc/<pid>/ from the host. There is no hypervisor, no guest kernel — just a process with guardrails.

Installing Docker and Essential Configuration

Docker runs on Linux, macOS, and Windows — but the installation path and underlying architecture differ significantly across platforms. Getting the installation right matters because a misconfigured Docker setup leads to confusing permission errors, poor performance, or missing features down the line.

Linux: Installing Docker Engine

On Linux, you want Docker Engine installed from Docker's official apt or yum repositories — not the version bundled with your distro's default package manager. Distribution-packaged versions (like docker.io on Ubuntu) are often several major versions behind and miss critical features like BuildKit improvements and compose v2 integration.

Ubuntu / Debian

bash
# Remove any old/distro-packaged versions
sudo apt-get remove docker docker-engine docker.io containerd runc

# Install prerequisites and add Docker's official GPG key
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

# Add the official Docker repository
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker Engine, CLI, and plugins
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin

RHEL / Fedora

bash
# Install the repo and Docker Engine
sudo dnf -y install dnf-plugins-core
sudo dnf config-manager --add-repo \
  https://download.docker.com/linux/fedora/docker-ce.repo

sudo dnf install docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin

Post-Install: User Group and Systemd

By default, the Docker daemon binds to a Unix socket owned by root. To run docker commands without sudo, add your user to the docker group. Then enable and start the systemd service so Docker launches on boot.

bash
# Add your user to the docker group
sudo usermod -aG docker $USER

# Apply the new group membership (or log out and back in)
newgrp docker

# Enable and start the Docker service
sudo systemctl enable docker
sudo systemctl start docker
Security Implication

Adding a user to the docker group grants root-equivalent privileges on the host. Any user in this group can mount the host filesystem into a container and read or write any file. In shared or production environments, consider rootless Docker instead (covered below).

macOS: Docker Desktop

Docker doesn't run natively on macOS because macOS isn't Linux. Docker Desktop for Mac spins up a lightweight Linux virtual machine — on Apple Silicon it uses Apple's Virtualization framework, and on Intel Macs it uses the HyperKit hypervisor. You interact with Docker through the CLI exactly as you would on Linux; the VM layer is transparent.

Install Docker Desktop by downloading the .dmg from docker.com or via Homebrew:

bash
brew install --cask docker

After installation, open Docker Desktop and configure resource limits under Settings → Resources. The defaults are conservative — you'll want to adjust them based on your workload:

ResourceDefaultRecommended for Development
CPUsHalf of host coresHalf to three-quarters of host cores
Memory2 GB4–8 GB (increase if building large images)
Disk image size64 GB64–128 GB (images and volumes consume this)
Swap1 GB1–2 GB

Windows: Docker Desktop with WSL 2

Docker Desktop on Windows offers two backend options: WSL 2 (Windows Subsystem for Linux 2) and Hyper-V. WSL 2 is the recommended backend and has been the default since Docker Desktop 3.x. Here's why it matters:

AspectWSL 2 BackendHyper-V Backend
ArchitectureShared Linux kernel with WindowsFull VM with dedicated kernel
Startup time~2 seconds~10–15 seconds
Memory usageDynamic — reclaimed when idleFixed allocation upfront
File I/O (Linux FS)Native speedSlower (9p/CIFS sharing)
Windows versionHome, Pro, EnterprisePro and Enterprise only

WSL 2 runs a real Linux kernel managed by Windows, so Docker containers execute with near-native performance. The key advantage is dynamic memory management — WSL 2 grows and shrinks its memory footprint based on actual usage, while Hyper-V reserves the full allocation upfront.

bash
# Enable WSL 2 (run in PowerShell as Administrator)
wsl --install

# Verify WSL 2 is the default version
wsl --set-default-version 2

# After installing Docker Desktop, verify from your WSL distro:
docker version

Essential Post-Install Configuration: daemon.json

The Docker daemon reads its configuration from /etc/docker/daemon.json on Linux (or via Docker Desktop settings on macOS/Windows). This file controls everything from logging behavior to network defaults and storage drivers. Creating a well-tuned daemon.json from the start prevents operational headaches later.

Here's a production-oriented configuration with the most commonly needed settings:

json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "dns": ["8.8.8.8", "8.8.4.4"],
  "default-address-pools": [
    {
      "base": "172.20.0.0/16",
      "size": 24
    }
  ],
  "storage-driver": "overlay2",
  "features": {
    "buildkit": true
  },
  "insecure-registries": ["myregistry.local:5000"],
  "experimental": true
}

Let's break down what each setting does and why you'd want it:

Log Driver and Size Limits

By default, Docker uses the json-file log driver with no size limit. This means a chatty container can fill your disk with logs. Setting max-size to 10m and max-file to 3 caps each container at 30 MB of log storage total, rotating automatically. For centralized logging setups, you might switch the driver to fluentd, syslog, or awslogs instead.

DNS Configuration

Docker containers inherit DNS settings from the host by default. If your host's /etc/resolv.conf points to a local stub resolver like 127.0.0.53 (common on Ubuntu with systemd-resolved), containers can't reach it. Explicitly setting "dns" in daemon.json provides a reliable fallback — replace the Google DNS IPs with your organization's internal DNS servers if needed.

Default Address Pools

Every Docker network allocates a subnet. By default, Docker picks from the 172.17.0.0/16 range, which can collide with your corporate VPN or on-prem network ranges. Configuring default-address-pools lets you carve out a specific CIDR range that won't conflict with your existing infrastructure.

Storage Driver, Insecure Registries, and Experimental Features

overlay2 is the recommended storage driver on modern Linux kernels (4.0+) and is the default on most distributions. The insecure-registries array lets you pull from HTTP registries (no TLS) — useful for local development registries but never appropriate for production. Enabling experimental unlocks features like docker buildx experimental commands and docker manifest.

After editing daemon.json, restart the daemon to apply changes:

bash
sudo systemctl restart docker

Verifying Your Installation

Three commands tell you everything you need to know about your Docker installation. Run them immediately after setup to confirm everything is working:

bash
# Shows Client and Server (daemon) versions — confirms the daemon is running
docker version

# Detailed system info: storage driver, logging driver, kernel version,
# number of containers/images, cgroup driver, and more
docker info

# Shows disk usage breakdown: images, containers, volumes, build cache
docker system df

docker version outputs both the CLI (Client) and daemon (Server) versions. If you see "Cannot connect to the Docker daemon," the daemon isn't running — check systemctl status docker on Linux or ensure Docker Desktop is started on macOS/Windows. docker info is your best diagnostic tool: it reveals the active storage driver, log driver, whether experimental mode is on, and the cgroup version in use. docker system df shows exactly how much disk space images, containers, and volumes are consuming.

Rootless Docker

Standard Docker requires the daemon to run as root, which creates a wide attack surface — a container escape vulnerability means full root access to the host. Rootless mode runs the entire Docker daemon and containers under a regular user's UID, using user namespaces to map container root to an unprivileged user on the host.

bash
# Install prerequisites (Ubuntu/Debian)
sudo apt-get install uidmap dbus-user-session

# Run the rootless setup script (as your regular user, NOT root)
dockerd-rootless-setuptool.sh install

# Set the Docker socket for your user session
export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock

Rootless Docker has some trade-offs: you can't bind to privileged ports (below 1024) without extra configuration, overlay2 requires kernel 5.11+ in rootless mode, and some networking features (like --net=host) behave differently. For development workstations and CI runners, these limitations rarely matter, and the security improvement is significant.

Tip

Add the DOCKER_HOST export to your ~/.bashrc or ~/.zshrc so every new shell session automatically connects to the rootless daemon. You can run rootless and rootful Docker side by side — they use separate daemons and storage directories.

Docker Images: Layers, Caching, and Content-Addressable Storage

A Docker image is not a single binary blob. It is an ordered collection of read-only filesystem layers plus a configuration JSON document that records metadata — the command to run, environment variables, exposed ports, and the ordered list of layer digests. Every Dockerfile instruction that modifies the filesystem (RUN, COPY, ADD) creates a new layer. Instructions that only set metadata (ENV, EXPOSE, CMD) modify the config JSON without adding a filesystem layer.

When you run a container, Docker stacks all read-only image layers using a union filesystem (overlay2 on modern Linux) and places a thin writable container layer on top. Any file changes the container makes — new files, modifications, deletions — happen in this writable layer, leaving the image layers untouched.

graph TD
    subgraph runtime ["Runtime (container)"]
        W["✏️ Container Writable Layer
Temporary — lost when container is removed"] end subgraph image ["Image (read-only layers)"] L4["Layer 4 — RUN make build"] L3["Layer 3 — COPY . /app"] L2["Layer 2 — RUN apt-get install gcc"] L1["Layer 1 — debian:bookworm-slim base"] end W --> L4 L4 --> L3 L3 --> L2 L2 --> L1 style W fill:#fff3cd,stroke:#ffaa00,color:#333 style L4 fill:#d1ecf1,stroke:#0c5460,color:#333 style L3 fill:#d1ecf1,stroke:#0c5460,color:#333 style L2 fill:#d1ecf1,stroke:#0c5460,color:#333 style L1 fill:#cce5ff,stroke:#004085,color:#333

Anatomy of an Image

Use docker image inspect to see the internal structure of any image. The output reveals the layer digests, the configuration (entrypoint, cmd, env, labels), and the image's own content-addressable ID.

bash
# View the full image config and layer list
docker image inspect nginx:1.25 --format '{{json .RootFS}}' | jq .

# Output shows something like:
# {
#   "Type": "layers",
#   "Layers": [
#     "sha256:9853575bc4f9...",
#     "sha256:a691e3b3eb56...",
#     "sha256:d404e58e1a2f...",
#     ...
#   ]
# }

Each entry in the Layers array is a diff ID — the SHA256 hash of the uncompressed layer content. The image ID itself (e.g., sha256:a8758716bb6a...) is the hash of the config JSON. This means two images with identical configs and layers will always have the same ID, regardless of when or where they were built.

To see a human-readable history of which Dockerfile instructions created which layers, use docker image history:

bash
docker image history nginx:1.25 --no-trunc

# IMAGE          CREATED BY                                      SIZE
# a8758716bb6a   CMD ["nginx" "-g" "daemon off;"]                 0B
# <missing>      EXPOSE map[80/tcp:{}]                            0B
# <missing>      COPY 30-tune-worker... (buildkit.dockerfile.v0)  4.62kB
# <missing>      RUN /bin/sh -c set -x && apt-get update ...     59.1MB
# <missing>      ENV NGINX_VERSION=1.25.3                         0B

Notice how CMD, EXPOSE, and ENV show 0B — they change only the config JSON, not the filesystem. The RUN and COPY instructions produce actual filesystem layers with measurable size.

Content-Addressable Storage

Docker's storage model is content-addressable: every layer is identified by the SHA256 hash of its content. Two layers with identical bytes produce the same digest, period. This has profound implications for efficiency.

If you build ten images that all start with FROM debian:bookworm-slim, the base layer is stored on disk exactly once. When you push these images to a registry, the registry already has that layer — so it is transferred zero times after the first push. The same deduplication applies when pulling: Docker checks which layers you already have locally and only downloads the missing ones.

Layers on disk vs. in transit

On disk, layers are stored uncompressed (as directories in the overlay2 driver). In registries and during docker pull/push, layers are gzip-compressed. The diff ID is the hash of the uncompressed content; the distribution digest is the hash of the compressed blob. Docker maps between the two in its local metadata.

The Layer Cache Mechanism

When you run docker build, Docker evaluates each instruction top-to-bottom and checks whether a cached layer already exists for it. If the instruction and all its inputs are identical to a previous build, Docker reuses the cached layer and skips execution. The moment it encounters a changed instruction, the cache is busted for that instruction and everything below it.

This is why instruction ordering in your Dockerfile matters enormously. Consider this common anti-pattern:

docker
# ❌ BAD — copying all source first busts cache on every code change
FROM node:20-slim
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build

Every time you change any source file, the COPY . . layer changes, which invalidates the cache for npm install — even though your dependencies haven't changed. The fix is to separate dependency installation from source code:

docker
# ✅ GOOD — dependency layer is cached unless package.json changes
FROM node:20-slim
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

Now the npm ci layer is only rebuilt when package.json or package-lock.json changes. Source code changes only invalidate the final COPY and build layers, saving minutes on every build.

Cache rule of thumb

Order Dockerfile instructions from least frequently changing (base image, system packages) to most frequently changing (source code, build commands). This maximizes cache reuse across builds.

Image Tags vs. Digests

An image tag like nginx:1.25 or node:20-slim is a human-readable, mutable pointer. The image maintainer can push a completely different image under the same tag at any time. The notorious :latest tag simply means "whatever was pushed most recently without an explicit tag" — it carries no guarantee of stability.

An image digest is immutable. It is the SHA256 hash of the image's manifest (the JSON document that lists the layer digests and config). A digest reference looks like this:

bash
# Pull by digest — guaranteed to be the exact same image every time
docker pull nginx@sha256:6db391d1c0cfb30588ba0bf72ea999404f2764fabf023d4f0c7063b68260bf22

# Find the digest of a local image
docker image inspect nginx:1.25 --format '{{index .RepoDigests 0}}'

Manifest Lists (Multi-Architecture Images)

Modern images often support multiple CPU architectures. When you docker pull nginx, Docker doesn't download a single image — it fetches a manifest list (also called an OCI image index), which is a JSON document pointing to platform-specific manifests. Docker selects the correct one for your architecture automatically.

bash
# Inspect the manifest list to see supported platforms
docker manifest inspect nginx:1.25 | jq '.manifests[] | {platform, digest}'

# Output:
# { "platform": { "architecture": "amd64", "os": "linux" }, "digest": "sha256:6db3..." }
# { "platform": { "architecture": "arm64", "os": "linux" }, "digest": "sha256:b19c..." }
# { "platform": { "architecture": "arm",   "os": "linux" }, "digest": "sha256:e02f..." }

This is why the same docker pull command works on an Intel Mac, an M-series Mac, an AWS Graviton instance, and a Raspberry Pi — the manifest list resolves to a different platform-specific image in each case.

Practical Commands for Image Management

bash
# List all local images (with size)
docker image ls

# Show only dangling images (untagged, leftover from rebuilds)
docker image ls --filter "dangling=true"

# See what's actually consuming disk — images, containers, volumes, build cache
docker system df -v

# Remove dangling images
docker image prune

# Remove ALL unused images (not just dangling) — frees significant space
docker image prune -a

# Nuclear option: reclaim everything (stopped containers, unused networks, etc.)
docker system prune -a --volumes
docker system df lies about "reclaimable" space

The "RECLAIMABLE" column in docker system df only counts images not referenced by any container. If you have stopped containers, those images still count as "in use." Run docker container prune first if you want an accurate picture of what can be freed.

Base Image Selection

The base image you choose in your FROM instruction sets the floor for your image size, security attack surface, and debugging experience. There is no universally "best" choice — only trade-offs.

Base ImageCompressed SizePackage ManagerShell & Debugging ToolsBest For
scratch0 MBNoneNone — completely emptyStatically compiled Go/Rust binaries
alpine:3.19~3.5 MBapkBusyBox shell, minimal utilsSmall general-purpose images
debian:bookworm-slim~28 MBaptBash, coreutilsApps needing glibc or Debian packages
gcr.io/distroless/static~2 MBNoneNone — no shell at allProduction containers (minimal CVE surface)
ubuntu:22.04~29 MBaptBash, common GNU toolsFamiliar dev environment, wide package support

Key trade-offs to consider

Alpine uses musl libc instead of glibc. Most software works fine, but some C libraries (especially those with DNS resolution edge cases or precompiled native modules like Python wheels) can behave differently. If you hit mysterious segfaults or DNS issues, try switching to a Debian-based image before debugging further.

Distroless images have no package manager and no shell. This is a security advantage — an attacker who gains code execution inside the container cannot easily install tools or explore the filesystem. The downside: you can't docker exec -it <container> /bin/sh to debug. Google provides :debug variants of distroless images that include a BusyBox shell for troubleshooting.

scratch is the ultimate minimal base: it is literally an empty filesystem. Your binary must be fully statically linked and carry everything it needs (including CA certificates if it makes HTTPS calls). It is ideal for single-binary Go or Rust applications compiled with CGO_ENABLED=0.

Writing Production-Grade Dockerfiles

A Dockerfile is a recipe, but most recipes online produce bloated, insecure, and slow-to-build images. This section walks through every Dockerfile instruction, shows correct usage versus common pitfalls, and ends with a complete before/after refactoring of a real-world Dockerfile.

FROM — Choosing Your Base Image

Every Dockerfile starts with FROM. It sets the base image that all subsequent instructions build upon. You can also alias a stage with AS for multi-stage builds.

dockerfile
# Always pin to a specific version — never use :latest in production
FROM node:20.11-alpine3.19 AS builder

# Scratch is a special empty image — ideal for statically compiled binaries
FROM scratch

Prefer -alpine or -slim variants for smaller attack surface and image size. A full node:20 image is ~1 GB; node:20-alpine is ~130 MB. Pin to a specific tag (including the OS patch version) so your builds are reproducible across time.

RUN — Executing Build Commands

RUN executes commands during the image build and creates a new layer for each instruction. The single most impactful optimization you can make is combining related commands into one RUN statement and cleaning up in the same layer.

dockerfile
# Each RUN creates a layer — apt cache is baked into layer 1 forever
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*

The rm on line 3 creates a new layer but doesn't shrink layer 1. The apt cache stays in the image.

dockerfile
# Single layer — install, clean up, done
RUN apt-get update \
    && apt-get install -y --no-install-recommends curl \
    && rm -rf /var/lib/apt/lists/*

One layer. The apt cache is created and deleted within the same layer, so it never appears in the final image.

COPY vs ADD

COPY and ADD both copy files from the build context into the image, but they differ in important ways. ADD has two extra behaviors: it auto-extracts compressed archives (.tar.gz, .xz, etc.) and can fetch files from remote URLs.

FeatureCOPYADD
Copy local files
Auto-extract tar archives
Fetch remote URLs✅ (but no caching control)
Predictable behavior❌ — may surprise you
dockerfile
# Use COPY for everything — it's explicit and predictable
COPY package.json package-lock.json ./
COPY src/ ./src/

# Only use ADD when you specifically need tar extraction
ADD rootfs.tar.gz /

Use COPY by default. Reach for ADD only when you need automatic tar extraction. For remote files, use RUN curl or RUN wget instead — it gives you control over caching, retries, and cleanup.

WORKDIR — Setting the Working Directory

WORKDIR sets the working directory for all subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions. If the directory doesn't exist, Docker creates it automatically.

dockerfile
# Good — uses WORKDIR
WORKDIR /app
COPY . .

# Bad — uses RUN cd (the cd has no effect on the next instruction)
RUN cd /app
COPY . .

Never use RUN cd /somewhere — each RUN starts a new shell, so the directory change is lost. Always use WORKDIR.

ENV vs ARG — Build-Time and Runtime Variables

ARG defines a variable available only during the build. ENV defines a variable that persists into the running container. This distinction matters for security — never put secrets in ENV since they're baked into every layer and visible to anyone who inspects the image.

dockerfile
# ARG — only available during build, not in the final image
ARG NODE_VERSION=20
FROM node:-alpine

# ARG must be re-declared after FROM (each FROM starts a new build stage)
ARG APP_VERSION
RUN echo "Building version "

# ENV — baked into the image, available at runtime
ENV NODE_ENV=production
ENV PORT=3000
AspectARGENV
Available during build
Available at runtime
Overridable at build--build-argNot directly
Overridable at runN/A-e or --env
Persists in image layers❌ (but cached in build history)
Never put secrets in ARG or ENV

ARG values are visible in docker history. ENV values are visible via docker inspect. For build-time secrets, use docker build --secret or BuildKit secret mounts (RUN --mount=type=secret).

EXPOSE — Documenting Ports

EXPOSE does not actually publish a port. It's documentation — a signal to the person running the container about which ports the application listens on. You still need -p at runtime.

dockerfile
EXPOSE 3000
EXPOSE 3000/udp

# At runtime, you still need -p to actually publish:
# docker run -p 8080:3000 myapp

ENTRYPOINT vs CMD — The PID 1 Problem

This is where most Dockerfiles go wrong. ENTRYPOINT sets the main executable. CMD provides default arguments to that executable (or acts as the full command if no ENTRYPOINT is set). But the critical detail is the form you use.

Exec Form vs Shell Form

FormSyntaxRuns as PID 1?Shell variable expansion?
Exec form["node", "server.js"]✅ Yes — directly❌ No
Shell formnode server.js❌ No — wraps in /bin/sh -c✅ Yes
dockerfile
# ✅ Exec form — node IS PID 1, receives SIGTERM directly
ENTRYPOINT ["node", "server.js"]

# ❌ Shell form — /bin/sh is PID 1, node is a child process
# SIGTERM goes to sh, which does NOT forward it to node
ENTRYPOINT node server.js

When you use shell form, /bin/sh becomes PID 1 inside the container. The shell does not forward signals like SIGTERM to child processes. This means docker stop sends SIGTERM, your app never receives it, Docker waits 10 seconds, then sends SIGKILL. Your application never gets a chance to shut down gracefully — open connections are dropped, transactions are lost.

Combining ENTRYPOINT and CMD

dockerfile
# ENTRYPOINT = the executable, CMD = default arguments
ENTRYPOINT ["python", "manage.py"]
CMD ["runserver", "0.0.0.0:8000"]

# docker run myapp                     → python manage.py runserver 0.0.0.0:8000
# docker run myapp migrate             → python manage.py migrate
# docker run myapp shell               → python manage.py shell

This pattern makes the container behave like a binary — you pass subcommands at runtime, and CMD provides a sensible default.

When you need shell expansion with exec form

If you need environment variable expansion but also want exec form, use an entrypoint script: ENTRYPOINT ["/docker-entrypoint.sh"]. Inside that script, use exec "" at the end to replace the shell with your application as PID 1.

HEALTHCHECK — Container Health Monitoring

HEALTHCHECK tells Docker how to test whether your container is still working. Without it, Docker only knows if the process is running — not whether it's actually healthy and serving traffic. Orchestrators like Docker Swarm and Kubernetes use health status for automated restarts and rolling deployments.

dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD ["curl", "-f", "http://localhost:3000/health"] || exit 1

# Disable a parent image's healthcheck
HEALTHCHECK NONE

The health check command runs inside the container. Exit code 0 means healthy, 1 means unhealthy. Keep the check lightweight — a heavy health check can itself degrade performance.

USER — Don't Run as Root

By default, containers run as root. If an attacker exploits a vulnerability in your application, they have root access inside the container — and potentially to the host via container escape exploits. The USER instruction switches to a non-root user.

dockerfile
# Create a non-root user and switch to it
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
# Set ownership before switching user
COPY --chown=appuser:appuser . /app
USER appuser

# Alpine uses addgroup/adduser instead
RUN addgroup -S appuser && adduser -S -G appuser appuser

Place the USER instruction as late as possible — you typically need root for installing packages and setting up the filesystem. Switch to the non-root user just before ENTRYPOINT/CMD.

LABEL — Image Metadata

LABEL adds key-value metadata to your image. This is useful for image cataloging, CI traceability, and compliance. Use the OCI standard annotation keys for interoperability.

dockerfile
LABEL org.opencontainers.image.title="my-api" \
      org.opencontainers.image.version="1.4.2" \
      org.opencontainers.image.source="https://github.com/acme/my-api" \
      org.opencontainers.image.authors="team@acme.com"

STOPSIGNAL — Custom Stop Signals

By default, docker stop sends SIGTERM. Some applications (like Nginx) prefer a different signal. STOPSIGNAL lets you change it.

dockerfile
# Nginx uses SIGQUIT for graceful shutdown
STOPSIGNAL SIGQUIT

SHELL — Changing the Default Shell

The SHELL instruction overrides the default shell used for shell-form commands. On Linux the default is ["/bin/sh", "-c"]; on Windows it's ["cmd", "/S", "/C"]. You might change this to use bash for better scripting features.

dockerfile
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]

# Now RUN uses bash with strict error handling
# -e: exit on error, -u: error on undefined vars, -o pipefail: catch pipe errors
RUN echo "hello" | grep "world"  # This would fail properly now

Using pipefail is critical. Without it, a pipeline like curl ... | tar xz succeeds even if curl fails — the exit code of the last command (tar) is what counts.

ONBUILD — Deferred Instructions

ONBUILD registers an instruction that fires when another image uses this image as its FROM base. It's used in base images to enforce patterns on downstream images.

dockerfile
# In a base image (e.g., company-node-base)
FROM node:20-alpine
ONBUILD COPY package*.json ./
ONBUILD RUN npm ci --production
ONBUILD COPY . .

# Any team's Dockerfile that does "FROM company-node-base"
# automatically gets those three instructions injected after their FROM

Use ONBUILD sparingly. It creates "magic" behavior that surprises people who don't read the base image's Dockerfile. Prefer explicit multi-stage builds for most use cases.

.dockerignore — Controlling the Build Context

Before Docker builds anything, it sends the entire build context (the directory you pass to docker build) to the daemon. Without a .dockerignore file, you're sending node_modules, .git, local configs, and secrets over to the build — wasting time and risking leaking sensitive files into the image.

bash
# .dockerignore
.git
.gitignore
node_modules
npm-debug.log
Dockerfile
docker-compose*.yml
.env
.env.*
*.md
.vscode
.idea
coverage
dist
.DS_Store

The syntax mirrors .gitignore. The biggest wins come from excluding .git (which can be hundreds of MB) and node_modules (since you'll npm ci inside the image anyway).

Complete Before/After: Naive → Production-Grade

Let's take a typical Node.js API Dockerfile and apply every principle from this section. This is the kind of refactoring that takes an image from 1.2 GB and 45-second builds down to 150 MB and 5-second rebuilds.

❌ The Naive Dockerfile

dockerfile
FROM node:latest
COPY . /app
WORKDIR /app
RUN npm install
EXPOSE 3000
CMD node server.js

What's wrong with this? Almost everything:

  • node:latest — unpinned, non-reproducible, full Debian image (~1 GB)
  • COPY . /app before dependency install — every code change invalidates the npm cache layer
  • npm install — installs devDependencies, uses mutable package.json resolution
  • Shell form CMD — node is not PID 1, won't receive SIGTERM
  • Runs as root — security risk
  • No .dockerignore — sends node_modules, .git, and everything else to the build
  • No health check — orchestrator can't monitor health
  • Single stage — build tools ship in production

✅ The Production-Grade Dockerfile

dockerfile
# syntax=docker/dockerfile:1

# ── Stage 1: Install dependencies ─────────────────────────────
FROM node:20.11-alpine3.19 AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

# ── Stage 2: Build (if you have a compile step) ───────────────
FROM node:20.11-alpine3.19 AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY src/ ./src/
COPY tsconfig.json ./
RUN npm run build

# ── Stage 3: Production image ─────────────────────────
FROM node:20.11-alpine3.19 AS production

LABEL org.opencontainers.image.title="my-api" \
      org.opencontainers.image.version="1.4.2"

# Use tini as PID 1 for proper signal handling
RUN apk add --no-cache tini

ENV NODE_ENV=production
WORKDIR /app

# Copy only production deps from stage 1
COPY --from=deps /app/node_modules ./node_modules
# Copy only compiled output from stage 2
COPY --from=build /app/dist ./dist
COPY package.json ./

# Non-root user
RUN addgroup -S appuser && adduser -S -G appuser appuser
RUN chown -R appuser:appuser /app
USER appuser

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD ["wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/server.js"]

Here's what changed and why:

ChangeWhy It Matters
Pinned Alpine base image~130 MB vs ~1 GB. Reproducible builds. Smaller attack surface.
Multi-stage build (3 stages)Build tools, devDependencies, and source code never reach the production image.
COPY package*.json firstDependencies are cached until package.json or package-lock.json changes. Code changes don't trigger reinstall.
npm ci instead of npm installDeterministic installs from lockfile. Faster. Fails if lockfile is out of sync.
tini as PID 1Forwards signals properly, reaps zombie processes. Essential for production containers.
Exec form CMDNo shell wrapping. Node receives signals directly through tini.
Non-root userPrinciple of least privilege. Limits blast radius of container escape exploits.
HEALTHCHECKOrchestrators can detect and replace unhealthy containers automatically.
LABEL with OCI keysImage is traceable back to source repo and version. Helps with auditing.
Use docker build --target for development

In the multi-stage Dockerfile above, run docker build --target build -t myapp:dev . to stop at the build stage. This gives you an image with devDependencies and source code — perfect for development and testing, while the same Dockerfile produces a lean production image by default.

Multi-Stage Builds, BuildKit, and Build Optimization

A single-stage Dockerfile drags every compiler, header file, and dev dependency into your final image. Multi-stage builds let you split the build process across multiple FROM instructions — each one starts a fresh filesystem — and then cherry-pick only the artifacts you need into a slim runtime image. The result is dramatically smaller, more secure images with a minimal attack surface.

flowchart LR
    subgraph stage1["Stage 1: Builder  (~800 MB)"]
        A["Source Code"] --> B["Install Deps & Compile"]
        B --> C["Binary / Bundle"]
    end
    subgraph stage2["Stage 2: Runtime  (~15 MB)"]
        D["Minimal Base Image"] --> E["Production Artifact"]
    end
    C -- "COPY --from=builder" --> E
    stage1 -. "discarded after build" .-> F(("🗑️"))
    style stage1 fill:#2d2d2d,stroke:#f97316,color:#f5f5f5
    style stage2 fill:#2d2d2d,stroke:#22c55e,color:#f5f5f5
    style F fill:#2d2d2d,stroke:#ef4444,color:#ef4444
    

The builder stage contains all the heavy tooling — Go compiler, Node.js with node_modules, Python build wheels — but none of it ships. Only the final FROM stage contributes to the image you push to a registry.

Multi-Stage Build Patterns by Language

The pattern is the same across languages: compile or bundle in one stage, copy the output into a minimal base. The specifics vary depending on how each runtime handles dependencies.

Go produces a statically-linked binary, making it ideal for scratch or distroless final images with virtually zero overhead.

dockerfile
# Stage 1: Build
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app ./cmd/server

# Stage 2: Runtime
FROM scratch
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /app /app
ENTRYPOINT ["/app"]

Setting CGO_ENABLED=0 ensures a fully static binary. The -ldflags="-s -w" flags strip debug symbols and DWARF info, shrinking the binary further. The final image is often under 15 MB.

Node.js apps need a runtime, so you can't use scratch. Instead, separate the npm ci (with devDependencies) from the production node_modules.

dockerfile
# Stage 1: Install & Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production deps only
FROM node:20-alpine AS production
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev && npm cache clean --force
COPY --from=builder /app/dist ./dist

USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]

The builder stage installs all dependencies (including TypeScript, Webpack, etc.) and builds. The production stage runs npm ci --omit=dev to get only runtime dependencies, then copies the compiled output.

Python benefits from building wheels in a builder stage and installing them into a clean runtime image without compilers or build headers.

dockerfile
# Stage 1: Build wheels
FROM python:3.12-slim AS builder
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev
WORKDIR /app
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

# Stage 2: Install pre-built wheels
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends libpq5 \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/* && rm -rf /wheels
COPY . .

USER nobody
CMD ["python", "main.py"]

The builder stage has gcc and dev headers to compile C extensions (like psycopg2). The runtime stage only needs the shared library (libpq5) and the pre-built wheels — no compiler necessary.

BuildKit: The Modern Build Engine

BuildKit is Docker's next-generation build backend, enabled by default since Docker Desktop 23.0. It replaces the legacy builder with parallel stage execution, better caching, and features like build secrets and SSH forwarding. If you're on an older Docker version, enable it with DOCKER_BUILDKIT=1.

BuildKit Parallel Execution

BuildKit analyzes your Dockerfile as a DAG (directed acyclic graph). Independent stages run in parallel automatically. A three-stage build where two stages don't depend on each other will build both concurrently, significantly reducing total build time.

Build Secrets

Never put credentials in ENV or ARG — they persist in image layers and history. BuildKit's --mount=type=secret mounts a file into the build container that is never committed to a layer.

dockerfile
# syntax=docker/dockerfile:1
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./

# Secret is mounted at /run/secrets/npmrc — never baked into a layer
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci

COPY . .
CMD ["node", "server.js"]

Build it by passing the secret file at build time:

bash
docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp .

SSH Forwarding

If your build needs to clone private Git repos, forward your host SSH agent instead of copying keys into the image:

dockerfile
# syntax=docker/dockerfile:1
FROM alpine/git AS source
RUN --mount=type=ssh git clone git@github.com:org/private-repo.git /src

FROM golang:1.22-alpine AS builder
COPY --from=source /src /src
WORKDIR /src
RUN go build -o /app .
bash
docker build --ssh default -t myapp .

Cache Mounts

Cache mounts persist directories across builds, avoiding redundant downloads. This is one of the highest-impact BuildKit optimizations — especially for package managers that maintain a local cache.

dockerfile
# Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
    go mod download

# apt cache — survives across builds
RUN --mount=type=cache,target=/var/cache/apt \
    --mount=type=cache,target=/var/lib/apt \
    apt-get update && apt-get install -y gcc

# pip cache
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

# npm cache
RUN --mount=type=cache,target=/root/.npm \
    npm ci

Heredocs in Dockerfiles

BuildKit supports heredoc syntax, letting you inline multi-line scripts and files without long chains of && \ continuations:

dockerfile
# syntax=docker/dockerfile:1
FROM nginx:alpine

# Inline a config file without COPY
COPY <<EOF /etc/nginx/conf.d/default.conf
server {
    listen 80;
    location / {
        proxy_pass http://backend:3000;
    }
}
EOF

# Multi-line RUN without awkward backslash chains
RUN <<EOF
    apk add --no-cache curl jq
    curl -sSL https://example.com/setup.sh | sh
    rm -rf /tmp/*
EOF

Cross-Platform Builds with docker buildx

docker buildx extends the build command to support multi-architecture images from a single machine. It uses QEMU emulation or remote builder nodes to compile for architectures your host doesn't natively support — critical for shipping ARM images from x86 CI runners (or vice versa).

bash
# Create a new builder with multi-arch support
docker buildx create --name multiarch --driver docker-container --use

# Build for amd64 and arm64, push a manifest list to registry
docker buildx build \
    --platform linux/amd64,linux/arm64 \
    --tag registry.example.com/myapp:1.0 \
    --push .

When a user pulls this image, Docker automatically selects the correct architecture variant from the manifest list. This is how official images like nginx and postgres serve both Intel and Apple Silicon machines from a single tag.

Tip

QEMU emulation is convenient but slow — a Go cross-compile via GOARCH=arm64 inside an amd64 builder is significantly faster than emulating an entire arm64 build. Use --platform only for the final stage and cross-compile natively when your toolchain supports it.

Build Cache Strategies

Docker layer caching works well locally, but CI environments start with a cold cache on every run. BuildKit provides three cache export/import backends to solve this, each with different trade-offs.

StrategyHow It WorksBest ForTrade-off
InlineEmbeds cache metadata into the image itselfSimple setups, single-platformOnly caches the final stage; increases image size slightly
RegistryPushes cache to a separate registry referenceCI pipelines, team sharingCaches all stages; requires registry write access
LocalExports cache to a local directorySelf-hosted runners with persistent storageFastest import; not shareable across machines

Inline Cache

The simplest option — embed cache metadata directly into the pushed image:

bash
# Export cache inline with the image
docker buildx build \
    --cache-to type=inline \
    --tag registry.example.com/myapp:latest \
    --push .

# Import cache from the previously pushed image
docker buildx build \
    --cache-from type=registry,ref=registry.example.com/myapp:latest \
    --tag registry.example.com/myapp:latest \
    --push .

Registry Cache

For multi-stage builds, registry cache is superior because it caches every stage, not just the final one:

bash
docker buildx build \
    --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
    --cache-to   type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
    --tag registry.example.com/myapp:latest \
    --push .

The mode=max flag tells BuildKit to cache all layers from all stages, not just the layers used in the final image. The cache is stored at a separate tag (:buildcache) so it doesn't pollute your image tags.

Local Cache

If your CI runner has persistent storage between runs, local directory cache avoids any registry round-trips:

bash
docker buildx build \
    --cache-from type=local,src=/tmp/buildcache \
    --cache-to   type=local,dest=/tmp/buildcache,mode=max \
    --tag myapp:latest \
    --load .
CI Cache Gotcha

GitHub Actions ephemeral runners lose local caches between jobs. Use actions/cache to persist the /tmp/buildcache directory, or prefer registry cache which survives regardless of runner lifecycle. For GitHub Actions specifically, use --cache-to type=gha and --cache-from type=gha which integrate natively with the Actions cache API.

Docker Registry: Hub, Private Registries, and Image Distribution

A Docker registry is the distribution layer that sits between image producers (build pipelines, developers) and image consumers (container runtimes, orchestrators). It stores image manifests and the individual layers that compose them. Every docker push and docker pull talks to a registry over the OCI Distribution Spec API — even when you think you're "just pulling from Docker."

Docker Hub — The Default Registry

When you run docker pull nginx, the Docker daemon contacts registry-1.docker.io behind the scenes. Docker Hub is the default registry baked into the Docker client. It hosts official images (curated by Docker and upstream maintainers) and community images (published by anyone with an account).

Official images use a single-name namespace like nginx, postgres, or node. Community and organization images use a two-part namespace: mycompany/api-server. Understanding this naming scheme is the foundation for every tagging strategy.

Image Naming Anatomy

A fully-qualified image reference has four parts. Most of the time you omit the registry and it defaults to Docker Hub, but for private registries every part matters.

text
[registry/][namespace/]repository[:tag|@digest]

# Examples:
nginx                                        # Docker Hub official, tag "latest" implied
mycompany/api-server:v2.3.1                  # Docker Hub, org namespace, semver tag
ghcr.io/myorg/worker:abc123f                 # GitHub Container Registry, git SHA tag
us-east1-docker.pkg.dev/proj/repo/app:main   # Google Artifact Registry

Image Tagging Strategies

Tags are mutable pointers — the same tag can be reassigned to a completely different image digest at any time. This is what makes tagging strategy so important: the tag you choose determines whether your deployments are reproducible or a game of roulette.

Semantic Versioning (semver)

The most widely-adopted strategy. You publish multiple tags per release so consumers can pin at their preferred specificity. A single build produces the image once but tags it three ways:

bash
# Tag the same image at three levels of specificity
docker tag api-server:build ghcr.io/myorg/api-server:3.2.1
docker tag api-server:build ghcr.io/myorg/api-server:3.2
docker tag api-server:build ghcr.io/myorg/api-server:3

# Push all three
docker push ghcr.io/myorg/api-server:3.2.1
docker push ghcr.io/myorg/api-server:3.2
docker push ghcr.io/myorg/api-server:3

Consumers pinning :3.2.1 get an immutable release. Those pinning :3.2 automatically receive patch updates. Those pinning :3 get minor updates too. This gives consumers explicit control over their upgrade risk.

Git SHA Tags

For internal services and CI/CD-driven deployments, tagging with the short Git commit SHA creates a 1:1 link between source code and the running artifact. Every build is unique and traceable:

bash
GIT_SHA=$(git rev-parse --short HEAD)
docker build -t ghcr.io/myorg/api-server:${GIT_SHA} .
docker push ghcr.io/myorg/api-server:${GIT_SHA}

# In your Kubernetes manifest or docker-compose.yml:
# image: ghcr.io/myorg/api-server:a1b2c3d

Why :latest Is an Anti-Pattern

The :latest trap

The tag :latest does not mean "most recent." It is simply the default tag applied when you don't specify one. It's mutable, never auto-updated on hosts that already pulled it, and gives you zero traceability. Using :latest in production means you cannot reliably answer "what version is running right now?"

StrategyImmutable?Traceable to Source?Best For
Semver (:3.2.1)By convention, yesVia release notes/changelogLibraries, public images, versioned APIs
Git SHA (:a1b2c3d)YesDirectly — git show a1b2c3dInternal services, CD pipelines
:latestNoNoLocal development only
Branch name (:main)NoLoosely — HEAD changesStaging / preview environments

Push/Pull Flow and Layer Deduplication

Registries store images as a manifest (a JSON document listing layers) plus individual layer blobs (compressed tarballs). When you push, the client checks which layers the registry already has and only uploads new ones. Pulls work the same way in reverse — the daemon skips layers it already has locally. This is why sharing a common base image across services drastically reduces both storage and transfer time.

sequenceDiagram
    participant Dev as Developer / CI
    participant Daemon as Docker Daemon
    participant Reg as Registry

    Note over Dev,Reg: docker push myregistry/app:v2.0

    Dev->>Daemon: docker push myregistry/app:v2.0
    Daemon->>Reg: POST /v2/app/blobs/uploads/ (Layer A hash)
    Reg-->>Daemon: 200 Layer A already exists (skip)
    Daemon->>Reg: POST /v2/app/blobs/uploads/ (Layer B hash)
    Reg-->>Daemon: 202 Upload URL
    Daemon->>Reg: PUT Layer B blob data
    Reg-->>Daemon: 201 Created
    Daemon->>Reg: PUT /v2/app/manifests/v2.0
    Reg-->>Daemon: 201 Manifest stored

    Note over Dev,Reg: docker pull myregistry/app:v2.0 (different host)

    Dev->>Daemon: docker pull myregistry/app:v2.0
    Daemon->>Reg: GET /v2/app/manifests/v2.0
    Reg-->>Daemon: Manifest (lists Layer A + B)
    Daemon->>Daemon: Layer A exists locally (skip)
    Daemon->>Reg: GET /v2/app/blobs/sha256:layerB
    Reg-->>Daemon: Layer B blob data
    Daemon-->>Dev: Image ready
    
Layer deduplication in practice

If 10 microservices all use FROM node:20-slim, the shared base layers are stored once in the registry and once on each host. Only the application-specific layers on top are unique. This is why consistent base images across an organization are worth the governance effort.

Private Registry Options

Docker Hub has rate limits (100 pulls/6h for anonymous, 200 for free accounts) and all images are public unless you pay. For proprietary code, compliance requirements, or simple pull-rate sanity, you need a private registry. The ecosystem offers options at every scale.

Self-Hosted: registry:2

The Docker-maintained registry:2 image is the simplest private registry you can run. It implements the full OCI Distribution API, supports S3/GCS/Azure backends for blob storage, and is production-ready when paired with a TLS reverse proxy.

bash
# Start a local registry on port 5000
docker run -d \
  --name registry \
  --restart always \
  -p 5000:5000 \
  -v registry-data:/var/lib/registry \
  registry:2

# Tag and push an image to it
docker tag my-app:latest localhost:5000/my-app:v1.0.0
docker push localhost:5000/my-app:v1.0.0

# Pull from any machine on the same network
docker pull 192.168.1.50:5000/my-app:v1.0.0

Enterprise: Harbor

Harbor is a CNCF graduated project that wraps registry:2 and adds enterprise features: RBAC, vulnerability scanning (via Trivy), image signing, replication between registries, audit logs, and a web UI. If you need a self-hosted registry with governance, Harbor is the standard choice.

Cloud-Managed Registries

RegistryProviderKey Advantage
ECR (Elastic Container Registry)AWSNative IAM auth, lifecycle policies for automatic cleanup
GAR (Artifact Registry)Google CloudMulti-format (Docker, npm, Maven), regional replication
ACR (Azure Container Registry)AzureACR Tasks for in-registry builds, geo-replication
GHCR (GitHub Container Registry)GitHubTight Actions integration, free for public repos

Cloud registries eliminate operational overhead — no TLS certs to manage, no storage backends to configure, no uptime to guarantee. They authenticate through the cloud provider's IAM, which simplifies CI/CD credentials. The tradeoff is vendor coupling and per-GB storage costs.

Authenticating with Cloud Registries

bash
# ECR login (token valid for 12 hours)
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS \
  --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
bash
# Google Artifact Registry login
gcloud auth configure-docker us-east1-docker.pkg.dev
bash
# GHCR login with a Personal Access Token
echo $CR_PAT | docker login ghcr.io \
  --username YOUR_GITHUB_USER --password-stdin

Image Signing and Verification

Pushing an image to a registry does not prove who built it or that it hasn't been tampered with. Image signing creates a cryptographic chain of trust: you verify that the image was produced by a known identity and hasn't been modified since signing. This is a supply-chain security essential.

Docker Content Trust (DCT)

DCT uses The Update Framework (TUF) via Notary to sign image tags. When enabled, the Docker client refuses to pull unsigned images. It's built into the Docker CLI but has seen limited adoption because Notary v1 is complex to operate.

bash
# Enable Docker Content Trust
export DOCKER_CONTENT_TRUST=1

# Now push signs automatically (prompts for passphrase on first use)
docker push ghcr.io/myorg/api-server:v2.0.0

# Pull will reject unsigned images
docker pull ghcr.io/myorg/api-server:v2.0.0

Cosign and Sigstore — The Modern Approach

Sigstore's cosign is rapidly becoming the industry standard for container signing. It signs the image digest (not the mutable tag), stores signatures in the same registry alongside the image, and supports keyless signing via OIDC identity providers (GitHub Actions, Google, Microsoft). No key management infrastructure required.

bash
# Sign with a key pair
cosign generate-key-pair
cosign sign --key cosign.key ghcr.io/myorg/api-server@sha256:abc123...

# Verify on the consumer side
cosign verify --key cosign.pub ghcr.io/myorg/api-server@sha256:abc123...

# Keyless signing in CI (e.g., GitHub Actions) — no keys to manage
cosign sign ghcr.io/myorg/api-server@sha256:abc123...
# Sigstore's Fulcio CA issues a short-lived cert tied to your OIDC identity
Enforce signing in Kubernetes

Use a policy engine like Kyverno or OPA Gatekeeper with cosign verification to reject unsigned images at admission time. This closes the loop — signing without enforcement is security theater.

Container Lifecycle: From Create to Remove

Every Docker container moves through a well-defined set of states — from the moment it's created to the moment it's removed from the system. Understanding this lifecycle is the foundation for debugging stuck containers, designing restart strategies, and writing clean orchestration scripts.

This section walks through each state transition, the commands that trigger them, and the flags and inspection tools you'll use daily.

The Container State Machine

A container can exist in one of five states. Each state transition is triggered by a specific Docker command or an event inside the container itself (like the main process exiting).

stateDiagram-v2
    [*] --> Created : docker create
    Created --> Running : docker start
    Running --> Running : docker restart
    Running --> Paused : docker pause
    Paused --> Running : docker unpause
    Running --> Exited : docker stop / process exits
    Running --> Exited : SIGKILL / OOM / crash
    Exited --> Running : docker start / docker restart
    Exited --> Removed : docker rm
    Created --> Removed : docker rm
    Removed --> [*]

    note right of Created : Container exists but\nmain process not started
    note right of Paused : Process frozen via\ncgroup freezer
    note right of Exited : Process terminated,\nfilesystem persists
    

The key insight: Exited containers still occupy disk space. Their writable layer, logs, and metadata remain until you explicitly docker rm them or use the --rm flag at creation time.

docker run Demystified

docker run is actually three commands combined: docker create + docker start + docker attach (in foreground mode). Understanding this decomposition helps you reason about what's happening when a run fails — did the image pull fail (create phase), did the process crash on startup (start phase), or is the output just not reaching your terminal (attach phase)?

bash
# These two sequences are equivalent:

# Sequence 1: docker run (all-in-one)
docker run -d --name my-nginx nginx:alpine

# Sequence 2: explicit create + start
docker create --name my-nginx nginx:alpine
docker start my-nginx

Essential docker run Flags

You'll use these flags constantly. Rather than memorize them in isolation, here's how they map to real operational concerns.

Detached vs Interactive Mode

The -d flag runs the container in the background and prints the container ID. Without it, your terminal attaches to the container's stdout/stderr. The -it combination allocates a pseudo-TTY (-t) and keeps stdin open (-i) — essential for interactive shells.

bash
# Background a web server
docker run -d --name web nginx:alpine

# Interactive shell into an Ubuntu container
docker run -it --name sandbox ubuntu:22.04 bash

# Combine: start detached, exec into it later
docker run -d --name app node:20-alpine sleep infinity
docker exec -it app sh

Naming, Cleanup, and Ports

bash
# --name:  give the container a human-readable name
# --rm:    auto-remove the container when it exits
# -p:      map host port to container port
docker run -d --name api --rm -p 3000:3000 my-api:latest

# Bind to a specific interface (security best practice)
docker run -d -p 127.0.0.1:5432:5432 postgres:16

# Map multiple ports
docker run -d -p 80:80 -p 443:443 --name proxy nginx:alpine

Environment Variables and Working Directory

The -e flag sets individual environment variables. For multiple variables, --env-file reads from a file — keeping secrets out of your shell history. The -w flag overrides the working directory inside the container.

bash
# Inline environment variables
docker run -d --name db \
  -e POSTGRES_USER=admin \
  -e POSTGRES_PASSWORD=secret \
  -e POSTGRES_DB=myapp \
  postgres:16

# Load from an env file (one VAR=value per line)
docker run -d --name app --env-file .env my-app:latest

# Override the working directory
docker run -it -w /opt/project node:20-alpine npm test
Warning

Environment variables passed with -e are visible via docker inspect. For production secrets, use Docker secrets (Swarm) or mount a secrets file. Never pass passwords as -e flags in CI/CD logs.

Complete Flags Reference

FlagPurposeExample
-dRun detached (background)docker run -d nginx
-itInteractive mode with TTYdocker run -it ubuntu bash
--nameAssign a container name--name my-app
--rmAuto-remove on exitdocker run --rm alpine echo hi
-pPublish port (host:container)-p 8080:80
-eSet environment variable-e NODE_ENV=production
--env-fileLoad env vars from file--env-file .env
-wSet working directory-w /app
--restartRestart policy--restart unless-stopped

Container Inspection Tools

Running containers are black boxes until you know how to peer inside them. Docker provides a suite of inspection commands that cover metadata, logs, resource usage, and filesystem changes.

docker inspect — The Swiss Army Knife

docker inspect returns a massive JSON blob with everything Docker knows about a container: its configuration, network settings, mounts, state, and more. The --format flag (Go templates) lets you extract exactly what you need.

bash
# Full JSON output
docker inspect my-app

# Get the container's IP address
docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-app

# Check the current state
docker inspect --format '{{.State.Status}}' my-app

# Get the exit code of a stopped container
docker inspect --format '{{.State.ExitCode}}' my-app

# View all environment variables
docker inspect --format '{{.Config.Env}}' my-app

# See port mappings
docker port my-app

docker logs — Container Output

Every container captures its main process's stdout and stderr. The logs command is your first stop when something goes wrong. Combine -f (follow) with --since for real-time debugging.

bash
# View all logs
docker logs my-app

# Follow logs in real time (like tail -f)
docker logs -f my-app

# Last 50 lines with timestamps
docker logs --tail 50 -t my-app

# Logs from the last 5 minutes
docker logs --since 5m my-app

docker stats, top, and diff

bash
# Live resource usage (CPU, memory, network, disk I/O)
docker stats my-app

# One-shot stats (no streaming)
docker stats --no-stream my-app

# See processes running inside the container
docker top my-app

# See filesystem changes since container was created
# A = Added, C = Changed, D = Deleted
docker diff my-app

Executing Commands and Copying Files

docker exec runs a new process inside a running container — indispensable for debugging. docker cp copies files between your host and a container (running or stopped).

bash
# Open an interactive shell in a running container
docker exec -it my-app sh

# Run a one-off command
docker exec my-app cat /etc/os-release

# Run as a specific user
docker exec -u root my-app apt-get update

# Set environment variables for the exec session
docker exec -e DEBUG=true my-app node script.js

# Copy a file from container to host
docker cp my-app:/var/log/app.log ./app.log

# Copy a config file from host into the container
docker cp ./nginx.conf my-app:/etc/nginx/nginx.conf
Note

docker exec only works on running containers. If the container has exited, you must docker start it first or use docker cp (which works on stopped containers too) to extract files.

Restart Policies

Restart policies determine what Docker does when a container's main process exits. They are your first line of defense for keeping services up without external orchestration. Set them at creation time with --restart.

PolicyBehaviorUse Case
noNever restart (default)One-off tasks, batch jobs
on-failure[:max]Restart only on non-zero exit code, optional retry limitWorkers that might crash but shouldn't loop forever
alwaysAlways restart, including after daemon restartCritical services that must survive host reboots
unless-stoppedLike always, but not if manually stoppedMost production services — respects manual docker stop
bash
# Restart on crash, up to 5 attempts
docker run -d --restart on-failure:5 --name worker my-worker:latest

# Always restart (survives dockerd restart and host reboot)
docker run -d --restart always --name db postgres:16

# Recommended default for most services
docker run -d --restart unless-stopped --name api my-api:latest

# Update the restart policy on an existing container
docker update --restart unless-stopped my-app

Events, Waiting, and Exit Codes

docker events and docker wait

docker events streams real-time events from the Docker daemon — container starts, stops, health checks, image pulls, network connects, and more. It's invaluable for understanding what's happening across all containers. docker wait blocks until a container stops and then prints its exit code, making it perfect for scripting sequential workflows.

bash
# Stream all Docker events in real time
docker events

# Filter events for a specific container
docker events --filter container=my-app

# Only show container start/stop events since last hour
docker events --since 1h --filter event=start --filter event=stop

# Wait for a container to exit, then get its exit code
EXIT_CODE=$(docker wait my-batch-job)
echo "Job finished with exit code: $EXIT_CODE"

# Use in CI: run tests, fail pipeline on non-zero exit
docker run --name tests my-app:test npm test
docker wait tests
EXIT_CODE=$(docker inspect --format '{{.State.ExitCode}}' tests)
exit $EXIT_CODE

Understanding Exit Codes

The exit code tells you how a container stopped. Three codes cover the vast majority of cases you'll encounter.

Exit CodeMeaningCause
0SuccessProcess completed normally
1Application errorUnhandled exception, misconfiguration, general failure
137Killed (SIGKILL / 128+9)OOM killer, forced termination, or docker stop after grace period timeout
143Terminated (SIGTERM / 128+15)Graceful docker stop — the process caught SIGTERM and exited
126Command not executablePermission denied on the entrypoint binary
127Command not foundEntrypoint or CMD binary doesn't exist in the image
bash
# Check why a container stopped
docker inspect --format '{{.State.ExitCode}}' my-app
# 137 → likely OOM. Check with:
docker inspect --format '{{.State.OOMKilled}}' my-app

# docker stop sends SIGTERM, waits 10s (default), then SIGKILL
docker stop my-app              # graceful: exit 143 if handled
docker stop -t 30 my-app        # give 30 seconds grace period
Tip

If your containers consistently exit with 137 and OOMKilled: true, increase the memory limit with --memory. If they exit 137 but OOMKilled: false, something external sent SIGKILL — check docker events to find the culprit.

Putting It All Together

Here's a realistic workflow that exercises the full container lifecycle — creating a container, inspecting it, debugging an issue, and cleaning up.

bash
# 1. Start a service
docker run -d \
  --name api \
  --restart unless-stopped \
  -p 3000:3000 \
  -e NODE_ENV=production \
  --env-file .env.production \
  my-api:2.1.0

# 2. Verify it's running and healthy
docker ps --filter name=api
docker logs --tail 20 api

# 3. Check resource usage
docker stats --no-stream api

# 4. Something looks wrong — debug it
docker exec -it api sh
# Inside: check files, env, connectivity, then exit

# 5. Pull a log file out for analysis
docker cp api:/app/logs/error.log ./error.log

# 6. See what files changed since image was built
docker diff api

# 7. Gracefully stop and remove
docker stop api
docker rm api

# Or force-remove a running container in one step
docker rm -f api

Docker Networking: Bridge, Host, Overlay, and DNS

Every container you run needs a way to communicate — with other containers, with the host, and with the outside world. Docker's networking subsystem is pluggable, built on Linux kernel primitives like network namespaces, veth pairs, iptables, and virtual bridges. Understanding how these pieces connect is the difference between "it works" and knowing why it works.

Docker ships with several built-in network drivers. Each one represents a different trade-off between isolation, performance, and simplicity. Let's walk through each driver, then cover DNS, port publishing, and troubleshooting.

graph TB
    subgraph Host["Docker Host (Linux Kernel)"]
        direction TB
        bridge0["docker0 bridge\nip: 172.17.0.1/16"]
        custom_br["my-net bridge\nip: 172.20.0.1/24"]
        dns["Embedded DNS\n127.0.0.11"]

        subgraph BridgeNet["Default Bridge Network"]
            direction LR
            c1["container-A\n172.17.0.2"]
            c2["container-B\n172.17.0.3"]
        end

        subgraph CustomNet["User-Defined Bridge my-net"]
            direction LR
            c3["api-server\n172.20.0.2"]
            c4["redis\n172.20.0.3"]
        end

        subgraph HostNet["Host Network"]
            c5["nginx\nshares host IP"]
        end

        c1 ---|"veth pair"| bridge0
        c2 ---|"veth pair"| bridge0
        c3 ---|"veth pair"| custom_br
        c4 ---|"veth pair"| custom_br

        bridge0 ---|"iptables NAT"| eth0["eth0 Host NIC"]
        custom_br ---|"iptables NAT"| eth0
        c5 ---|"bypasses namespace"| eth0

        c3 -.-|"DNS: redis to 172.20.0.3"| dns
        c4 -.-|"DNS: api-server to 172.20.0.2"| dns
    end

    eth0 --- internet["External Network"]
    

The Default Bridge Network

When you install Docker, it creates a Linux bridge called docker0 with a default subnet (typically 172.17.0.0/16). Every container that starts without a --network flag connects here. Docker creates a veth pair — one end goes inside the container's network namespace as eth0, the other end attaches to the docker0 bridge on the host.

Outbound traffic from containers is NATed via iptables MASQUERADE rules so it appears to originate from the host's IP. Containers on the default bridge can reach each other by IP address, but they do not get automatic DNS resolution by name.

bash
# Run two containers on the default bridge
docker run -d --name box1 alpine sleep 3600
docker run -d --name box2 alpine sleep 3600

# They can ping each other by IP, but NOT by name
docker exec box1 ping -c 2 
# Works

docker exec box1 ping -c 2 box2
# Fails — no DNS on the default bridge
Avoid the default bridge for real workloads

The default bridge network is a legacy artifact. It doesn't provide automatic DNS, you can't connect/disconnect containers at runtime, and all containers share a single unscoped network. Always create user-defined bridge networks instead.

User-Defined Bridge Networks

A user-defined bridge is Docker's recommended approach for single-host container networking. It works the same way under the hood — Linux bridge, veth pairs, iptables NAT — but adds critical features that the default bridge lacks. The most important one: automatic DNS resolution between containers by name and alias.

Containers on a user-defined bridge are also fully isolated from containers on other networks, and you can attach or detach running containers on the fly.

bash
# Create a user-defined bridge with a custom subnet
docker network create \
  --driver bridge \
  --subnet 172.20.0.0/24 \
  --gateway 172.20.0.1 \
  my-app-net

# Run containers on it — they resolve each other by name
docker run -d --name api --network my-app-net nginx:alpine
docker run -d --name cache --network my-app-net redis:alpine

# DNS just works
docker exec api ping -c 2 cache
# PING cache (172.20.0.3): 56 data bytes — resolves automatically

# Attach a running container to a second network
docker network connect my-app-net box1

Host Network Mode

With --network host, Docker skips creating a network namespace entirely. The container shares the host's network stack — same interfaces, same IP, same port space. There's no NAT overhead and no veth pair, so you get bare-metal network performance.

The trade-off is obvious: no port isolation. If your container binds to port 80, it occupies port 80 on the host. Two containers can't use the same port. This mode is useful for performance-sensitive applications (high-throughput proxies, monitoring agents) or when you need to access host network services directly.

bash
# Container uses the host's network directly — no -p flag needed
docker run -d --name web --network host nginx:alpine

# nginx is now reachable on host_ip:80
curl localhost:80

None and Macvlan

--network none gives a container a network namespace with only a loopback interface. No external connectivity at all. This is useful for batch jobs that process local files, or for security-hardened containers that should never touch the network.

Macvlan assigns a real MAC address to each container, making it appear as a physical device on your LAN. Containers get IPs from your physical network's DHCP server (or you assign them statically). This is essential when you need containers to be directly addressable on the LAN — for example, migrating legacy apps that expect to sit on a flat L2 network.

bash
# Create a macvlan network tied to the host's eth0
docker network create \
  --driver macvlan \
  --subnet 192.168.1.0/24 \
  --gateway 192.168.1.1 \
  -o parent=eth0 \
  lan-net

# Container gets a real LAN IP
docker run -d --name legacy-app --network lan-net \
  --ip 192.168.1.50 my-legacy-image

Overlay Networks (Swarm & Multi-Host)

When containers span multiple Docker hosts, you need overlay networking. Overlay networks use VXLAN tunneling to encapsulate Layer 2 frames inside UDP packets, creating a virtual network that spans across physical hosts. Docker Swarm manages overlay networks natively — no external tooling required.

Every Swarm service placed on an overlay network can reach every other service by name, regardless of which physical node the tasks run on. Under the hood, Docker maintains a distributed key-value store (via Raft consensus) that maps container IPs to host IPs for the VXLAN tunnel endpoints.

bash
# Create an overlay network (requires Swarm mode)
docker network create --driver overlay --attachable backend-net

# Deploy services — they resolve each other across nodes
docker service create --name api --network backend-net my-api:latest
docker service create --name db --network backend-net postgres:16

# From inside the api container, "db" resolves to the VIP
# regardless of which Swarm node it's running on

Network Driver Comparison

DriverScopeDNSIsolationPerformanceUse Case
bridge (default)Single hostNo auto DNSNetwork namespaceGood (veth + NAT)Quick tests, throwaway containers
bridge (user-defined)Single hostYesNamespace + scopedGood (veth + NAT)Most single-host workloads
hostSingle hostN/ANone — shares hostBest (no overhead)High-throughput, monitoring
noneSingle hostN/ACompleteN/AOffline / security-hardened
overlayMulti-hostYesNamespace + VXLANGood (VXLAN encap cost)Swarm services, cross-node
macvlanSingle hostNoMAC-levelExcellentLegacy apps, flat L2 LAN

Port Publishing

Containers on bridge networks are isolated behind NAT. To make a service reachable from outside the host, you publish ports with -p. Docker sets up iptables DNAT rules (and a docker-proxy userland process) to forward traffic from the host port to the container's IP and port.

bash
# Explicit mapping: host 8080 -> container 80
docker run -d -p 8080:80 nginx:alpine

# Bind to a specific host interface only
docker run -d -p 127.0.0.1:8080:80 nginx:alpine

# Publish all EXPOSEd ports to random host ports
docker run -d -P nginx:alpine

# UDP port publishing
docker run -d -p 5514:514/udp syslog-server

# Verify the iptables rules Docker created
sudo iptables -t nat -L DOCKER -n -v

Docker's Embedded DNS Server (127.0.0.11)

Every container on a user-defined network gets its /etc/resolv.conf pointed at 127.0.0.11 — Docker's built-in DNS server. This server resolves container names, service names, and network aliases to internal IPs. For any name it can't resolve internally, it forwards the query to the DNS servers configured on the host (or those you specify with --dns).

This is what makes service discovery work without any external tool. When you do ping redis from another container on the same user-defined network, the query hits 127.0.0.11, Docker's DNS looks up the container named redis on that network, and returns its IP.

bash
# Inspect the embedded DNS from inside a container
docker run --rm --network my-app-net alpine cat /etc/resolv.conf
# nameserver 127.0.0.11

# Use dig to query Docker's DNS
docker run --rm --network my-app-net alpine \
  sh -c "apk add --no-cache bind-tools && dig cache"
# ;; ANSWER SECTION:
# cache.   600   IN   A   172.20.0.3

# Add a network alias — one container, multiple DNS names
docker run -d --network my-app-net --network-alias db \
  --network-alias primary-db --name postgres postgres:16
DNS round-robin for scaling

Multiple containers can share the same --network-alias. Docker's DNS returns all matching IPs in round-robin order. This gives you basic client-side load balancing without a proxy — though it's limited because DNS results are often cached by the client.

IPv6 and Custom Subnets

Docker supports dual-stack (IPv4 + IPv6) networking but it's not enabled by default. You need to explicitly opt in at both the daemon level and the network level. Once enabled, containers receive both an IPv4 and IPv6 address and can communicate over either protocol.

json
{
  "ipv6": true,
  "fixed-cidr-v6": "fd00:dead:beef::/48",
  "default-address-pools": [
    { "base": "10.10.0.0/16", "size": 24 },
    { "base": "fd00:db8::/104", "size": 112 }
  ]
}
bash
# Create a dual-stack network
docker network create \
  --ipv6 \
  --subnet 172.28.0.0/24 \
  --subnet fd00:cafe::/64 \
  dual-stack-net

# Verify both addresses are assigned
docker run --rm --network dual-stack-net alpine ip addr show eth0
# inet 172.28.0.2/24 ...
# inet6 fd00:cafe::2/64 ...

Network Troubleshooting with Netshoot

When networking breaks, you need tools — but production container images are (rightly) minimal. The nicolaka/netshoot image packs every network diagnostic tool you'd want: tcpdump, dig, nslookup, curl, iperf, netstat, ss, ip, bridge, and more. You can either join a container's network namespace or attach to a Docker network to diagnose from the inside.

bash
# Join a running container's network namespace to debug it
docker run --rm -it --network container:api nicolaka/netshoot

# Now you share the exact same network stack as the "api" container
ss -tlnp                    # Check listening ports
dig cache                   # Test DNS resolution
curl http://cache:6379/ping # Test connectivity
tcpdump -i eth0 port 80     # Capture traffic

# Or attach to a network to test from a fresh perspective
docker run --rm -it --network my-app-net nicolaka/netshoot
nslookup api 127.0.0.11    # Query Docker's embedded DNS directly
traceroute api              # Trace the route between networks
Quick inspection without netshoot

For a faster check, docker network inspect my-app-net shows all connected containers, their IPs, subnet configuration, and driver options — all without running a separate debug container.

Practical Mental Model

Here's how to think about Docker networking decisions in practice:

  • Single app with multiple containers? — Use one user-defined bridge. Containers talk by name.
  • Need true LAN presence? — Use macvlan. Containers get real MAC/IP on the physical network.
  • Multi-host cluster? — Use overlay with Swarm (or a CNI plugin with Kubernetes).
  • Maximum network performance? — Use host mode and accept port conflicts.
  • Security-critical batch job? — Use none to eliminate the network attack surface entirely.
  • Debugging connectivity? — Attach netshoot to the target container's namespace with --network container:<name>.

Docker Volumes and Storage: Bind Mounts, Named Volumes, and tmpfs

Every Docker container gets its own writable layer on top of the image's read-only layers. When the container is removed, that writable layer is gone — permanently. Any database rows, uploaded files, or application state written inside the container vanish with it.

This is Docker's ephemeral storage problem. It's by design — containers are meant to be disposable. But real applications need durable data. Docker solves this with mounts: mechanisms that let a container read and write to storage that lives outside the container's writable layer.

Mount Types at a Glance

Docker provides three distinct mount types. Each places data in a different location on the host and serves a different purpose. The diagram below shows how they relate to the container and the host filesystem.

flowchart LR
    subgraph Host["Docker Host"]
        direction TB
        VolumeArea["/var/lib/docker/volumes/\n(Managed by Docker)"]
        HostDir["/home/user/project\n(Any host path)"]
        RAM["Host Memory - RAM"]
    end

    subgraph Container["Container"]
        direction TB
        Writable["Writable Layer\n(ephemeral - dies with container)"]
        MountNV["/var/lib/postgresql/data"]
        MountBM["/app/src"]
        MountTmp["/tmp/secrets"]
    end

    VolumeArea -- "Named Volume" --> MountNV
    HostDir -- "Bind Mount" --> MountBM
    RAM -- "tmpfs Mount" --> MountTmp

    style Writable fill:#ff6b6b,stroke:#c0392b,color:#fff
    style VolumeArea fill:#51cf66,stroke:#2f9e44,color:#fff
    style HostDir fill:#339af0,stroke:#1971c2,color:#fff
    style RAM fill:#fcc419,stroke:#e67700,color:#000
    
Mount TypeStored AtManaged ByBest For
Named Volume/var/lib/docker/volumes/Docker EngineDatabases, persistent app data
Bind MountAny host path you chooseYou (the host OS)Source code in development, config files
tmpfsHost memory (RAM) onlyKernelSecrets, sensitive caches, scratch space

Named Volumes

Named volumes are Docker's recommended storage mechanism for persistent data. Docker creates a directory under /var/lib/docker/volumes/<name>/_data and manages it entirely. You don't need to know or care about the exact host path — Docker handles creation, mounting, and cleanup.

When you mount a named volume into a container for the first time and the volume is empty, Docker copies the contents from the container's image at that mount point into the volume. This "copy-on-first-use" behavior means database images that ship with initial data in /var/lib/postgresql/data work out of the box.

bash
# Create a named volume explicitly
docker volume create pgdata

# Run Postgres with the named volume
docker run -d \
  --name db \
  -v pgdata:/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=secret \
  postgres:16

# The data survives container removal
docker rm -f db
docker run -d --name db2 -v pgdata:/var/lib/postgresql/data -e POSTGRES_PASSWORD=secret postgres:16
# All databases and tables are still there

Bind Mounts

Bind mounts map a specific host directory or file directly into the container. Unlike named volumes, Docker does not manage the lifecycle — if the host path doesn't exist, Docker creates it as an empty directory (which is often not what you want). Bind mounts give full control but come with more responsibility.

The primary use case is development: you mount your source code into the container so changes on the host are immediately visible inside the container without rebuilding the image.

bash
# Mount current directory into /app/src in the container
docker run -d \
  --name devserver \
  -v "$(pwd)":/app/src \
  -w /app/src \
  node:20 \
  npm run dev
Bind mounts override image contents

Unlike named volumes, bind mounts do not copy image data into the mount. If your image has files at /app and you bind-mount an empty host directory there, the container sees an empty /app. This is the #1 cause of "file not found" errors in development setups.

tmpfs Mounts

A tmpfs mount stores data in the host's memory only — nothing is written to disk. When the container stops, the data is gone. This is ideal for sensitive data like API tokens or session keys that should never touch a filesystem, and for high-throughput scratch data where disk I/O would be a bottleneck.

bash
# tmpfs mount - 64MB in-memory filesystem at /tmp/secrets
docker run -d \
  --name app \
  --tmpfs /tmp/secrets:rw,size=67108864 \
  myapp:latest

The --mount vs -v Syntax

Docker offers two syntaxes for attaching storage. The -v (or --volume) flag is older and more compact. The --mount flag is newer, more explicit, and the recommended choice for anything beyond simple cases. The key behavioral difference: -v auto-creates missing host directories for bind mounts, while --mount throws an error — which is usually what you want, because a silent auto-created empty directory is a debugging nightmare.

bash
# Named volume
docker run -d \
  --mount type=volume,source=pgdata,target=/var/lib/postgresql/data \
  postgres:16

# Bind mount
docker run -d \
  --mount type=bind,source="$(pwd)"/src,target=/app/src \
  node:20

# tmpfs
docker run -d \
  --mount type=tmpfs,target=/tmp/cache,tmpfs-size=104857600 \
  myapp:latest

# Read-only bind mount
docker run -d \
  --mount type=bind,source="$(pwd)"/config,target=/etc/app,readonly \
  myapp:latest
bash
# Named volume - name:container_path
docker run -d \
  -v pgdata:/var/lib/postgresql/data \
  postgres:16

# Bind mount - /host/path:container_path
docker run -d \
  -v "$(pwd)"/src:/app/src \
  node:20

# Read-only bind mount - append :ro
docker run -d \
  -v "$(pwd)"/config:/etc/app:ro \
  myapp:latest

# Anonymous volume (no name - Docker assigns a random hash)
docker run -d \
  -v /var/lib/postgresql/data \
  postgres:16
Prefer --mount in scripts and Dockerfiles

The -v flag silently creates host directories that don't exist — which masks typos and misconfigurations. --mount fails fast with a clear error. Use -v for quick terminal one-liners; use --mount in anything checked into version control.

Volume Drivers

By default, Docker volumes use the local driver, which stores data on the host filesystem. Volume drivers extend this to support remote storage — NFS shares, cloud block storage, distributed filesystems, and more. You specify the driver when creating a volume.

bash
# Create a volume backed by an NFS share using the local driver
docker volume create \
  --driver local \
  --opt type=nfs \
  --opt o=addr=192.168.1.100,rw,nfsvers=4 \
  --opt device=:/exports/data \
  nfs_data

# Use it like any other named volume
docker run -d --mount source=nfs_data,target=/data myapp:latest

In Docker Swarm and Kubernetes environments, volume drivers become essential for making data accessible across multiple nodes. Common third-party drivers include REX-Ray (AWS EBS, Azure Disk), NetApp Trident, and Portworx.

Practical Patterns

Database Volumes

Always use a named volume for database data directories. This decouples the data lifecycle from the container lifecycle — you can upgrade Postgres from 16 to 17 by stopping the old container and starting a new one with the same volume.

yaml
# docker-compose.yml
services:
  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro

volumes:
  pgdata:  # Docker manages this volume

Sharing Data Between Containers

Multiple containers can mount the same named volume. A common pattern is having one container generate files (e.g., a static site builder) and another serve them (e.g., Nginx).

bash
# Builder writes HTML to the shared volume
docker run --rm -v site_content:/output builder:latest

# Nginx serves from the same volume (read-only)
docker run -d \
  --mount type=volume,source=site_content,target=/usr/share/nginx/html,readonly \
  -p 8080:80 \
  nginx:alpine

Read-Only Mounts

Appending :ro (with -v) or readonly (with --mount) prevents the container from writing to the mount. Use this for configuration files and secrets — it enforces the principle of least privilege and protects against accidental modification.

bash
# Mount config as read-only, but keep the data volume writable
docker run -d \
  --mount type=bind,source="$(pwd)"/nginx.conf,target=/etc/nginx/nginx.conf,readonly \
  --mount type=volume,source=app_logs,target=/var/log/nginx \
  -p 80:80 \
  nginx:alpine

UID/GID Permission Gotchas

This is the single most common Docker volume pain point. When a container process runs as a non-root user (say, UID 1000), it needs matching permissions on the mounted directory. But named volumes owned by root and bind mounts with the host user's UID may not align with the container's user.

dockerfile
# Fix permissions in the Dockerfile - the right approach
FROM node:20-slim

# Create app user with a known UID
RUN groupadd -g 1001 appgroup && \
    useradd -u 1001 -g appgroup -m appuser

# Create and own the data directory BEFORE switching user
RUN mkdir -p /app/data && chown appuser:appgroup /app/data

USER appuser
WORKDIR /app
bash
# For bind mounts, ensure the host directory matches the container's UID
mkdir -p ./data
chown 1001:1001 ./data
docker run -d -v "$(pwd)"/data:/app/data myapp:latest

# Quick debug: check what UID the container process is using
docker exec mycontainer id
# uid=1001(appuser) gid=1001(appgroup) groups=1001(appgroup)
Linux vs macOS volume permissions

On Linux, UID/GID mapping is direct — the container's UID must match file ownership on the host. On macOS (Docker Desktop), Docker runs in a Linux VM with automatic UID translation, so permission issues are less common in development. Don't let macOS lull you into ignoring permissions — they will bite in production on Linux.

Volume Lifecycle Commands

Docker provides a complete set of CLI commands for managing volumes independently of containers. Volumes are first-class objects with their own lifecycle.

bash
# Create a named volume
docker volume create mydata

# List all volumes
docker volume ls

# Inspect volume metadata (driver, mount point, labels)
docker volume inspect mydata

# Remove a specific volume (fails if a container is using it)
docker volume rm mydata

# Remove ALL volumes not currently mounted to a container
docker volume prune

# Nuclear option: prune volumes AND pass the -a flag for all unused
docker volume prune -a --force
CommandWhat It DoesSafe in Production?
docker volume lsLists all volumes✅ Read-only
docker volume inspect <name>Shows JSON metadata✅ Read-only
docker volume create <name>Creates an empty volume✅ Non-destructive
docker volume rm <name>Deletes a specific volume⚠️ Verify first
docker volume pruneDeletes all unused volumes❌ Dangerous

Backup and Restore Strategies

Docker volumes don't have a built-in backup command. The standard approach is to mount the volume into a temporary container that runs a backup tool like tar and writes the archive to a bind mount on the host.

  1. Back up a named volume to a tar archive

    Spin up a throwaway container that mounts both the volume (as source) and a host directory (as destination), then tar the data.

    bash
    # Backup: volume -> tar file on host
    docker run --rm \
      -v pgdata:/source:ro \
      -v "$(pwd)"/backups:/backup \
      alpine \
      tar czf /backup/pgdata-$(date +%Y%m%d).tar.gz -C /source .
  2. Restore a tar archive into a volume

    Create a fresh volume (or use the existing one), then extract the archive into it.

    bash
    # Restore: tar file on host -> volume
    docker volume create pgdata_restored
    
    docker run --rm \
      -v pgdata_restored:/target \
      -v "$(pwd)"/backups:/backup:ro \
      alpine \
      tar xzf /backup/pgdata-20240115.tar.gz -C /target
  3. For databases, prefer logical backups

    File-level backups of a running database risk corruption. Use the database's own dump tool instead.

    bash
    # pg_dump while the container is running - consistent logical backup
    docker exec db pg_dumpall -U postgres > backups/full-dump.sql
    
    # Restore into a new container
    cat backups/full-dump.sql | docker exec -i db2 psql -U postgres

Resource Limits and Runtime Constraints

Every container shares the host kernel, which means an unconstrained container can monopolize CPU, exhaust memory, or fork-bomb the entire machine. Docker exposes Linux cgroups and security modules as simple CLI flags so you can enforce hard boundaries on what each container is allowed to consume and do.

This section covers the full spectrum: memory, CPU, PID, and I/O limits; live updates via docker update; monitoring with docker stats; and the security subsystem — Linux capabilities, seccomp, AppArmor, and related flags.

Memory Limits

Docker provides four memory-related flags that map directly to cgroup memory controllers. Understanding the difference between a hard limit and a soft reservation is critical — the former terminates your container via the OOM mechanism, the latter merely signals the kernel to reclaim memory under pressure.

FlagEffectDefault
--memory (-m)Hard memory limit. Container is OOM-terminated if it exceeds this.Unlimited
--memory-reservationSoft limit. Kernel tries to reclaim memory when host is under pressure.Unlimited
--memory-swapTotal memory + swap allowed. Set equal to --memory to disable swap.2× memory limit
--oom-kill-disablePrevents the OOM mechanism from terminating this container.false
bash
# Hard limit of 512MB, no swap, soft reservation of 256MB
docker run -d \
  --memory=512m \
  --memory-swap=512m \
  --memory-reservation=256m \
  --name my-app nginx

# Verify the applied limits
docker inspect --format='{{.HostConfig.Memory}}' my-app
# Output: 536870912 (bytes = 512MB)
Warning

Using --oom-kill-disable without a --memory limit is dangerous. If the container consumes all host memory, the kernel OOM mechanism will target other processes on the host instead — potentially including system-critical ones.

CPU Limits

CPU constraints come in two flavors: hard caps (the container physically cannot use more) and relative weights (shares are only relevant when CPUs are contested). Choosing the right approach depends on whether you need guaranteed throughput or fair scheduling under contention.

FlagMechanismWhen to Use
--cpusHard cap on CPU cores (e.g., 1.5 = 1.5 cores max).Predictable upper bound for any workload.
--cpu-sharesRelative weight (default 1024). Only enforced under contention.Fair sharing across many containers.
--cpuset-cpusPins container to specific cores (e.g., "0,2" or "0-3").NUMA-aware or latency-sensitive apps.
bash
# Cap at 2 CPU cores, pinned to cores 0 and 1
docker run -d \
  --cpus=2 \
  --cpuset-cpus="0,1" \
  --name worker my-worker-image

# Give a background job lower priority (half the default shares)
docker run -d \
  --cpu-shares=512 \
  --name background-job my-batch-image

The --cpus flag is syntactic sugar for the older --cpu-period / --cpu-quota pair. Setting --cpus=1.5 is equivalent to --cpu-period=100000 --cpu-quota=150000 — meaning 150ms of CPU time per 100ms scheduling period.

PID and I/O Limits

PID limits prevent a container from fork-bombing the host. The --pids-limit flag caps the total number of processes (including threads) inside the container’s PID namespace. A value of 100 is a reasonable default for most web services.

bash
# Limit to 200 processes
docker run -d --pids-limit=200 --name api my-api-image

# I/O throttling: limit read/write bandwidth and IOPS
docker run -d \
  --device-read-bps=/dev/sda:10mb \
  --device-write-bps=/dev/sda:10mb \
  --device-read-iops=/dev/sda:1000 \
  --device-write-iops=/dev/sda:1000 \
  --name io-limited my-app

I/O limits use the blkio cgroup controller. Note that these flags apply to direct device I/O only — buffered writes going through the page cache may not be throttled as expected. For cgroups v2, Docker uses the unified io.max controller, which provides more consistent behavior across filesystems.

Cgroups v2 Mapping

Modern Linux distributions (Ubuntu 22.04+, Fedora 31+, Debian 11+) use cgroups v2 by default. Docker Engine 20.10+ supports cgroups v2 natively. The CLI flags remain the same, but the underlying kernel controllers differ:

Docker FlagCgroups v1 ControllerCgroups v2 Controller
--memorymemory.limit_in_bytesmemory.max
--memory-reservationmemory.soft_limit_in_bytesmemory.low
--cpuscpu.cfs_quota_us / cpu.cfs_period_uscpu.max
--cpu-sharescpu.sharescpu.weight (1–10000 scale)
--pids-limitpids.maxpids.max
--device-read-bpsblkio.throttle.read_bps_deviceio.max (rbps field)
bash
# Check if your host uses cgroups v2
stat -fc %T /sys/fs/cgroup/
# "cgroup2fs" means v2, "tmpfs" means v1

# Inspect the actual cgroup files for a running container
CONTAINER_ID=$(docker inspect --format='{{.Id}}' my-app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max

Live Updates with docker update

You don’t need to restart a container to change its resource limits. The docker update command modifies cgroup settings on the fly. This is invaluable during incident response — you can throttle a runaway container without downtime.

bash
# Double the memory limit on a running container
docker update --memory=1g --memory-swap=1g my-app

# Reduce CPU allocation during off-peak hours
docker update --cpus=0.5 my-app

# Update multiple containers at once
docker update --memory=256m --cpus=1 container1 container2 container3

Monitoring with docker stats

The docker stats command provides a live, top-like view of resource consumption per container. It reads directly from cgroup accounting files, so the numbers reflect exactly what the kernel is enforcing.

bash
# Live stats for all running containers
docker stats

# One-shot snapshot (useful for scripting)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.PIDs}}"

# Monitor a specific container
docker stats my-app --format "{{.MemUsage}} / {{.MemPerc}}"

Linux Capabilities

Traditional Unix security has two states: unprivileged user or all-powerful root. Linux capabilities split root’s power into ~40 distinct privileges. Docker drops most of these by default, keeping only a minimal set needed for typical workloads. You can further tighten or selectively expand this set.

bash
# Drop ALL capabilities, then add back only what's needed
docker run -d \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --name secure-web nginx

# A container that needs to modify network interfaces
docker run -d \
  --cap-drop=ALL \
  --cap-add=NET_ADMIN \
  --cap-add=NET_RAW \
  --name net-tool nicolaka/netshoot

# Check which capabilities a running container has
docker inspect --format='{{.HostConfig.CapAdd}}' secure-web
docker inspect --format='{{.HostConfig.CapDrop}}' secure-web

Common capabilities you might need to add back:

  • NET_BIND_SERVICE — bind to ports below 1024
  • NET_ADMIN — modify routing tables, network interfaces
  • SYS_PTRACE — debugging tools like strace, gdb
  • SYS_ADMIN — mount filesystems, use bpf() (broad — avoid if possible)
  • DAC_OVERRIDE — bypass file read/write permission checks
Caution

--privileged gives the container all capabilities, access to all devices, and disables seccomp/AppArmor. It effectively removes the isolation boundary. Use it only for DinD (Docker-in-Docker) or system-level tooling during development — never in production.

Seccomp, AppArmor, and Security Options

Capabilities control which privileged operations are allowed. Seccomp goes deeper — it filters which system calls the container process can make at all. Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux syscalls, including dangerous ones like reboot, kexec_load, and mount.

bash
# Run with a custom seccomp profile
docker run -d \
  --security-opt seccomp=./custom-seccomp.json \
  --name locked-down my-app

# Disable seccomp entirely (not recommended)
docker run -d \
  --security-opt seccomp=unconfined \
  --name debug-container my-app

# Use a specific AppArmor profile
docker run -d \
  --security-opt apparmor=my-custom-profile \
  --name armored-app my-app

Custom Seccomp Profile Structure

A seccomp profile is a JSON file that whitelists or blacklists specific syscalls. You start from Docker’s default profile and modify it to fit your application.

json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "stat", "fstat",
                "mmap", "mprotect", "munmap", "brk", "exit_group"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        { "index": 0, "value": 2114060288, "op": "SCMP_CMP_MASKED_EQ" }
      ]
    }
  ]
}

Read-Only Filesystem, Ulimits, and Other Flags

Beyond capabilities and seccomp, Docker offers several more hardening flags. A read-only root filesystem forces you to explicitly declare writable paths — an excellent defense against runtime tampering.

bash
# Read-only root filesystem with explicit tmpfs for writable paths
docker run -d \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --tmpfs /run:rw,noexec,nosuid \
  --name immutable-app my-app

# Set ulimits: max open files and max processes
docker run -d \
  --ulimit nofile=1024:2048 \
  --ulimit nproc=512:1024 \
  --name limited-app my-app

# Prevent the container from gaining new privileges
docker run -d \
  --security-opt no-new-privileges:true \
  --name hardened-app my-app

Putting It All Together: A Hardened Container

Here is what a production-grade container launch looks like when you combine resource limits with security constraints. Each flag serves a specific purpose — nothing is left to defaults.

bash
docker run -d \
  --name production-api \
  --memory=512m \
  --memory-swap=512m \
  --memory-reservation=256m \
  --cpus=2 \
  --pids-limit=200 \
  --ulimit nofile=1024:4096 \
  --ulimit nproc=256:512 \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --security-opt no-new-privileges:true \
  --security-opt seccomp=./seccomp-profile.json \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --restart=unless-stopped \
  --health-cmd="curl -f http://localhost:8080/health || exit 1" \
  --health-interval=30s \
  my-api:v2.1.0
Tip

Start restrictive and loosen as needed. Run your application, monitor it with docker stats, and only add capabilities or raise limits when you hit actual failures. The error messages (OOM events, EPERM on syscalls, ENOSPC on PID limits) will tell you exactly what to adjust.

Docker Compose: Defining Multi-Container Applications

Real applications are never a single container. A typical web project needs a reverse proxy, an API server, a database, and a cache — at minimum. Docker Compose lets you define all of these services, their networks, and their volumes in a single declarative YAML file, then manage the entire stack with one command.

Compose uses the Compose Specification, an open standard. If you've been writing version: "3.8" at the top of your files, you can stop — the version key is deprecated and ignored by modern Docker Compose. Just start with services:.

Compose V2 is the default

The standalone docker-compose binary (V1, Python-based) is deprecated. Docker CLI now ships Compose V2 as a plugin — use docker compose (with a space). All commands in this section use V2 syntax.

Service Topology Overview

Before diving into the YAML, here's the architecture we'll build throughout this section — an Nginx reverse proxy fronting an API server, backed by PostgreSQL and Redis, with proper network segmentation and persistent storage.

graph LR
    subgraph frontend-net["frontend network"]
        NGINX["🌐 nginx\n(reverse proxy)\nport 80:80"]
        API["⚙️ api\n(Node.js app)\nport 3000"]
    end

    subgraph backend-net["backend network"]
        API2["⚙️ api"]
        PG["🐘 postgres\nport 5432"]
        REDIS["⚡ redis\nport 6379"]
    end

    NGINX -->|"proxy_pass"| API
    API2 -->|"queries"| PG
    API2 -->|"sessions/cache"| REDIS

    PG ---|"pg-data volume"| PGV[("pg-data")]
    REDIS ---|"redis-data volume"| RDV[("redis-data")]
    

Defining Services

Each service in a Compose file maps to a container. You specify what image to use (or how to build one), what environment it needs, which ports to expose, and how it relates to other services. Here's the core anatomy of a service definition.

Image vs Build

You can pull a pre-built image or build from a Dockerfile. Use image for off-the-shelf services (databases, caches) and build for your own application code.

yaml
services:
  # Using a pre-built image
  redis:
    image: redis:7-alpine

  # Building from a Dockerfile
  api:
    build:
      context: ./api
      dockerfile: Dockerfile
      args:
        NODE_ENV: production
    image: myapp/api:latest   # tags the built image

Command, Environment, and Env Files

Override the default container command with command or entrypoint. Pass configuration via environment for inline values or env_file to load from a file. The env_file approach keeps secrets out of your Compose file.

yaml
services:
  api:
    build: ./api
    command: ["node", "src/server.js"]
    environment:
      NODE_ENV: production
      REDIS_URL: redis://redis:6379
    env_file:
      - ./api/.env          # DB credentials, API keys, etc.

Ports, Volumes, and Restart Policies

Map host ports with ports, persist data with volumes, and keep services alive with restart. For volumes, you can use named volumes (managed by Docker) or bind mounts (host paths).

yaml
services:
  postgres:
    image: postgres:16-alpine
    ports:
      - "5432:5432"           # host:container
    volumes:
      - pg-data:/var/lib/postgresql/data   # named volume
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql  # bind mount
    restart: unless-stopped   # survives daemon restarts

volumes:
  pg-data:                    # declared at top level

Healthchecks and depends_on with Conditions

A running container isn't necessarily a ready container. Postgres might accept TCP connections before it's actually ready to serve queries. Healthchecks solve this. Combine them with depends_on conditions to enforce proper startup ordering.

yaml
services:
  postgres:
    image: postgres:16-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  api:
    build: ./api
    depends_on:
      postgres:
        condition: service_healthy   # waits for healthcheck
      redis:
        condition: service_healthy

Without the condition: service_healthy clause, depends_on only waits for the container to start, not become ready. The three conditions are service_started (default), service_healthy, and service_completed_successfully (useful for init/migration containers).

Networks and Volumes

By default, Compose creates a single network for all services. This is fine for simple projects but breaks the principle of least privilege. Defining custom networks lets you isolate traffic — your database shouldn't be reachable from the reverse proxy.

yaml
networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge

volumes:
  pg-data:
  redis-data:

services:
  nginx:
    networks: [frontend]          # only frontend
  api:
    networks: [frontend, backend] # bridges both
  postgres:
    networks: [backend]           # only backend
  redis:
    networks: [backend]           # only backend

Services on the same network can reach each other by service name as the hostname. In this setup, nginx can resolve api but cannot reach postgres or redis directly. The api service, connected to both networks, acts as the bridge.

The Complete Stack

Putting it all together — here's a production-ready Compose file for our Nginx + API + Postgres + Redis stack. This is the kind of file you'd actually commit to a project repository.

yaml
# compose.yaml — no "version" key needed
services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
    depends_on:
      api:
        condition: service_healthy
    networks:
      - frontend
    restart: unless-stopped

  api:
    build:
      context: ./api
      dockerfile: Dockerfile
    command: ["node", "src/server.js"]
    env_file: ./api/.env
    environment:
      DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/myapp
      REDIS_URL: redis://redis:6379
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - frontend
      - backend
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s
    volumes:
      - pg-data:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    networks:
      - backend
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command: ["redis-server", "--appendonly", "yes"]
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    volumes:
      - redis-data:/data
    networks:
      - backend
    restart: unless-stopped

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge

volumes:
  pg-data:
  redis-data:

CLI Commands

Compose's CLI is how you interact with the stack defined in your YAML. These commands operate on the entire project (all services) by default, or you can target individual services by name.

CommandWhat It DoesCommon Flags
docker compose up -dCreate and start all services in detached mode--build to rebuild images first
docker compose downStop and remove containers, networks-v also removes named volumes
docker compose logs -f apiTail logs for a specific service--since 5m, --tail 100
docker compose exec api shRun a command in a running container-it for interactive terminal
docker compose psList containers and their status-a includes stopped containers
docker compose buildBuild or rebuild service images--no-cache, --parallel
docker compose pullPull latest images for services--ignore-pull-failures
bash
# Start the stack (rebuild if Dockerfiles changed)
docker compose up -d --build

# Check health status
docker compose ps

# Follow API logs
docker compose logs -f api

# Open a shell inside the API container
docker compose exec api sh

# Tear down everything including volumes
docker compose down -v

Variable Substitution with .env

Compose automatically reads a .env file in the project directory and makes those variables available for ${VAR} interpolation inside the Compose file. This is the standard way to handle per-environment configuration without modifying the YAML.

bash
# .env (in the same directory as compose.yaml)
POSTGRES_PASSWORD=supersecret
COMPOSE_PROJECT_NAME=myapp
API_IMAGE_TAG=1.4.2
yaml
# compose.yaml — variables resolved from .env
services:
  api:
    image: myapp/api:${API_IMAGE_TAG}
    environment:
      DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/myapp
  postgres:
    environment:
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?error: set POSTGRES_PASSWORD}

The ${VAR:?error message} syntax makes Compose exit with an error if the variable is unset — a safety net for required secrets. You can also use ${VAR:-default} to provide a fallback value.

Profiles

Not every service should start every time. Debug tools, migration runners, or monitoring stacks are only needed sometimes. Profiles let you tag services and only start them when explicitly requested.

yaml
services:
  api:
    build: ./api            # no profiles → always starts

  postgres:
    image: postgres:16-alpine  # no profiles → always starts

  pgadmin:
    image: dpage/pgadmin4
    profiles: ["debug"]     # only starts with --profile debug
    ports:
      - "5050:80"

  migrate:
    build: ./api
    command: ["npm", "run", "migrate"]
    profiles: ["tools"]     # only starts with --profile tools
    depends_on:
      postgres:
        condition: service_healthy
bash
# Normal startup — only api and postgres
docker compose up -d

# Include the debug tools
docker compose --profile debug up -d

# Run migrations then exit
docker compose --profile tools run --rm migrate

Override Files and extends

Compose merges multiple files together. The convention is a base compose.yaml and an override compose.override.yaml that Compose loads automatically. This lets you keep production settings in the base file and layer development-specific changes (bind mounts, debug ports, different commands) on top.

yaml
# compose.override.yaml — auto-loaded in development
services:
  api:
    build:
      target: development         # multi-stage: use dev stage
    command: ["npm", "run", "dev"]
    volumes:
      - ./api/src:/app/src:cached # bind mount for hot reload
    ports:
      - "9229:9229"               # Node.js debugger port
    environment:
      NODE_ENV: development

For explicit multi-file setups (e.g., staging vs production), use the -f flag:

bash
# Production: base only (skip auto-loaded override)
docker compose -f compose.yaml -f compose.prod.yaml up -d

# Staging: base + staging overrides
docker compose -f compose.yaml -f compose.staging.yaml up -d

The extends keyword lets a service inherit configuration from another service, either in the same file or a different file. This reduces duplication when multiple services share a common base configuration.

yaml
# common.yaml — shared service definitions
services:
  node-base:
    build:
      context: .
      args:
        NODE_ENV: production
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"

# compose.yaml — extend the base
services:
  api:
    extends:
      file: common.yaml
      service: node-base
    command: ["node", "src/api.js"]
    ports:
      - "3000:3000"

  worker:
    extends:
      file: common.yaml
      service: node-base
    command: ["node", "src/worker.js"]

Compose Watch for Development

The watch feature (introduced in Compose 2.22) monitors your source files and automatically syncs changes to containers or triggers rebuilds. It's a Compose-native alternative to bind mounts + nodemon-style watchers, and it handles edge cases like dependency changes more cleanly.

yaml
services:
  api:
    build: ./api
    develop:
      watch:
        # Sync source files → fast, no restart needed
        - action: sync
          path: ./api/src
          target: /app/src

        # Rebuild when dependencies change
        - action: rebuild
          path: ./api/package.json

        # Sync + restart when config changes
        - action: sync+restart
          path: ./api/config
          target: /app/config
bash
# Start services with file watching
docker compose watch

# Or combine with up (detached watch)
docker compose up -d --watch

The three watch actions serve different purposes: sync copies changed files into the container without restarting (ideal for hot-reload runtimes), rebuild triggers a full image rebuild and container replacement (for dependency changes), and sync+restart copies files then restarts the container (for config files that are read once at startup).

Watch vs Bind Mounts

Unlike bind mounts, watch works consistently across macOS, Linux, and Windows — no filesystem notification issues. It also lets you define different actions per path pattern. Use bind mounts during quick prototyping; switch to watch for team-shared dev environments.

Putting It Into Practice

Here's a complete project structure that ties together everything from this section — override files, env files, profiles, and the full service stack:

bash
myapp/
├── compose.yaml              # base: all services, networks, volumes
├── compose.override.yaml     # dev: bind mounts, debug ports, watch
├── compose.prod.yaml         # prod: replicas, resource limits
├── .env                      # POSTGRES_PASSWORD, API_IMAGE_TAG
├── .env.example              # template committed to git
├── api/
│   ├── Dockerfile
│   ├── .env                  # service-level env (loaded via env_file)
│   └── src/
├── nginx/
│   └── default.conf
└── db/
    └── init.sql
Don't commit .env files

Add .env and api/.env to your .gitignore. Commit a .env.example with placeholder values instead. Secrets in version control are a security incident waiting to happen.

Docker in CI/CD Pipelines

Docker transforms CI/CD from a "works on my machine" problem into a reproducible, portable process. But the way you use Docker in a pipeline — as a build tool vs. as a deliverable artifact — fundamentally shapes your workflow's architecture, speed, and security posture.

flowchart LR
    A["Git Push"] --> B["CI Trigger"]
    B --> C["Build Image"]
    C --> D["Run Tests"]
    D --> E["Security Scan"]
    E --> F{"Pass?"}
    F -->|Yes| G["Push to Registry"]
    F -->|No| H["Fail & Notify"]
    G --> I["Deploy Staging"]
    I --> J["Smoke Tests"]
    J --> K{"Pass?"}
    K -->|Yes| L["Deploy Production"]
    K -->|No| H
    

Docker as Build Tool vs. Docker as Artifact

There are two distinct roles Docker plays in CI/CD. As a build tool, Docker provides a consistent environment for compiling code, running linters, and executing tests — the image is thrown away after the job. As an artifact, Docker produces the deployable image itself — the pipeline's output is a tagged, versioned container image pushed to a registry.

Most mature pipelines use Docker in both roles simultaneously: a CI-specific image runs the build steps, and the build's output is a production-ready Docker image. Understanding this distinction helps you design cleaner pipelines.

RolePurposeImage LifecycleExample
Build ToolConsistent build/test environmentEphemeral, discarded after jobRunning npm test inside a Node container
ArtifactDeployable application packageTagged, pushed to registry, deployedMulti-stage build producing a production image

Docker-in-Docker (DinD) vs. Socket Mounting

When your CI runner itself is a container, you need a way to build Docker images from inside it. The two approaches — Docker-in-Docker and socket mounting — have very different trade-offs in terms of isolation, performance, and security.

Docker-in-Docker (DinD) runs a full Docker daemon inside your CI container. Each job gets a completely isolated Docker environment with its own image cache. This provides strong isolation between jobs but sacrifices layer caching across runs, since the inner daemon's storage is ephemeral.

Socket mounting shares the host's Docker daemon by bind-mounting /var/run/docker.sock into the CI container. Jobs share the host's layer cache, making builds dramatically faster. The downside: any container can interact with the host daemon, which is a security risk in multi-tenant environments.

yaml
# GitLab CI with Docker-in-Docker
build:
  image: docker:24-dind
  services:
    - docker:24-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  script:
    - docker build -t myapp:$CI_COMMIT_SHA .
yaml
# Docker Compose for self-hosted CI runner
services:
  ci-runner:
    image: myorg/ci-runner:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./workspace:/workspace
    working_dir: /workspace
Warning

Mounting the Docker socket gives the container root-equivalent access to the host. Never do this with untrusted code or in multi-tenant CI environments. For those cases, use DinD with TLS or rootless alternatives like Kaniko or Buildah.

GitHub Actions: Building and Pushing Images

GitHub Actions has first-class Docker support through the docker/build-push-action. The workflow below is a production-grade template: it builds the image, caches layers in GitHub's cache backend, and pushes to GitHub Container Registry on every push to main.

yaml
name: Build and Push

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - uses: docker/setup-buildx-action@v3

      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - uses: docker/metadata-action@v5
        id: meta
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}

      - uses: docker/build-push-action@v5
        with:
          context: .
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Layer Caching Strategies

Without caching, every CI run rebuilds every layer from scratch. The cache-from and cache-to options in BuildKit enable multiple caching backends. GitHub Actions' type=gha cache is the simplest, but for larger images, registry-based caching often performs better.

yaml
# Registry-based caching — survives cache eviction
- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: ${{ steps.meta.outputs.tags }}
    cache-from: type=registry,ref=ghcr.io/myorg/myapp:buildcache
    cache-to: type=registry,ref=ghcr.io/myorg/myapp:buildcache,mode=max

Image Tagging Strategy for CI

How you tag images determines your ability to trace deployments back to source code, roll back safely, and manage releases. A good tagging strategy uses multiple tags per image — each serving a different purpose.

Tag TypeFormatPurposeExample
Git SHAabc1234Immutable, maps to exact commitmyapp:a1b2c3d
Semverv1.2.3Release versioningmyapp:1.2.3, myapp:1.2
Branchmain, developLatest from a branch (mutable)myapp:main
PRpr-42Preview environmentsmyapp:pr-42
latestlatestConvenience only — never in prodmyapp:latest
Note

Always deploy by immutable tags (git SHA or digest). Mutable tags like latest or main can change underneath a running deployment, making rollbacks unreliable and auditing impossible.

Running Tests in CI with Docker

Docker gives you two patterns for running tests in CI. You can run tests during the image build (in a test stage of a multi-stage Dockerfile), or run tests against a built image by starting the container and executing a test suite from outside. The first is simpler; the second allows integration testing with real dependencies.

docker
# Multi-stage Dockerfile with test stage
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci

FROM deps AS test
COPY . .
RUN npm run lint && npm run test:unit

FROM deps AS build
COPY . .
RUN npm run build

FROM node:20-alpine AS production
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]

In CI, you target the test stage explicitly. If tests fail, the build fails and no image gets pushed:

yaml
# GitHub Actions — run tests then build production image
steps:
  - name: Run tests inside Docker
    run: docker build --target test -t myapp:test .

  - name: Build production image
    run: docker build --target production -t myapp:${{ github.sha }} .

Automated Integration Testing with Compose

For integration tests that need databases, caches, or message brokers, Docker Compose spins up the full dependency graph. Define a dedicated docker-compose.test.yml that extends your base compose file with a test runner service.

yaml
# docker-compose.test.yml
services:
  app:
    build:
      context: .
      target: test
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
    environment:
      DATABASE_URL: postgres://test:test@postgres:5432/testdb
      REDIS_URL: redis://redis:6379
    command: npm run test:integration

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U test"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
bash
# Run integration tests in CI
docker compose -f docker-compose.test.yml up \
  --build --abort-on-container-exit --exit-code-from app

Multi-Platform Builds in CI

If you deploy to both amd64 servers and arm64 instances (AWS Graviton, Apple Silicon dev machines), you need multi-platform images. BuildKit with QEMU emulation handles this natively in GitHub Actions.

yaml
steps:
  - uses: docker/setup-qemu-action@v3

  - uses: docker/setup-buildx-action@v3

  - uses: docker/build-push-action@v5
    with:
      context: .
      platforms: linux/amd64,linux/arm64
      push: true
      tags: ghcr.io/myorg/myapp:${{ github.sha }}
Tip

QEMU emulation makes cross-platform builds 3-10x slower. For faster builds, use native runners for each architecture and merge manifests with docker buildx imagetools create. GitHub's ubuntu-24.04-arm runners make this practical.

Security Scanning

Every image you push to a registry should pass a vulnerability scan. The three major tools each have different strengths: Trivy is fast and open-source, Docker Scout integrates directly with Docker Hub, and Snyk offers deep dependency analysis with fix recommendations.

yaml
# GitHub Actions — Trivy scan
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: myapp:${{ github.sha }}
    format: table
    exit-code: 1              # Fail pipeline on vulnerabilities
    severity: CRITICAL,HIGH
    ignore-unfixed: true
yaml
# GitHub Actions — Docker Scout
- name: Docker Scout CVE scan
  uses: docker/scout-action@v1
  with:
    command: cves
    image: myapp:${{ github.sha }}
    only-severities: critical,high
    exit-code: true           # Fail on policy violation
yaml
# GitHub Actions — Snyk Container
- name: Snyk container scan
  uses: snyk/actions/docker@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
  with:
    image: myapp:${{ github.sha }}
    args: --severity-threshold=high

Deployment Strategies

Once your image is built, tested, and scanned, the deployment strategy determines how you roll it out without downtime. Each strategy trades off between speed, safety, and infrastructure complexity.

Blue-Green Deployment

You maintain two identical environments — blue (current) and green (new). Deploy the new image to the green environment, verify it, then switch the load balancer to point to green. Rollback is instant: switch back to blue.

bash
#!/bin/bash
# Blue-green deploy with Docker Compose
NEW_VERSION=$1
CURRENT=$(docker compose ps --format json | jq -r '.[0].Name' | grep -o 'blue\|green')
TARGET=$( [ "$CURRENT" = "blue" ] && echo "green" || echo "blue" )

# Deploy new version to inactive environment
IMAGE_TAG="$NEW_VERSION" docker compose up -d "app-${TARGET}"

# Health check the new environment
until curl -sf "http://app-${TARGET}:3000/health"; do sleep 2; done

# Switch traffic (update nginx upstream)
sed -i "s/app-${CURRENT}/app-${TARGET}/" /etc/nginx/conf.d/upstream.conf
nginx -s reload

echo "Switched traffic from $CURRENT to $TARGET (v$NEW_VERSION)"

Rolling Deployment

Replace instances one at a time. This is the default in Docker Swarm and Kubernetes. You trade deployment speed for zero-downtime guarantees — at any point during the rollout, both old and new versions serve traffic.

yaml
# Docker Swarm rolling update configuration
services:
  web:
    image: ghcr.io/myorg/myapp:${VERSION}
    deploy:
      replicas: 4
      update_config:
        parallelism: 1        # Update one container at a time
        delay: 30s             # Wait 30s between updates
        failure_action: rollback
        order: start-first     # Start new before stopping old
      rollback_config:
        parallelism: 0         # Roll back all at once
        order: start-first

Canary Deployment

Route a small percentage of traffic (e.g., 5%) to the new version. Monitor error rates and latency. If metrics look good, gradually increase traffic until the canary takes 100%. This catches issues that only surface under real production load.

yaml
# GitHub Actions — Canary deploy step
deploy-canary:
  needs: [build, test, scan]
  runs-on: ubuntu-latest
  steps:
    - name: Deploy canary (5% traffic)
      run: |
        kubectl set image deployment/myapp-canary \
          app=ghcr.io/myorg/myapp:${{ github.sha }}
        kubectl scale deployment/myapp-canary --replicas=1

    - name: Monitor canary (5 minutes)
      run: |
        for i in $(seq 1 30); do
          ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors_total[1m])" \
            | jq '.data.result[0].value[1]')
          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "Error rate too high: $ERROR_RATE — rolling back"
            kubectl rollout undo deployment/myapp-canary
            exit 1
          fi
          sleep 10
        done

    - name: Promote canary to full rollout
      run: |
        kubectl set image deployment/myapp \
          app=ghcr.io/myorg/myapp:${{ github.sha }}
        kubectl rollout status deployment/myapp
StrategyDowntimeRollback SpeedResource OverheadBest For
Blue-GreenZeroInstant (switch back)2x infrastructureCritical apps, fast rollback needed
RollingZeroMinutes (re-roll)Minimal (+1 instance)Stateless services, most common
CanaryZeroFast (scale down canary)Minimal (+1 instance)High-traffic apps, risk-sensitive releases

Debugging and Troubleshooting Docker Containers

When a container misbehaves — crashing immediately, returning errors, or performing differently than your local environment — you need a systematic approach rather than guesswork. Docker exposes a rich set of introspection commands that let you trace the problem from the outside in: logs first, then state inspection, then interactive debugging.

This section builds your diagnostic toolkit from the essential commands you’ll use daily to advanced techniques for the toughest problems.

flowchart TD
    A["Container Problem"] --> B{"Is the container running?"}
    B -->|"No - Exited"| C["docker logs CONTAINER"]
    B -->|"Yes - Misbehaving"| H["docker exec -it CONTAINER sh"]
    C --> D{"Check exit code"}
    D -->|"Exit 0"| E["Process completed normally.\nCheck CMD / entrypoint logic."]
    D -->|"Exit 1"| F["Application error.\nRead logs for stack trace."]
    D -->|"Exit 137"| G["OOM or SIGKILL.\nCheck memory limits."]
    D -->|"Exit 127"| G2["Command not found.\nCheck PATH or binary exists."]
    D -->|"Exit 126"| G3["Permission denied.\nCheck file permissions."]
    H -->|"Shell works"| I["Inspect process, network,\nfiles from inside."]
    H -->|"exec fails"| J{"Minimal image?\n(distroless / scratch)"}
    J -->|"Yes"| K["docker debug CONTAINER\nor nsenter via PID"]
    J -->|"No"| L["Container may be paused\nor in a crash loop.\nCheck docker inspect."]
    

Essential Diagnostic Commands

These five commands form the core of your debugging workflow. You’ll reach for them in roughly this order: logs tell you what went wrong, inspect tells you why the container is configured that way, exec lets you poke around inside, events shows you when things happened, and diff reveals what changed on the filesystem.

docker logs — Your First Stop

Always start here. docker logs captures everything your container writes to stdout and stderr. The --tail and --since flags prevent you from drowning in output on long-running containers.

bash
# Last 50 lines
docker logs --tail 50 my-api

# Logs from the past 5 minutes
docker logs --since 5m my-api

# Follow logs in real time (like tail -f)
docker logs -f --tail 20 my-api

# Show timestamps alongside each log line
docker logs -t --since 2024-01-15T10:00:00 my-api

docker inspect — Deep State Examination

When logs aren’t enough, docker inspect dumps the full JSON configuration and runtime state of a container. Use Go template syntax with --format to extract exactly what you need.

bash
# Why did this container stop?
docker inspect --format='{{.State.ExitCode}}' my-api
docker inspect --format='{{.State.OOMKilled}}' my-api
docker inspect --format='{{.State.Error}}' my-api

# What IP address was assigned?
docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-api

# What mounts and volumes are attached?
docker inspect --format='{{json .Mounts}}' my-api | python3 -m json.tool

# Full health check status
docker inspect --format='{{json .State.Health}}' my-api | python3 -m json.tool

docker exec — Interactive Exploration

Once a container is running, exec lets you open a shell inside it. This is your primary tool for checking process state, network connectivity, filesystem contents, and environment variables from the container’s perspective.

bash
# Open an interactive shell
docker exec -it my-api /bin/sh

# Run a single command without interactive mode
docker exec my-api cat /etc/resolv.conf

# Exec as root (even if Dockerfile uses USER)
docker exec -u 0 my-api id

# Check what processes are running inside
docker exec my-api ps aux

# Test network connectivity from inside the container
docker exec my-api ping -c 3 database
docker exec my-api nslookup database

docker events & docker diff

docker events gives you a real-time stream of Docker daemon events — container starts, stops, OOM events, network connections, and volume mounts. docker diff shows filesystem changes (added, changed, or deleted files) relative to the image.

bash
# Watch all events in real time (run in a separate terminal)
docker events

# Filter to only OOM events and container deaths
docker events --filter event=oom --filter event=die

# See what files changed inside a container
docker diff my-api
# Output: C /tmp  (Changed), A /tmp/cache.db  (Added), D /app/old.log  (Deleted)

Exit Code Reference

The exit code is the single most informative piece of data from a crashed container. It immediately narrows down the category of failure. Run docker inspect --format='{{.State.ExitCode}}' CONTAINER to retrieve it.

Exit CodeMeaningCommon CauseWhat To Do
0Success (normal exit)Process finished its work and exitedCheck if your CMD is a one-shot command instead of a long-running server
1General application errorUnhandled exception, missing config, bad inputRead docker logs — the stack trace will be there
126Command not executableBinary exists but lacks execute permissionsCheck chmod +x on your entrypoint script
127Command not foundTypo in CMD, binary not in PATH, missing from imageVerify the binary exists: docker run --rm IMAGE which myapp
137SIGKILL (128 + 9)OOM, docker stop after timeout, external signalCheck OOMKilled with inspect; raise memory limits if needed
139SIGSEGV (128 + 11)Segmentation fault in the applicationDebug the native binary; check for architecture mismatch (amd64 vs arm64)
143SIGTERM (128 + 15)Graceful shutdown via docker stopNormal — but if unexpected, check if something is restarting your container

Common Failure Patterns

OOM Events (Exit 137)

When a container exceeds its memory limit, the Linux kernel’s OOM mechanism sends SIGKILL (signal 9), resulting in exit code 137. The container gets no chance to clean up. This is one of the most common production failures, especially for Java and Node.js apps with default heap settings.

bash
# Confirm OOM
docker inspect --format='{{.State.OOMKilled}}' my-api
# Output: true

# Check current memory usage vs limit
docker stats --no-stream my-api

# Check kernel OOM logs on the host (Linux)
dmesg | grep -i oom

# Run with a higher memory limit
docker run -d --memory=512m --memory-swap=512m my-api

PID 1 Signal Handling

In a container, your process runs as PID 1. Unlike a normal process, PID 1 does not get default signal handlers from the kernel. If your app doesn’t explicitly handle SIGTERM, docker stop will wait for the 10-second grace period, then send SIGKILL — causing slow, ungraceful shutdowns every time.

docker
# BAD: shell form wraps your app in /bin/sh — signals go to sh, not your app
CMD node server.js

# GOOD: exec form — your app IS PID 1 and receives signals directly
CMD ["node", "server.js"]

# ALSO GOOD: use tini as an init process to handle signal forwarding
ENTRYPOINT ["/tini", "--"]
CMD ["node", "server.js"]
PID 1 and docker stop

If docker stop consistently takes exactly 10 seconds (the default grace period), your process isn’t handling SIGTERM. Switch to exec form in your CMD or add --init to your docker run command to inject tini automatically.

Filesystem Permission Errors

A container that works with docker run but fails when you add -v /host/path:/app/data almost always has a permissions mismatch. The UID inside the container doesn’t match the owner of the host directory. This is especially common when a Dockerfile uses USER 1000 but the host directory is owned by root.

bash
# Check what user the container runs as
docker exec my-api id
# Output: uid=1000(appuser) gid=1000(appuser)

# Check ownership of the mounted volume from inside
docker exec my-api ls -la /app/data
# Output: drwxr-xr-x 2 root root 4096 ... /app/data  <-- mismatch!

# Fix: match the host directory ownership to the container UID
sudo chown -R 1000:1000 /host/path

# Or run the container with a matching user
docker run -v /host/path:/app/data --user "$(id -u):$(id -g)" my-api

DNS Failures and Network Issues

Containers on a user-defined bridge network can resolve each other by container name. But containers on the default bridge network cannot — they only get access by IP. If ping database fails inside your app container, check which network both containers are on.

bash
# Check which network a container is on
docker inspect --format='{{json .NetworkSettings.Networks}}' my-api | python3 -m json.tool

# Test DNS resolution from inside the container
docker exec my-api nslookup database
docker exec my-api cat /etc/resolv.conf

# Check if both containers share a network
docker network inspect my-network --format='{{range .Containers}}{{.Name}} {{end}}'

# Quick connectivity test
docker exec my-api nc -zv database 5432

Port Conflicts

If docker run -p 8080:80 fails with “port is already allocated,” something on the host is already bound to that port. Find the offending process and either stop it or choose a different host port.

bash
# Find what is using port 8080 on the host
lsof -i :8080
# or on Linux:
ss -tlnp | grep 8080

# Use a different host port (host:container)
docker run -d -p 9090:80 my-web

# Let Docker assign a random available port
docker run -d -p 80 my-web
docker port my-web

Advanced Debugging Techniques

When docker exec isn’t enough — either because the container has no shell (distroless/scratch images), has already crashed, or you need kernel-level visibility — these tools go deeper.

docker debug (Docker Desktop)

Docker Desktop includes docker debug, which attaches a fully-equipped debugging shell to any container, even ones built from scratch or distroless images. It injects debugging tools without modifying the container image.

bash
# Attach a debug shell to a running container (even distroless)
docker debug my-api

# Debug a stopped/exited container
docker debug my-api

# You get vim, nano, curl, htop, strace, and more — all injected at runtime

nsenter — Enter Container Namespaces from the Host

nsenter lets you enter one or more Linux namespaces of a running container directly from the host. This is the escape hatch for production servers where Docker Desktop isn’t available and the container image has no shell.

bash
# Get the container's PID on the host
PID=$(docker inspect --format='{{.State.Pid}}' my-api)

# Enter all namespaces of the container with a host shell
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh

# Enter only the network namespace (useful for network debugging)
sudo nsenter -t $PID -n -- ip addr
sudo nsenter -t $PID -n -- ss -tlnp

Network Debugging with nicolaka/netshoot

The netshoot image is a Swiss Army knife for container network debugging. It includes curl, dig, nmap, tcpdump, iperf, and dozens more networking tools. Attach it to any container’s network namespace to diagnose connectivity problems without modifying the target container.

bash
# Share the network namespace of a running container
docker run -it --rm --network container:my-api nicolaka/netshoot

# Now you have the full toolkit inside the same network namespace:
# tcpdump -i eth0 port 80
# dig database
# curl -v http://localhost:3000/healthz
# iftop

# Attach to a Docker Compose network to debug service discovery
docker run -it --rm --network myapp_default nicolaka/netshoot
# dig my-api
# curl http://my-api:3000/healthz
Pro tip: strace for the hardest bugs

When logs show nothing and the app just silently hangs, strace reveals every system call the process makes. Run docker exec my-api strace -p 1 -f -e trace=network to trace all network-related syscalls of PID 1. If strace isn’t installed in the image, use nsenter or docker debug to get access.

Systematic Debugging Workflow

When you hit an unfamiliar container problem, resist the urge to randomly try things. Follow this checklist top-to-bottom — each step either solves the problem or gives you information for the next step.

  1. Check container status and exit code

    Get the big picture first. Is the container running, exited, or restarting? The exit code immediately categorizes the failure.

    bash
    docker ps -a --filter name=my-api
    docker inspect --format='Exit:{{.State.ExitCode}} OOM:{{.State.OOMKilled}} Error:{{.State.Error}}' my-api
  2. Read the logs

    For exited containers, the answer is almost always in the logs. Look for stack traces, "permission denied", "address already in use", or "connection refused" messages.

    bash
    docker logs --tail 100 my-api 2>&1 | head -50
  3. Inspect configuration

    Compare the running configuration against what you expected. Check environment variables, mounts, networks, and port bindings.

    bash
    docker inspect my-api | python3 -m json.tool | less
  4. Get a shell inside

    If the container is running, exec in and poke around. Check processes, files, network, and DNS. If the container has no shell, use docker debug or nsenter.

    bash
    docker exec -it my-api /bin/sh
    # Inside: ps aux, env, cat /etc/hosts, nslookup database, ls -la /app
  5. Reproduce with a fresh container

    If you’ve changed things inside the container while debugging, start fresh. Override the entrypoint with sh to keep the container alive while you test interactively.

    bash
    # Override entrypoint to keep container alive for debugging
    docker run -it --rm --entrypoint /bin/sh my-api-image
    # Now manually run the entrypoint commands one by one to find the failure
“Works on my machine” — the Docker edition

When code works locally but fails in a container, the culprit is almost always one of three things: a missing environment variable, a filesystem path that doesn’t exist in the image, or the container running as a different user (non-root) than your local dev environment. Always check env, pwd, and id inside the container first.

Docker Security: Attack Surface and Hardening

Docker containers share the host kernel. That single fact shapes the entire security model: every misconfiguration is a potential path from container to host. Unlike virtual machines, there is no hypervisor boundary \u2014 only Linux kernel features (namespaces, cgroups, capabilities, seccomp, LSMs) stand between a container process and full host access.

Effective Docker security is defense-in-depth. No single control is sufficient; you layer protections so that when one fails, others hold. The diagram below shows these layers from the outside in.

graph LR
    A["\U0001F5A5\uFE0F Host Hardening\npatched kernel, minimal OS"] --> B["\U0001F433 Docker Daemon\nsocket access, TLS, rootless"]
    B --> C["\U0001F4E6 Image Security\ntrusted base, pinned digests, scanned"]
    C --> D["\U0001F528 Build Security\nno secrets baked in, .dockerignore"]
    D --> E["\U0001F512 Runtime Hardening\nnon-root, dropped caps, read-only"]
    E --> F["\U0001F310 Application\nleast privilege, no shell"]
    

The Docker Socket Is Root Access

The Docker daemon (/var/run/docker.sock) runs as root. Any process that can talk to this socket can spin up a privileged container, mount the host filesystem, and effectively become root on the host. This is not a bug \u2014 it is how Docker works.

Mounting the Docker socket into a container (a common pattern for CI runners, monitoring tools, and \u201CDocker-in-Docker\u201D setups) grants that container full root access to the host. Treat socket access as equivalent to adding someone to the sudoers file.

bash
# This gives the container FULL root access to the host
docker run -v /var/run/docker.sock:/var/run/docker.sock some-image

# From inside that container, an attacker can:
docker run -it --privileged --pid=host -v /:/hostfs alpine chroot /hostfs
Caution

Never mount the Docker socket into application containers. If you must (e.g., CI/CD), use a scoped proxy like docker-socket-proxy to limit which API endpoints are accessible.

Image Security

Your container is only as secure as the image it runs. A base image with known CVEs, an unpinned tag that gets overwritten, or a malicious image from a public registry can compromise your workload before a single line of your code executes.

Use Trusted, Minimal Base Images

Start from images that have a small attack surface. Fewer packages means fewer CVEs. The table below compares common options.

Base ImageSizePackagesBest For
scratch0 MBNoneStatically compiled Go/Rust binaries
gcr.io/distroless/static~2 MBCA certs, tzdataStatic binaries needing TLS
alpine:3.20~7 MBmusl, busyboxGeneral-purpose minimal base
ubuntu:24.04~78 MBapt, glibc, manyWhen you need glibc or apt packages
node:22-slim~200 MBNode.js + Debian minimalNode.js apps needing native modules

Pin Image Digests, Not Just Tags

Tags are mutable. The alpine:3.20 you pulled last week might not be the same alpine:3.20 you pull today \u2014 the tag can be re-pushed. Pin by SHA256 digest to guarantee reproducibility.

docker
# \u274C Mutable \u2014 can change underneath you
FROM alpine:3.20

# \u2705 Immutable \u2014 this exact image, forever
FROM alpine:3.20@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c

Scan Images for Vulnerabilities

Integrate vulnerability scanning into your CI pipeline. Three popular tools serve this purpose \u2014 pick the one that fits your workflow.

bash
# Scan an image \u2014 fail CI on HIGH or CRITICAL CVEs
trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:latest

# Scan a Dockerfile for misconfigurations
trivy config --severity HIGH,CRITICAL Dockerfile

Trivy (by Aqua Security) is open-source, fast, and scans images, filesystems, IaC configs, and SBOMs. It is the most widely adopted OSS scanner.

bash
# Built into Docker Desktop and the docker CLI
docker scout cves myapp:latest

# Get fix recommendations
docker scout recommendations myapp:latest

Docker Scout integrates directly into Docker Desktop and Docker Hub. It provides policy-based evaluation and remediation advice built into the Docker workflow.

bash
# Scan an image
grype myapp:latest

# Pair with Syft for SBOM generation + scanning
syft myapp:latest -o spdx-json | grype

Grype (by Anchore) pairs with Syft for SBOM generation. Together they give you a complete supply-chain view \u2014 generate an SBOM once, scan it repeatedly as new CVEs are published.

Build-Time Security

The image build process is where secrets most commonly leak. Every RUN, COPY, and ADD instruction creates a layer that is permanently stored in the image. Anyone who can docker pull your image can inspect every layer and extract anything you put in it \u2014 including credentials you thought you deleted in a later layer.

Never Bake Secrets Into Images

BuildKit\u2019s --mount=type=secret lets you inject secrets at build time without them ever appearing in a layer. The secret is mounted into the build container\u2019s filesystem and vanishes when the RUN instruction completes.

docker
# syntax=docker/dockerfile:1
FROM node:22-slim

WORKDIR /app
COPY package*.json ./

# \u2705 Secret is mounted at /run/secrets/npmrc \u2014 never stored in any layer
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci --production

COPY . .
USER node
CMD ["node", "server.js"]
bash
# Pass the secret at build time
docker build --secret id=npmrc,src=/Users/ashutoshmaheshwari/.npmrc -t myapp .

Use .dockerignore Religiously

Without a .dockerignore, COPY . . sends your entire build context to the daemon \u2014 including .env files, .git history, SSH keys, and anything else in the directory. A proper .dockerignore is your first line of defense against accidental secret leakage.

bash
# .dockerignore
.git
.env
.env.*
*.pem
*.key
node_modules
docker-compose*.yml
README.md

Runtime Hardening

Even a perfectly built image can be dangerous at runtime if you run it with Docker\u2019s permissive defaults. The default Docker container runs as root (UID 0), with a set of Linux capabilities, a writable filesystem, and the ability to escalate privileges. Each of these defaults should be locked down.

Run as Non-Root

If your process runs as root inside the container and an attacker escapes the container\u2019s namespace isolation, they are root on the host. Setting a non-root USER in your Dockerfile is the single most impactful hardening step.

docker
FROM python:3.12-slim

RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
WORKDIR /app
COPY --chown=appuser:appuser . .
RUN pip install --no-cache-dir -r requirements.txt

USER appuser
CMD ["gunicorn", "app:create_app()"]

Drop All Capabilities, Add Only What You Need

By default, Docker grants containers a subset of Linux capabilities (14 out of 41+). Most applications need none of them. Drop them all, then selectively add back only what your process requires.

bash
# Drop everything, add back only NET_BIND_SERVICE (to bind port 80/443)
docker run \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --security-opt=no-new-privileges \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid \
  myapp:latest

Here is what each flag does and why it matters:

FlagWhat It DoesWhy It Matters
--cap-drop=ALLRemoves all Linux capabilitiesPrevents mount, raw sockets, kernel module loading, etc.
--security-opt=no-new-privilegesPrevents child processes from gaining new privilegesBlocks setuid/setgid binaries from escalating
--read-onlyMakes the root filesystem read-onlyPrevents attackers from writing malware, modifying binaries
--tmpfs /tmpMounts a writable tmpfs at /tmpGives the app a scratch space without making / writable
--pids-limit=100Limits number of processesPrevents fork bombs from bringing down the host

Seccomp and AppArmor Profiles

Seccomp filters restrict which system calls a container can make. Docker ships with a default seccomp profile that blocks ~44 dangerous syscalls (like reboot, mount, kexec_load). You can create a custom profile for even tighter restrictions.

AppArmor (on Debian/Ubuntu) and SELinux (on RHEL/Fedora) provide mandatory access control. Docker applies a default AppArmor profile (docker-default) that restricts file access, mounting, and raw network operations.

bash
# Use a custom seccomp profile (stricter than default)
docker run --security-opt seccomp=custom-profile.json myapp:latest

# Generate a profile by tracing syscalls with OCI tools
# Step 1: Run with no seccomp to trace what the app actually calls
docker run --security-opt seccomp=unconfined myapp:latest
# Step 2: Use tools like oci-seccomp-bpf-hook to build a whitelist
Note

Running with --privileged disables seccomp, AppArmor, drops all capability restrictions, and gives the container nearly full access to the host. Never use --privileged in production. If you think you need it, you almost certainly need a specific --cap-add or --device instead.

Rootless Docker and User Namespace Remapping

Even with all the runtime flags above, the Docker daemon itself runs as root. Two approaches address this at the daemon level.

Rootless Docker

Rootless mode runs the entire Docker daemon and containers under a regular (non-root) user. If an attacker escapes the container, they land as an unprivileged user on the host \u2014 not root. This is the strongest isolation improvement you can make at the daemon level.

bash
# Install rootless Docker
dockerd-rootless-setuptool.sh install

# Verify it is running rootless
docker info --format '{{.SecurityOptions}}'
# Should include "rootless"

# Set the socket path for your user
export DOCKER_HOST=unix:///docker.sock

User Namespace Remapping

If you cannot use rootless mode, user namespace remapping (userns-remap) maps UID 0 inside the container to a high-numbered unprivileged UID on the host. The process thinks it is root, but the kernel knows it is uid 100000.

json
{
  "userns-remap": "default"
}

With "userns-remap": "default", Docker creates a dockremap user and configures subordinate UID/GID ranges in /etc/subuid and /etc/subgid. Container root (UID 0) maps to a host UID like 100000, so even a container escape yields no real privileges.

Container Escape CVEs: Why This All Matters

Container escapes are not theoretical. They have happened, and they will happen again. Understanding past escapes shows why defense-in-depth is essential \u2014 each vulnerability below would have been mitigated (or completely prevented) by one or more of the hardening techniques in this section.

CVEYearWhat HappenedMitigated By
CVE-2019-57362019Malicious container overwrites host runc binary via /proc/self/exeUser namespaces, read-only rootfs, patched runc
CVE-2020-152572020containerd-shim API accessible from containers sharing host network namespaceNever use --network=host in production
CVE-2022-01852022Linux kernel heap overflow in filesystem context, escapable from unprivileged user nsPatched host kernel, seccomp (blocks unshare)
CVE-2024-216262024runc WORKDIR race condition leaks host file descriptors into containerPatched runc (>= 1.1.12), rootless mode
Tip

The pattern is consistent: keep your host kernel, runc, and containerd patched, and layer runtime protections so that even a 0-day escape faces additional barriers (non-root user, dropped capabilities, user namespace remapping).

Production Hardening Checklist

Combine everything into a single hardened docker run invocation. This represents a production-grade baseline.

bash
docker run -d \
  --name myapp \
  --user 1000:1000 \
  --cap-drop=ALL \
  --security-opt=no-new-privileges \
  --security-opt seccomp=myapp-seccomp.json \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --pids-limit=100 \
  --memory=512m \
  --cpus=1 \
  --restart=unless-stopped \
  --health-cmd="curl -f http://localhost:8080/healthz || exit 1" \
  --health-interval=30s \
  myapp:1.2.3@sha256:abc123...

Performance Optimization: Faster Builds, Smaller Images, Efficient Runtime

Docker performance isn't one thing — it's three. A fast build that produces a 1.2 GB image is still a problem. A tiny image that takes 15 minutes to build is still a problem. And even a fast, small image can choke at runtime if you ignore memory limits and I/O patterns. This section attacks all three dimensions systematically.

Dimension 1: Build Speed

Docker builds layers sequentially from top to bottom. Every time a layer's input changes, that layer and every layer below it are rebuilt from scratch. The single most impactful thing you can do for build speed is order your instructions from least-frequently-changed to most-frequently-changed.

Instruction Ordering

Here's a common anti-pattern — copying all source files before installing dependencies:

docker
# BAD: Any source file change invalidates the npm install cache
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build

Every time you touch a single .js file, npm install re-runs — downloading the exact same dependencies. Fix it by copying the lockfile first:

docker
# GOOD: npm install only re-runs when package files change
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
RUN npm run build

BuildKit Cache Mounts

BuildKit (the default builder since Docker 23.0) supports cache mounts — persistent directories that survive between builds. This is transformative for package managers that maintain their own caches (apt, pip, go, npm).

docker
# syntax=docker/dockerfile:1
FROM python:3.12-slim

# Apt cache survives between builds
RUN --mount=type=cache,target=/var/cache/apt \
    --mount=type=cache,target=/var/lib/apt \
    apt-get update && apt-get install -y gcc libpq-dev

# Pip cache survives between builds
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

Without cache mounts, every apt-get install re-downloads every .deb package. With them, only new or changed packages are fetched. On a project with heavy C dependencies, this can cut build time from 4 minutes to 30 seconds.

The .dockerignore File

Before Docker starts building, it sends your entire build context (the directory you pass to docker build) to the daemon. If that directory contains node_modules, .git, build artifacts, or large data files, you're wasting time and bandwidth — and possibly leaking secrets.

bash
# .dockerignore
.git
node_modules
dist
*.md
.env
.vscode
coverage
__pycache__
*.pyc

Parallel Multi-Stage Builds

BuildKit automatically parallelizes independent stages. If your build has a frontend and a backend that don't depend on each other, declare them as separate stages and BuildKit will build them concurrently:

docker
# These two stages build IN PARALLEL with BuildKit
FROM node:20-alpine AS frontend
WORKDIR /app/frontend
COPY frontend/package*.json ./
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM golang:1.22-alpine AS backend
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o server .

# Final stage: pulls artifacts from both
FROM alpine:3.19
COPY --from=frontend /app/frontend/dist /srv/static
COPY --from=backend /app/server /usr/local/bin/server
ENTRYPOINT ["server"]

Dimension 2: Image Size

Large images mean slower pulls, slower deploys, larger attack surface, and higher storage costs. The goal is to ship only what you need to run the application — no compilers, no package managers, no documentation, no leftover tarballs.

Choosing the Right Base Image

Your base image choice is the single biggest lever for image size. Here's how common bases compare:

Base ImageSizeUse Case
ubuntu:24.04~78 MBWhen you need full glibc + apt ecosystem
debian:bookworm-slim~75 MBSlimmed Debian — fewer pre-installed packages
alpine:3.19~7 MBMinimal with musl libc — great for Go, Rust, static binaries
gcr.io/distroless/static~2 MBNo shell, no package manager — just your binary
scratch0 MBLiterally empty — for fully static binaries only
Alpine and musl libc

Alpine uses musl instead of glibc. Most Go and Rust programs work flawlessly. Some Python/Node C extensions may have compatibility issues — test before committing. If you hit musl problems, debian:bookworm-slim is a solid fallback.

Delete Temp Files in the Same Layer

Docker images are additive — each layer stores a diff. If you download a 200 MB tarball in one RUN step and delete it in the next, the tarball is still in the first layer. You pay the full cost. Always clean up in the same RUN instruction:

docker
# BAD — 3 layers, tarball persists in layer 1
RUN curl -O https://example.com/big-archive.tar.gz
RUN tar xzf big-archive.tar.gz
RUN rm big-archive.tar.gz

# GOOD — 1 layer, tarball never persisted
RUN curl -O https://example.com/big-archive.tar.gz \
    && tar xzf big-archive.tar.gz \
    && rm big-archive.tar.gz

The same logic applies to apt-get. Always chain apt-get update, apt-get install, and cleanup into one RUN:

docker
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
       ca-certificates \
       curl \
    && rm -rf /var/lib/apt/lists/*

The --no-install-recommends flag prevents apt from pulling in suggested packages you don't need. This alone can save 50-200 MB depending on the packages involved.

Multi-Stage Builds for Minimal Final Images

Multi-stage builds are the most powerful size-reduction technique. You build in a fat image with all the tools, then copy just the artifact into a minimal runtime image. The compiler, source code, and build dependencies never ship.

docker
# Stage 1: Build (900+ MB with Go toolchain)
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app

# Stage 2: Run (just the binary — ~7 MB total)
FROM scratch
COPY --from=build /app /app
ENTRYPOINT ["/app"]

Inspecting Image Layers with dive

The dive tool gives you an interactive, layer-by-layer breakdown of your image. It shows exactly what each layer added, how much space it uses, and highlights wasted space (files added in one layer and deleted in a later one).

bash
# Install dive (macOS)
brew install dive

# Analyze an image
dive my-app:latest

# CI mode — fail if image efficiency is below threshold
dive my-app:latest --ci --lowestEfficiency=0.95

Dimension 3: Runtime Performance

A lean image is only half the story. How you run the container matters just as much. Docker's default settings are generous, which is fine for dev but dangerous in production. Here are the levers you should be tuning.

Memory and CPU Limits

Without limits, a single container can consume all host memory and trigger the OOM killer, taking down every container on the host. Always set explicit constraints:

bash
# Hard memory cap + 2 CPU cores
docker run -d \
  --memory=512m \
  --memory-swap=512m \
  --cpus=2 \
  --name api-server \
  my-api:latest

# Reserve memory (soft limit) — useful with orchestrators
docker run -d \
  --memory=512m \
  --memory-reservation=256m \
  my-api:latest

Setting --memory-swap equal to --memory disables swap entirely. This prevents a memory-starved container from thrashing swap and degrading performance for everything on the host.

Volumes for Write-Heavy Workloads

Docker's copy-on-write filesystem (overlay2) adds overhead for write-heavy operations. Databases, log files, and any workload that writes frequently should use volumes, which bypass the storage driver and write directly to the host filesystem:

bash
# Named volume for database data
docker run -d \
  -v pgdata:/var/lib/postgresql/data \
  postgres:16-alpine

# tmpfs for ephemeral scratch data (RAM-backed, zero disk I/O)
docker run -d \
  --tmpfs /tmp:rw,noexec,size=256m \
  my-worker:latest

# Shared memory size (default is 64MB — too small for Postgres/Chrome)
docker run -d \
  --shm-size=256m \
  postgres:16-alpine

Network Performance

Docker's default bridge network adds NAT overhead. For latency-sensitive workloads where the container needs bare-metal network performance, you can bypass Docker networking entirely:

bash
# Host networking — no NAT, no port mapping, ~zero overhead
docker run -d --network=host my-app:latest
Warning

--network=host removes network isolation — the container shares the host's network namespace. Only use this when you've measured a real performance gap and accept the security trade-off. It also doesn't work on Docker Desktop for Mac/Windows (which runs in a Linux VM).

Benchmarking and Monitoring

You can't optimize what you don't measure. Docker ships with built-in monitoring, and open-source tools take it further for production environments.

bash
# Live stats for all running containers
docker stats

# One-shot stats (useful for scripts)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"

For production monitoring, cAdvisor exports per-container metrics (CPU, memory, filesystem, network) to Prometheus. A minimal setup looks like this:

yaml
# docker-compose.monitoring.yml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

Practical Exercise: From 1.2 GB to Under 50 MB

Let's apply every technique from this section. We'll start with a bloated Node.js image and systematically shrink it. The application is a simple Express API.

  1. Start with the bloated Dockerfile

    This is what a naive Dockerfile often looks like in the wild. Build it and note the image size:

    docker
    # Dockerfile.bloated — ~1.2 GB
    FROM node:20
    WORKDIR /app
    COPY . .
    RUN npm install
    RUN apt-get update && apt-get install -y python3 build-essential
    RUN npm run build
    EXPOSE 3000
    CMD ["node", "dist/server.js"]
    bash
    docker build -f Dockerfile.bloated -t myapp:bloated .
    docker images myapp:bloated
    # REPOSITORY   TAG      SIZE
    # myapp        bloated  1.21GB
  2. Switch to Alpine and add a .dockerignore

    Switch the base to node:20-alpine, split dependency install from source copy, and exclude unnecessary files from the build context:

    bash
    # Create .dockerignore
    echo -e "node_modules\ndist\n.git\n*.md\n.env" > .dockerignore
  3. Apply multi-stage build

    Separate the build stage (which needs dev dependencies and build tools) from the production stage (which needs only the compiled output and production dependencies):

    docker
    # Dockerfile.optimized — ~45 MB
    FROM node:20-alpine AS build
    WORKDIR /app
    COPY package.json package-lock.json ./
    RUN npm ci
    COPY tsconfig.json ./
    COPY src/ ./src/
    RUN npm run build
    
    FROM node:20-alpine AS production
    WORKDIR /app
    ENV NODE_ENV=production
    COPY package.json package-lock.json ./
    RUN npm ci --omit=dev && npm cache clean --force
    COPY --from=build /app/dist ./dist
    USER node
    EXPOSE 3000
    CMD ["node", "dist/server.js"]
  4. Verify the result

    Build the optimized image and compare sizes:

    bash
    docker build -f Dockerfile.optimized -t myapp:optimized .
    docker images myapp
    # REPOSITORY   TAG        SIZE
    # myapp        bloated    1.21GB
    # myapp        optimized  45MB
    
    # Inspect layers for further optimization
    dive myapp:optimized

    That's a 96% reduction — from 1.21 GB to ~45 MB — achieved by switching the base image, using multi-stage builds, installing only production dependencies, and cleaning the npm cache.

Going even smaller

For Go, Rust, or any language that produces static binaries, you can use FROM scratch or FROM gcr.io/distroless/static as the final stage — bringing the total image size to just the binary itself (often 5-15 MB). For Node.js, node:20-alpine is the practical floor since you need the Node runtime.

Beyond Single Host: Docker Swarm and Kubernetes Overview

A single Docker host gets you surprisingly far — but eventually you hit its ceiling. Maybe you need high availability, so one crashed server doesn't take everything down. Maybe traffic has outgrown a single machine. Maybe you need zero-downtime deploys. This is where container orchestration enters the picture.

Orchestrators manage containers across multiple machines, handling scheduling, networking, scaling, and self-healing. The two orchestrators most relevant to Docker users are Docker Swarm (built into Docker) and Kubernetes (the industry standard). Let's examine both, honestly.

graph LR
    subgraph swarm["Docker Swarm"]
        direction TB
        SM[Manager Nodes] --> SW[Worker Nodes]
        SW --> SS[Services]
        SS --> SC[Containers / Tasks]
    end

    subgraph shared["Shared Concepts"]
        direction TB
        S1[Declarative Config]
        S2[Service Discovery]
        S3[Load Balancing]
        S4[Rolling Updates]
        S5[Overlay Networking]
    end

    subgraph k8s["Kubernetes"]
        direction TB
        KCP[Control Plane] --> KW[Worker Nodes]
        KW --> KP[Pods]
        KP --> KC[Containers]
    end

    swarm --- shared
    shared --- k8s
    

Docker Swarm: The Built-In Orchestrator

Docker Swarm mode is baked directly into the Docker Engine. There's nothing extra to install — if you have Docker, you have Swarm. You initialize a cluster with a single command, and other nodes join with a token. The simplicity is genuine and significant.

bash
# Initialize Swarm on the first manager node
docker swarm init --advertise-addr 192.168.1.10

# On worker nodes, join with the token provided
docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377

# Check cluster status
docker node ls

Swarm introduces the concept of services — a declarative way to say "I want 3 replicas of this image running at all times." Swarm handles placement, restarts, and load balancing automatically.

Services and Overlay Networking

A Swarm service wraps your container definition with orchestration metadata: replica count, update policy, resource constraints, and networking. Overlay networks span all nodes in the cluster, so containers on different physical machines can communicate as if they were on the same network.

bash
# Create an overlay network
docker network create --driver overlay backend-net

# Deploy a service with 3 replicas
docker service create \
  --name api \
  --replicas 3 \
  --network backend-net \
  --publish 8080:3000 \
  myapp/api:1.2.0

# Scale up
docker service scale api=5

# Rolling update to a new image
docker service update --image myapp/api:1.3.0 api

Stack Deploys: Compose for Swarm

If you already use Docker Compose, you're 80% of the way to Swarm. A docker stack deploy reads a Compose file (with a deploy key) and creates services, networks, and volumes across the cluster. The deploy section specifies replicas, update policies, and resource limits — things that don't apply to single-host Compose.

yaml
# docker-compose.prod.yml
version: "3.8"
services:
  api:
    image: myapp/api:1.3.0
    networks:
      - backend
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      resources:
        limits:
          cpus: "0.50"
          memory: 256M
    ports:
      - "8080:3000"

  redis:
    image: redis:7-alpine
    networks:
      - backend
    deploy:
      placement:
        constraints:
          - node.role == manager

networks:
  backend:
    driver: overlay
bash
# Deploy the entire stack
docker stack deploy -c docker-compose.prod.yml myapp

# Check running services
docker stack services myapp

# Remove the stack
docker stack rm myapp

Routing Mesh

Swarm's routing mesh is an underrated feature. When you publish a port, every node in the cluster listens on that port — even nodes not running the service. Traffic hitting any node is automatically routed to a healthy container. This means you can point a simple load balancer (or DNS round-robin) at all your nodes without worrying about which one is running what.

The Honest Take on Swarm

Docker Swarm is genuinely simpler than Kubernetes — less YAML, fewer concepts, zero external dependencies. For small teams running 5-20 services across a handful of nodes, it works well. However, Swarm lost the orchestration war. Docker Inc. pivoted away from it, the ecosystem stagnated, and most cloud providers don't offer managed Swarm. The community, tooling, and job market all center on Kubernetes. Learn Swarm if it fits your scale, but don't bet your career on it alone.

Kubernetes: The Industry Standard

Kubernetes (K8s) is the dominant container orchestration platform. It was originally designed by Google, based on over a decade of running containers at massive scale (their internal system, Borg). Unlike Swarm, Kubernetes is not built into Docker — it's a separate, complex system with its own API, CLI (kubectl), and ecosystem.

The learning curve is steep, but the payoff is a platform that can handle virtually any workload pattern. Here are the core building blocks you need to understand.

Core Kubernetes Objects

ObjectWhat It DoesSwarm Equivalent
PodSmallest deployable unit — one or more containers sharing network/storageTask (single container)
DeploymentManages Pod replicas, rolling updates, rollbacksService
ServiceStable network endpoint for a set of Pods (load balancing)Service VIP + routing mesh
NamespaceVirtual cluster isolation within one physical clusterStack name (loosely)
ConfigMapInject configuration as env vars or filesConfigs
SecretLike ConfigMap but for sensitive data (base64-encoded, encryptable at rest)Secrets

Everything in Kubernetes is declared as YAML and applied via kubectl apply. Here's a minimal Deployment and Service — the Kubernetes equivalent of docker service create:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myapp/api:1.3.0
          ports:
            - containerPort: 3000
          resources:
            limits:
              cpu: "500m"
              memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 3000
  type: LoadBalancer

Notice the difference in verbosity. The Kubernetes manifest is roughly 3× longer than the equivalent Swarm stack file — and this is a simple example. That verbosity buys you precision and flexibility, but it's a real cost for small projects.

The Migration Path: Compose → Swarm → Kubernetes

You don't need to jump straight to Kubernetes. There's a natural progression that lets you add complexity only when your needs demand it.

Docker Compose handles local development and single-host deployments. When you need multi-node, add a deploy key and use Docker Swarm — your Compose files mostly carry over. When you outgrow Swarm (or need the Kubernetes ecosystem), Kompose can convert your Compose files into Kubernetes manifests as a starting point.

bash
# Install Kompose
curl -L https://github.com/kubernetes/kompose/releases/latest/download/kompose-linux-amd64 -o kompose
chmod +x kompose

# Convert a Compose file to Kubernetes manifests
./kompose convert -f docker-compose.yml

# This generates Deployment and Service YAML files
# Review and adjust them — Kompose output is a starting point, not production-ready
Kompose Is a Scaffold, Not a Solution

Kompose translates syntax, not architecture. It won't generate Ingress rules, Horizontal Pod Autoscalers, PersistentVolumeClaims, or health checks. Treat its output as a rough draft that needs manual refinement for production use.

Alternatives Worth Knowing

Kubernetes isn't the only option beyond Swarm. Depending on your team size, cloud provider, and operational appetite, these alternatives may be a better fit.

PlatformBest ForTrade-off
HashiCorp NomadTeams that want orchestration without Kubernetes complexity. Supports containers, VMs, and bare binaries.Smaller ecosystem than K8s. Fewer managed offerings.
AWS ECSAWS-native teams. Tight integration with ALB, IAM, CloudWatch.Vendor lock-in. Not portable to other clouds.
Google Cloud RunStateless HTTP services that need to scale to zero. No cluster management at all.Limited to request-driven workloads. No persistent connections or background jobs.
Fly.ioEdge deployments. Runs containers close to users worldwide with minimal config.Smaller company, niche platform. Less enterprise tooling.
Choosing Your Orchestrator

If you're a solo dev or small team with fewer than 10 services, start with Docker Compose on a single host, or Docker Swarm for basic HA. If you're building a platform for multiple teams, or your cloud provider offers managed Kubernetes (EKS, GKE, AKS), that's the pragmatic choice. The "best" orchestrator is the one your team can actually operate.

Real-World Patterns and Production Playbook

Running Docker in development is one thing. Running it in production — where a container crash at 3 AM pages your on-call engineer — is another beast entirely. This section distills the battle-tested patterns that separate "it works on my machine" from "it works reliably, at scale, under pressure."

We will walk through the 12-factor methodology applied to containers, graceful shutdown mechanics, logging pipelines, image promotion strategies, monitoring stacks, and a concrete production-readiness checklist you can adopt today.

The 12-Factor App, Containerized

The 12-factor app methodology was written in 2011, but it reads like a Docker best-practices guide. Containers naturally enforce many of its principles — but not all of them automatically. Here is how the most critical factors map to Docker decisions.

FactorPrincipleDocker Implementation
III. ConfigStore config in the environmentUse -e, --env-file, or orchestrator secrets — never bake config into images
VI. ProcessesStateless processesKeep containers ephemeral; persist state in volumes or external stores (Redis, PostgreSQL)
VII. Port BindingExport services via port bindingEXPOSE in Dockerfile, -p at runtime — the app binds to 0.0.0.0
VIII. ConcurrencyScale via the process modelScale horizontally with docker compose up --scale web=4 or Kubernetes replicas
IX. DisposabilityFast startup, graceful shutdownMinimal images + proper signal handling (see below)
XI. LogsTreat logs as event streamsWrite to stdout/stderr — let the logging driver handle the rest
Factor III is the one teams violate most

If you are copying .env files into your image at build time, or hard-coding database URLs in application code, you have broken config separation. The image should be identical across all environments — only environment variables change.

Health Checks in Production

A running container is not necessarily a healthy container. Your Node.js process might be alive but stuck in an infinite loop. Your database connection pool might be exhausted. Docker's HEALTHCHECK instruction lets you define what "healthy" actually means for your application, and orchestrators use this signal to route traffic and restart failed containers.

dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:8080/healthz || exit 1

The four parameters matter:

  • --interval: How often to check (30s is a sensible default).
  • --timeout: Max time the check can take before it is considered failed.
  • --start-period: Grace period for slow-starting apps (the check runs but failures do not count).
  • --retries: How many consecutive failures before the container is marked unhealthy.

For production health endpoints, go deeper than just "is the server responding." Check downstream dependencies — database connectivity, cache availability, disk space — and return structured JSON with per-component status.

json
{
  "status": "healthy",
  "checks": {
    "database": { "status": "up", "latency_ms": 3 },
    "redis": { "status": "up", "latency_ms": 1 },
    "disk": { "status": "up", "free_gb": 12.4 }
  },
  "uptime_seconds": 86412
}

Graceful Shutdown: SIGTERM, PID 1, and Init Systems

When Docker stops a container, it sends SIGTERM to PID 1, waits for a grace period (default: 10 seconds), and then sends SIGKILL. If your app handles SIGTERM correctly, it can finish in-flight requests, close database connections, and flush buffers before exiting cleanly. If it does not, your users see dropped connections and your data might corrupt.

The problem is that many containers run a shell as PID 1. Shells do not forward signals to child processes, so your app never receives SIGTERM — it just gets hard-terminated after the timeout. This is the "PID 1 zombie reaping" problem.

The Fix: Use tini or dumb-init

dockerfile
# tini is built into Docker — just use --init at runtime
# OR install it explicitly in your image:
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["node", "server.js"]

The easiest option: run any container with docker run --init myapp and Docker uses its bundled tini. For Compose, add init: true to the service.

dockerfile
RUN apt-get update && apt-get install -y dumb-init
ENTRYPOINT ["dumb-init", "--"]
CMD ["python", "app.py"]

dumb-init by Yelp serves the same purpose — it runs as PID 1, forwards signals, and reaps zombie processes. Choose whichever your team prefers; both are battle-tested.

On the application side, you still need to handle the signal. Here is a minimal Node.js example:

javascript
process.on('SIGTERM', () => {
  console.log('SIGTERM received. Shutting down gracefully...');
  server.close(() => {
    db.end();            // close database pool
    process.exit(0);     // exit cleanly
  });
  // Force exit after 8s if close hangs
  setTimeout(() => process.exit(1), 8000);
});

Logging Strategies

Docker captures everything your container writes to stdout and stderr. The logging driver determines where those streams go. Choosing the right driver is one of the highest-impact production decisions you will make — it affects debugging speed, storage costs, and system performance.

DriverWhere Logs GoBest ForCaveat
json-fileLocal JSON files on the hostDevelopment, small deploymentsFills disk without max-size/max-file limits
fluentdFluentd/Fluent Bit collectorCentralized logging at scaleRequires running a Fluentd sidecar or daemon
lokiGrafana Loki (via plugin)Grafana-based stacks, cost-conscious teamsNeeds the Loki Docker plugin installed
syslogRemote syslog serverTraditional infrastructureLess structured than JSON alternatives

Always set log rotation on the default json-file driver. Without it, a verbose container can consume all available disk space and bring down the host.

json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "5",
    "compress": "true"
  }
}

Put this in /etc/docker/daemon.json to set defaults for all containers on the host. For a production ELK (Elasticsearch + Logstash + Kibana) pipeline, ship logs from Fluent Bit to Elasticsearch. Fluent Bit is lighter than Logstash and handles Docker log format natively.

Structured logging from the start

Emit logs as JSON from your application. This makes them machine-parseable without grok patterns or regex extractors. A line like {"level":"error","msg":"connection refused","service":"api","ts":"2024-01-15T08:30:00Z"} is infinitely more useful in Kibana or Grafana than a plain text string.

Configuration Management

Configuration has a hierarchy in containerized environments. Getting it right means your image stays portable and your secrets stay safe.

The Configuration Pyramid

  • Baked into the image — Only for defaults that are truly universal (e.g., the app listen port, timezone UTC). These go in the Dockerfile.
  • Environment variables — For values that change per environment: database URLs, feature flags, log levels. Pass via -e, --env-file, or Compose environment: key.
  • Mounted config files — For complex configuration (nginx.conf, application YAML). Bind-mount or use Docker configs in Swarm.
  • Secrets — For passwords, API keys, TLS certs. Use Docker Secrets (Swarm), Kubernetes Secrets, or a vault like HashiCorp Vault. Never pass secrets as build args — they are stored in image layers.
yaml
# docker-compose.yml — clean config separation
services:
  api:
    image: mycompany/api:1.4.2
    environment:
      - NODE_ENV=production
      - LOG_LEVEL=info
    env_file:
      - .env.production
    configs:
      - source: nginx_conf
        target: /etc/nginx/nginx.conf
    secrets:
      - db_password

configs:
  nginx_conf:
    file: ./config/nginx.prod.conf

secrets:
  db_password:
    external: true

Image Promotion: Build Once, Deploy Everywhere

The golden rule of container deployments: build the image exactly once, then promote that identical artifact through dev, staging, and production. Rebuilding per environment introduces drift — different dependency versions, different build timestamps, different behavior. If it passed tests in staging, you want that exact image in production.

graph LR
    A["Developer\ngit push"] --> B["CI Pipeline\ndocker build"]
    B --> C["Registry\n:git-sha tag"]
    C --> D["Dev\nAuto-deploy"]
    D --> E{"Tests\nPass?"}
    E -->|Yes| F["Staging\nSame image"]
    F --> G{"QA\nPass?"}
    G -->|Yes| H["Production\nSame image"]
    E -->|No| I["Fix &\nRebuild"]
    G -->|No| I

    style A fill:#4a9eff,color:#fff
    style B fill:#f5a623,color:#fff
    style C fill:#7b68ee,color:#fff
    style H fill:#2ecc71,color:#fff
    style I fill:#e74c3c,color:#fff
    

The key mechanism is tagging. Each build produces a single image tagged with both a semantic version and the git SHA. The same image gets additional tags as it is promoted.

bash
# Build once in CI
GIT_SHA=$(git rev-parse --short HEAD)
VERSION="1.4.2"

docker build -t mycompany/api:${VERSION}-${GIT_SHA} .

# Tag for promotion
docker tag mycompany/api:${VERSION}-${GIT_SHA} mycompany/api:${VERSION}
docker tag mycompany/api:${VERSION}-${GIT_SHA} mycompany/api:latest

# Push all tags
docker push mycompany/api:${VERSION}-${GIT_SHA}
docker push mycompany/api:${VERSION}
docker push mycompany/api:latest

Tagging Strategy: Semver + Git SHA

TagExamplePurpose
version-sha1.4.2-a3f8c1dImmutable, traceable to exact commit
version1.4.2Human-readable release identifier
latestlatestConvenience only — never use in production manifests
Never deploy :latest in production

The :latest tag is mutable — it points to whatever was pushed most recently. If two developers push in quick succession, the second overrides the first with no audit trail. Always pin to an immutable tag (1.4.2-a3f8c1d) in production deployments so you can trace exactly what is running and roll back deterministically.

Monitoring: cAdvisor + Prometheus + Grafana

The standard open-source monitoring stack for containerized workloads is cAdvisor, Prometheus, and Grafana. Each layer has a specific job: cAdvisor exposes per-container resource metrics (CPU, memory, network, disk I/O), Prometheus scrapes and stores those metrics as time-series data, and Grafana visualizes them with dashboards and alerts.

yaml
# docker-compose.monitoring.yml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"

  prometheus:
    image: prom/prometheus:v2.51.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:10.4.0
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_pw

volumes:
  prometheus_data:
  grafana_data:

Point Prometheus at cAdvisor with a simple scrape config:

yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'app'
    static_configs:
      - targets: ['api:9100']  # your app's /metrics endpoint

Key metrics to alert on: container_memory_usage_bytes approaching the limit (OOM termination incoming), container_cpu_usage_seconds_total sustained spikes, container_network_receive_errors_total rising, and your application's own latency and error-rate metrics exposed via a /metrics endpoint.

Production-Readiness Checklist

Before any container touches production traffic, run through this checklist. Each item addresses a real failure mode that has caused outages in production systems.

CategoryCheckWhy It Matters
ImageUsing a minimal base (alpine, distroless, slim)Fewer packages = smaller attack surface, faster pulls
ImagePinned base image to a specific digest or versionPrevents silent breakage when upstream images change
ImageMulti-stage build — no build tools in final imageReduces image size and eliminates compilers attackers could use
ImageScanned for CVEs (Trivy, Snyk, Grype)Known vulnerabilities caught before deployment
RuntimeRunning as non-root userLimits blast radius if the container is compromised
RuntimeHEALTHCHECK defined and testedOrchestrator can detect and replace unhealthy instances
RuntimeGraceful shutdown handles SIGTERMZero dropped connections during deployments and scaling
RuntimeMemory and CPU limits setPrevents a single container from starving the host
OpsLogs go to stdout/stderr with rotation configuredPrevents disk exhaustion, enables centralized log aggregation
OpsSecrets injected at runtime, not baked into imageSecrets in layers are recoverable by anyone with image access
OpsImmutable tag deployed (semver + SHA, not :latest)Reproducible deployments, reliable rollbacks
OpsMonitoring and alerting configuredYou find out about problems before your users do
Automate the checklist

Do not rely on humans to remember 12 items. Encode these checks into your CI pipeline. Tools like Hadolint for Dockerfile linting, Trivy for vulnerability scanning, and Conftest for policy-as-code can gate deployments automatically.