Docker — From Fundamentals to Mastery
Prerequisites: Basic command-line/terminal familiarity, understanding of what a web server or application process is, and basic Linux concepts (files, processes, networking). No prior container experience required.
What Docker Is and Why It Matters
Docker is a platform for building, shipping, and running applications inside containers — lightweight, isolated environments that package your code together with every dependency it needs: libraries, system tools, runtime, and configuration files. Think of a container as a self-contained unit that runs the same way regardless of where you deploy it.
The pitch is simple: you describe your environment once in a Dockerfile, build an image, and that image runs identically on your laptop, your CI server, a staging VM, and a production Kubernetes cluster. No surprises, no drift, no "let me check which version of Python is installed."
The Problem: "Works on My Machine"
Every developer has heard — or said — the phrase "it works on my machine." The root cause is environment drift: subtle differences in OS versions, library versions, environment variables, file paths, and system configurations between development, staging, and production. These differences create bugs that are nearly impossible to reproduce and painful to debug.
Before containers, teams tried to solve this with provisioning scripts, configuration management tools like Chef or Puppet, and heavyweight virtual machines. These approaches helped, but they were slow, brittle, and expensive. Docker attacked the problem at its core by making the environment itself portable and versioned.
Docker doesn't just ship your code — it ships the entire environment your code needs to run. The container is the deployment artifact, not just the application binary.
Here's what a minimal Dockerfile looks like — a plain text recipe that defines your container environment:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
This file pins the Node.js version, installs exact dependencies, and defines how the app starts. Anyone with Docker installed can run docker build -t myapp . and get an identical image — on any machine, any OS.
Containers vs. Virtual Machines
Containers and virtual machines both provide isolation, but they achieve it in fundamentally different ways. A VM virtualizes the hardware through a hypervisor (like VMware, Hyper-V, or KVM), and each VM runs a complete guest operating system — its own kernel, init system, and full userspace. A container virtualizes at the OS level, sharing the host's kernel and isolating processes using Linux namespaces and cgroups.
This architectural difference has massive practical consequences:
| Characteristic | Virtual Machine | Container |
|---|---|---|
| Isolation level | Full hardware virtualization | Process-level (shared kernel) |
| Startup time | 30–90 seconds | Milliseconds to low seconds |
| Image size | Gigabytes (includes full OS) | Megabytes (just app + deps) |
| Resource overhead | High — each VM reserves CPU/RAM for its OS | Minimal — near bare-metal efficiency |
| Density | Tens of VMs per host | Hundreds to thousands of containers per host |
| Guest OS | Any OS (Linux, Windows, BSD) | Must match host kernel type |
| Security boundary | Strong — full kernel separation | Weaker — shared kernel attack surface |
In practice, containers and VMs are not an either/or choice. Production environments often run containers inside VMs — the VM provides a strong security boundary and the containers provide fast, dense application packaging on top.
A Brief History
Container-like isolation isn't new. Unix had chroot in 1979, FreeBSD introduced Jails in 2000, and Linux added cgroups (2006) and namespaces (2002–2013) over a long stretch. What Docker did in 2013 was make this technology accessible.
Docker started life inside a PaaS company called dotCloud. Solomon Hykes open-sourced the internal container engine at PyCon 2013, and the response was so overwhelming that dotCloud pivoted entirely and rebranded as Docker, Inc. Within a year, every major cloud provider had Docker support, and the container ecosystem exploded.
In 2015, Docker helped establish the Open Container Initiative (OCI) under the Linux Foundation. The OCI defines two critical standards:
- Image Spec (image-spec) — the format for container images, ensuring any OCI-compliant tool can build and distribute images.
- Runtime Spec (runtime-spec) — how a container is configured, created, and run, so different runtimes (runc, crun, gVisor) are interchangeable.
These standards mean you're not locked into Docker's tooling. Podman, Buildah, containerd, and others all speak the same language. Docker popularized containers; the OCI made them an industry standard.
Docker in the Ecosystem
mindmap
root((Docker))
Problems It Solves
Environment consistency
Dependency isolation
Reproducible builds
Dev/prod parity
Key Concepts
Images
Containers
Registries
Dockerfile
Alternatives
Podman
LXC / LXD
Virtual Machines
Firecracker microVMs
Use Cases
Microservices
CI/CD pipelines
Local dev environments
Cloud deployment
When Docker Is the Right Tool
Docker excels in specific scenarios. It's the default choice for packaging microservices, running reproducible CI/CD pipelines, creating consistent local development environments, and deploying to any cloud platform. If your workload is a Linux-based server process — an API, a web app, a worker, a database — Docker is almost certainly a good fit.
When Docker Is Not the Right Tool
Not everything belongs in a container. Recognize these situations and reach for something else:
- Different-kernel workloads — You can't run a Windows container on a Linux host (or vice versa) without a VM in between, because containers share the host kernel. If you need FreeBSD jails or a macOS environment, Docker won't help.
- Heavy GUI applications — Desktop apps with GPU acceleration, complex display requirements, and hardware peripherals are a poor fit. The isolation model adds friction (X11 forwarding, GPU passthrough) that rarely justifies the benefit.
- Bare-metal HPC — High-performance computing workloads that need every last CPU cycle, direct hardware access (InfiniBand, RDMA), or custom kernel modules pay a measurable overhead cost from the container abstraction layer, even if small.
- Stateful data stores in production — While you can run databases in containers (and it's great for dev/test), production databases often benefit from running on bare metal or VMs where storage performance, durability, and operational tooling are mature.
If your workload is a Linux server process that reads config from the environment and communicates over the network, Docker is almost certainly the right packaging choice. If it needs direct hardware access or a non-Linux kernel, look elsewhere.
Docker Architecture: Daemon, Containerd, and the OCI Stack
Docker is not a single monolithic binary — it is a stack of loosely coupled components, each with a well-defined responsibility. Understanding this layered architecture helps you debug container failures, reason about security boundaries, and appreciate why Docker containers keep running even when parts of the stack restart.
The architecture flows top-to-bottom: the Docker CLI talks to the Docker daemon, which delegates to containerd, which spawns containers via runc. Each layer exists for a reason, and each can be swapped independently thanks to the OCI (Open Container Initiative) standards.
The Full Picture
graph TD
CLI["Docker CLI
docker run, build, push"]
DOCKERD["Docker Daemon - dockerd
API · Image Builds · Networking · Volumes"]
CTRD["containerd
Container Supervision · Image Management · Snapshots"]
SHIM["containerd-shim
Keeps container alive independently"]
RUNC["runc
Creates namespaces + cgroups, then exits"]
PROC["Container Process
PID 1 inside container"]
IMG["Image Store
content-addressed"]
SNAP["Snapshotter
overlay2"]
CLI -->|"REST API over
Unix socket"| DOCKERD
DOCKERD -->|"gRPC API"| CTRD
CTRD --> SHIM
SHIM --> RUNC
RUNC -->|"fork/exec"| PROC
CTRD --- IMG
CTRD --- SNAP
style CLI fill:#4a9eff,stroke:#2d7cd6,color:#fff
style DOCKERD fill:#366ea3,stroke:#2a5580,color:#fff
style CTRD fill:#7c3aed,stroke:#6025c9,color:#fff
style SHIM fill:#9f67f5,stroke:#7c3aed,color:#fff
style RUNC fill:#e05d44,stroke:#b8452f,color:#fff
style PROC fill:#2ea043,stroke:#238636,color:#fff
style IMG fill:#7c3aed,stroke:#6025c9,color:#fff
style SNAP fill:#7c3aed,stroke:#6025c9,color:#fff
Docker CLI: The User-Facing Interface
The Docker CLI (docker) is a standalone binary that does almost nothing on its own. Every command you type — docker run, docker build, docker push — is translated into an HTTP request and sent to the Docker daemon over a Unix socket (/var/run/docker.sock) or a TCP endpoint.
This separation is what makes remote Docker possible. You can point your local CLI at a remote daemon by setting the DOCKER_HOST environment variable, and every command works as if the daemon were local.
# The CLI just sends REST calls to the daemon
# These two commands are equivalent:
docker ps
curl --unix-socket /var/run/docker.sock http://localhost/v1.44/containers/json | jq
Docker Daemon (dockerd): The Orchestration Hub
The Docker daemon (dockerd) is the central API server. It exposes the full Docker API and is responsible for the higher-level features that Docker is known for: image builds (processing Dockerfiles), networking (bridge, overlay, host networks), volumes (named and bind mounts), and image management (pull, push, tag).
What dockerd does not do is actually run containers. When you execute docker run nginx, the daemon resolves the image, sets up networking and volumes, and then hands the actual container creation to containerd via a gRPC API. This decoupling means that dockerd can be upgraded or restarted without stopping your running containers.
# You can see the daemon and containerd running as separate processes
ps aux | grep -E 'dockerd|containerd'
# root 1234 dockerd --group docker --host fd://
# root 1235 containerd
# root 1290 containerd-shim-runc-v2 -namespace moby -id <container-id>
containerd: The Container Supervisor
Containerd is a high-level container runtime — a CNCF graduated project that Docker donated to the community. It manages the complete container lifecycle: pulling and storing images, creating container snapshots using a snapshotter (typically overlay2), and supervising running containers.
Containerd exposes its own gRPC API and can be used entirely without Docker. In fact, Kubernetes clusters using the containerd CRI (Container Runtime Interface) talk directly to containerd, bypassing dockerd entirely. This is why Kubernetes deprecated the Docker shim — containerd was already doing the real work.
Containerd is not a Docker replacement — it is a Docker component. Docker adds image builds, Compose, Swarm, and a user-friendly CLI on top of containerd. When Kubernetes "dropped Docker support," it just stopped using the dockerd layer and talked to containerd directly.
runc: The Low-Level Runtime
Runc is the low-level OCI-compliant runtime that creates the actual container process. When containerd needs to start a container, it prepares an OCI bundle (a config.json file describing the container configuration plus a root filesystem) and invokes runc.
Runc's job is focused and short-lived. It calls the Linux kernel primitives to create namespaces (pid, net, mnt, uts, ipc, user) and configure cgroups (CPU, memory, I/O limits), then performs a fork/exec to start the container's entrypoint process. Once the container process is running, runc exits. It is not a long-running daemon.
# runc can create containers directly from an OCI bundle
# First, export a root filesystem and generate a config
mkdir my-container && cd my-container
docker export $(docker create alpine) | tar -xf -
runc spec # generates config.json (OCI runtime spec)
sudo runc run my-alpine-ctr # creates namespaces, cgroups, and runs /bin/sh
containerd-shim: The Decoupling Layer
The shim process (containerd-shim-runc-v2) is the unsung hero of Docker's reliability. One shim process runs per container and acts as the parent process for the container's PID 1. This design achieves a critical goal: containers survive daemon restarts.
If containerd crashes or is upgraded, the shim keeps the container running. When containerd comes back, it reconnects to the existing shims. The shim also handles keeping STDIO streams open, reporting the container's exit status, and reaping zombie processes.
| Component | Lifecycle | Role |
|---|---|---|
dockerd | Long-running daemon | API server, image builds, networking, volumes |
containerd | Long-running daemon | Image management, container supervision, snapshots |
containerd-shim | Per-container, long-running | Decouples container from containerd lifecycle |
runc | Short-lived (exits after setup) | Creates namespaces and cgroups, fork/exec entrypoint |
The OCI Standards: Why This All Works
The Open Container Initiative (OCI) defines two specifications that make Docker's components interchangeable. The Image Spec defines how container images are formatted (layers, manifests, and configuration). The Runtime Spec defines the interface for low-level runtimes — the config.json format and the expected lifecycle commands (create, start, kill, delete).
Because these specs are standardized, you can swap components freely. Replace dockerd with Podman (which is daemonless and talks directly to a container runtime). Replace containerd with CRI-O (purpose-built for Kubernetes). Replace runc with kata-containers (which runs each container inside a lightweight VM for stronger isolation). The interfaces remain the same.
Any image you build with docker build is OCI-compliant by default. It will run on Podman, CRI-O, containerd standalone, or any other OCI-compatible runtime without modification. You are not locked into Docker.
Seeing the Stack in Action
You can observe the entire chain on a running Docker host. When you start a container, each layer leaves a footprint in the process tree:
# Start a container, then inspect the process hierarchy
docker run -d --name demo nginx:alpine
# Show the process tree — notice runc is already gone
pstree -p $(pgrep containerd-shim | head -1)
# containerd-shim(4521)─┬─nginx(4550)─┬─nginx(4601)
# │ └─nginx(4602)
# └─{containerd-shi}(4522)
# The shim is the parent, not dockerd or containerd
# This is why the container survives daemon restarts
Docker does not use a hypervisor or virtual machine. Containers are regular Linux processes isolated with kernel namespaces and constrained with cgroups. The runc runtime sets up these kernel primitives — it does not start a VM. This is why containers start in milliseconds, not seconds.
Under the Hood: Namespaces, Cgroups, and Union Filesystems
A container is not a virtual machine. It's a regular Linux process that the kernel has been told to isolate and restrict. Three kernel features make this possible: namespaces give the process its own view of the system, cgroups cap the resources it can consume, and union filesystems provide an efficient layered filesystem. Understanding these primitives is the difference between using Docker as a black box and truly mastering it.
graph TD
subgraph NS["Namespaces"]
PID["PID"]
NET["NET"]
MNT["MNT"]
UTS["UTS"]
IPC["IPC"]
USER["USER"]
end
subgraph CG["Cgroups"]
CPU["CPU"]
MEM["Memory"]
PIDS["PIDs"]
IO["Block I/O"]
end
subgraph OFS["OverlayFS"]
BASE["Base Image Layer"]
APP["App Layer"]
RW["Writable Layer"]
BASE --> APP --> RW
end
NS -->|"Isolated Process View"| CONTAINER
CG -->|"Resource Limits"| CONTAINER
OFS -->|"Layered Filesystem"| CONTAINER
CONTAINER["🐳 Container = Isolated Process"]
style CONTAINER fill:#0db7ed,stroke:#384d54,color:#fff,font-weight:bold
style NS fill:#2d3748,stroke:#4a5568,color:#e2e8f0
style CG fill:#2d3748,stroke:#4a5568,color:#e2e8f0
style OFS fill:#2d3748,stroke:#4a5568,color:#e2e8f0
Namespaces: Giving Each Container Its Own World
Linux namespaces partition kernel resources so that one set of processes sees one set of resources while another set sees a different set. When Docker starts a container, it creates a new instance of each namespace type, giving the container process an isolated view of the system. The process thinks it has the entire machine to itself.
| Namespace | Flag | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs — the container's entrypoint becomes PID 1 |
| NET | CLONE_NEWNET | Network stack — interfaces, routing tables, iptables rules, ports |
| MNT | CLONE_NEWNS | Mount points — the container sees its own root filesystem |
| UTS | CLONE_NEWUTS | Hostname and domain name |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues, shared memory |
| USER | CLONE_NEWUSER | UID/GID mapping — root inside the container can map to an unprivileged user on the host |
Seeing Namespaces in Action
Every process on Linux has namespace references listed under /proc/self/ns/. You can compare these between the host and a container to see the isolation in effect. Each namespace gets a unique inode number — different numbers mean different namespaces.
# On the host — list your namespace inode numbers
ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026531836]'
# ...
# Inside a container — the inode numbers will be DIFFERENT
docker run --rm alpine ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532257]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532260]'
# ...
The different inode numbers (e.g., 4026531836 vs 4026532260 for PID) prove that the container process lives in a separate PID namespace. Inside that namespace, it sees its own PID 1 and cannot see host processes at all.
# The container's entrypoint IS PID 1 inside its PID namespace
docker run --rm alpine ps aux
# PID USER COMMAND
# 1 root ps aux
# But on the host, it's just a regular process with a high PID
docker run -d --name demo alpine sleep 3600
docker top demo
# PID USER COMMAND
# 48291 root sleep 3600
Inside a container, your entrypoint process becomes PID 1. This means it receives signals like SIGTERM directly. If your process doesn't handle signals properly, docker stop will wait the full timeout (default 10s) and then send SIGKILL. This is why many images use tini or dumb-init as a lightweight init system.
Cgroups: Enforcing Resource Limits
Namespaces control what a process can see; cgroups (control groups) control what it can use. Cgroups are a kernel mechanism that organizes processes into hierarchical groups and applies resource limits to those groups. Without cgroups, a single container could consume all host memory or starve every other process of CPU time.
Cgroups v1 vs v2
Linux ships two cgroup architectures. Modern Docker installations (Docker Engine 20.10+ on kernels 5.2+) support both, and cgroups v2 is now the default on most current distributions including Ubuntu 22.04+, Fedora 31+, and Debian 11+.
| Feature | Cgroups v1 | Cgroups v2 |
|---|---|---|
| Hierarchy | Multiple trees — one per resource controller (cpu, memory, etc.) | Single unified tree at /sys/fs/cgroup/ |
| Controllers | Independently mountable | All managed through one hierarchy |
| Pressure info | Not available | PSI (Pressure Stall Information) for memory, CPU, I/O |
| Mount path | /sys/fs/cgroup/memory/, /sys/fs/cgroup/cpu/, ... | /sys/fs/cgroup/ (unified) |
How Docker Uses Cgroups
When you pass flags like --memory or --cpus to docker run, Docker translates them into cgroup settings. Here's what the common flags control:
# Run a container with resource limits
docker run -d \
--name constrained \
--memory=256m \
--memory-swap=512m \
--cpus=1.5 \
--pids-limit=100 \
nginx:alpine
These flags map directly to cgroup knobs. You can inspect them by reading the cgroup filesystem. On a cgroups v2 system, Docker places each container's cgroup under /sys/fs/cgroup/system.slice/docker-<container-id>.scope/.
# Find the container's cgroup path
CONTAINER_ID=$(docker inspect --format '{{.Id}}' constrained)
# Cgroups v2 — read memory limit (in bytes)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# 268435456 (= 256 * 1024 * 1024 = 256MB)
# Read CPU quota (in microseconds per period)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# 150000 100000 (= 150ms of every 100ms period = 1.5 CPUs)
# Read PIDs limit
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/pids.max
# 100
# See current memory usage
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
# 8388608 (current usage in bytes)
If you run docker run without --memory or --cpus, the container has no resource limits. It can consume all available host memory and CPU. In production, always set explicit limits — an unbounded container can trigger the Linux OOM killer and take down other containers (or the host itself).
Union Filesystems: Layers and Copy-on-Write
The third primitive solves a practical problem: if you run 50 containers from the same image, you don't want 50 copies of the filesystem. Union filesystems (specifically OverlayFS with the overlay2 storage driver) let Docker stack filesystem layers on top of each other. Multiple read-only layers from the image sit below a single thin writable layer for the running container.
How Layers Stack
An image is built from a series of layers, each representing a Dockerfile instruction. OverlayFS merges these layers into a single coherent view using three directories:
| OverlayFS Directory | Role | Description |
|---|---|---|
lowerdir | Read-only image layers | All image layers stacked together. Shared across containers from the same image. |
upperdir | Writable container layer | All file writes, modifications, and deletions go here. Unique per container. |
merged | Unified view | The combined filesystem the container actually sees — a union of lower + upper. |
workdir | Internal scratch space | Used by OverlayFS for atomic operations like rename(). |
# Inspect the overlay mount for a running container
docker inspect constrained --format '{{.GraphDriver.Data.MergedDir}}'
# /var/lib/docker/overlay2/abc123.../merged
docker inspect constrained --format '{{.GraphDriver.Data.UpperDir}}'
# /var/lib/docker/overlay2/abc123.../diff
docker inspect constrained --format '{{.GraphDriver.Data.LowerDir}}'
# /var/lib/docker/overlay2/layer1/diff:/var/lib/docker/overlay2/layer2/diff:...
# Each ":" separates a layer — they stack bottom-to-top
Copy-on-Write in Practice
When a container reads a file, OverlayFS looks through the layers top-down and returns the first match. When a container writes to a file that exists in a lower (read-only) layer, OverlayFS performs a copy-up: it copies the entire file to the writable upperdir, then applies the modification there. The original file in the lower layer remains untouched.
# Start a container and modify a file from the base image
docker run -d --name cow-demo alpine sleep 3600
# Write to a file — this triggers copy-on-write
docker exec cow-demo sh -c "echo 'modified' > /etc/hostname"
# The change lives ONLY in the container's writable layer
docker diff cow-demo
# C /etc
# C /etc/hostname
This is why image layers are so space-efficient: 100 containers running the same nginx:alpine image share one copy of the base layers on disk. Each container only consumes additional space for the files it modifies.
Putting It All Together: A Container Is Just a Process
When you run docker run nginx, here's what actually happens at the kernel level:
-
Prepare the filesystem
Docker pulls image layers (if not cached), stacks them using OverlayFS, and creates a thin writable layer on top. The
mergeddirectory becomes the container's root filesystem. -
Create namespaces
Docker calls
clone()with namespace flags (CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | ...) to create a new process with its own isolated view of PIDs, networking, mounts, hostname, IPC, and user IDs. -
Configure cgroups
Docker creates a new cgroup for the container and writes the resource limits you specified (memory, CPU, PIDs) to the corresponding cgroup files. The container process is placed into this cgroup.
-
Pivot root and exec
The process calls
pivot_root()to switch its root filesystem to the OverlayFSmergeddirectory, thenexec()s the container's entrypoint (e.g.,nginx -g 'daemon off;'). From this point on, it's a regular Linux process — just one that can only see and use what the kernel allows.
Run docker run -d alpine sleep 3600, then find it on the host with ps aux | grep "sleep 3600". You'll see it listed as a normal process. You can even strace it or inspect /proc/<pid>/ from the host. There is no hypervisor, no guest kernel — just a process with guardrails.
Installing Docker and Essential Configuration
Docker runs on Linux, macOS, and Windows — but the installation path and underlying architecture differ significantly across platforms. Getting the installation right matters because a misconfigured Docker setup leads to confusing permission errors, poor performance, or missing features down the line.
Linux: Installing Docker Engine
On Linux, you want Docker Engine installed from Docker's official apt or yum repositories — not the version bundled with your distro's default package manager. Distribution-packaged versions (like docker.io on Ubuntu) are often several major versions behind and miss critical features like BuildKit improvements and compose v2 integration.
Ubuntu / Debian
# Remove any old/distro-packaged versions
sudo apt-get remove docker docker-engine docker.io containerd runc
# Install prerequisites and add Docker's official GPG key
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add the official Docker repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine, CLI, and plugins
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io \
docker-buildx-plugin docker-compose-plugin
RHEL / Fedora
# Install the repo and Docker Engine
sudo dnf -y install dnf-plugins-core
sudo dnf config-manager --add-repo \
https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io \
docker-buildx-plugin docker-compose-plugin
Post-Install: User Group and Systemd
By default, the Docker daemon binds to a Unix socket owned by root. To run docker commands without sudo, add your user to the docker group. Then enable and start the systemd service so Docker launches on boot.
# Add your user to the docker group
sudo usermod -aG docker $USER
# Apply the new group membership (or log out and back in)
newgrp docker
# Enable and start the Docker service
sudo systemctl enable docker
sudo systemctl start docker
Adding a user to the docker group grants root-equivalent privileges on the host. Any user in this group can mount the host filesystem into a container and read or write any file. In shared or production environments, consider rootless Docker instead (covered below).
macOS: Docker Desktop
Docker doesn't run natively on macOS because macOS isn't Linux. Docker Desktop for Mac spins up a lightweight Linux virtual machine — on Apple Silicon it uses Apple's Virtualization framework, and on Intel Macs it uses the HyperKit hypervisor. You interact with Docker through the CLI exactly as you would on Linux; the VM layer is transparent.
Install Docker Desktop by downloading the .dmg from docker.com or via Homebrew:
brew install --cask docker
After installation, open Docker Desktop and configure resource limits under Settings → Resources. The defaults are conservative — you'll want to adjust them based on your workload:
| Resource | Default | Recommended for Development |
|---|---|---|
| CPUs | Half of host cores | Half to three-quarters of host cores |
| Memory | 2 GB | 4–8 GB (increase if building large images) |
| Disk image size | 64 GB | 64–128 GB (images and volumes consume this) |
| Swap | 1 GB | 1–2 GB |
Windows: Docker Desktop with WSL 2
Docker Desktop on Windows offers two backend options: WSL 2 (Windows Subsystem for Linux 2) and Hyper-V. WSL 2 is the recommended backend and has been the default since Docker Desktop 3.x. Here's why it matters:
| Aspect | WSL 2 Backend | Hyper-V Backend |
|---|---|---|
| Architecture | Shared Linux kernel with Windows | Full VM with dedicated kernel |
| Startup time | ~2 seconds | ~10–15 seconds |
| Memory usage | Dynamic — reclaimed when idle | Fixed allocation upfront |
| File I/O (Linux FS) | Native speed | Slower (9p/CIFS sharing) |
| Windows version | Home, Pro, Enterprise | Pro and Enterprise only |
WSL 2 runs a real Linux kernel managed by Windows, so Docker containers execute with near-native performance. The key advantage is dynamic memory management — WSL 2 grows and shrinks its memory footprint based on actual usage, while Hyper-V reserves the full allocation upfront.
# Enable WSL 2 (run in PowerShell as Administrator)
wsl --install
# Verify WSL 2 is the default version
wsl --set-default-version 2
# After installing Docker Desktop, verify from your WSL distro:
docker version
Essential Post-Install Configuration: daemon.json
The Docker daemon reads its configuration from /etc/docker/daemon.json on Linux (or via Docker Desktop settings on macOS/Windows). This file controls everything from logging behavior to network defaults and storage drivers. Creating a well-tuned daemon.json from the start prevents operational headaches later.
Here's a production-oriented configuration with the most commonly needed settings:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"dns": ["8.8.8.8", "8.8.4.4"],
"default-address-pools": [
{
"base": "172.20.0.0/16",
"size": 24
}
],
"storage-driver": "overlay2",
"features": {
"buildkit": true
},
"insecure-registries": ["myregistry.local:5000"],
"experimental": true
}
Let's break down what each setting does and why you'd want it:
Log Driver and Size Limits
By default, Docker uses the json-file log driver with no size limit. This means a chatty container can fill your disk with logs. Setting max-size to 10m and max-file to 3 caps each container at 30 MB of log storage total, rotating automatically. For centralized logging setups, you might switch the driver to fluentd, syslog, or awslogs instead.
DNS Configuration
Docker containers inherit DNS settings from the host by default. If your host's /etc/resolv.conf points to a local stub resolver like 127.0.0.53 (common on Ubuntu with systemd-resolved), containers can't reach it. Explicitly setting "dns" in daemon.json provides a reliable fallback — replace the Google DNS IPs with your organization's internal DNS servers if needed.
Default Address Pools
Every Docker network allocates a subnet. By default, Docker picks from the 172.17.0.0/16 range, which can collide with your corporate VPN or on-prem network ranges. Configuring default-address-pools lets you carve out a specific CIDR range that won't conflict with your existing infrastructure.
Storage Driver, Insecure Registries, and Experimental Features
overlay2 is the recommended storage driver on modern Linux kernels (4.0+) and is the default on most distributions. The insecure-registries array lets you pull from HTTP registries (no TLS) — useful for local development registries but never appropriate for production. Enabling experimental unlocks features like docker buildx experimental commands and docker manifest.
After editing daemon.json, restart the daemon to apply changes:
sudo systemctl restart docker
Verifying Your Installation
Three commands tell you everything you need to know about your Docker installation. Run them immediately after setup to confirm everything is working:
# Shows Client and Server (daemon) versions — confirms the daemon is running
docker version
# Detailed system info: storage driver, logging driver, kernel version,
# number of containers/images, cgroup driver, and more
docker info
# Shows disk usage breakdown: images, containers, volumes, build cache
docker system df
docker version outputs both the CLI (Client) and daemon (Server) versions. If you see "Cannot connect to the Docker daemon," the daemon isn't running — check systemctl status docker on Linux or ensure Docker Desktop is started on macOS/Windows. docker info is your best diagnostic tool: it reveals the active storage driver, log driver, whether experimental mode is on, and the cgroup version in use. docker system df shows exactly how much disk space images, containers, and volumes are consuming.
Rootless Docker
Standard Docker requires the daemon to run as root, which creates a wide attack surface — a container escape vulnerability means full root access to the host. Rootless mode runs the entire Docker daemon and containers under a regular user's UID, using user namespaces to map container root to an unprivileged user on the host.
# Install prerequisites (Ubuntu/Debian)
sudo apt-get install uidmap dbus-user-session
# Run the rootless setup script (as your regular user, NOT root)
dockerd-rootless-setuptool.sh install
# Set the Docker socket for your user session
export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock
Rootless Docker has some trade-offs: you can't bind to privileged ports (below 1024) without extra configuration, overlay2 requires kernel 5.11+ in rootless mode, and some networking features (like --net=host) behave differently. For development workstations and CI runners, these limitations rarely matter, and the security improvement is significant.
Add the DOCKER_HOST export to your ~/.bashrc or ~/.zshrc so every new shell session automatically connects to the rootless daemon. You can run rootless and rootful Docker side by side — they use separate daemons and storage directories.
Docker Images: Layers, Caching, and Content-Addressable Storage
A Docker image is not a single binary blob. It is an ordered collection of read-only filesystem layers plus a configuration JSON document that records metadata — the command to run, environment variables, exposed ports, and the ordered list of layer digests. Every Dockerfile instruction that modifies the filesystem (RUN, COPY, ADD) creates a new layer. Instructions that only set metadata (ENV, EXPOSE, CMD) modify the config JSON without adding a filesystem layer.
When you run a container, Docker stacks all read-only image layers using a union filesystem (overlay2 on modern Linux) and places a thin writable container layer on top. Any file changes the container makes — new files, modifications, deletions — happen in this writable layer, leaving the image layers untouched.
graph TD
subgraph runtime ["Runtime (container)"]
W["✏️ Container Writable Layer
Temporary — lost when container is removed"]
end
subgraph image ["Image (read-only layers)"]
L4["Layer 4 — RUN make build"]
L3["Layer 3 — COPY . /app"]
L2["Layer 2 — RUN apt-get install gcc"]
L1["Layer 1 — debian:bookworm-slim base"]
end
W --> L4
L4 --> L3
L3 --> L2
L2 --> L1
style W fill:#fff3cd,stroke:#ffaa00,color:#333
style L4 fill:#d1ecf1,stroke:#0c5460,color:#333
style L3 fill:#d1ecf1,stroke:#0c5460,color:#333
style L2 fill:#d1ecf1,stroke:#0c5460,color:#333
style L1 fill:#cce5ff,stroke:#004085,color:#333
Anatomy of an Image
Use docker image inspect to see the internal structure of any image. The output reveals the layer digests, the configuration (entrypoint, cmd, env, labels), and the image's own content-addressable ID.
# View the full image config and layer list
docker image inspect nginx:1.25 --format '{{json .RootFS}}' | jq .
# Output shows something like:
# {
# "Type": "layers",
# "Layers": [
# "sha256:9853575bc4f9...",
# "sha256:a691e3b3eb56...",
# "sha256:d404e58e1a2f...",
# ...
# ]
# }
Each entry in the Layers array is a diff ID — the SHA256 hash of the uncompressed layer content. The image ID itself (e.g., sha256:a8758716bb6a...) is the hash of the config JSON. This means two images with identical configs and layers will always have the same ID, regardless of when or where they were built.
To see a human-readable history of which Dockerfile instructions created which layers, use docker image history:
docker image history nginx:1.25 --no-trunc
# IMAGE CREATED BY SIZE
# a8758716bb6a CMD ["nginx" "-g" "daemon off;"] 0B
# <missing> EXPOSE map[80/tcp:{}] 0B
# <missing> COPY 30-tune-worker... (buildkit.dockerfile.v0) 4.62kB
# <missing> RUN /bin/sh -c set -x && apt-get update ... 59.1MB
# <missing> ENV NGINX_VERSION=1.25.3 0B
Notice how CMD, EXPOSE, and ENV show 0B — they change only the config JSON, not the filesystem. The RUN and COPY instructions produce actual filesystem layers with measurable size.
Content-Addressable Storage
Docker's storage model is content-addressable: every layer is identified by the SHA256 hash of its content. Two layers with identical bytes produce the same digest, period. This has profound implications for efficiency.
If you build ten images that all start with FROM debian:bookworm-slim, the base layer is stored on disk exactly once. When you push these images to a registry, the registry already has that layer — so it is transferred zero times after the first push. The same deduplication applies when pulling: Docker checks which layers you already have locally and only downloads the missing ones.
On disk, layers are stored uncompressed (as directories in the overlay2 driver). In registries and during docker pull/push, layers are gzip-compressed. The diff ID is the hash of the uncompressed content; the distribution digest is the hash of the compressed blob. Docker maps between the two in its local metadata.
The Layer Cache Mechanism
When you run docker build, Docker evaluates each instruction top-to-bottom and checks whether a cached layer already exists for it. If the instruction and all its inputs are identical to a previous build, Docker reuses the cached layer and skips execution. The moment it encounters a changed instruction, the cache is busted for that instruction and everything below it.
This is why instruction ordering in your Dockerfile matters enormously. Consider this common anti-pattern:
# ❌ BAD — copying all source first busts cache on every code change
FROM node:20-slim
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
Every time you change any source file, the COPY . . layer changes, which invalidates the cache for npm install — even though your dependencies haven't changed. The fix is to separate dependency installation from source code:
# ✅ GOOD — dependency layer is cached unless package.json changes
FROM node:20-slim
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
Now the npm ci layer is only rebuilt when package.json or package-lock.json changes. Source code changes only invalidate the final COPY and build layers, saving minutes on every build.
Order Dockerfile instructions from least frequently changing (base image, system packages) to most frequently changing (source code, build commands). This maximizes cache reuse across builds.
Image Tags vs. Digests
An image tag like nginx:1.25 or node:20-slim is a human-readable, mutable pointer. The image maintainer can push a completely different image under the same tag at any time. The notorious :latest tag simply means "whatever was pushed most recently without an explicit tag" — it carries no guarantee of stability.
An image digest is immutable. It is the SHA256 hash of the image's manifest (the JSON document that lists the layer digests and config). A digest reference looks like this:
# Pull by digest — guaranteed to be the exact same image every time
docker pull nginx@sha256:6db391d1c0cfb30588ba0bf72ea999404f2764fabf023d4f0c7063b68260bf22
# Find the digest of a local image
docker image inspect nginx:1.25 --format '{{index .RepoDigests 0}}'
Manifest Lists (Multi-Architecture Images)
Modern images often support multiple CPU architectures. When you docker pull nginx, Docker doesn't download a single image — it fetches a manifest list (also called an OCI image index), which is a JSON document pointing to platform-specific manifests. Docker selects the correct one for your architecture automatically.
# Inspect the manifest list to see supported platforms
docker manifest inspect nginx:1.25 | jq '.manifests[] | {platform, digest}'
# Output:
# { "platform": { "architecture": "amd64", "os": "linux" }, "digest": "sha256:6db3..." }
# { "platform": { "architecture": "arm64", "os": "linux" }, "digest": "sha256:b19c..." }
# { "platform": { "architecture": "arm", "os": "linux" }, "digest": "sha256:e02f..." }
This is why the same docker pull command works on an Intel Mac, an M-series Mac, an AWS Graviton instance, and a Raspberry Pi — the manifest list resolves to a different platform-specific image in each case.
Practical Commands for Image Management
# List all local images (with size)
docker image ls
# Show only dangling images (untagged, leftover from rebuilds)
docker image ls --filter "dangling=true"
# See what's actually consuming disk — images, containers, volumes, build cache
docker system df -v
# Remove dangling images
docker image prune
# Remove ALL unused images (not just dangling) — frees significant space
docker image prune -a
# Nuclear option: reclaim everything (stopped containers, unused networks, etc.)
docker system prune -a --volumes
The "RECLAIMABLE" column in docker system df only counts images not referenced by any container. If you have stopped containers, those images still count as "in use." Run docker container prune first if you want an accurate picture of what can be freed.
Base Image Selection
The base image you choose in your FROM instruction sets the floor for your image size, security attack surface, and debugging experience. There is no universally "best" choice — only trade-offs.
| Base Image | Compressed Size | Package Manager | Shell & Debugging Tools | Best For |
|---|---|---|---|---|
scratch | 0 MB | None | None — completely empty | Statically compiled Go/Rust binaries |
alpine:3.19 | ~3.5 MB | apk | BusyBox shell, minimal utils | Small general-purpose images |
debian:bookworm-slim | ~28 MB | apt | Bash, coreutils | Apps needing glibc or Debian packages |
gcr.io/distroless/static | ~2 MB | None | None — no shell at all | Production containers (minimal CVE surface) |
ubuntu:22.04 | ~29 MB | apt | Bash, common GNU tools | Familiar dev environment, wide package support |
Key trade-offs to consider
Alpine uses musl libc instead of glibc. Most software works fine, but some C libraries (especially those with DNS resolution edge cases or precompiled native modules like Python wheels) can behave differently. If you hit mysterious segfaults or DNS issues, try switching to a Debian-based image before debugging further.
Distroless images have no package manager and no shell. This is a security advantage — an attacker who gains code execution inside the container cannot easily install tools or explore the filesystem. The downside: you can't docker exec -it <container> /bin/sh to debug. Google provides :debug variants of distroless images that include a BusyBox shell for troubleshooting.
scratch is the ultimate minimal base: it is literally an empty filesystem. Your binary must be fully statically linked and carry everything it needs (including CA certificates if it makes HTTPS calls). It is ideal for single-binary Go or Rust applications compiled with CGO_ENABLED=0.
Writing Production-Grade Dockerfiles
A Dockerfile is a recipe, but most recipes online produce bloated, insecure, and slow-to-build images. This section walks through every Dockerfile instruction, shows correct usage versus common pitfalls, and ends with a complete before/after refactoring of a real-world Dockerfile.
FROM — Choosing Your Base Image
Every Dockerfile starts with FROM. It sets the base image that all subsequent instructions build upon. You can also alias a stage with AS for multi-stage builds.
# Always pin to a specific version — never use :latest in production
FROM node:20.11-alpine3.19 AS builder
# Scratch is a special empty image — ideal for statically compiled binaries
FROM scratch
Prefer -alpine or -slim variants for smaller attack surface and image size. A full node:20 image is ~1 GB; node:20-alpine is ~130 MB. Pin to a specific tag (including the OS patch version) so your builds are reproducible across time.
RUN — Executing Build Commands
RUN executes commands during the image build and creates a new layer for each instruction. The single most impactful optimization you can make is combining related commands into one RUN statement and cleaning up in the same layer.
# Each RUN creates a layer — apt cache is baked into layer 1 forever
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*
The rm on line 3 creates a new layer but doesn't shrink layer 1. The apt cache stays in the image.
# Single layer — install, clean up, done
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl \
&& rm -rf /var/lib/apt/lists/*
One layer. The apt cache is created and deleted within the same layer, so it never appears in the final image.
COPY vs ADD
COPY and ADD both copy files from the build context into the image, but they differ in important ways. ADD has two extra behaviors: it auto-extracts compressed archives (.tar.gz, .xz, etc.) and can fetch files from remote URLs.
| Feature | COPY | ADD |
|---|---|---|
| Copy local files | ✅ | ✅ |
| Auto-extract tar archives | ❌ | ✅ |
| Fetch remote URLs | ❌ | ✅ (but no caching control) |
| Predictable behavior | ✅ | ❌ — may surprise you |
# Use COPY for everything — it's explicit and predictable
COPY package.json package-lock.json ./
COPY src/ ./src/
# Only use ADD when you specifically need tar extraction
ADD rootfs.tar.gz /
Use COPY by default. Reach for ADD only when you need automatic tar extraction. For remote files, use RUN curl or RUN wget instead — it gives you control over caching, retries, and cleanup.
WORKDIR — Setting the Working Directory
WORKDIR sets the working directory for all subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions. If the directory doesn't exist, Docker creates it automatically.
# Good — uses WORKDIR
WORKDIR /app
COPY . .
# Bad — uses RUN cd (the cd has no effect on the next instruction)
RUN cd /app
COPY . .
Never use RUN cd /somewhere — each RUN starts a new shell, so the directory change is lost. Always use WORKDIR.
ENV vs ARG — Build-Time and Runtime Variables
ARG defines a variable available only during the build. ENV defines a variable that persists into the running container. This distinction matters for security — never put secrets in ENV since they're baked into every layer and visible to anyone who inspects the image.
# ARG — only available during build, not in the final image
ARG NODE_VERSION=20
FROM node:-alpine
# ARG must be re-declared after FROM (each FROM starts a new build stage)
ARG APP_VERSION
RUN echo "Building version "
# ENV — baked into the image, available at runtime
ENV NODE_ENV=production
ENV PORT=3000
| Aspect | ARG | ENV |
|---|---|---|
| Available during build | ✅ | ✅ |
| Available at runtime | ❌ | ✅ |
| Overridable at build | --build-arg | Not directly |
| Overridable at run | N/A | -e or --env |
| Persists in image layers | ❌ (but cached in build history) | ✅ |
ARG values are visible in docker history. ENV values are visible via docker inspect. For build-time secrets, use docker build --secret or BuildKit secret mounts (RUN --mount=type=secret).
EXPOSE — Documenting Ports
EXPOSE does not actually publish a port. It's documentation — a signal to the person running the container about which ports the application listens on. You still need -p at runtime.
EXPOSE 3000
EXPOSE 3000/udp
# At runtime, you still need -p to actually publish:
# docker run -p 8080:3000 myapp
ENTRYPOINT vs CMD — The PID 1 Problem
This is where most Dockerfiles go wrong. ENTRYPOINT sets the main executable. CMD provides default arguments to that executable (or acts as the full command if no ENTRYPOINT is set). But the critical detail is the form you use.
Exec Form vs Shell Form
| Form | Syntax | Runs as PID 1? | Shell variable expansion? |
|---|---|---|---|
| Exec form | ["node", "server.js"] | ✅ Yes — directly | ❌ No |
| Shell form | node server.js | ❌ No — wraps in /bin/sh -c | ✅ Yes |
# ✅ Exec form — node IS PID 1, receives SIGTERM directly
ENTRYPOINT ["node", "server.js"]
# ❌ Shell form — /bin/sh is PID 1, node is a child process
# SIGTERM goes to sh, which does NOT forward it to node
ENTRYPOINT node server.js
When you use shell form, /bin/sh becomes PID 1 inside the container. The shell does not forward signals like SIGTERM to child processes. This means docker stop sends SIGTERM, your app never receives it, Docker waits 10 seconds, then sends SIGKILL. Your application never gets a chance to shut down gracefully — open connections are dropped, transactions are lost.
Combining ENTRYPOINT and CMD
# ENTRYPOINT = the executable, CMD = default arguments
ENTRYPOINT ["python", "manage.py"]
CMD ["runserver", "0.0.0.0:8000"]
# docker run myapp → python manage.py runserver 0.0.0.0:8000
# docker run myapp migrate → python manage.py migrate
# docker run myapp shell → python manage.py shell
This pattern makes the container behave like a binary — you pass subcommands at runtime, and CMD provides a sensible default.
If you need environment variable expansion but also want exec form, use an entrypoint script: ENTRYPOINT ["/docker-entrypoint.sh"]. Inside that script, use exec "" at the end to replace the shell with your application as PID 1.
HEALTHCHECK — Container Health Monitoring
HEALTHCHECK tells Docker how to test whether your container is still working. Without it, Docker only knows if the process is running — not whether it's actually healthy and serving traffic. Orchestrators like Docker Swarm and Kubernetes use health status for automated restarts and rolling deployments.
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD ["curl", "-f", "http://localhost:3000/health"] || exit 1
# Disable a parent image's healthcheck
HEALTHCHECK NONE
The health check command runs inside the container. Exit code 0 means healthy, 1 means unhealthy. Keep the check lightweight — a heavy health check can itself degrade performance.
USER — Don't Run as Root
By default, containers run as root. If an attacker exploits a vulnerability in your application, they have root access inside the container — and potentially to the host via container escape exploits. The USER instruction switches to a non-root user.
# Create a non-root user and switch to it
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
# Set ownership before switching user
COPY --chown=appuser:appuser . /app
USER appuser
# Alpine uses addgroup/adduser instead
RUN addgroup -S appuser && adduser -S -G appuser appuser
Place the USER instruction as late as possible — you typically need root for installing packages and setting up the filesystem. Switch to the non-root user just before ENTRYPOINT/CMD.
LABEL — Image Metadata
LABEL adds key-value metadata to your image. This is useful for image cataloging, CI traceability, and compliance. Use the OCI standard annotation keys for interoperability.
LABEL org.opencontainers.image.title="my-api" \
org.opencontainers.image.version="1.4.2" \
org.opencontainers.image.source="https://github.com/acme/my-api" \
org.opencontainers.image.authors="team@acme.com"
STOPSIGNAL — Custom Stop Signals
By default, docker stop sends SIGTERM. Some applications (like Nginx) prefer a different signal. STOPSIGNAL lets you change it.
# Nginx uses SIGQUIT for graceful shutdown
STOPSIGNAL SIGQUIT
SHELL — Changing the Default Shell
The SHELL instruction overrides the default shell used for shell-form commands. On Linux the default is ["/bin/sh", "-c"]; on Windows it's ["cmd", "/S", "/C"]. You might change this to use bash for better scripting features.
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
# Now RUN uses bash with strict error handling
# -e: exit on error, -u: error on undefined vars, -o pipefail: catch pipe errors
RUN echo "hello" | grep "world" # This would fail properly now
Using pipefail is critical. Without it, a pipeline like curl ... | tar xz succeeds even if curl fails — the exit code of the last command (tar) is what counts.
ONBUILD — Deferred Instructions
ONBUILD registers an instruction that fires when another image uses this image as its FROM base. It's used in base images to enforce patterns on downstream images.
# In a base image (e.g., company-node-base)
FROM node:20-alpine
ONBUILD COPY package*.json ./
ONBUILD RUN npm ci --production
ONBUILD COPY . .
# Any team's Dockerfile that does "FROM company-node-base"
# automatically gets those three instructions injected after their FROM
Use ONBUILD sparingly. It creates "magic" behavior that surprises people who don't read the base image's Dockerfile. Prefer explicit multi-stage builds for most use cases.
.dockerignore — Controlling the Build Context
Before Docker builds anything, it sends the entire build context (the directory you pass to docker build) to the daemon. Without a .dockerignore file, you're sending node_modules, .git, local configs, and secrets over to the build — wasting time and risking leaking sensitive files into the image.
# .dockerignore
.git
.gitignore
node_modules
npm-debug.log
Dockerfile
docker-compose*.yml
.env
.env.*
*.md
.vscode
.idea
coverage
dist
.DS_Store
The syntax mirrors .gitignore. The biggest wins come from excluding .git (which can be hundreds of MB) and node_modules (since you'll npm ci inside the image anyway).
Complete Before/After: Naive → Production-Grade
Let's take a typical Node.js API Dockerfile and apply every principle from this section. This is the kind of refactoring that takes an image from 1.2 GB and 45-second builds down to 150 MB and 5-second rebuilds.
❌ The Naive Dockerfile
FROM node:latest
COPY . /app
WORKDIR /app
RUN npm install
EXPOSE 3000
CMD node server.js
What's wrong with this? Almost everything:
node:latest— unpinned, non-reproducible, full Debian image (~1 GB)COPY . /appbefore dependency install — every code change invalidates the npm cache layernpm install— installs devDependencies, uses mutablepackage.jsonresolution- Shell form
CMD— node is not PID 1, won't receive SIGTERM - Runs as root — security risk
- No
.dockerignore— sendsnode_modules,.git, and everything else to the build - No health check — orchestrator can't monitor health
- Single stage — build tools ship in production
✅ The Production-Grade Dockerfile
# syntax=docker/dockerfile:1
# ── Stage 1: Install dependencies ─────────────────────────────
FROM node:20.11-alpine3.19 AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
# ── Stage 2: Build (if you have a compile step) ───────────────
FROM node:20.11-alpine3.19 AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY src/ ./src/
COPY tsconfig.json ./
RUN npm run build
# ── Stage 3: Production image ─────────────────────────
FROM node:20.11-alpine3.19 AS production
LABEL org.opencontainers.image.title="my-api" \
org.opencontainers.image.version="1.4.2"
# Use tini as PID 1 for proper signal handling
RUN apk add --no-cache tini
ENV NODE_ENV=production
WORKDIR /app
# Copy only production deps from stage 1
COPY --from=deps /app/node_modules ./node_modules
# Copy only compiled output from stage 2
COPY --from=build /app/dist ./dist
COPY package.json ./
# Non-root user
RUN addgroup -S appuser && adduser -S -G appuser appuser
RUN chown -R appuser:appuser /app
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD ["wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "dist/server.js"]
Here's what changed and why:
| Change | Why It Matters |
|---|---|
| Pinned Alpine base image | ~130 MB vs ~1 GB. Reproducible builds. Smaller attack surface. |
| Multi-stage build (3 stages) | Build tools, devDependencies, and source code never reach the production image. |
COPY package*.json first | Dependencies are cached until package.json or package-lock.json changes. Code changes don't trigger reinstall. |
npm ci instead of npm install | Deterministic installs from lockfile. Faster. Fails if lockfile is out of sync. |
| tini as PID 1 | Forwards signals properly, reaps zombie processes. Essential for production containers. |
Exec form CMD | No shell wrapping. Node receives signals directly through tini. |
| Non-root user | Principle of least privilege. Limits blast radius of container escape exploits. |
| HEALTHCHECK | Orchestrators can detect and replace unhealthy containers automatically. |
| LABEL with OCI keys | Image is traceable back to source repo and version. Helps with auditing. |
docker build --target for developmentIn the multi-stage Dockerfile above, run docker build --target build -t myapp:dev . to stop at the build stage. This gives you an image with devDependencies and source code — perfect for development and testing, while the same Dockerfile produces a lean production image by default.
Multi-Stage Builds, BuildKit, and Build Optimization
A single-stage Dockerfile drags every compiler, header file, and dev dependency into your final image. Multi-stage builds let you split the build process across multiple FROM instructions — each one starts a fresh filesystem — and then cherry-pick only the artifacts you need into a slim runtime image. The result is dramatically smaller, more secure images with a minimal attack surface.
flowchart LR
subgraph stage1["Stage 1: Builder (~800 MB)"]
A["Source Code"] --> B["Install Deps & Compile"]
B --> C["Binary / Bundle"]
end
subgraph stage2["Stage 2: Runtime (~15 MB)"]
D["Minimal Base Image"] --> E["Production Artifact"]
end
C -- "COPY --from=builder" --> E
stage1 -. "discarded after build" .-> F(("🗑️"))
style stage1 fill:#2d2d2d,stroke:#f97316,color:#f5f5f5
style stage2 fill:#2d2d2d,stroke:#22c55e,color:#f5f5f5
style F fill:#2d2d2d,stroke:#ef4444,color:#ef4444
The builder stage contains all the heavy tooling — Go compiler, Node.js with node_modules, Python build wheels — but none of it ships. Only the final FROM stage contributes to the image you push to a registry.
Multi-Stage Build Patterns by Language
The pattern is the same across languages: compile or bundle in one stage, copy the output into a minimal base. The specifics vary depending on how each runtime handles dependencies.
Go produces a statically-linked binary, making it ideal for scratch or distroless final images with virtually zero overhead.
# Stage 1: Build
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app ./cmd/server
# Stage 2: Runtime
FROM scratch
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /app /app
ENTRYPOINT ["/app"]
Setting CGO_ENABLED=0 ensures a fully static binary. The -ldflags="-s -w" flags strip debug symbols and DWARF info, shrinking the binary further. The final image is often under 15 MB.
Node.js apps need a runtime, so you can't use scratch. Instead, separate the npm ci (with devDependencies) from the production node_modules.
# Stage 1: Install & Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
# Stage 2: Production deps only
FROM node:20-alpine AS production
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --omit=dev && npm cache clean --force
COPY --from=builder /app/dist ./dist
USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]
The builder stage installs all dependencies (including TypeScript, Webpack, etc.) and builds. The production stage runs npm ci --omit=dev to get only runtime dependencies, then copies the compiled output.
Python benefits from building wheels in a builder stage and installing them into a clean runtime image without compilers or build headers.
# Stage 1: Build wheels
FROM python:3.12-slim AS builder
RUN apt-get update && apt-get install -y --no-install-recommends gcc libpq-dev
WORKDIR /app
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
# Stage 2: Install pre-built wheels
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends libpq5 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/* && rm -rf /wheels
COPY . .
USER nobody
CMD ["python", "main.py"]
The builder stage has gcc and dev headers to compile C extensions (like psycopg2). The runtime stage only needs the shared library (libpq5) and the pre-built wheels — no compiler necessary.
BuildKit: The Modern Build Engine
BuildKit is Docker's next-generation build backend, enabled by default since Docker Desktop 23.0. It replaces the legacy builder with parallel stage execution, better caching, and features like build secrets and SSH forwarding. If you're on an older Docker version, enable it with DOCKER_BUILDKIT=1.
BuildKit analyzes your Dockerfile as a DAG (directed acyclic graph). Independent stages run in parallel automatically. A three-stage build where two stages don't depend on each other will build both concurrently, significantly reducing total build time.
Build Secrets
Never put credentials in ENV or ARG — they persist in image layers and history. BuildKit's --mount=type=secret mounts a file into the build container that is never committed to a layer.
# syntax=docker/dockerfile:1
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
# Secret is mounted at /run/secrets/npmrc — never baked into a layer
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
npm ci
COPY . .
CMD ["node", "server.js"]
Build it by passing the secret file at build time:
docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp .
SSH Forwarding
If your build needs to clone private Git repos, forward your host SSH agent instead of copying keys into the image:
# syntax=docker/dockerfile:1
FROM alpine/git AS source
RUN --mount=type=ssh git clone git@github.com:org/private-repo.git /src
FROM golang:1.22-alpine AS builder
COPY --from=source /src /src
WORKDIR /src
RUN go build -o /app .
docker build --ssh default -t myapp .
Cache Mounts
Cache mounts persist directories across builds, avoiding redundant downloads. This is one of the highest-impact BuildKit optimizations — especially for package managers that maintain a local cache.
# Go module cache
RUN --mount=type=cache,target=/go/pkg/mod \
go mod download
# apt cache — survives across builds
RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
apt-get update && apt-get install -y gcc
# pip cache
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
# npm cache
RUN --mount=type=cache,target=/root/.npm \
npm ci
Heredocs in Dockerfiles
BuildKit supports heredoc syntax, letting you inline multi-line scripts and files without long chains of && \ continuations:
# syntax=docker/dockerfile:1
FROM nginx:alpine
# Inline a config file without COPY
COPY <<EOF /etc/nginx/conf.d/default.conf
server {
listen 80;
location / {
proxy_pass http://backend:3000;
}
}
EOF
# Multi-line RUN without awkward backslash chains
RUN <<EOF
apk add --no-cache curl jq
curl -sSL https://example.com/setup.sh | sh
rm -rf /tmp/*
EOF
Cross-Platform Builds with docker buildx
docker buildx extends the build command to support multi-architecture images from a single machine. It uses QEMU emulation or remote builder nodes to compile for architectures your host doesn't natively support — critical for shipping ARM images from x86 CI runners (or vice versa).
# Create a new builder with multi-arch support
docker buildx create --name multiarch --driver docker-container --use
# Build for amd64 and arm64, push a manifest list to registry
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag registry.example.com/myapp:1.0 \
--push .
When a user pulls this image, Docker automatically selects the correct architecture variant from the manifest list. This is how official images like nginx and postgres serve both Intel and Apple Silicon machines from a single tag.
QEMU emulation is convenient but slow — a Go cross-compile via GOARCH=arm64 inside an amd64 builder is significantly faster than emulating an entire arm64 build. Use --platform only for the final stage and cross-compile natively when your toolchain supports it.
Build Cache Strategies
Docker layer caching works well locally, but CI environments start with a cold cache on every run. BuildKit provides three cache export/import backends to solve this, each with different trade-offs.
| Strategy | How It Works | Best For | Trade-off |
|---|---|---|---|
| Inline | Embeds cache metadata into the image itself | Simple setups, single-platform | Only caches the final stage; increases image size slightly |
| Registry | Pushes cache to a separate registry reference | CI pipelines, team sharing | Caches all stages; requires registry write access |
| Local | Exports cache to a local directory | Self-hosted runners with persistent storage | Fastest import; not shareable across machines |
Inline Cache
The simplest option — embed cache metadata directly into the pushed image:
# Export cache inline with the image
docker buildx build \
--cache-to type=inline \
--tag registry.example.com/myapp:latest \
--push .
# Import cache from the previously pushed image
docker buildx build \
--cache-from type=registry,ref=registry.example.com/myapp:latest \
--tag registry.example.com/myapp:latest \
--push .
Registry Cache
For multi-stage builds, registry cache is superior because it caches every stage, not just the final one:
docker buildx build \
--cache-from type=registry,ref=registry.example.com/myapp:buildcache \
--cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
--tag registry.example.com/myapp:latest \
--push .
The mode=max flag tells BuildKit to cache all layers from all stages, not just the layers used in the final image. The cache is stored at a separate tag (:buildcache) so it doesn't pollute your image tags.
Local Cache
If your CI runner has persistent storage between runs, local directory cache avoids any registry round-trips:
docker buildx build \
--cache-from type=local,src=/tmp/buildcache \
--cache-to type=local,dest=/tmp/buildcache,mode=max \
--tag myapp:latest \
--load .
GitHub Actions ephemeral runners lose local caches between jobs. Use actions/cache to persist the /tmp/buildcache directory, or prefer registry cache which survives regardless of runner lifecycle. For GitHub Actions specifically, use --cache-to type=gha and --cache-from type=gha which integrate natively with the Actions cache API.
Docker Registry: Hub, Private Registries, and Image Distribution
A Docker registry is the distribution layer that sits between image producers (build pipelines, developers) and image consumers (container runtimes, orchestrators). It stores image manifests and the individual layers that compose them. Every docker push and docker pull talks to a registry over the OCI Distribution Spec API — even when you think you're "just pulling from Docker."
Docker Hub — The Default Registry
When you run docker pull nginx, the Docker daemon contacts registry-1.docker.io behind the scenes. Docker Hub is the default registry baked into the Docker client. It hosts official images (curated by Docker and upstream maintainers) and community images (published by anyone with an account).
Official images use a single-name namespace like nginx, postgres, or node. Community and organization images use a two-part namespace: mycompany/api-server. Understanding this naming scheme is the foundation for every tagging strategy.
Image Naming Anatomy
A fully-qualified image reference has four parts. Most of the time you omit the registry and it defaults to Docker Hub, but for private registries every part matters.
[registry/][namespace/]repository[:tag|@digest]
# Examples:
nginx # Docker Hub official, tag "latest" implied
mycompany/api-server:v2.3.1 # Docker Hub, org namespace, semver tag
ghcr.io/myorg/worker:abc123f # GitHub Container Registry, git SHA tag
us-east1-docker.pkg.dev/proj/repo/app:main # Google Artifact Registry
Image Tagging Strategies
Tags are mutable pointers — the same tag can be reassigned to a completely different image digest at any time. This is what makes tagging strategy so important: the tag you choose determines whether your deployments are reproducible or a game of roulette.
Semantic Versioning (semver)
The most widely-adopted strategy. You publish multiple tags per release so consumers can pin at their preferred specificity. A single build produces the image once but tags it three ways:
# Tag the same image at three levels of specificity
docker tag api-server:build ghcr.io/myorg/api-server:3.2.1
docker tag api-server:build ghcr.io/myorg/api-server:3.2
docker tag api-server:build ghcr.io/myorg/api-server:3
# Push all three
docker push ghcr.io/myorg/api-server:3.2.1
docker push ghcr.io/myorg/api-server:3.2
docker push ghcr.io/myorg/api-server:3
Consumers pinning :3.2.1 get an immutable release. Those pinning :3.2 automatically receive patch updates. Those pinning :3 get minor updates too. This gives consumers explicit control over their upgrade risk.
Git SHA Tags
For internal services and CI/CD-driven deployments, tagging with the short Git commit SHA creates a 1:1 link between source code and the running artifact. Every build is unique and traceable:
GIT_SHA=$(git rev-parse --short HEAD)
docker build -t ghcr.io/myorg/api-server:${GIT_SHA} .
docker push ghcr.io/myorg/api-server:${GIT_SHA}
# In your Kubernetes manifest or docker-compose.yml:
# image: ghcr.io/myorg/api-server:a1b2c3d
Why :latest Is an Anti-Pattern
:latest trapThe tag :latest does not mean "most recent." It is simply the default tag applied when you don't specify one. It's mutable, never auto-updated on hosts that already pulled it, and gives you zero traceability. Using :latest in production means you cannot reliably answer "what version is running right now?"
| Strategy | Immutable? | Traceable to Source? | Best For |
|---|---|---|---|
Semver (:3.2.1) | By convention, yes | Via release notes/changelog | Libraries, public images, versioned APIs |
Git SHA (:a1b2c3d) | Yes | Directly — git show a1b2c3d | Internal services, CD pipelines |
:latest | No | No | Local development only |
Branch name (:main) | No | Loosely — HEAD changes | Staging / preview environments |
Push/Pull Flow and Layer Deduplication
Registries store images as a manifest (a JSON document listing layers) plus individual layer blobs (compressed tarballs). When you push, the client checks which layers the registry already has and only uploads new ones. Pulls work the same way in reverse — the daemon skips layers it already has locally. This is why sharing a common base image across services drastically reduces both storage and transfer time.
sequenceDiagram
participant Dev as Developer / CI
participant Daemon as Docker Daemon
participant Reg as Registry
Note over Dev,Reg: docker push myregistry/app:v2.0
Dev->>Daemon: docker push myregistry/app:v2.0
Daemon->>Reg: POST /v2/app/blobs/uploads/ (Layer A hash)
Reg-->>Daemon: 200 Layer A already exists (skip)
Daemon->>Reg: POST /v2/app/blobs/uploads/ (Layer B hash)
Reg-->>Daemon: 202 Upload URL
Daemon->>Reg: PUT Layer B blob data
Reg-->>Daemon: 201 Created
Daemon->>Reg: PUT /v2/app/manifests/v2.0
Reg-->>Daemon: 201 Manifest stored
Note over Dev,Reg: docker pull myregistry/app:v2.0 (different host)
Dev->>Daemon: docker pull myregistry/app:v2.0
Daemon->>Reg: GET /v2/app/manifests/v2.0
Reg-->>Daemon: Manifest (lists Layer A + B)
Daemon->>Daemon: Layer A exists locally (skip)
Daemon->>Reg: GET /v2/app/blobs/sha256:layerB
Reg-->>Daemon: Layer B blob data
Daemon-->>Dev: Image ready
If 10 microservices all use FROM node:20-slim, the shared base layers are stored once in the registry and once on each host. Only the application-specific layers on top are unique. This is why consistent base images across an organization are worth the governance effort.
Private Registry Options
Docker Hub has rate limits (100 pulls/6h for anonymous, 200 for free accounts) and all images are public unless you pay. For proprietary code, compliance requirements, or simple pull-rate sanity, you need a private registry. The ecosystem offers options at every scale.
Self-Hosted: registry:2
The Docker-maintained registry:2 image is the simplest private registry you can run. It implements the full OCI Distribution API, supports S3/GCS/Azure backends for blob storage, and is production-ready when paired with a TLS reverse proxy.
# Start a local registry on port 5000
docker run -d \
--name registry \
--restart always \
-p 5000:5000 \
-v registry-data:/var/lib/registry \
registry:2
# Tag and push an image to it
docker tag my-app:latest localhost:5000/my-app:v1.0.0
docker push localhost:5000/my-app:v1.0.0
# Pull from any machine on the same network
docker pull 192.168.1.50:5000/my-app:v1.0.0
Enterprise: Harbor
Harbor is a CNCF graduated project that wraps registry:2 and adds enterprise features: RBAC, vulnerability scanning (via Trivy), image signing, replication between registries, audit logs, and a web UI. If you need a self-hosted registry with governance, Harbor is the standard choice.
Cloud-Managed Registries
| Registry | Provider | Key Advantage |
|---|---|---|
| ECR (Elastic Container Registry) | AWS | Native IAM auth, lifecycle policies for automatic cleanup |
| GAR (Artifact Registry) | Google Cloud | Multi-format (Docker, npm, Maven), regional replication |
| ACR (Azure Container Registry) | Azure | ACR Tasks for in-registry builds, geo-replication |
| GHCR (GitHub Container Registry) | GitHub | Tight Actions integration, free for public repos |
Cloud registries eliminate operational overhead — no TLS certs to manage, no storage backends to configure, no uptime to guarantee. They authenticate through the cloud provider's IAM, which simplifies CI/CD credentials. The tradeoff is vendor coupling and per-GB storage costs.
Authenticating with Cloud Registries
# ECR login (token valid for 12 hours)
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS \
--password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
# Google Artifact Registry login
gcloud auth configure-docker us-east1-docker.pkg.dev
# GHCR login with a Personal Access Token
echo $CR_PAT | docker login ghcr.io \
--username YOUR_GITHUB_USER --password-stdin
Image Signing and Verification
Pushing an image to a registry does not prove who built it or that it hasn't been tampered with. Image signing creates a cryptographic chain of trust: you verify that the image was produced by a known identity and hasn't been modified since signing. This is a supply-chain security essential.
Docker Content Trust (DCT)
DCT uses The Update Framework (TUF) via Notary to sign image tags. When enabled, the Docker client refuses to pull unsigned images. It's built into the Docker CLI but has seen limited adoption because Notary v1 is complex to operate.
# Enable Docker Content Trust
export DOCKER_CONTENT_TRUST=1
# Now push signs automatically (prompts for passphrase on first use)
docker push ghcr.io/myorg/api-server:v2.0.0
# Pull will reject unsigned images
docker pull ghcr.io/myorg/api-server:v2.0.0
Cosign and Sigstore — The Modern Approach
Sigstore's cosign is rapidly becoming the industry standard for container signing. It signs the image digest (not the mutable tag), stores signatures in the same registry alongside the image, and supports keyless signing via OIDC identity providers (GitHub Actions, Google, Microsoft). No key management infrastructure required.
# Sign with a key pair
cosign generate-key-pair
cosign sign --key cosign.key ghcr.io/myorg/api-server@sha256:abc123...
# Verify on the consumer side
cosign verify --key cosign.pub ghcr.io/myorg/api-server@sha256:abc123...
# Keyless signing in CI (e.g., GitHub Actions) — no keys to manage
cosign sign ghcr.io/myorg/api-server@sha256:abc123...
# Sigstore's Fulcio CA issues a short-lived cert tied to your OIDC identity
Use a policy engine like Kyverno or OPA Gatekeeper with cosign verification to reject unsigned images at admission time. This closes the loop — signing without enforcement is security theater.
Container Lifecycle: From Create to Remove
Every Docker container moves through a well-defined set of states — from the moment it's created to the moment it's removed from the system. Understanding this lifecycle is the foundation for debugging stuck containers, designing restart strategies, and writing clean orchestration scripts.
This section walks through each state transition, the commands that trigger them, and the flags and inspection tools you'll use daily.
The Container State Machine
A container can exist in one of five states. Each state transition is triggered by a specific Docker command or an event inside the container itself (like the main process exiting).
stateDiagram-v2
[*] --> Created : docker create
Created --> Running : docker start
Running --> Running : docker restart
Running --> Paused : docker pause
Paused --> Running : docker unpause
Running --> Exited : docker stop / process exits
Running --> Exited : SIGKILL / OOM / crash
Exited --> Running : docker start / docker restart
Exited --> Removed : docker rm
Created --> Removed : docker rm
Removed --> [*]
note right of Created : Container exists but\nmain process not started
note right of Paused : Process frozen via\ncgroup freezer
note right of Exited : Process terminated,\nfilesystem persists
The key insight: Exited containers still occupy disk space. Their writable layer, logs, and metadata remain until you explicitly docker rm them or use the --rm flag at creation time.
docker run Demystified
docker run is actually three commands combined: docker create + docker start + docker attach (in foreground mode). Understanding this decomposition helps you reason about what's happening when a run fails — did the image pull fail (create phase), did the process crash on startup (start phase), or is the output just not reaching your terminal (attach phase)?
# These two sequences are equivalent:
# Sequence 1: docker run (all-in-one)
docker run -d --name my-nginx nginx:alpine
# Sequence 2: explicit create + start
docker create --name my-nginx nginx:alpine
docker start my-nginx
Essential docker run Flags
You'll use these flags constantly. Rather than memorize them in isolation, here's how they map to real operational concerns.
Detached vs Interactive Mode
The -d flag runs the container in the background and prints the container ID. Without it, your terminal attaches to the container's stdout/stderr. The -it combination allocates a pseudo-TTY (-t) and keeps stdin open (-i) — essential for interactive shells.
# Background a web server
docker run -d --name web nginx:alpine
# Interactive shell into an Ubuntu container
docker run -it --name sandbox ubuntu:22.04 bash
# Combine: start detached, exec into it later
docker run -d --name app node:20-alpine sleep infinity
docker exec -it app sh
Naming, Cleanup, and Ports
# --name: give the container a human-readable name
# --rm: auto-remove the container when it exits
# -p: map host port to container port
docker run -d --name api --rm -p 3000:3000 my-api:latest
# Bind to a specific interface (security best practice)
docker run -d -p 127.0.0.1:5432:5432 postgres:16
# Map multiple ports
docker run -d -p 80:80 -p 443:443 --name proxy nginx:alpine
Environment Variables and Working Directory
The -e flag sets individual environment variables. For multiple variables, --env-file reads from a file — keeping secrets out of your shell history. The -w flag overrides the working directory inside the container.
# Inline environment variables
docker run -d --name db \
-e POSTGRES_USER=admin \
-e POSTGRES_PASSWORD=secret \
-e POSTGRES_DB=myapp \
postgres:16
# Load from an env file (one VAR=value per line)
docker run -d --name app --env-file .env my-app:latest
# Override the working directory
docker run -it -w /opt/project node:20-alpine npm test
Environment variables passed with -e are visible via docker inspect. For production secrets, use Docker secrets (Swarm) or mount a secrets file. Never pass passwords as -e flags in CI/CD logs.
Complete Flags Reference
| Flag | Purpose | Example |
|---|---|---|
-d | Run detached (background) | docker run -d nginx |
-it | Interactive mode with TTY | docker run -it ubuntu bash |
--name | Assign a container name | --name my-app |
--rm | Auto-remove on exit | docker run --rm alpine echo hi |
-p | Publish port (host:container) | -p 8080:80 |
-e | Set environment variable | -e NODE_ENV=production |
--env-file | Load env vars from file | --env-file .env |
-w | Set working directory | -w /app |
--restart | Restart policy | --restart unless-stopped |
Container Inspection Tools
Running containers are black boxes until you know how to peer inside them. Docker provides a suite of inspection commands that cover metadata, logs, resource usage, and filesystem changes.
docker inspect — The Swiss Army Knife
docker inspect returns a massive JSON blob with everything Docker knows about a container: its configuration, network settings, mounts, state, and more. The --format flag (Go templates) lets you extract exactly what you need.
# Full JSON output
docker inspect my-app
# Get the container's IP address
docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-app
# Check the current state
docker inspect --format '{{.State.Status}}' my-app
# Get the exit code of a stopped container
docker inspect --format '{{.State.ExitCode}}' my-app
# View all environment variables
docker inspect --format '{{.Config.Env}}' my-app
# See port mappings
docker port my-app
docker logs — Container Output
Every container captures its main process's stdout and stderr. The logs command is your first stop when something goes wrong. Combine -f (follow) with --since for real-time debugging.
# View all logs
docker logs my-app
# Follow logs in real time (like tail -f)
docker logs -f my-app
# Last 50 lines with timestamps
docker logs --tail 50 -t my-app
# Logs from the last 5 minutes
docker logs --since 5m my-app
docker stats, top, and diff
# Live resource usage (CPU, memory, network, disk I/O)
docker stats my-app
# One-shot stats (no streaming)
docker stats --no-stream my-app
# See processes running inside the container
docker top my-app
# See filesystem changes since container was created
# A = Added, C = Changed, D = Deleted
docker diff my-app
Executing Commands and Copying Files
docker exec runs a new process inside a running container — indispensable for debugging. docker cp copies files between your host and a container (running or stopped).
# Open an interactive shell in a running container
docker exec -it my-app sh
# Run a one-off command
docker exec my-app cat /etc/os-release
# Run as a specific user
docker exec -u root my-app apt-get update
# Set environment variables for the exec session
docker exec -e DEBUG=true my-app node script.js
# Copy a file from container to host
docker cp my-app:/var/log/app.log ./app.log
# Copy a config file from host into the container
docker cp ./nginx.conf my-app:/etc/nginx/nginx.conf
docker exec only works on running containers. If the container has exited, you must docker start it first or use docker cp (which works on stopped containers too) to extract files.
Restart Policies
Restart policies determine what Docker does when a container's main process exits. They are your first line of defense for keeping services up without external orchestration. Set them at creation time with --restart.
| Policy | Behavior | Use Case |
|---|---|---|
no | Never restart (default) | One-off tasks, batch jobs |
on-failure[:max] | Restart only on non-zero exit code, optional retry limit | Workers that might crash but shouldn't loop forever |
always | Always restart, including after daemon restart | Critical services that must survive host reboots |
unless-stopped | Like always, but not if manually stopped | Most production services — respects manual docker stop |
# Restart on crash, up to 5 attempts
docker run -d --restart on-failure:5 --name worker my-worker:latest
# Always restart (survives dockerd restart and host reboot)
docker run -d --restart always --name db postgres:16
# Recommended default for most services
docker run -d --restart unless-stopped --name api my-api:latest
# Update the restart policy on an existing container
docker update --restart unless-stopped my-app
Events, Waiting, and Exit Codes
docker events and docker wait
docker events streams real-time events from the Docker daemon — container starts, stops, health checks, image pulls, network connects, and more. It's invaluable for understanding what's happening across all containers. docker wait blocks until a container stops and then prints its exit code, making it perfect for scripting sequential workflows.
# Stream all Docker events in real time
docker events
# Filter events for a specific container
docker events --filter container=my-app
# Only show container start/stop events since last hour
docker events --since 1h --filter event=start --filter event=stop
# Wait for a container to exit, then get its exit code
EXIT_CODE=$(docker wait my-batch-job)
echo "Job finished with exit code: $EXIT_CODE"
# Use in CI: run tests, fail pipeline on non-zero exit
docker run --name tests my-app:test npm test
docker wait tests
EXIT_CODE=$(docker inspect --format '{{.State.ExitCode}}' tests)
exit $EXIT_CODE
Understanding Exit Codes
The exit code tells you how a container stopped. Three codes cover the vast majority of cases you'll encounter.
| Exit Code | Meaning | Cause |
|---|---|---|
0 | Success | Process completed normally |
1 | Application error | Unhandled exception, misconfiguration, general failure |
137 | Killed (SIGKILL / 128+9) | OOM killer, forced termination, or docker stop after grace period timeout |
143 | Terminated (SIGTERM / 128+15) | Graceful docker stop — the process caught SIGTERM and exited |
126 | Command not executable | Permission denied on the entrypoint binary |
127 | Command not found | Entrypoint or CMD binary doesn't exist in the image |
# Check why a container stopped
docker inspect --format '{{.State.ExitCode}}' my-app
# 137 → likely OOM. Check with:
docker inspect --format '{{.State.OOMKilled}}' my-app
# docker stop sends SIGTERM, waits 10s (default), then SIGKILL
docker stop my-app # graceful: exit 143 if handled
docker stop -t 30 my-app # give 30 seconds grace period
If your containers consistently exit with 137 and OOMKilled: true, increase the memory limit with --memory. If they exit 137 but OOMKilled: false, something external sent SIGKILL — check docker events to find the culprit.
Putting It All Together
Here's a realistic workflow that exercises the full container lifecycle — creating a container, inspecting it, debugging an issue, and cleaning up.
# 1. Start a service
docker run -d \
--name api \
--restart unless-stopped \
-p 3000:3000 \
-e NODE_ENV=production \
--env-file .env.production \
my-api:2.1.0
# 2. Verify it's running and healthy
docker ps --filter name=api
docker logs --tail 20 api
# 3. Check resource usage
docker stats --no-stream api
# 4. Something looks wrong — debug it
docker exec -it api sh
# Inside: check files, env, connectivity, then exit
# 5. Pull a log file out for analysis
docker cp api:/app/logs/error.log ./error.log
# 6. See what files changed since image was built
docker diff api
# 7. Gracefully stop and remove
docker stop api
docker rm api
# Or force-remove a running container in one step
docker rm -f api
Docker Networking: Bridge, Host, Overlay, and DNS
Every container you run needs a way to communicate — with other containers, with the host, and with the outside world. Docker's networking subsystem is pluggable, built on Linux kernel primitives like network namespaces, veth pairs, iptables, and virtual bridges. Understanding how these pieces connect is the difference between "it works" and knowing why it works.
Docker ships with several built-in network drivers. Each one represents a different trade-off between isolation, performance, and simplicity. Let's walk through each driver, then cover DNS, port publishing, and troubleshooting.
graph TB
subgraph Host["Docker Host (Linux Kernel)"]
direction TB
bridge0["docker0 bridge\nip: 172.17.0.1/16"]
custom_br["my-net bridge\nip: 172.20.0.1/24"]
dns["Embedded DNS\n127.0.0.11"]
subgraph BridgeNet["Default Bridge Network"]
direction LR
c1["container-A\n172.17.0.2"]
c2["container-B\n172.17.0.3"]
end
subgraph CustomNet["User-Defined Bridge my-net"]
direction LR
c3["api-server\n172.20.0.2"]
c4["redis\n172.20.0.3"]
end
subgraph HostNet["Host Network"]
c5["nginx\nshares host IP"]
end
c1 ---|"veth pair"| bridge0
c2 ---|"veth pair"| bridge0
c3 ---|"veth pair"| custom_br
c4 ---|"veth pair"| custom_br
bridge0 ---|"iptables NAT"| eth0["eth0 Host NIC"]
custom_br ---|"iptables NAT"| eth0
c5 ---|"bypasses namespace"| eth0
c3 -.-|"DNS: redis to 172.20.0.3"| dns
c4 -.-|"DNS: api-server to 172.20.0.2"| dns
end
eth0 --- internet["External Network"]
The Default Bridge Network
When you install Docker, it creates a Linux bridge called docker0 with a default subnet (typically 172.17.0.0/16). Every container that starts without a --network flag connects here. Docker creates a veth pair — one end goes inside the container's network namespace as eth0, the other end attaches to the docker0 bridge on the host.
Outbound traffic from containers is NATed via iptables MASQUERADE rules so it appears to originate from the host's IP. Containers on the default bridge can reach each other by IP address, but they do not get automatic DNS resolution by name.
# Run two containers on the default bridge
docker run -d --name box1 alpine sleep 3600
docker run -d --name box2 alpine sleep 3600
# They can ping each other by IP, but NOT by name
docker exec box1 ping -c 2
# Works
docker exec box1 ping -c 2 box2
# Fails — no DNS on the default bridge
The default bridge network is a legacy artifact. It doesn't provide automatic DNS, you can't connect/disconnect containers at runtime, and all containers share a single unscoped network. Always create user-defined bridge networks instead.
User-Defined Bridge Networks
A user-defined bridge is Docker's recommended approach for single-host container networking. It works the same way under the hood — Linux bridge, veth pairs, iptables NAT — but adds critical features that the default bridge lacks. The most important one: automatic DNS resolution between containers by name and alias.
Containers on a user-defined bridge are also fully isolated from containers on other networks, and you can attach or detach running containers on the fly.
# Create a user-defined bridge with a custom subnet
docker network create \
--driver bridge \
--subnet 172.20.0.0/24 \
--gateway 172.20.0.1 \
my-app-net
# Run containers on it — they resolve each other by name
docker run -d --name api --network my-app-net nginx:alpine
docker run -d --name cache --network my-app-net redis:alpine
# DNS just works
docker exec api ping -c 2 cache
# PING cache (172.20.0.3): 56 data bytes — resolves automatically
# Attach a running container to a second network
docker network connect my-app-net box1
Host Network Mode
With --network host, Docker skips creating a network namespace entirely. The container shares the host's network stack — same interfaces, same IP, same port space. There's no NAT overhead and no veth pair, so you get bare-metal network performance.
The trade-off is obvious: no port isolation. If your container binds to port 80, it occupies port 80 on the host. Two containers can't use the same port. This mode is useful for performance-sensitive applications (high-throughput proxies, monitoring agents) or when you need to access host network services directly.
# Container uses the host's network directly — no -p flag needed
docker run -d --name web --network host nginx:alpine
# nginx is now reachable on host_ip:80
curl localhost:80
None and Macvlan
--network none gives a container a network namespace with only a loopback interface. No external connectivity at all. This is useful for batch jobs that process local files, or for security-hardened containers that should never touch the network.
Macvlan assigns a real MAC address to each container, making it appear as a physical device on your LAN. Containers get IPs from your physical network's DHCP server (or you assign them statically). This is essential when you need containers to be directly addressable on the LAN — for example, migrating legacy apps that expect to sit on a flat L2 network.
# Create a macvlan network tied to the host's eth0
docker network create \
--driver macvlan \
--subnet 192.168.1.0/24 \
--gateway 192.168.1.1 \
-o parent=eth0 \
lan-net
# Container gets a real LAN IP
docker run -d --name legacy-app --network lan-net \
--ip 192.168.1.50 my-legacy-image
Overlay Networks (Swarm & Multi-Host)
When containers span multiple Docker hosts, you need overlay networking. Overlay networks use VXLAN tunneling to encapsulate Layer 2 frames inside UDP packets, creating a virtual network that spans across physical hosts. Docker Swarm manages overlay networks natively — no external tooling required.
Every Swarm service placed on an overlay network can reach every other service by name, regardless of which physical node the tasks run on. Under the hood, Docker maintains a distributed key-value store (via Raft consensus) that maps container IPs to host IPs for the VXLAN tunnel endpoints.
# Create an overlay network (requires Swarm mode)
docker network create --driver overlay --attachable backend-net
# Deploy services — they resolve each other across nodes
docker service create --name api --network backend-net my-api:latest
docker service create --name db --network backend-net postgres:16
# From inside the api container, "db" resolves to the VIP
# regardless of which Swarm node it's running on
Network Driver Comparison
| Driver | Scope | DNS | Isolation | Performance | Use Case |
|---|---|---|---|---|---|
bridge (default) | Single host | No auto DNS | Network namespace | Good (veth + NAT) | Quick tests, throwaway containers |
bridge (user-defined) | Single host | Yes | Namespace + scoped | Good (veth + NAT) | Most single-host workloads |
host | Single host | N/A | None — shares host | Best (no overhead) | High-throughput, monitoring |
none | Single host | N/A | Complete | N/A | Offline / security-hardened |
overlay | Multi-host | Yes | Namespace + VXLAN | Good (VXLAN encap cost) | Swarm services, cross-node |
macvlan | Single host | No | MAC-level | Excellent | Legacy apps, flat L2 LAN |
Port Publishing
Containers on bridge networks are isolated behind NAT. To make a service reachable from outside the host, you publish ports with -p. Docker sets up iptables DNAT rules (and a docker-proxy userland process) to forward traffic from the host port to the container's IP and port.
# Explicit mapping: host 8080 -> container 80
docker run -d -p 8080:80 nginx:alpine
# Bind to a specific host interface only
docker run -d -p 127.0.0.1:8080:80 nginx:alpine
# Publish all EXPOSEd ports to random host ports
docker run -d -P nginx:alpine
# UDP port publishing
docker run -d -p 5514:514/udp syslog-server
# Verify the iptables rules Docker created
sudo iptables -t nat -L DOCKER -n -v
Docker's Embedded DNS Server (127.0.0.11)
Every container on a user-defined network gets its /etc/resolv.conf pointed at 127.0.0.11 — Docker's built-in DNS server. This server resolves container names, service names, and network aliases to internal IPs. For any name it can't resolve internally, it forwards the query to the DNS servers configured on the host (or those you specify with --dns).
This is what makes service discovery work without any external tool. When you do ping redis from another container on the same user-defined network, the query hits 127.0.0.11, Docker's DNS looks up the container named redis on that network, and returns its IP.
# Inspect the embedded DNS from inside a container
docker run --rm --network my-app-net alpine cat /etc/resolv.conf
# nameserver 127.0.0.11
# Use dig to query Docker's DNS
docker run --rm --network my-app-net alpine \
sh -c "apk add --no-cache bind-tools && dig cache"
# ;; ANSWER SECTION:
# cache. 600 IN A 172.20.0.3
# Add a network alias — one container, multiple DNS names
docker run -d --network my-app-net --network-alias db \
--network-alias primary-db --name postgres postgres:16
Multiple containers can share the same --network-alias. Docker's DNS returns all matching IPs in round-robin order. This gives you basic client-side load balancing without a proxy — though it's limited because DNS results are often cached by the client.
IPv6 and Custom Subnets
Docker supports dual-stack (IPv4 + IPv6) networking but it's not enabled by default. You need to explicitly opt in at both the daemon level and the network level. Once enabled, containers receive both an IPv4 and IPv6 address and can communicate over either protocol.
{
"ipv6": true,
"fixed-cidr-v6": "fd00:dead:beef::/48",
"default-address-pools": [
{ "base": "10.10.0.0/16", "size": 24 },
{ "base": "fd00:db8::/104", "size": 112 }
]
}
# Create a dual-stack network
docker network create \
--ipv6 \
--subnet 172.28.0.0/24 \
--subnet fd00:cafe::/64 \
dual-stack-net
# Verify both addresses are assigned
docker run --rm --network dual-stack-net alpine ip addr show eth0
# inet 172.28.0.2/24 ...
# inet6 fd00:cafe::2/64 ...
Network Troubleshooting with Netshoot
When networking breaks, you need tools — but production container images are (rightly) minimal. The nicolaka/netshoot image packs every network diagnostic tool you'd want: tcpdump, dig, nslookup, curl, iperf, netstat, ss, ip, bridge, and more. You can either join a container's network namespace or attach to a Docker network to diagnose from the inside.
# Join a running container's network namespace to debug it
docker run --rm -it --network container:api nicolaka/netshoot
# Now you share the exact same network stack as the "api" container
ss -tlnp # Check listening ports
dig cache # Test DNS resolution
curl http://cache:6379/ping # Test connectivity
tcpdump -i eth0 port 80 # Capture traffic
# Or attach to a network to test from a fresh perspective
docker run --rm -it --network my-app-net nicolaka/netshoot
nslookup api 127.0.0.11 # Query Docker's embedded DNS directly
traceroute api # Trace the route between networks
For a faster check, docker network inspect my-app-net shows all connected containers, their IPs, subnet configuration, and driver options — all without running a separate debug container.
Practical Mental Model
Here's how to think about Docker networking decisions in practice:
- Single app with multiple containers? — Use one user-defined bridge. Containers talk by name.
- Need true LAN presence? — Use macvlan. Containers get real MAC/IP on the physical network.
- Multi-host cluster? — Use overlay with Swarm (or a CNI plugin with Kubernetes).
- Maximum network performance? — Use host mode and accept port conflicts.
- Security-critical batch job? — Use
noneto eliminate the network attack surface entirely. - Debugging connectivity? — Attach
netshootto the target container's namespace with--network container:<name>.
Docker Volumes and Storage: Bind Mounts, Named Volumes, and tmpfs
Every Docker container gets its own writable layer on top of the image's read-only layers. When the container is removed, that writable layer is gone — permanently. Any database rows, uploaded files, or application state written inside the container vanish with it.
This is Docker's ephemeral storage problem. It's by design — containers are meant to be disposable. But real applications need durable data. Docker solves this with mounts: mechanisms that let a container read and write to storage that lives outside the container's writable layer.
Mount Types at a Glance
Docker provides three distinct mount types. Each places data in a different location on the host and serves a different purpose. The diagram below shows how they relate to the container and the host filesystem.
flowchart LR
subgraph Host["Docker Host"]
direction TB
VolumeArea["/var/lib/docker/volumes/\n(Managed by Docker)"]
HostDir["/home/user/project\n(Any host path)"]
RAM["Host Memory - RAM"]
end
subgraph Container["Container"]
direction TB
Writable["Writable Layer\n(ephemeral - dies with container)"]
MountNV["/var/lib/postgresql/data"]
MountBM["/app/src"]
MountTmp["/tmp/secrets"]
end
VolumeArea -- "Named Volume" --> MountNV
HostDir -- "Bind Mount" --> MountBM
RAM -- "tmpfs Mount" --> MountTmp
style Writable fill:#ff6b6b,stroke:#c0392b,color:#fff
style VolumeArea fill:#51cf66,stroke:#2f9e44,color:#fff
style HostDir fill:#339af0,stroke:#1971c2,color:#fff
style RAM fill:#fcc419,stroke:#e67700,color:#000
| Mount Type | Stored At | Managed By | Best For |
|---|---|---|---|
| Named Volume | /var/lib/docker/volumes/ | Docker Engine | Databases, persistent app data |
| Bind Mount | Any host path you choose | You (the host OS) | Source code in development, config files |
| tmpfs | Host memory (RAM) only | Kernel | Secrets, sensitive caches, scratch space |
Named Volumes
Named volumes are Docker's recommended storage mechanism for persistent data. Docker creates a directory under /var/lib/docker/volumes/<name>/_data and manages it entirely. You don't need to know or care about the exact host path — Docker handles creation, mounting, and cleanup.
When you mount a named volume into a container for the first time and the volume is empty, Docker copies the contents from the container's image at that mount point into the volume. This "copy-on-first-use" behavior means database images that ship with initial data in /var/lib/postgresql/data work out of the box.
# Create a named volume explicitly
docker volume create pgdata
# Run Postgres with the named volume
docker run -d \
--name db \
-v pgdata:/var/lib/postgresql/data \
-e POSTGRES_PASSWORD=secret \
postgres:16
# The data survives container removal
docker rm -f db
docker run -d --name db2 -v pgdata:/var/lib/postgresql/data -e POSTGRES_PASSWORD=secret postgres:16
# All databases and tables are still there
Bind Mounts
Bind mounts map a specific host directory or file directly into the container. Unlike named volumes, Docker does not manage the lifecycle — if the host path doesn't exist, Docker creates it as an empty directory (which is often not what you want). Bind mounts give full control but come with more responsibility.
The primary use case is development: you mount your source code into the container so changes on the host are immediately visible inside the container without rebuilding the image.
# Mount current directory into /app/src in the container
docker run -d \
--name devserver \
-v "$(pwd)":/app/src \
-w /app/src \
node:20 \
npm run dev
Unlike named volumes, bind mounts do not copy image data into the mount. If your image has files at /app and you bind-mount an empty host directory there, the container sees an empty /app. This is the #1 cause of "file not found" errors in development setups.
tmpfs Mounts
A tmpfs mount stores data in the host's memory only — nothing is written to disk. When the container stops, the data is gone. This is ideal for sensitive data like API tokens or session keys that should never touch a filesystem, and for high-throughput scratch data where disk I/O would be a bottleneck.
# tmpfs mount - 64MB in-memory filesystem at /tmp/secrets
docker run -d \
--name app \
--tmpfs /tmp/secrets:rw,size=67108864 \
myapp:latest
The --mount vs -v Syntax
Docker offers two syntaxes for attaching storage. The -v (or --volume) flag is older and more compact. The --mount flag is newer, more explicit, and the recommended choice for anything beyond simple cases. The key behavioral difference: -v auto-creates missing host directories for bind mounts, while --mount throws an error — which is usually what you want, because a silent auto-created empty directory is a debugging nightmare.
# Named volume
docker run -d \
--mount type=volume,source=pgdata,target=/var/lib/postgresql/data \
postgres:16
# Bind mount
docker run -d \
--mount type=bind,source="$(pwd)"/src,target=/app/src \
node:20
# tmpfs
docker run -d \
--mount type=tmpfs,target=/tmp/cache,tmpfs-size=104857600 \
myapp:latest
# Read-only bind mount
docker run -d \
--mount type=bind,source="$(pwd)"/config,target=/etc/app,readonly \
myapp:latest
# Named volume - name:container_path
docker run -d \
-v pgdata:/var/lib/postgresql/data \
postgres:16
# Bind mount - /host/path:container_path
docker run -d \
-v "$(pwd)"/src:/app/src \
node:20
# Read-only bind mount - append :ro
docker run -d \
-v "$(pwd)"/config:/etc/app:ro \
myapp:latest
# Anonymous volume (no name - Docker assigns a random hash)
docker run -d \
-v /var/lib/postgresql/data \
postgres:16
The -v flag silently creates host directories that don't exist — which masks typos and misconfigurations. --mount fails fast with a clear error. Use -v for quick terminal one-liners; use --mount in anything checked into version control.
Volume Drivers
By default, Docker volumes use the local driver, which stores data on the host filesystem. Volume drivers extend this to support remote storage — NFS shares, cloud block storage, distributed filesystems, and more. You specify the driver when creating a volume.
# Create a volume backed by an NFS share using the local driver
docker volume create \
--driver local \
--opt type=nfs \
--opt o=addr=192.168.1.100,rw,nfsvers=4 \
--opt device=:/exports/data \
nfs_data
# Use it like any other named volume
docker run -d --mount source=nfs_data,target=/data myapp:latest
In Docker Swarm and Kubernetes environments, volume drivers become essential for making data accessible across multiple nodes. Common third-party drivers include REX-Ray (AWS EBS, Azure Disk), NetApp Trident, and Portworx.
Practical Patterns
Database Volumes
Always use a named volume for database data directories. This decouples the data lifecycle from the container lifecycle — you can upgrade Postgres from 16 to 17 by stopping the old container and starting a new one with the same volume.
# docker-compose.yml
services:
db:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
volumes:
- pgdata:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
volumes:
pgdata: # Docker manages this volume
Sharing Data Between Containers
Multiple containers can mount the same named volume. A common pattern is having one container generate files (e.g., a static site builder) and another serve them (e.g., Nginx).
# Builder writes HTML to the shared volume
docker run --rm -v site_content:/output builder:latest
# Nginx serves from the same volume (read-only)
docker run -d \
--mount type=volume,source=site_content,target=/usr/share/nginx/html,readonly \
-p 8080:80 \
nginx:alpine
Read-Only Mounts
Appending :ro (with -v) or readonly (with --mount) prevents the container from writing to the mount. Use this for configuration files and secrets — it enforces the principle of least privilege and protects against accidental modification.
# Mount config as read-only, but keep the data volume writable
docker run -d \
--mount type=bind,source="$(pwd)"/nginx.conf,target=/etc/nginx/nginx.conf,readonly \
--mount type=volume,source=app_logs,target=/var/log/nginx \
-p 80:80 \
nginx:alpine
UID/GID Permission Gotchas
This is the single most common Docker volume pain point. When a container process runs as a non-root user (say, UID 1000), it needs matching permissions on the mounted directory. But named volumes owned by root and bind mounts with the host user's UID may not align with the container's user.
# Fix permissions in the Dockerfile - the right approach
FROM node:20-slim
# Create app user with a known UID
RUN groupadd -g 1001 appgroup && \
useradd -u 1001 -g appgroup -m appuser
# Create and own the data directory BEFORE switching user
RUN mkdir -p /app/data && chown appuser:appgroup /app/data
USER appuser
WORKDIR /app
# For bind mounts, ensure the host directory matches the container's UID
mkdir -p ./data
chown 1001:1001 ./data
docker run -d -v "$(pwd)"/data:/app/data myapp:latest
# Quick debug: check what UID the container process is using
docker exec mycontainer id
# uid=1001(appuser) gid=1001(appgroup) groups=1001(appgroup)
On Linux, UID/GID mapping is direct — the container's UID must match file ownership on the host. On macOS (Docker Desktop), Docker runs in a Linux VM with automatic UID translation, so permission issues are less common in development. Don't let macOS lull you into ignoring permissions — they will bite in production on Linux.
Volume Lifecycle Commands
Docker provides a complete set of CLI commands for managing volumes independently of containers. Volumes are first-class objects with their own lifecycle.
# Create a named volume
docker volume create mydata
# List all volumes
docker volume ls
# Inspect volume metadata (driver, mount point, labels)
docker volume inspect mydata
# Remove a specific volume (fails if a container is using it)
docker volume rm mydata
# Remove ALL volumes not currently mounted to a container
docker volume prune
# Nuclear option: prune volumes AND pass the -a flag for all unused
docker volume prune -a --force
| Command | What It Does | Safe in Production? |
|---|---|---|
docker volume ls | Lists all volumes | ✅ Read-only |
docker volume inspect <name> | Shows JSON metadata | ✅ Read-only |
docker volume create <name> | Creates an empty volume | ✅ Non-destructive |
docker volume rm <name> | Deletes a specific volume | ⚠️ Verify first |
docker volume prune | Deletes all unused volumes | ❌ Dangerous |
Backup and Restore Strategies
Docker volumes don't have a built-in backup command. The standard approach is to mount the volume into a temporary container that runs a backup tool like tar and writes the archive to a bind mount on the host.
-
Back up a named volume to a tar archive
Spin up a throwaway container that mounts both the volume (as source) and a host directory (as destination), then tar the data.
bash# Backup: volume -> tar file on host docker run --rm \ -v pgdata:/source:ro \ -v "$(pwd)"/backups:/backup \ alpine \ tar czf /backup/pgdata-$(date +%Y%m%d).tar.gz -C /source . -
Restore a tar archive into a volume
Create a fresh volume (or use the existing one), then extract the archive into it.
bash# Restore: tar file on host -> volume docker volume create pgdata_restored docker run --rm \ -v pgdata_restored:/target \ -v "$(pwd)"/backups:/backup:ro \ alpine \ tar xzf /backup/pgdata-20240115.tar.gz -C /target -
For databases, prefer logical backups
File-level backups of a running database risk corruption. Use the database's own dump tool instead.
bash# pg_dump while the container is running - consistent logical backup docker exec db pg_dumpall -U postgres > backups/full-dump.sql # Restore into a new container cat backups/full-dump.sql | docker exec -i db2 psql -U postgres
Resource Limits and Runtime Constraints
Every container shares the host kernel, which means an unconstrained container can monopolize CPU, exhaust memory, or fork-bomb the entire machine. Docker exposes Linux cgroups and security modules as simple CLI flags so you can enforce hard boundaries on what each container is allowed to consume and do.
This section covers the full spectrum: memory, CPU, PID, and I/O limits; live updates via docker update; monitoring with docker stats; and the security subsystem — Linux capabilities, seccomp, AppArmor, and related flags.
Memory Limits
Docker provides four memory-related flags that map directly to cgroup memory controllers. Understanding the difference between a hard limit and a soft reservation is critical — the former terminates your container via the OOM mechanism, the latter merely signals the kernel to reclaim memory under pressure.
| Flag | Effect | Default |
|---|---|---|
--memory (-m) | Hard memory limit. Container is OOM-terminated if it exceeds this. | Unlimited |
--memory-reservation | Soft limit. Kernel tries to reclaim memory when host is under pressure. | Unlimited |
--memory-swap | Total memory + swap allowed. Set equal to --memory to disable swap. | 2× memory limit |
--oom-kill-disable | Prevents the OOM mechanism from terminating this container. | false |
# Hard limit of 512MB, no swap, soft reservation of 256MB
docker run -d \
--memory=512m \
--memory-swap=512m \
--memory-reservation=256m \
--name my-app nginx
# Verify the applied limits
docker inspect --format='{{.HostConfig.Memory}}' my-app
# Output: 536870912 (bytes = 512MB)
Using --oom-kill-disable without a --memory limit is dangerous. If the container consumes all host memory, the kernel OOM mechanism will target other processes on the host instead — potentially including system-critical ones.
CPU Limits
CPU constraints come in two flavors: hard caps (the container physically cannot use more) and relative weights (shares are only relevant when CPUs are contested). Choosing the right approach depends on whether you need guaranteed throughput or fair scheduling under contention.
| Flag | Mechanism | When to Use |
|---|---|---|
--cpus | Hard cap on CPU cores (e.g., 1.5 = 1.5 cores max). | Predictable upper bound for any workload. |
--cpu-shares | Relative weight (default 1024). Only enforced under contention. | Fair sharing across many containers. |
--cpuset-cpus | Pins container to specific cores (e.g., "0,2" or "0-3"). | NUMA-aware or latency-sensitive apps. |
# Cap at 2 CPU cores, pinned to cores 0 and 1
docker run -d \
--cpus=2 \
--cpuset-cpus="0,1" \
--name worker my-worker-image
# Give a background job lower priority (half the default shares)
docker run -d \
--cpu-shares=512 \
--name background-job my-batch-image
The --cpus flag is syntactic sugar for the older --cpu-period / --cpu-quota pair. Setting --cpus=1.5 is equivalent to --cpu-period=100000 --cpu-quota=150000 — meaning 150ms of CPU time per 100ms scheduling period.
PID and I/O Limits
PID limits prevent a container from fork-bombing the host. The --pids-limit flag caps the total number of processes (including threads) inside the container’s PID namespace. A value of 100 is a reasonable default for most web services.
# Limit to 200 processes
docker run -d --pids-limit=200 --name api my-api-image
# I/O throttling: limit read/write bandwidth and IOPS
docker run -d \
--device-read-bps=/dev/sda:10mb \
--device-write-bps=/dev/sda:10mb \
--device-read-iops=/dev/sda:1000 \
--device-write-iops=/dev/sda:1000 \
--name io-limited my-app
I/O limits use the blkio cgroup controller. Note that these flags apply to direct device I/O only — buffered writes going through the page cache may not be throttled as expected. For cgroups v2, Docker uses the unified io.max controller, which provides more consistent behavior across filesystems.
Cgroups v2 Mapping
Modern Linux distributions (Ubuntu 22.04+, Fedora 31+, Debian 11+) use cgroups v2 by default. Docker Engine 20.10+ supports cgroups v2 natively. The CLI flags remain the same, but the underlying kernel controllers differ:
| Docker Flag | Cgroups v1 Controller | Cgroups v2 Controller |
|---|---|---|
--memory | memory.limit_in_bytes | memory.max |
--memory-reservation | memory.soft_limit_in_bytes | memory.low |
--cpus | cpu.cfs_quota_us / cpu.cfs_period_us | cpu.max |
--cpu-shares | cpu.shares | cpu.weight (1–10000 scale) |
--pids-limit | pids.max | pids.max |
--device-read-bps | blkio.throttle.read_bps_device | io.max (rbps field) |
# Check if your host uses cgroups v2
stat -fc %T /sys/fs/cgroup/
# "cgroup2fs" means v2, "tmpfs" means v1
# Inspect the actual cgroup files for a running container
CONTAINER_ID=$(docker inspect --format='{{.Id}}' my-app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
Live Updates with docker update
You don’t need to restart a container to change its resource limits. The docker update command modifies cgroup settings on the fly. This is invaluable during incident response — you can throttle a runaway container without downtime.
# Double the memory limit on a running container
docker update --memory=1g --memory-swap=1g my-app
# Reduce CPU allocation during off-peak hours
docker update --cpus=0.5 my-app
# Update multiple containers at once
docker update --memory=256m --cpus=1 container1 container2 container3
Monitoring with docker stats
The docker stats command provides a live, top-like view of resource consumption per container. It reads directly from cgroup accounting files, so the numbers reflect exactly what the kernel is enforcing.
# Live stats for all running containers
docker stats
# One-shot snapshot (useful for scripting)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.PIDs}}"
# Monitor a specific container
docker stats my-app --format "{{.MemUsage}} / {{.MemPerc}}"
Linux Capabilities
Traditional Unix security has two states: unprivileged user or all-powerful root. Linux capabilities split root’s power into ~40 distinct privileges. Docker drops most of these by default, keeping only a minimal set needed for typical workloads. You can further tighten or selectively expand this set.
# Drop ALL capabilities, then add back only what's needed
docker run -d \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--name secure-web nginx
# A container that needs to modify network interfaces
docker run -d \
--cap-drop=ALL \
--cap-add=NET_ADMIN \
--cap-add=NET_RAW \
--name net-tool nicolaka/netshoot
# Check which capabilities a running container has
docker inspect --format='{{.HostConfig.CapAdd}}' secure-web
docker inspect --format='{{.HostConfig.CapDrop}}' secure-web
Common capabilities you might need to add back:
NET_BIND_SERVICE— bind to ports below 1024NET_ADMIN— modify routing tables, network interfacesSYS_PTRACE— debugging tools likestrace,gdbSYS_ADMIN— mount filesystems, usebpf()(broad — avoid if possible)DAC_OVERRIDE— bypass file read/write permission checks
--privileged gives the container all capabilities, access to all devices, and disables seccomp/AppArmor. It effectively removes the isolation boundary. Use it only for DinD (Docker-in-Docker) or system-level tooling during development — never in production.
Seccomp, AppArmor, and Security Options
Capabilities control which privileged operations are allowed. Seccomp goes deeper — it filters which system calls the container process can make at all. Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux syscalls, including dangerous ones like reboot, kexec_load, and mount.
# Run with a custom seccomp profile
docker run -d \
--security-opt seccomp=./custom-seccomp.json \
--name locked-down my-app
# Disable seccomp entirely (not recommended)
docker run -d \
--security-opt seccomp=unconfined \
--name debug-container my-app
# Use a specific AppArmor profile
docker run -d \
--security-opt apparmor=my-custom-profile \
--name armored-app my-app
Custom Seccomp Profile Structure
A seccomp profile is a JSON file that whitelists or blacklists specific syscalls. You start from Docker’s default profile and modify it to fit your application.
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat", "fstat",
"mmap", "mprotect", "munmap", "brk", "exit_group"],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["clone"],
"action": "SCMP_ACT_ALLOW",
"args": [
{ "index": 0, "value": 2114060288, "op": "SCMP_CMP_MASKED_EQ" }
]
}
]
}
Read-Only Filesystem, Ulimits, and Other Flags
Beyond capabilities and seccomp, Docker offers several more hardening flags. A read-only root filesystem forces you to explicitly declare writable paths — an excellent defense against runtime tampering.
# Read-only root filesystem with explicit tmpfs for writable paths
docker run -d \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--tmpfs /run:rw,noexec,nosuid \
--name immutable-app my-app
# Set ulimits: max open files and max processes
docker run -d \
--ulimit nofile=1024:2048 \
--ulimit nproc=512:1024 \
--name limited-app my-app
# Prevent the container from gaining new privileges
docker run -d \
--security-opt no-new-privileges:true \
--name hardened-app my-app
Putting It All Together: A Hardened Container
Here is what a production-grade container launch looks like when you combine resource limits with security constraints. Each flag serves a specific purpose — nothing is left to defaults.
docker run -d \
--name production-api \
--memory=512m \
--memory-swap=512m \
--memory-reservation=256m \
--cpus=2 \
--pids-limit=200 \
--ulimit nofile=1024:4096 \
--ulimit nproc=256:512 \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--security-opt no-new-privileges:true \
--security-opt seccomp=./seccomp-profile.json \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--restart=unless-stopped \
--health-cmd="curl -f http://localhost:8080/health || exit 1" \
--health-interval=30s \
my-api:v2.1.0
Start restrictive and loosen as needed. Run your application, monitor it with docker stats, and only add capabilities or raise limits when you hit actual failures. The error messages (OOM events, EPERM on syscalls, ENOSPC on PID limits) will tell you exactly what to adjust.
Docker Compose: Defining Multi-Container Applications
Real applications are never a single container. A typical web project needs a reverse proxy, an API server, a database, and a cache — at minimum. Docker Compose lets you define all of these services, their networks, and their volumes in a single declarative YAML file, then manage the entire stack with one command.
Compose uses the Compose Specification, an open standard. If you've been writing version: "3.8" at the top of your files, you can stop — the version key is deprecated and ignored by modern Docker Compose. Just start with services:.
The standalone docker-compose binary (V1, Python-based) is deprecated. Docker CLI now ships Compose V2 as a plugin — use docker compose (with a space). All commands in this section use V2 syntax.
Service Topology Overview
Before diving into the YAML, here's the architecture we'll build throughout this section — an Nginx reverse proxy fronting an API server, backed by PostgreSQL and Redis, with proper network segmentation and persistent storage.
graph LR
subgraph frontend-net["frontend network"]
NGINX["🌐 nginx\n(reverse proxy)\nport 80:80"]
API["⚙️ api\n(Node.js app)\nport 3000"]
end
subgraph backend-net["backend network"]
API2["⚙️ api"]
PG["🐘 postgres\nport 5432"]
REDIS["⚡ redis\nport 6379"]
end
NGINX -->|"proxy_pass"| API
API2 -->|"queries"| PG
API2 -->|"sessions/cache"| REDIS
PG ---|"pg-data volume"| PGV[("pg-data")]
REDIS ---|"redis-data volume"| RDV[("redis-data")]
Defining Services
Each service in a Compose file maps to a container. You specify what image to use (or how to build one), what environment it needs, which ports to expose, and how it relates to other services. Here's the core anatomy of a service definition.
Image vs Build
You can pull a pre-built image or build from a Dockerfile. Use image for off-the-shelf services (databases, caches) and build for your own application code.
services:
# Using a pre-built image
redis:
image: redis:7-alpine
# Building from a Dockerfile
api:
build:
context: ./api
dockerfile: Dockerfile
args:
NODE_ENV: production
image: myapp/api:latest # tags the built image
Command, Environment, and Env Files
Override the default container command with command or entrypoint. Pass configuration via environment for inline values or env_file to load from a file. The env_file approach keeps secrets out of your Compose file.
services:
api:
build: ./api
command: ["node", "src/server.js"]
environment:
NODE_ENV: production
REDIS_URL: redis://redis:6379
env_file:
- ./api/.env # DB credentials, API keys, etc.
Ports, Volumes, and Restart Policies
Map host ports with ports, persist data with volumes, and keep services alive with restart. For volumes, you can use named volumes (managed by Docker) or bind mounts (host paths).
services:
postgres:
image: postgres:16-alpine
ports:
- "5432:5432" # host:container
volumes:
- pg-data:/var/lib/postgresql/data # named volume
- ./init.sql:/docker-entrypoint-initdb.d/init.sql # bind mount
restart: unless-stopped # survives daemon restarts
volumes:
pg-data: # declared at top level
Healthchecks and depends_on with Conditions
A running container isn't necessarily a ready container. Postgres might accept TCP connections before it's actually ready to serve queries. Healthchecks solve this. Combine them with depends_on conditions to enforce proper startup ordering.
services:
postgres:
image: postgres:16-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
api:
build: ./api
depends_on:
postgres:
condition: service_healthy # waits for healthcheck
redis:
condition: service_healthy
Without the condition: service_healthy clause, depends_on only waits for the container to start, not become ready. The three conditions are service_started (default), service_healthy, and service_completed_successfully (useful for init/migration containers).
Networks and Volumes
By default, Compose creates a single network for all services. This is fine for simple projects but breaks the principle of least privilege. Defining custom networks lets you isolate traffic — your database shouldn't be reachable from the reverse proxy.
networks:
frontend:
driver: bridge
backend:
driver: bridge
volumes:
pg-data:
redis-data:
services:
nginx:
networks: [frontend] # only frontend
api:
networks: [frontend, backend] # bridges both
postgres:
networks: [backend] # only backend
redis:
networks: [backend] # only backend
Services on the same network can reach each other by service name as the hostname. In this setup, nginx can resolve api but cannot reach postgres or redis directly. The api service, connected to both networks, acts as the bridge.
The Complete Stack
Putting it all together — here's a production-ready Compose file for our Nginx + API + Postgres + Redis stack. This is the kind of file you'd actually commit to a project repository.
# compose.yaml — no "version" key needed
services:
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
depends_on:
api:
condition: service_healthy
networks:
- frontend
restart: unless-stopped
api:
build:
context: ./api
dockerfile: Dockerfile
command: ["node", "src/server.js"]
env_file: ./api/.env
environment:
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/myapp
REDIS_URL: redis://redis:6379
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 15s
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
networks:
- frontend
- backend
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: myapp
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
volumes:
- pg-data:/var/lib/postgresql/data
- ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
networks:
- backend
restart: unless-stopped
redis:
image: redis:7-alpine
command: ["redis-server", "--appendonly", "yes"]
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
volumes:
- redis-data:/data
networks:
- backend
restart: unless-stopped
networks:
frontend:
driver: bridge
backend:
driver: bridge
volumes:
pg-data:
redis-data:
CLI Commands
Compose's CLI is how you interact with the stack defined in your YAML. These commands operate on the entire project (all services) by default, or you can target individual services by name.
| Command | What It Does | Common Flags |
|---|---|---|
docker compose up -d | Create and start all services in detached mode | --build to rebuild images first |
docker compose down | Stop and remove containers, networks | -v also removes named volumes |
docker compose logs -f api | Tail logs for a specific service | --since 5m, --tail 100 |
docker compose exec api sh | Run a command in a running container | -it for interactive terminal |
docker compose ps | List containers and their status | -a includes stopped containers |
docker compose build | Build or rebuild service images | --no-cache, --parallel |
docker compose pull | Pull latest images for services | --ignore-pull-failures |
# Start the stack (rebuild if Dockerfiles changed)
docker compose up -d --build
# Check health status
docker compose ps
# Follow API logs
docker compose logs -f api
# Open a shell inside the API container
docker compose exec api sh
# Tear down everything including volumes
docker compose down -v
Variable Substitution with .env
Compose automatically reads a .env file in the project directory and makes those variables available for ${VAR} interpolation inside the Compose file. This is the standard way to handle per-environment configuration without modifying the YAML.
# .env (in the same directory as compose.yaml)
POSTGRES_PASSWORD=supersecret
COMPOSE_PROJECT_NAME=myapp
API_IMAGE_TAG=1.4.2
# compose.yaml — variables resolved from .env
services:
api:
image: myapp/api:${API_IMAGE_TAG}
environment:
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/myapp
postgres:
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?error: set POSTGRES_PASSWORD}
The ${VAR:?error message} syntax makes Compose exit with an error if the variable is unset — a safety net for required secrets. You can also use ${VAR:-default} to provide a fallback value.
Profiles
Not every service should start every time. Debug tools, migration runners, or monitoring stacks are only needed sometimes. Profiles let you tag services and only start them when explicitly requested.
services:
api:
build: ./api # no profiles → always starts
postgres:
image: postgres:16-alpine # no profiles → always starts
pgadmin:
image: dpage/pgadmin4
profiles: ["debug"] # only starts with --profile debug
ports:
- "5050:80"
migrate:
build: ./api
command: ["npm", "run", "migrate"]
profiles: ["tools"] # only starts with --profile tools
depends_on:
postgres:
condition: service_healthy
# Normal startup — only api and postgres
docker compose up -d
# Include the debug tools
docker compose --profile debug up -d
# Run migrations then exit
docker compose --profile tools run --rm migrate
Override Files and extends
Compose merges multiple files together. The convention is a base compose.yaml and an override compose.override.yaml that Compose loads automatically. This lets you keep production settings in the base file and layer development-specific changes (bind mounts, debug ports, different commands) on top.
# compose.override.yaml — auto-loaded in development
services:
api:
build:
target: development # multi-stage: use dev stage
command: ["npm", "run", "dev"]
volumes:
- ./api/src:/app/src:cached # bind mount for hot reload
ports:
- "9229:9229" # Node.js debugger port
environment:
NODE_ENV: development
For explicit multi-file setups (e.g., staging vs production), use the -f flag:
# Production: base only (skip auto-loaded override)
docker compose -f compose.yaml -f compose.prod.yaml up -d
# Staging: base + staging overrides
docker compose -f compose.yaml -f compose.staging.yaml up -d
The extends keyword lets a service inherit configuration from another service, either in the same file or a different file. This reduces duplication when multiple services share a common base configuration.
# common.yaml — shared service definitions
services:
node-base:
build:
context: .
args:
NODE_ENV: production
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
# compose.yaml — extend the base
services:
api:
extends:
file: common.yaml
service: node-base
command: ["node", "src/api.js"]
ports:
- "3000:3000"
worker:
extends:
file: common.yaml
service: node-base
command: ["node", "src/worker.js"]
Compose Watch for Development
The watch feature (introduced in Compose 2.22) monitors your source files and automatically syncs changes to containers or triggers rebuilds. It's a Compose-native alternative to bind mounts + nodemon-style watchers, and it handles edge cases like dependency changes more cleanly.
services:
api:
build: ./api
develop:
watch:
# Sync source files → fast, no restart needed
- action: sync
path: ./api/src
target: /app/src
# Rebuild when dependencies change
- action: rebuild
path: ./api/package.json
# Sync + restart when config changes
- action: sync+restart
path: ./api/config
target: /app/config
# Start services with file watching
docker compose watch
# Or combine with up (detached watch)
docker compose up -d --watch
The three watch actions serve different purposes: sync copies changed files into the container without restarting (ideal for hot-reload runtimes), rebuild triggers a full image rebuild and container replacement (for dependency changes), and sync+restart copies files then restarts the container (for config files that are read once at startup).
Unlike bind mounts, watch works consistently across macOS, Linux, and Windows — no filesystem notification issues. It also lets you define different actions per path pattern. Use bind mounts during quick prototyping; switch to watch for team-shared dev environments.
Putting It Into Practice
Here's a complete project structure that ties together everything from this section — override files, env files, profiles, and the full service stack:
myapp/
├── compose.yaml # base: all services, networks, volumes
├── compose.override.yaml # dev: bind mounts, debug ports, watch
├── compose.prod.yaml # prod: replicas, resource limits
├── .env # POSTGRES_PASSWORD, API_IMAGE_TAG
├── .env.example # template committed to git
├── api/
│ ├── Dockerfile
│ ├── .env # service-level env (loaded via env_file)
│ └── src/
├── nginx/
│ └── default.conf
└── db/
└── init.sql
Add .env and api/.env to your .gitignore. Commit a .env.example with placeholder values instead. Secrets in version control are a security incident waiting to happen.
Docker in CI/CD Pipelines
Docker transforms CI/CD from a "works on my machine" problem into a reproducible, portable process. But the way you use Docker in a pipeline — as a build tool vs. as a deliverable artifact — fundamentally shapes your workflow's architecture, speed, and security posture.
flowchart LR
A["Git Push"] --> B["CI Trigger"]
B --> C["Build Image"]
C --> D["Run Tests"]
D --> E["Security Scan"]
E --> F{"Pass?"}
F -->|Yes| G["Push to Registry"]
F -->|No| H["Fail & Notify"]
G --> I["Deploy Staging"]
I --> J["Smoke Tests"]
J --> K{"Pass?"}
K -->|Yes| L["Deploy Production"]
K -->|No| H
Docker as Build Tool vs. Docker as Artifact
There are two distinct roles Docker plays in CI/CD. As a build tool, Docker provides a consistent environment for compiling code, running linters, and executing tests — the image is thrown away after the job. As an artifact, Docker produces the deployable image itself — the pipeline's output is a tagged, versioned container image pushed to a registry.
Most mature pipelines use Docker in both roles simultaneously: a CI-specific image runs the build steps, and the build's output is a production-ready Docker image. Understanding this distinction helps you design cleaner pipelines.
| Role | Purpose | Image Lifecycle | Example |
|---|---|---|---|
| Build Tool | Consistent build/test environment | Ephemeral, discarded after job | Running npm test inside a Node container |
| Artifact | Deployable application package | Tagged, pushed to registry, deployed | Multi-stage build producing a production image |
Docker-in-Docker (DinD) vs. Socket Mounting
When your CI runner itself is a container, you need a way to build Docker images from inside it. The two approaches — Docker-in-Docker and socket mounting — have very different trade-offs in terms of isolation, performance, and security.
Docker-in-Docker (DinD) runs a full Docker daemon inside your CI container. Each job gets a completely isolated Docker environment with its own image cache. This provides strong isolation between jobs but sacrifices layer caching across runs, since the inner daemon's storage is ephemeral.
Socket mounting shares the host's Docker daemon by bind-mounting /var/run/docker.sock into the CI container. Jobs share the host's layer cache, making builds dramatically faster. The downside: any container can interact with the host daemon, which is a security risk in multi-tenant environments.
# GitLab CI with Docker-in-Docker
build:
image: docker:24-dind
services:
- docker:24-dind
variables:
DOCKER_TLS_CERTDIR: "/certs"
script:
- docker build -t myapp:$CI_COMMIT_SHA .
# Docker Compose for self-hosted CI runner
services:
ci-runner:
image: myorg/ci-runner:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./workspace:/workspace
working_dir: /workspace
Mounting the Docker socket gives the container root-equivalent access to the host. Never do this with untrusted code or in multi-tenant CI environments. For those cases, use DinD with TLS or rootless alternatives like Kaniko or Buildah.
GitHub Actions: Building and Pushing Images
GitHub Actions has first-class Docker support through the docker/build-push-action. The workflow below is a production-grade template: it builds the image, caches layers in GitHub's cache backend, and pushes to GitHub Container Registry on every push to main.
name: Build and Push
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/metadata-action@v5
id: meta
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
- uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
Layer Caching Strategies
Without caching, every CI run rebuilds every layer from scratch. The cache-from and cache-to options in BuildKit enable multiple caching backends. GitHub Actions' type=gha cache is the simplest, but for larger images, registry-based caching often performs better.
# Registry-based caching — survives cache eviction
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=registry,ref=ghcr.io/myorg/myapp:buildcache
cache-to: type=registry,ref=ghcr.io/myorg/myapp:buildcache,mode=max
Image Tagging Strategy for CI
How you tag images determines your ability to trace deployments back to source code, roll back safely, and manage releases. A good tagging strategy uses multiple tags per image — each serving a different purpose.
| Tag Type | Format | Purpose | Example |
|---|---|---|---|
| Git SHA | abc1234 | Immutable, maps to exact commit | myapp:a1b2c3d |
| Semver | v1.2.3 | Release versioning | myapp:1.2.3, myapp:1.2 |
| Branch | main, develop | Latest from a branch (mutable) | myapp:main |
| PR | pr-42 | Preview environments | myapp:pr-42 |
latest | latest | Convenience only — never in prod | myapp:latest |
Always deploy by immutable tags (git SHA or digest). Mutable tags like latest or main can change underneath a running deployment, making rollbacks unreliable and auditing impossible.
Running Tests in CI with Docker
Docker gives you two patterns for running tests in CI. You can run tests during the image build (in a test stage of a multi-stage Dockerfile), or run tests against a built image by starting the container and executing a test suite from outside. The first is simpler; the second allows integration testing with real dependencies.
# Multi-stage Dockerfile with test stage
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci
FROM deps AS test
COPY . .
RUN npm run lint && npm run test:unit
FROM deps AS build
COPY . .
RUN npm run build
FROM node:20-alpine AS production
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
In CI, you target the test stage explicitly. If tests fail, the build fails and no image gets pushed:
# GitHub Actions — run tests then build production image
steps:
- name: Run tests inside Docker
run: docker build --target test -t myapp:test .
- name: Build production image
run: docker build --target production -t myapp:${{ github.sha }} .
Automated Integration Testing with Compose
For integration tests that need databases, caches, or message brokers, Docker Compose spins up the full dependency graph. Define a dedicated docker-compose.test.yml that extends your base compose file with a test runner service.
# docker-compose.test.yml
services:
app:
build:
context: .
target: test
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
environment:
DATABASE_URL: postgres://test:test@postgres:5432/testdb
REDIS_URL: redis://redis:6379
command: npm run test:integration
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
healthcheck:
test: ["CMD-SHELL", "pg_isready -U test"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
# Run integration tests in CI
docker compose -f docker-compose.test.yml up \
--build --abort-on-container-exit --exit-code-from app
Multi-Platform Builds in CI
If you deploy to both amd64 servers and arm64 instances (AWS Graviton, Apple Silicon dev machines), you need multi-platform images. BuildKit with QEMU emulation handles this natively in GitHub Actions.
steps:
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ghcr.io/myorg/myapp:${{ github.sha }}
QEMU emulation makes cross-platform builds 3-10x slower. For faster builds, use native runners for each architecture and merge manifests with docker buildx imagetools create. GitHub's ubuntu-24.04-arm runners make this practical.
Security Scanning
Every image you push to a registry should pass a vulnerability scan. The three major tools each have different strengths: Trivy is fast and open-source, Docker Scout integrates directly with Docker Hub, and Snyk offers deep dependency analysis with fix recommendations.
# GitHub Actions — Trivy scan
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
format: table
exit-code: 1 # Fail pipeline on vulnerabilities
severity: CRITICAL,HIGH
ignore-unfixed: true
# GitHub Actions — Docker Scout
- name: Docker Scout CVE scan
uses: docker/scout-action@v1
with:
command: cves
image: myapp:${{ github.sha }}
only-severities: critical,high
exit-code: true # Fail on policy violation
# GitHub Actions — Snyk Container
- name: Snyk container scan
uses: snyk/actions/docker@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
image: myapp:${{ github.sha }}
args: --severity-threshold=high
Deployment Strategies
Once your image is built, tested, and scanned, the deployment strategy determines how you roll it out without downtime. Each strategy trades off between speed, safety, and infrastructure complexity.
Blue-Green Deployment
You maintain two identical environments — blue (current) and green (new). Deploy the new image to the green environment, verify it, then switch the load balancer to point to green. Rollback is instant: switch back to blue.
#!/bin/bash
# Blue-green deploy with Docker Compose
NEW_VERSION=$1
CURRENT=$(docker compose ps --format json | jq -r '.[0].Name' | grep -o 'blue\|green')
TARGET=$( [ "$CURRENT" = "blue" ] && echo "green" || echo "blue" )
# Deploy new version to inactive environment
IMAGE_TAG="$NEW_VERSION" docker compose up -d "app-${TARGET}"
# Health check the new environment
until curl -sf "http://app-${TARGET}:3000/health"; do sleep 2; done
# Switch traffic (update nginx upstream)
sed -i "s/app-${CURRENT}/app-${TARGET}/" /etc/nginx/conf.d/upstream.conf
nginx -s reload
echo "Switched traffic from $CURRENT to $TARGET (v$NEW_VERSION)"
Rolling Deployment
Replace instances one at a time. This is the default in Docker Swarm and Kubernetes. You trade deployment speed for zero-downtime guarantees — at any point during the rollout, both old and new versions serve traffic.
# Docker Swarm rolling update configuration
services:
web:
image: ghcr.io/myorg/myapp:${VERSION}
deploy:
replicas: 4
update_config:
parallelism: 1 # Update one container at a time
delay: 30s # Wait 30s between updates
failure_action: rollback
order: start-first # Start new before stopping old
rollback_config:
parallelism: 0 # Roll back all at once
order: start-first
Canary Deployment
Route a small percentage of traffic (e.g., 5%) to the new version. Monitor error rates and latency. If metrics look good, gradually increase traffic until the canary takes 100%. This catches issues that only surface under real production load.
# GitHub Actions — Canary deploy step
deploy-canary:
needs: [build, test, scan]
runs-on: ubuntu-latest
steps:
- name: Deploy canary (5% traffic)
run: |
kubectl set image deployment/myapp-canary \
app=ghcr.io/myorg/myapp:${{ github.sha }}
kubectl scale deployment/myapp-canary --replicas=1
- name: Monitor canary (5 minutes)
run: |
for i in $(seq 1 30); do
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors_total[1m])" \
| jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE — rolling back"
kubectl rollout undo deployment/myapp-canary
exit 1
fi
sleep 10
done
- name: Promote canary to full rollout
run: |
kubectl set image deployment/myapp \
app=ghcr.io/myorg/myapp:${{ github.sha }}
kubectl rollout status deployment/myapp
| Strategy | Downtime | Rollback Speed | Resource Overhead | Best For |
|---|---|---|---|---|
| Blue-Green | Zero | Instant (switch back) | 2x infrastructure | Critical apps, fast rollback needed |
| Rolling | Zero | Minutes (re-roll) | Minimal (+1 instance) | Stateless services, most common |
| Canary | Zero | Fast (scale down canary) | Minimal (+1 instance) | High-traffic apps, risk-sensitive releases |
Debugging and Troubleshooting Docker Containers
When a container misbehaves — crashing immediately, returning errors, or performing differently than your local environment — you need a systematic approach rather than guesswork. Docker exposes a rich set of introspection commands that let you trace the problem from the outside in: logs first, then state inspection, then interactive debugging.
This section builds your diagnostic toolkit from the essential commands you’ll use daily to advanced techniques for the toughest problems.
flowchart TD
A["Container Problem"] --> B{"Is the container running?"}
B -->|"No - Exited"| C["docker logs CONTAINER"]
B -->|"Yes - Misbehaving"| H["docker exec -it CONTAINER sh"]
C --> D{"Check exit code"}
D -->|"Exit 0"| E["Process completed normally.\nCheck CMD / entrypoint logic."]
D -->|"Exit 1"| F["Application error.\nRead logs for stack trace."]
D -->|"Exit 137"| G["OOM or SIGKILL.\nCheck memory limits."]
D -->|"Exit 127"| G2["Command not found.\nCheck PATH or binary exists."]
D -->|"Exit 126"| G3["Permission denied.\nCheck file permissions."]
H -->|"Shell works"| I["Inspect process, network,\nfiles from inside."]
H -->|"exec fails"| J{"Minimal image?\n(distroless / scratch)"}
J -->|"Yes"| K["docker debug CONTAINER\nor nsenter via PID"]
J -->|"No"| L["Container may be paused\nor in a crash loop.\nCheck docker inspect."]
Essential Diagnostic Commands
These five commands form the core of your debugging workflow. You’ll reach for them in roughly this order: logs tell you what went wrong, inspect tells you why the container is configured that way, exec lets you poke around inside, events shows you when things happened, and diff reveals what changed on the filesystem.
docker logs — Your First Stop
Always start here. docker logs captures everything your container writes to stdout and stderr. The --tail and --since flags prevent you from drowning in output on long-running containers.
# Last 50 lines
docker logs --tail 50 my-api
# Logs from the past 5 minutes
docker logs --since 5m my-api
# Follow logs in real time (like tail -f)
docker logs -f --tail 20 my-api
# Show timestamps alongside each log line
docker logs -t --since 2024-01-15T10:00:00 my-api
docker inspect — Deep State Examination
When logs aren’t enough, docker inspect dumps the full JSON configuration and runtime state of a container. Use Go template syntax with --format to extract exactly what you need.
# Why did this container stop?
docker inspect --format='{{.State.ExitCode}}' my-api
docker inspect --format='{{.State.OOMKilled}}' my-api
docker inspect --format='{{.State.Error}}' my-api
# What IP address was assigned?
docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' my-api
# What mounts and volumes are attached?
docker inspect --format='{{json .Mounts}}' my-api | python3 -m json.tool
# Full health check status
docker inspect --format='{{json .State.Health}}' my-api | python3 -m json.tool
docker exec — Interactive Exploration
Once a container is running, exec lets you open a shell inside it. This is your primary tool for checking process state, network connectivity, filesystem contents, and environment variables from the container’s perspective.
# Open an interactive shell
docker exec -it my-api /bin/sh
# Run a single command without interactive mode
docker exec my-api cat /etc/resolv.conf
# Exec as root (even if Dockerfile uses USER)
docker exec -u 0 my-api id
# Check what processes are running inside
docker exec my-api ps aux
# Test network connectivity from inside the container
docker exec my-api ping -c 3 database
docker exec my-api nslookup database
docker events & docker diff
docker events gives you a real-time stream of Docker daemon events — container starts, stops, OOM events, network connections, and volume mounts. docker diff shows filesystem changes (added, changed, or deleted files) relative to the image.
# Watch all events in real time (run in a separate terminal)
docker events
# Filter to only OOM events and container deaths
docker events --filter event=oom --filter event=die
# See what files changed inside a container
docker diff my-api
# Output: C /tmp (Changed), A /tmp/cache.db (Added), D /app/old.log (Deleted)
Exit Code Reference
The exit code is the single most informative piece of data from a crashed container. It immediately narrows down the category of failure. Run docker inspect --format='{{.State.ExitCode}}' CONTAINER to retrieve it.
| Exit Code | Meaning | Common Cause | What To Do |
|---|---|---|---|
0 | Success (normal exit) | Process finished its work and exited | Check if your CMD is a one-shot command instead of a long-running server |
1 | General application error | Unhandled exception, missing config, bad input | Read docker logs — the stack trace will be there |
126 | Command not executable | Binary exists but lacks execute permissions | Check chmod +x on your entrypoint script |
127 | Command not found | Typo in CMD, binary not in PATH, missing from image | Verify the binary exists: docker run --rm IMAGE which myapp |
137 | SIGKILL (128 + 9) | OOM, docker stop after timeout, external signal | Check OOMKilled with inspect; raise memory limits if needed |
139 | SIGSEGV (128 + 11) | Segmentation fault in the application | Debug the native binary; check for architecture mismatch (amd64 vs arm64) |
143 | SIGTERM (128 + 15) | Graceful shutdown via docker stop | Normal — but if unexpected, check if something is restarting your container |
Common Failure Patterns
OOM Events (Exit 137)
When a container exceeds its memory limit, the Linux kernel’s OOM mechanism sends SIGKILL (signal 9), resulting in exit code 137. The container gets no chance to clean up. This is one of the most common production failures, especially for Java and Node.js apps with default heap settings.
# Confirm OOM
docker inspect --format='{{.State.OOMKilled}}' my-api
# Output: true
# Check current memory usage vs limit
docker stats --no-stream my-api
# Check kernel OOM logs on the host (Linux)
dmesg | grep -i oom
# Run with a higher memory limit
docker run -d --memory=512m --memory-swap=512m my-api
PID 1 Signal Handling
In a container, your process runs as PID 1. Unlike a normal process, PID 1 does not get default signal handlers from the kernel. If your app doesn’t explicitly handle SIGTERM, docker stop will wait for the 10-second grace period, then send SIGKILL — causing slow, ungraceful shutdowns every time.
# BAD: shell form wraps your app in /bin/sh — signals go to sh, not your app
CMD node server.js
# GOOD: exec form — your app IS PID 1 and receives signals directly
CMD ["node", "server.js"]
# ALSO GOOD: use tini as an init process to handle signal forwarding
ENTRYPOINT ["/tini", "--"]
CMD ["node", "server.js"]
If docker stop consistently takes exactly 10 seconds (the default grace period), your process isn’t handling SIGTERM. Switch to exec form in your CMD or add --init to your docker run command to inject tini automatically.
Filesystem Permission Errors
A container that works with docker run but fails when you add -v /host/path:/app/data almost always has a permissions mismatch. The UID inside the container doesn’t match the owner of the host directory. This is especially common when a Dockerfile uses USER 1000 but the host directory is owned by root.
# Check what user the container runs as
docker exec my-api id
# Output: uid=1000(appuser) gid=1000(appuser)
# Check ownership of the mounted volume from inside
docker exec my-api ls -la /app/data
# Output: drwxr-xr-x 2 root root 4096 ... /app/data <-- mismatch!
# Fix: match the host directory ownership to the container UID
sudo chown -R 1000:1000 /host/path
# Or run the container with a matching user
docker run -v /host/path:/app/data --user "$(id -u):$(id -g)" my-api
DNS Failures and Network Issues
Containers on a user-defined bridge network can resolve each other by container name. But containers on the default bridge network cannot — they only get access by IP. If ping database fails inside your app container, check which network both containers are on.
# Check which network a container is on
docker inspect --format='{{json .NetworkSettings.Networks}}' my-api | python3 -m json.tool
# Test DNS resolution from inside the container
docker exec my-api nslookup database
docker exec my-api cat /etc/resolv.conf
# Check if both containers share a network
docker network inspect my-network --format='{{range .Containers}}{{.Name}} {{end}}'
# Quick connectivity test
docker exec my-api nc -zv database 5432
Port Conflicts
If docker run -p 8080:80 fails with “port is already allocated,” something on the host is already bound to that port. Find the offending process and either stop it or choose a different host port.
# Find what is using port 8080 on the host
lsof -i :8080
# or on Linux:
ss -tlnp | grep 8080
# Use a different host port (host:container)
docker run -d -p 9090:80 my-web
# Let Docker assign a random available port
docker run -d -p 80 my-web
docker port my-web
Advanced Debugging Techniques
When docker exec isn’t enough — either because the container has no shell (distroless/scratch images), has already crashed, or you need kernel-level visibility — these tools go deeper.
docker debug (Docker Desktop)
Docker Desktop includes docker debug, which attaches a fully-equipped debugging shell to any container, even ones built from scratch or distroless images. It injects debugging tools without modifying the container image.
# Attach a debug shell to a running container (even distroless)
docker debug my-api
# Debug a stopped/exited container
docker debug my-api
# You get vim, nano, curl, htop, strace, and more — all injected at runtime
nsenter — Enter Container Namespaces from the Host
nsenter lets you enter one or more Linux namespaces of a running container directly from the host. This is the escape hatch for production servers where Docker Desktop isn’t available and the container image has no shell.
# Get the container's PID on the host
PID=$(docker inspect --format='{{.State.Pid}}' my-api)
# Enter all namespaces of the container with a host shell
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
# Enter only the network namespace (useful for network debugging)
sudo nsenter -t $PID -n -- ip addr
sudo nsenter -t $PID -n -- ss -tlnp
Network Debugging with nicolaka/netshoot
The netshoot image is a Swiss Army knife for container network debugging. It includes curl, dig, nmap, tcpdump, iperf, and dozens more networking tools. Attach it to any container’s network namespace to diagnose connectivity problems without modifying the target container.
# Share the network namespace of a running container
docker run -it --rm --network container:my-api nicolaka/netshoot
# Now you have the full toolkit inside the same network namespace:
# tcpdump -i eth0 port 80
# dig database
# curl -v http://localhost:3000/healthz
# iftop
# Attach to a Docker Compose network to debug service discovery
docker run -it --rm --network myapp_default nicolaka/netshoot
# dig my-api
# curl http://my-api:3000/healthz
When logs show nothing and the app just silently hangs, strace reveals every system call the process makes. Run docker exec my-api strace -p 1 -f -e trace=network to trace all network-related syscalls of PID 1. If strace isn’t installed in the image, use nsenter or docker debug to get access.
Systematic Debugging Workflow
When you hit an unfamiliar container problem, resist the urge to randomly try things. Follow this checklist top-to-bottom — each step either solves the problem or gives you information for the next step.
-
Check container status and exit code
Get the big picture first. Is the container running, exited, or restarting? The exit code immediately categorizes the failure.
bashdocker ps -a --filter name=my-api docker inspect --format='Exit:{{.State.ExitCode}} OOM:{{.State.OOMKilled}} Error:{{.State.Error}}' my-api -
Read the logs
For exited containers, the answer is almost always in the logs. Look for stack traces, "permission denied", "address already in use", or "connection refused" messages.
bashdocker logs --tail 100 my-api 2>&1 | head -50 -
Inspect configuration
Compare the running configuration against what you expected. Check environment variables, mounts, networks, and port bindings.
bashdocker inspect my-api | python3 -m json.tool | less -
Get a shell inside
If the container is running, exec in and poke around. Check processes, files, network, and DNS. If the container has no shell, use
docker debugornsenter.bashdocker exec -it my-api /bin/sh # Inside: ps aux, env, cat /etc/hosts, nslookup database, ls -la /app -
Reproduce with a fresh container
If you’ve changed things inside the container while debugging, start fresh. Override the entrypoint with
shto keep the container alive while you test interactively.bash# Override entrypoint to keep container alive for debugging docker run -it --rm --entrypoint /bin/sh my-api-image # Now manually run the entrypoint commands one by one to find the failure
When code works locally but fails in a container, the culprit is almost always one of three things: a missing environment variable, a filesystem path that doesn’t exist in the image, or the container running as a different user (non-root) than your local dev environment. Always check env, pwd, and id inside the container first.
Docker Security: Attack Surface and Hardening
Docker containers share the host kernel. That single fact shapes the entire security model: every misconfiguration is a potential path from container to host. Unlike virtual machines, there is no hypervisor boundary \u2014 only Linux kernel features (namespaces, cgroups, capabilities, seccomp, LSMs) stand between a container process and full host access.
Effective Docker security is defense-in-depth. No single control is sufficient; you layer protections so that when one fails, others hold. The diagram below shows these layers from the outside in.
graph LR
A["\U0001F5A5\uFE0F Host Hardening\npatched kernel, minimal OS"] --> B["\U0001F433 Docker Daemon\nsocket access, TLS, rootless"]
B --> C["\U0001F4E6 Image Security\ntrusted base, pinned digests, scanned"]
C --> D["\U0001F528 Build Security\nno secrets baked in, .dockerignore"]
D --> E["\U0001F512 Runtime Hardening\nnon-root, dropped caps, read-only"]
E --> F["\U0001F310 Application\nleast privilege, no shell"]
The Docker Socket Is Root Access
The Docker daemon (/var/run/docker.sock) runs as root. Any process that can talk to this socket can spin up a privileged container, mount the host filesystem, and effectively become root on the host. This is not a bug \u2014 it is how Docker works.
Mounting the Docker socket into a container (a common pattern for CI runners, monitoring tools, and \u201CDocker-in-Docker\u201D setups) grants that container full root access to the host. Treat socket access as equivalent to adding someone to the sudoers file.
# This gives the container FULL root access to the host
docker run -v /var/run/docker.sock:/var/run/docker.sock some-image
# From inside that container, an attacker can:
docker run -it --privileged --pid=host -v /:/hostfs alpine chroot /hostfs
Never mount the Docker socket into application containers. If you must (e.g., CI/CD), use a scoped proxy like docker-socket-proxy to limit which API endpoints are accessible.
Image Security
Your container is only as secure as the image it runs. A base image with known CVEs, an unpinned tag that gets overwritten, or a malicious image from a public registry can compromise your workload before a single line of your code executes.
Use Trusted, Minimal Base Images
Start from images that have a small attack surface. Fewer packages means fewer CVEs. The table below compares common options.
| Base Image | Size | Packages | Best For |
|---|---|---|---|
scratch | 0 MB | None | Statically compiled Go/Rust binaries |
gcr.io/distroless/static | ~2 MB | CA certs, tzdata | Static binaries needing TLS |
alpine:3.20 | ~7 MB | musl, busybox | General-purpose minimal base |
ubuntu:24.04 | ~78 MB | apt, glibc, many | When you need glibc or apt packages |
node:22-slim | ~200 MB | Node.js + Debian minimal | Node.js apps needing native modules |
Pin Image Digests, Not Just Tags
Tags are mutable. The alpine:3.20 you pulled last week might not be the same alpine:3.20 you pull today \u2014 the tag can be re-pushed. Pin by SHA256 digest to guarantee reproducibility.
# \u274C Mutable \u2014 can change underneath you
FROM alpine:3.20
# \u2705 Immutable \u2014 this exact image, forever
FROM alpine:3.20@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c
Scan Images for Vulnerabilities
Integrate vulnerability scanning into your CI pipeline. Three popular tools serve this purpose \u2014 pick the one that fits your workflow.
# Scan an image \u2014 fail CI on HIGH or CRITICAL CVEs
trivy image --severity HIGH,CRITICAL --exit-code 1 myapp:latest
# Scan a Dockerfile for misconfigurations
trivy config --severity HIGH,CRITICAL Dockerfile
Trivy (by Aqua Security) is open-source, fast, and scans images, filesystems, IaC configs, and SBOMs. It is the most widely adopted OSS scanner.
# Built into Docker Desktop and the docker CLI
docker scout cves myapp:latest
# Get fix recommendations
docker scout recommendations myapp:latest
Docker Scout integrates directly into Docker Desktop and Docker Hub. It provides policy-based evaluation and remediation advice built into the Docker workflow.
# Scan an image
grype myapp:latest
# Pair with Syft for SBOM generation + scanning
syft myapp:latest -o spdx-json | grype
Grype (by Anchore) pairs with Syft for SBOM generation. Together they give you a complete supply-chain view \u2014 generate an SBOM once, scan it repeatedly as new CVEs are published.
Build-Time Security
The image build process is where secrets most commonly leak. Every RUN, COPY, and ADD instruction creates a layer that is permanently stored in the image. Anyone who can docker pull your image can inspect every layer and extract anything you put in it \u2014 including credentials you thought you deleted in a later layer.
Never Bake Secrets Into Images
BuildKit\u2019s --mount=type=secret lets you inject secrets at build time without them ever appearing in a layer. The secret is mounted into the build container\u2019s filesystem and vanishes when the RUN instruction completes.
# syntax=docker/dockerfile:1
FROM node:22-slim
WORKDIR /app
COPY package*.json ./
# \u2705 Secret is mounted at /run/secrets/npmrc \u2014 never stored in any layer
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
npm ci --production
COPY . .
USER node
CMD ["node", "server.js"]
# Pass the secret at build time
docker build --secret id=npmrc,src=/Users/ashutoshmaheshwari/.npmrc -t myapp .
Use .dockerignore Religiously
Without a .dockerignore, COPY . . sends your entire build context to the daemon \u2014 including .env files, .git history, SSH keys, and anything else in the directory. A proper .dockerignore is your first line of defense against accidental secret leakage.
# .dockerignore
.git
.env
.env.*
*.pem
*.key
node_modules
docker-compose*.yml
README.md
Runtime Hardening
Even a perfectly built image can be dangerous at runtime if you run it with Docker\u2019s permissive defaults. The default Docker container runs as root (UID 0), with a set of Linux capabilities, a writable filesystem, and the ability to escalate privileges. Each of these defaults should be locked down.
Run as Non-Root
If your process runs as root inside the container and an attacker escapes the container\u2019s namespace isolation, they are root on the host. Setting a non-root USER in your Dockerfile is the single most impactful hardening step.
FROM python:3.12-slim
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
WORKDIR /app
COPY --chown=appuser:appuser . .
RUN pip install --no-cache-dir -r requirements.txt
USER appuser
CMD ["gunicorn", "app:create_app()"]
Drop All Capabilities, Add Only What You Need
By default, Docker grants containers a subset of Linux capabilities (14 out of 41+). Most applications need none of them. Drop them all, then selectively add back only what your process requires.
# Drop everything, add back only NET_BIND_SERVICE (to bind port 80/443)
docker run \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--security-opt=no-new-privileges \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid \
myapp:latest
Here is what each flag does and why it matters:
| Flag | What It Does | Why It Matters |
|---|---|---|
--cap-drop=ALL | Removes all Linux capabilities | Prevents mount, raw sockets, kernel module loading, etc. |
--security-opt=no-new-privileges | Prevents child processes from gaining new privileges | Blocks setuid/setgid binaries from escalating |
--read-only | Makes the root filesystem read-only | Prevents attackers from writing malware, modifying binaries |
--tmpfs /tmp | Mounts a writable tmpfs at /tmp | Gives the app a scratch space without making / writable |
--pids-limit=100 | Limits number of processes | Prevents fork bombs from bringing down the host |
Seccomp and AppArmor Profiles
Seccomp filters restrict which system calls a container can make. Docker ships with a default seccomp profile that blocks ~44 dangerous syscalls (like reboot, mount, kexec_load). You can create a custom profile for even tighter restrictions.
AppArmor (on Debian/Ubuntu) and SELinux (on RHEL/Fedora) provide mandatory access control. Docker applies a default AppArmor profile (docker-default) that restricts file access, mounting, and raw network operations.
# Use a custom seccomp profile (stricter than default)
docker run --security-opt seccomp=custom-profile.json myapp:latest
# Generate a profile by tracing syscalls with OCI tools
# Step 1: Run with no seccomp to trace what the app actually calls
docker run --security-opt seccomp=unconfined myapp:latest
# Step 2: Use tools like oci-seccomp-bpf-hook to build a whitelist
Running with --privileged disables seccomp, AppArmor, drops all capability restrictions, and gives the container nearly full access to the host. Never use --privileged in production. If you think you need it, you almost certainly need a specific --cap-add or --device instead.
Rootless Docker and User Namespace Remapping
Even with all the runtime flags above, the Docker daemon itself runs as root. Two approaches address this at the daemon level.
Rootless Docker
Rootless mode runs the entire Docker daemon and containers under a regular (non-root) user. If an attacker escapes the container, they land as an unprivileged user on the host \u2014 not root. This is the strongest isolation improvement you can make at the daemon level.
# Install rootless Docker
dockerd-rootless-setuptool.sh install
# Verify it is running rootless
docker info --format '{{.SecurityOptions}}'
# Should include "rootless"
# Set the socket path for your user
export DOCKER_HOST=unix:///docker.sock
User Namespace Remapping
If you cannot use rootless mode, user namespace remapping (userns-remap) maps UID 0 inside the container to a high-numbered unprivileged UID on the host. The process thinks it is root, but the kernel knows it is uid 100000.
{
"userns-remap": "default"
}
With "userns-remap": "default", Docker creates a dockremap user and configures subordinate UID/GID ranges in /etc/subuid and /etc/subgid. Container root (UID 0) maps to a host UID like 100000, so even a container escape yields no real privileges.
Container Escape CVEs: Why This All Matters
Container escapes are not theoretical. They have happened, and they will happen again. Understanding past escapes shows why defense-in-depth is essential \u2014 each vulnerability below would have been mitigated (or completely prevented) by one or more of the hardening techniques in this section.
| CVE | Year | What Happened | Mitigated By |
|---|---|---|---|
CVE-2019-5736 | 2019 | Malicious container overwrites host runc binary via /proc/self/exe | User namespaces, read-only rootfs, patched runc |
CVE-2020-15257 | 2020 | containerd-shim API accessible from containers sharing host network namespace | Never use --network=host in production |
CVE-2022-0185 | 2022 | Linux kernel heap overflow in filesystem context, escapable from unprivileged user ns | Patched host kernel, seccomp (blocks unshare) |
CVE-2024-21626 | 2024 | runc WORKDIR race condition leaks host file descriptors into container | Patched runc (>= 1.1.12), rootless mode |
The pattern is consistent: keep your host kernel, runc, and containerd patched, and layer runtime protections so that even a 0-day escape faces additional barriers (non-root user, dropped capabilities, user namespace remapping).
Production Hardening Checklist
Combine everything into a single hardened docker run invocation. This represents a production-grade baseline.
docker run -d \
--name myapp \
--user 1000:1000 \
--cap-drop=ALL \
--security-opt=no-new-privileges \
--security-opt seccomp=myapp-seccomp.json \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--pids-limit=100 \
--memory=512m \
--cpus=1 \
--restart=unless-stopped \
--health-cmd="curl -f http://localhost:8080/healthz || exit 1" \
--health-interval=30s \
myapp:1.2.3@sha256:abc123...
Performance Optimization: Faster Builds, Smaller Images, Efficient Runtime
Docker performance isn't one thing — it's three. A fast build that produces a 1.2 GB image is still a problem. A tiny image that takes 15 minutes to build is still a problem. And even a fast, small image can choke at runtime if you ignore memory limits and I/O patterns. This section attacks all three dimensions systematically.
Dimension 1: Build Speed
Docker builds layers sequentially from top to bottom. Every time a layer's input changes, that layer and every layer below it are rebuilt from scratch. The single most impactful thing you can do for build speed is order your instructions from least-frequently-changed to most-frequently-changed.
Instruction Ordering
Here's a common anti-pattern — copying all source files before installing dependencies:
# BAD: Any source file change invalidates the npm install cache
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
Every time you touch a single .js file, npm install re-runs — downloading the exact same dependencies. Fix it by copying the lockfile first:
# GOOD: npm install only re-runs when package files change
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
RUN npm run build
BuildKit Cache Mounts
BuildKit (the default builder since Docker 23.0) supports cache mounts — persistent directories that survive between builds. This is transformative for package managers that maintain their own caches (apt, pip, go, npm).
# syntax=docker/dockerfile:1
FROM python:3.12-slim
# Apt cache survives between builds
RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
apt-get update && apt-get install -y gcc libpq-dev
# Pip cache survives between builds
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
Without cache mounts, every apt-get install re-downloads every .deb package. With them, only new or changed packages are fetched. On a project with heavy C dependencies, this can cut build time from 4 minutes to 30 seconds.
The .dockerignore File
Before Docker starts building, it sends your entire build context (the directory you pass to docker build) to the daemon. If that directory contains node_modules, .git, build artifacts, or large data files, you're wasting time and bandwidth — and possibly leaking secrets.
# .dockerignore
.git
node_modules
dist
*.md
.env
.vscode
coverage
__pycache__
*.pyc
Parallel Multi-Stage Builds
BuildKit automatically parallelizes independent stages. If your build has a frontend and a backend that don't depend on each other, declare them as separate stages and BuildKit will build them concurrently:
# These two stages build IN PARALLEL with BuildKit
FROM node:20-alpine AS frontend
WORKDIR /app/frontend
COPY frontend/package*.json ./
RUN npm ci
COPY frontend/ .
RUN npm run build
FROM golang:1.22-alpine AS backend
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o server .
# Final stage: pulls artifacts from both
FROM alpine:3.19
COPY --from=frontend /app/frontend/dist /srv/static
COPY --from=backend /app/server /usr/local/bin/server
ENTRYPOINT ["server"]
Dimension 2: Image Size
Large images mean slower pulls, slower deploys, larger attack surface, and higher storage costs. The goal is to ship only what you need to run the application — no compilers, no package managers, no documentation, no leftover tarballs.
Choosing the Right Base Image
Your base image choice is the single biggest lever for image size. Here's how common bases compare:
| Base Image | Size | Use Case |
|---|---|---|
ubuntu:24.04 | ~78 MB | When you need full glibc + apt ecosystem |
debian:bookworm-slim | ~75 MB | Slimmed Debian — fewer pre-installed packages |
alpine:3.19 | ~7 MB | Minimal with musl libc — great for Go, Rust, static binaries |
gcr.io/distroless/static | ~2 MB | No shell, no package manager — just your binary |
scratch | 0 MB | Literally empty — for fully static binaries only |
Alpine uses musl instead of glibc. Most Go and Rust programs work flawlessly. Some Python/Node C extensions may have compatibility issues — test before committing. If you hit musl problems, debian:bookworm-slim is a solid fallback.
Delete Temp Files in the Same Layer
Docker images are additive — each layer stores a diff. If you download a 200 MB tarball in one RUN step and delete it in the next, the tarball is still in the first layer. You pay the full cost. Always clean up in the same RUN instruction:
# BAD — 3 layers, tarball persists in layer 1
RUN curl -O https://example.com/big-archive.tar.gz
RUN tar xzf big-archive.tar.gz
RUN rm big-archive.tar.gz
# GOOD — 1 layer, tarball never persisted
RUN curl -O https://example.com/big-archive.tar.gz \
&& tar xzf big-archive.tar.gz \
&& rm big-archive.tar.gz
The same logic applies to apt-get. Always chain apt-get update, apt-get install, and cleanup into one RUN:
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates \
curl \
&& rm -rf /var/lib/apt/lists/*
The --no-install-recommends flag prevents apt from pulling in suggested packages you don't need. This alone can save 50-200 MB depending on the packages involved.
Multi-Stage Builds for Minimal Final Images
Multi-stage builds are the most powerful size-reduction technique. You build in a fat image with all the tools, then copy just the artifact into a minimal runtime image. The compiler, source code, and build dependencies never ship.
# Stage 1: Build (900+ MB with Go toolchain)
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app
# Stage 2: Run (just the binary — ~7 MB total)
FROM scratch
COPY --from=build /app /app
ENTRYPOINT ["/app"]
Inspecting Image Layers with dive
The dive tool gives you an interactive, layer-by-layer breakdown of your image. It shows exactly what each layer added, how much space it uses, and highlights wasted space (files added in one layer and deleted in a later one).
# Install dive (macOS)
brew install dive
# Analyze an image
dive my-app:latest
# CI mode — fail if image efficiency is below threshold
dive my-app:latest --ci --lowestEfficiency=0.95
Dimension 3: Runtime Performance
A lean image is only half the story. How you run the container matters just as much. Docker's default settings are generous, which is fine for dev but dangerous in production. Here are the levers you should be tuning.
Memory and CPU Limits
Without limits, a single container can consume all host memory and trigger the OOM killer, taking down every container on the host. Always set explicit constraints:
# Hard memory cap + 2 CPU cores
docker run -d \
--memory=512m \
--memory-swap=512m \
--cpus=2 \
--name api-server \
my-api:latest
# Reserve memory (soft limit) — useful with orchestrators
docker run -d \
--memory=512m \
--memory-reservation=256m \
my-api:latest
Setting --memory-swap equal to --memory disables swap entirely. This prevents a memory-starved container from thrashing swap and degrading performance for everything on the host.
Volumes for Write-Heavy Workloads
Docker's copy-on-write filesystem (overlay2) adds overhead for write-heavy operations. Databases, log files, and any workload that writes frequently should use volumes, which bypass the storage driver and write directly to the host filesystem:
# Named volume for database data
docker run -d \
-v pgdata:/var/lib/postgresql/data \
postgres:16-alpine
# tmpfs for ephemeral scratch data (RAM-backed, zero disk I/O)
docker run -d \
--tmpfs /tmp:rw,noexec,size=256m \
my-worker:latest
# Shared memory size (default is 64MB — too small for Postgres/Chrome)
docker run -d \
--shm-size=256m \
postgres:16-alpine
Network Performance
Docker's default bridge network adds NAT overhead. For latency-sensitive workloads where the container needs bare-metal network performance, you can bypass Docker networking entirely:
# Host networking — no NAT, no port mapping, ~zero overhead
docker run -d --network=host my-app:latest
--network=host removes network isolation — the container shares the host's network namespace. Only use this when you've measured a real performance gap and accept the security trade-off. It also doesn't work on Docker Desktop for Mac/Windows (which runs in a Linux VM).
Benchmarking and Monitoring
You can't optimize what you don't measure. Docker ships with built-in monitoring, and open-source tools take it further for production environments.
# Live stats for all running containers
docker stats
# One-shot stats (useful for scripts)
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"
For production monitoring, cAdvisor exports per-container metrics (CPU, memory, filesystem, network) to Prometheus. A minimal setup looks like this:
# docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
Practical Exercise: From 1.2 GB to Under 50 MB
Let's apply every technique from this section. We'll start with a bloated Node.js image and systematically shrink it. The application is a simple Express API.
-
Start with the bloated Dockerfile
This is what a naive Dockerfile often looks like in the wild. Build it and note the image size:
docker# Dockerfile.bloated — ~1.2 GB FROM node:20 WORKDIR /app COPY . . RUN npm install RUN apt-get update && apt-get install -y python3 build-essential RUN npm run build EXPOSE 3000 CMD ["node", "dist/server.js"]bashdocker build -f Dockerfile.bloated -t myapp:bloated . docker images myapp:bloated # REPOSITORY TAG SIZE # myapp bloated 1.21GB -
Switch to Alpine and add a .dockerignore
Switch the base to
node:20-alpine, split dependency install from source copy, and exclude unnecessary files from the build context:bash# Create .dockerignore echo -e "node_modules\ndist\n.git\n*.md\n.env" > .dockerignore -
Apply multi-stage build
Separate the build stage (which needs dev dependencies and build tools) from the production stage (which needs only the compiled output and production dependencies):
docker# Dockerfile.optimized — ~45 MB FROM node:20-alpine AS build WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci COPY tsconfig.json ./ COPY src/ ./src/ RUN npm run build FROM node:20-alpine AS production WORKDIR /app ENV NODE_ENV=production COPY package.json package-lock.json ./ RUN npm ci --omit=dev && npm cache clean --force COPY --from=build /app/dist ./dist USER node EXPOSE 3000 CMD ["node", "dist/server.js"] -
Verify the result
Build the optimized image and compare sizes:
bashdocker build -f Dockerfile.optimized -t myapp:optimized . docker images myapp # REPOSITORY TAG SIZE # myapp bloated 1.21GB # myapp optimized 45MB # Inspect layers for further optimization dive myapp:optimizedThat's a 96% reduction — from 1.21 GB to ~45 MB — achieved by switching the base image, using multi-stage builds, installing only production dependencies, and cleaning the npm cache.
For Go, Rust, or any language that produces static binaries, you can use FROM scratch or FROM gcr.io/distroless/static as the final stage — bringing the total image size to just the binary itself (often 5-15 MB). For Node.js, node:20-alpine is the practical floor since you need the Node runtime.
Beyond Single Host: Docker Swarm and Kubernetes Overview
A single Docker host gets you surprisingly far — but eventually you hit its ceiling. Maybe you need high availability, so one crashed server doesn't take everything down. Maybe traffic has outgrown a single machine. Maybe you need zero-downtime deploys. This is where container orchestration enters the picture.
Orchestrators manage containers across multiple machines, handling scheduling, networking, scaling, and self-healing. The two orchestrators most relevant to Docker users are Docker Swarm (built into Docker) and Kubernetes (the industry standard). Let's examine both, honestly.
graph LR
subgraph swarm["Docker Swarm"]
direction TB
SM[Manager Nodes] --> SW[Worker Nodes]
SW --> SS[Services]
SS --> SC[Containers / Tasks]
end
subgraph shared["Shared Concepts"]
direction TB
S1[Declarative Config]
S2[Service Discovery]
S3[Load Balancing]
S4[Rolling Updates]
S5[Overlay Networking]
end
subgraph k8s["Kubernetes"]
direction TB
KCP[Control Plane] --> KW[Worker Nodes]
KW --> KP[Pods]
KP --> KC[Containers]
end
swarm --- shared
shared --- k8s
Docker Swarm: The Built-In Orchestrator
Docker Swarm mode is baked directly into the Docker Engine. There's nothing extra to install — if you have Docker, you have Swarm. You initialize a cluster with a single command, and other nodes join with a token. The simplicity is genuine and significant.
# Initialize Swarm on the first manager node
docker swarm init --advertise-addr 192.168.1.10
# On worker nodes, join with the token provided
docker swarm join --token SWMTKN-1-abc123... 192.168.1.10:2377
# Check cluster status
docker node ls
Swarm introduces the concept of services — a declarative way to say "I want 3 replicas of this image running at all times." Swarm handles placement, restarts, and load balancing automatically.
Services and Overlay Networking
A Swarm service wraps your container definition with orchestration metadata: replica count, update policy, resource constraints, and networking. Overlay networks span all nodes in the cluster, so containers on different physical machines can communicate as if they were on the same network.
# Create an overlay network
docker network create --driver overlay backend-net
# Deploy a service with 3 replicas
docker service create \
--name api \
--replicas 3 \
--network backend-net \
--publish 8080:3000 \
myapp/api:1.2.0
# Scale up
docker service scale api=5
# Rolling update to a new image
docker service update --image myapp/api:1.3.0 api
Stack Deploys: Compose for Swarm
If you already use Docker Compose, you're 80% of the way to Swarm. A docker stack deploy reads a Compose file (with a deploy key) and creates services, networks, and volumes across the cluster. The deploy section specifies replicas, update policies, and resource limits — things that don't apply to single-host Compose.
# docker-compose.prod.yml
version: "3.8"
services:
api:
image: myapp/api:1.3.0
networks:
- backend
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
resources:
limits:
cpus: "0.50"
memory: 256M
ports:
- "8080:3000"
redis:
image: redis:7-alpine
networks:
- backend
deploy:
placement:
constraints:
- node.role == manager
networks:
backend:
driver: overlay
# Deploy the entire stack
docker stack deploy -c docker-compose.prod.yml myapp
# Check running services
docker stack services myapp
# Remove the stack
docker stack rm myapp
Routing Mesh
Swarm's routing mesh is an underrated feature. When you publish a port, every node in the cluster listens on that port — even nodes not running the service. Traffic hitting any node is automatically routed to a healthy container. This means you can point a simple load balancer (or DNS round-robin) at all your nodes without worrying about which one is running what.
Docker Swarm is genuinely simpler than Kubernetes — less YAML, fewer concepts, zero external dependencies. For small teams running 5-20 services across a handful of nodes, it works well. However, Swarm lost the orchestration war. Docker Inc. pivoted away from it, the ecosystem stagnated, and most cloud providers don't offer managed Swarm. The community, tooling, and job market all center on Kubernetes. Learn Swarm if it fits your scale, but don't bet your career on it alone.
Kubernetes: The Industry Standard
Kubernetes (K8s) is the dominant container orchestration platform. It was originally designed by Google, based on over a decade of running containers at massive scale (their internal system, Borg). Unlike Swarm, Kubernetes is not built into Docker — it's a separate, complex system with its own API, CLI (kubectl), and ecosystem.
The learning curve is steep, but the payoff is a platform that can handle virtually any workload pattern. Here are the core building blocks you need to understand.
Core Kubernetes Objects
| Object | What It Does | Swarm Equivalent |
|---|---|---|
| Pod | Smallest deployable unit — one or more containers sharing network/storage | Task (single container) |
| Deployment | Manages Pod replicas, rolling updates, rollbacks | Service |
| Service | Stable network endpoint for a set of Pods (load balancing) | Service VIP + routing mesh |
| Namespace | Virtual cluster isolation within one physical cluster | Stack name (loosely) |
| ConfigMap | Inject configuration as env vars or files | Configs |
| Secret | Like ConfigMap but for sensitive data (base64-encoded, encryptable at rest) | Secrets |
Everything in Kubernetes is declared as YAML and applied via kubectl apply. Here's a minimal Deployment and Service — the Kubernetes equivalent of docker service create:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: myapp/api:1.3.0
ports:
- containerPort: 3000
resources:
limits:
cpu: "500m"
memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
ports:
- port: 80
targetPort: 3000
type: LoadBalancer
Notice the difference in verbosity. The Kubernetes manifest is roughly 3× longer than the equivalent Swarm stack file — and this is a simple example. That verbosity buys you precision and flexibility, but it's a real cost for small projects.
The Migration Path: Compose → Swarm → Kubernetes
You don't need to jump straight to Kubernetes. There's a natural progression that lets you add complexity only when your needs demand it.
Docker Compose handles local development and single-host deployments. When you need multi-node, add a deploy key and use Docker Swarm — your Compose files mostly carry over. When you outgrow Swarm (or need the Kubernetes ecosystem), Kompose can convert your Compose files into Kubernetes manifests as a starting point.
# Install Kompose
curl -L https://github.com/kubernetes/kompose/releases/latest/download/kompose-linux-amd64 -o kompose
chmod +x kompose
# Convert a Compose file to Kubernetes manifests
./kompose convert -f docker-compose.yml
# This generates Deployment and Service YAML files
# Review and adjust them — Kompose output is a starting point, not production-ready
Kompose translates syntax, not architecture. It won't generate Ingress rules, Horizontal Pod Autoscalers, PersistentVolumeClaims, or health checks. Treat its output as a rough draft that needs manual refinement for production use.
Alternatives Worth Knowing
Kubernetes isn't the only option beyond Swarm. Depending on your team size, cloud provider, and operational appetite, these alternatives may be a better fit.
| Platform | Best For | Trade-off |
|---|---|---|
| HashiCorp Nomad | Teams that want orchestration without Kubernetes complexity. Supports containers, VMs, and bare binaries. | Smaller ecosystem than K8s. Fewer managed offerings. |
| AWS ECS | AWS-native teams. Tight integration with ALB, IAM, CloudWatch. | Vendor lock-in. Not portable to other clouds. |
| Google Cloud Run | Stateless HTTP services that need to scale to zero. No cluster management at all. | Limited to request-driven workloads. No persistent connections or background jobs. |
| Fly.io | Edge deployments. Runs containers close to users worldwide with minimal config. | Smaller company, niche platform. Less enterprise tooling. |
If you're a solo dev or small team with fewer than 10 services, start with Docker Compose on a single host, or Docker Swarm for basic HA. If you're building a platform for multiple teams, or your cloud provider offers managed Kubernetes (EKS, GKE, AKS), that's the pragmatic choice. The "best" orchestrator is the one your team can actually operate.
Real-World Patterns and Production Playbook
Running Docker in development is one thing. Running it in production — where a container crash at 3 AM pages your on-call engineer — is another beast entirely. This section distills the battle-tested patterns that separate "it works on my machine" from "it works reliably, at scale, under pressure."
We will walk through the 12-factor methodology applied to containers, graceful shutdown mechanics, logging pipelines, image promotion strategies, monitoring stacks, and a concrete production-readiness checklist you can adopt today.
The 12-Factor App, Containerized
The 12-factor app methodology was written in 2011, but it reads like a Docker best-practices guide. Containers naturally enforce many of its principles — but not all of them automatically. Here is how the most critical factors map to Docker decisions.
| Factor | Principle | Docker Implementation |
|---|---|---|
| III. Config | Store config in the environment | Use -e, --env-file, or orchestrator secrets — never bake config into images |
| VI. Processes | Stateless processes | Keep containers ephemeral; persist state in volumes or external stores (Redis, PostgreSQL) |
| VII. Port Binding | Export services via port binding | EXPOSE in Dockerfile, -p at runtime — the app binds to 0.0.0.0 |
| VIII. Concurrency | Scale via the process model | Scale horizontally with docker compose up --scale web=4 or Kubernetes replicas |
| IX. Disposability | Fast startup, graceful shutdown | Minimal images + proper signal handling (see below) |
| XI. Logs | Treat logs as event streams | Write to stdout/stderr — let the logging driver handle the rest |
If you are copying .env files into your image at build time, or hard-coding database URLs in application code, you have broken config separation. The image should be identical across all environments — only environment variables change.
Health Checks in Production
A running container is not necessarily a healthy container. Your Node.js process might be alive but stuck in an infinite loop. Your database connection pool might be exhausted. Docker's HEALTHCHECK instruction lets you define what "healthy" actually means for your application, and orchestrators use this signal to route traffic and restart failed containers.
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -f http://localhost:8080/healthz || exit 1
The four parameters matter:
--interval: How often to check (30s is a sensible default).--timeout: Max time the check can take before it is considered failed.--start-period: Grace period for slow-starting apps (the check runs but failures do not count).--retries: How many consecutive failures before the container is markedunhealthy.
For production health endpoints, go deeper than just "is the server responding." Check downstream dependencies — database connectivity, cache availability, disk space — and return structured JSON with per-component status.
{
"status": "healthy",
"checks": {
"database": { "status": "up", "latency_ms": 3 },
"redis": { "status": "up", "latency_ms": 1 },
"disk": { "status": "up", "free_gb": 12.4 }
},
"uptime_seconds": 86412
}
Graceful Shutdown: SIGTERM, PID 1, and Init Systems
When Docker stops a container, it sends SIGTERM to PID 1, waits for a grace period (default: 10 seconds), and then sends SIGKILL. If your app handles SIGTERM correctly, it can finish in-flight requests, close database connections, and flush buffers before exiting cleanly. If it does not, your users see dropped connections and your data might corrupt.
The problem is that many containers run a shell as PID 1. Shells do not forward signals to child processes, so your app never receives SIGTERM — it just gets hard-terminated after the timeout. This is the "PID 1 zombie reaping" problem.
The Fix: Use tini or dumb-init
# tini is built into Docker — just use --init at runtime
# OR install it explicitly in your image:
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["node", "server.js"]
The easiest option: run any container with docker run --init myapp and Docker uses its bundled tini. For Compose, add init: true to the service.
RUN apt-get update && apt-get install -y dumb-init
ENTRYPOINT ["dumb-init", "--"]
CMD ["python", "app.py"]
dumb-init by Yelp serves the same purpose — it runs as PID 1, forwards signals, and reaps zombie processes. Choose whichever your team prefers; both are battle-tested.
On the application side, you still need to handle the signal. Here is a minimal Node.js example:
process.on('SIGTERM', () => {
console.log('SIGTERM received. Shutting down gracefully...');
server.close(() => {
db.end(); // close database pool
process.exit(0); // exit cleanly
});
// Force exit after 8s if close hangs
setTimeout(() => process.exit(1), 8000);
});
Logging Strategies
Docker captures everything your container writes to stdout and stderr. The logging driver determines where those streams go. Choosing the right driver is one of the highest-impact production decisions you will make — it affects debugging speed, storage costs, and system performance.
| Driver | Where Logs Go | Best For | Caveat |
|---|---|---|---|
json-file | Local JSON files on the host | Development, small deployments | Fills disk without max-size/max-file limits |
fluentd | Fluentd/Fluent Bit collector | Centralized logging at scale | Requires running a Fluentd sidecar or daemon |
loki | Grafana Loki (via plugin) | Grafana-based stacks, cost-conscious teams | Needs the Loki Docker plugin installed |
syslog | Remote syslog server | Traditional infrastructure | Less structured than JSON alternatives |
Always set log rotation on the default json-file driver. Without it, a verbose container can consume all available disk space and bring down the host.
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5",
"compress": "true"
}
}
Put this in /etc/docker/daemon.json to set defaults for all containers on the host. For a production ELK (Elasticsearch + Logstash + Kibana) pipeline, ship logs from Fluent Bit to Elasticsearch. Fluent Bit is lighter than Logstash and handles Docker log format natively.
Emit logs as JSON from your application. This makes them machine-parseable without grok patterns or regex extractors. A line like {"level":"error","msg":"connection refused","service":"api","ts":"2024-01-15T08:30:00Z"} is infinitely more useful in Kibana or Grafana than a plain text string.
Configuration Management
Configuration has a hierarchy in containerized environments. Getting it right means your image stays portable and your secrets stay safe.
The Configuration Pyramid
- Baked into the image — Only for defaults that are truly universal (e.g., the app listen port, timezone
UTC). These go in the Dockerfile. - Environment variables — For values that change per environment: database URLs, feature flags, log levels. Pass via
-e,--env-file, or Composeenvironment:key. - Mounted config files — For complex configuration (nginx.conf, application YAML). Bind-mount or use Docker configs in Swarm.
- Secrets — For passwords, API keys, TLS certs. Use Docker Secrets (Swarm), Kubernetes Secrets, or a vault like HashiCorp Vault. Never pass secrets as build args — they are stored in image layers.
# docker-compose.yml — clean config separation
services:
api:
image: mycompany/api:1.4.2
environment:
- NODE_ENV=production
- LOG_LEVEL=info
env_file:
- .env.production
configs:
- source: nginx_conf
target: /etc/nginx/nginx.conf
secrets:
- db_password
configs:
nginx_conf:
file: ./config/nginx.prod.conf
secrets:
db_password:
external: true
Image Promotion: Build Once, Deploy Everywhere
The golden rule of container deployments: build the image exactly once, then promote that identical artifact through dev, staging, and production. Rebuilding per environment introduces drift — different dependency versions, different build timestamps, different behavior. If it passed tests in staging, you want that exact image in production.
graph LR
A["Developer\ngit push"] --> B["CI Pipeline\ndocker build"]
B --> C["Registry\n:git-sha tag"]
C --> D["Dev\nAuto-deploy"]
D --> E{"Tests\nPass?"}
E -->|Yes| F["Staging\nSame image"]
F --> G{"QA\nPass?"}
G -->|Yes| H["Production\nSame image"]
E -->|No| I["Fix &\nRebuild"]
G -->|No| I
style A fill:#4a9eff,color:#fff
style B fill:#f5a623,color:#fff
style C fill:#7b68ee,color:#fff
style H fill:#2ecc71,color:#fff
style I fill:#e74c3c,color:#fff
The key mechanism is tagging. Each build produces a single image tagged with both a semantic version and the git SHA. The same image gets additional tags as it is promoted.
# Build once in CI
GIT_SHA=$(git rev-parse --short HEAD)
VERSION="1.4.2"
docker build -t mycompany/api:${VERSION}-${GIT_SHA} .
# Tag for promotion
docker tag mycompany/api:${VERSION}-${GIT_SHA} mycompany/api:${VERSION}
docker tag mycompany/api:${VERSION}-${GIT_SHA} mycompany/api:latest
# Push all tags
docker push mycompany/api:${VERSION}-${GIT_SHA}
docker push mycompany/api:${VERSION}
docker push mycompany/api:latest
Tagging Strategy: Semver + Git SHA
| Tag | Example | Purpose |
|---|---|---|
version-sha | 1.4.2-a3f8c1d | Immutable, traceable to exact commit |
version | 1.4.2 | Human-readable release identifier |
latest | latest | Convenience only — never use in production manifests |
The :latest tag is mutable — it points to whatever was pushed most recently. If two developers push in quick succession, the second overrides the first with no audit trail. Always pin to an immutable tag (1.4.2-a3f8c1d) in production deployments so you can trace exactly what is running and roll back deterministically.
Monitoring: cAdvisor + Prometheus + Grafana
The standard open-source monitoring stack for containerized workloads is cAdvisor, Prometheus, and Grafana. Each layer has a specific job: cAdvisor exposes per-container resource metrics (CPU, memory, network, disk I/O), Prometheus scrapes and stores those metrics as time-series data, and Grafana visualizes them with dashboards and alerts.
# docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:10.4.0
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_pw
volumes:
prometheus_data:
grafana_data:
Point Prometheus at cAdvisor with a simple scrape config:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'app'
static_configs:
- targets: ['api:9100'] # your app's /metrics endpoint
Key metrics to alert on: container_memory_usage_bytes approaching the limit (OOM termination incoming), container_cpu_usage_seconds_total sustained spikes, container_network_receive_errors_total rising, and your application's own latency and error-rate metrics exposed via a /metrics endpoint.
Production-Readiness Checklist
Before any container touches production traffic, run through this checklist. Each item addresses a real failure mode that has caused outages in production systems.
| Category | Check | Why It Matters |
|---|---|---|
| Image | Using a minimal base (alpine, distroless, slim) | Fewer packages = smaller attack surface, faster pulls |
| Image | Pinned base image to a specific digest or version | Prevents silent breakage when upstream images change |
| Image | Multi-stage build — no build tools in final image | Reduces image size and eliminates compilers attackers could use |
| Image | Scanned for CVEs (Trivy, Snyk, Grype) | Known vulnerabilities caught before deployment |
| Runtime | Running as non-root user | Limits blast radius if the container is compromised |
| Runtime | HEALTHCHECK defined and tested | Orchestrator can detect and replace unhealthy instances |
| Runtime | Graceful shutdown handles SIGTERM | Zero dropped connections during deployments and scaling |
| Runtime | Memory and CPU limits set | Prevents a single container from starving the host |
| Ops | Logs go to stdout/stderr with rotation configured | Prevents disk exhaustion, enables centralized log aggregation |
| Ops | Secrets injected at runtime, not baked into image | Secrets in layers are recoverable by anyone with image access |
| Ops | Immutable tag deployed (semver + SHA, not :latest) | Reproducible deployments, reliable rollbacks |
| Ops | Monitoring and alerting configured | You find out about problems before your users do |