Docker, Docker Compose, and Docker Swarm
Docker & Docker Compose for Data Engineering
A comprehensive guide to containerizing data pipelines, databases, and analytics workflows.
Official Resources:
Cheatsheets:
What is Docker?
Docker packages applications with their complete runtime environments into lightweight, portable containers. Unlike virtual machines that virtualize hardware, containers virtualize at the operating system level—sharing the host kernel while maintaining strong isolation through namespaces.
Why Docker for Data Engineering?
Reproducible environments across development, testing, and production
Version-controlled infrastructure alongside your code
Easy orchestration of complex data stacks (databases, ETL tools, streaming platforms)
Simplified dependency management for Python/R/Java data tools
Rapid environment teardown and rebuild during development
Core Concepts
Images vs Containers
Image: A read-only template defining what a container should be. Think of it as a class in OOP.
Container: A running instance of an image with its own filesystem, network, and process space. Think of it as an object instantiated from a class.
Key Insight: Deleting and recreating a container resets it to the original image state. Restarting a stopped container preserves changes made during its lifetime.
Container vs Virtual Machine
Docker Architecture
Docker Daemon: The background service that manages containers. Must be running for Docker to function.
Docker Desktop: A GUI wrapper (optional on Linux) that manages the daemon, provides visualization, and on Mac/Windows, handles the Linux VM that actually runs Docker.
Docker CLI: Your primary interface. Communicates directly with the daemon via socket/API.
Start here: docker --help or docker <command> --help for specific guidance.
Essential Commands
Container Lifecycle
Run a container:
Common flags:
-d: Detached mode (runs in background)-p host:container: Port mapping.It's two-way communication.
When you map ports with
-p 8080:80:Incoming (to container):
Requests to
localhost:8080on your machine → forwarded to port 80 in the container
Outgoing (from container):
Responses from the container's port 80 → sent back through port 8080 to your machine
So it's a complete bidirectional channel. You send HTTP requests to port 8080, they go into the container on port 80, the application processes them, and the responses come back out the same path.
-v volume:path: Mount volume for persistence-e KEY=VALUE: Set environment variables--name: Assign a custom name--network: Attach to a specific network--rm: Auto-remove container when stopped
-m,--memory=""limit memory allocation. Number is a positive integer. Unit can be one of
b,k,m, org
--memory-swap=""used to control the total amount of memory (RAM) plus swap space (virtual memory) that a container can use
List containers:
Stop/Start/Restart:
Remove containers:
Working with Images
List images:
Pull/Push images:
Remove images:
Search Docker Hub:
Building Custom Images
Dockerfile Basics
A Dockerfile is Infrastructure as Code for your container image. Place it in your project root.
Data Engineering Example - Python ETL Container:
Build the image:
Key Dockerfile Instructions:
FROM: Base image to build uponWORKDIR: Set working directoryCOPY/ADD: Copy files from host to imageRUN: Execute commands during build (install packages, etc.)ENV: Set environment variablesEXPOSE: Document which ports the container listens onCMD: Default command when container startsENTRYPOINT: Configure container as executable
Best Practices:
Use official base images when possible
Copy dependency files before application code (layer caching)
Minimize layers by combining RUN commands
Use
.dockerignoreto exclude unnecessary filesDon't store secrets in images (use environment variables or secrets management)
Executing Commands in Containers
One-off Commands
Interactive Shell Sessions
Flags:
-i: Interactive mode (keep STDIN open)-t: Allocate pseudo-TTY (keyboard interface)
Monitoring & Debugging
Viewing Logs
Resource Monitoring
Check resource usage:
View processes in container:
Resource Limits
Control resource allocation to prevent containers from consuming excessive resources:
Memory suffixes: b, k, m, g (e.g., 512m, 2g)
Resource Constraints Documentation
docker cp
Copy files between your host machine and a container:
Example scenario:
docker diff
Shows what files have been modified in a container since it started:
Example output:
A= Added fileC= Changed file/directoryD= Deleted file
Example scenario:
Output might show:
This is useful for debugging or understanding what a container modified before committing it to an image.
Persistent Storage: Volumes
Containers are ephemeral by design. Volumes provide persistent storage that survives container deletion.
Volume Types
Named Volumes (Recommended): Managed by Docker, stored in Docker's storage directory.
Anonymous Volumes: Created automatically, harder to reference later.
Bind Mounts: Direct mapping to host filesystem paths.
Working with Volumes
Using Volumes
Named volume example:
Bind mount example (development):
Multiple volumes:
Data Engineering Use Cases:
Database storage (PostgreSQL, MongoDB, Elasticsearch)
Persistent logs (Airflow, Spark)
Shared datasets between containers
ML model artifacts and checkpoints
Configuration files
Networking
Containers need networking for inter-service communication and external access.
Network Modes
Bridge (default): Private network for container communication.
Host: Container shares host's network stack.
None: No networking (isolated mode).
Custom networks: User-defined bridges with DNS resolution.
Working with Networks
Connecting Containers
Example: Connect application to database
Network Communication Diagram:
Key Feature: Docker provides automatic DNS resolution. Container etl-job can connect to postgres by name, not IP address.
Network Isolation
Offline mode (for security or testing):
Test connectivity:
Here is your modernized, up-to-date cheatsheet.
Key Changes Applied:
Commands: Updated from
docker-compose(v1) todocker compose(v2).File Name: Updated to
compose.yaml, which is the preferred standard for V2 (thoughdocker-compose.ymlstill works).Version Key: Removed the
version: '3.8'line. The new Compose Specification (V2) defaults to the latest schema automatically, making the version line obsolete.
Docker Compose (V2) Cheatsheet
Docker Compose orchestrates multi-container applications using a declarative YAML file. Essential for data engineering where you typically run databases, message queues, ETL tools, and monitoring together.
Why Use Compose?
Define entire data stack in one file
Single command to start/stop all services
Automatic network creation and DNS resolution (service names = hostnames)
Service dependencies and health checks
Basic Compose File
File: compose.yaml (Preferred) or docker-compose.yml
Essential Compose Commands
more commands
Common Docker Compose Commands
Here are some of the most frequently used commands and their functions:
Command
Description
docker compose up
Creates and starts all services defined in the Compose file. Use with the -d flag for detached mode (running in the background).
docker compose down
Stops and removes containers, networks, and volumes created by up.
docker compose ps
Lists the running containers for the current project.
docker compose logs
Views the aggregated log output from all containers.
docker compose build
Builds or rebuilds service images.
docker compose pull
Pulls service images from a registry (e.g., Docker Hub).
docker compose start
Starts existing, stopped service containers.
docker compose stop
Stops running service containers without removing them.
docker compose restart
Restarts service containers.
docker compose exec
Executes a command in a running container.
docker compose run
Runs a one-off command on a service (e.g., running tests or migrations).
For a full list of commands and options, refer to the Docker Compose CLI reference.
Data Engineering Stack Example
Complete pipeline with database, message queue, and processing.
Environment Variables
1. Using .env file (Default)
Create .env in the same directory as compose.yaml. Compose automatically loads these.
Reference in YAML: ${VAR_NAME} or ${VAR_NAME:-default_value}.
2. Override for different environments
Advanced Compose Features
Service Dependencies with Health Checks
Ensures services don't start until their dependencies are actually ready (not just running).
Resource Limits
Limit CPU and RAM to prevent crashes on your local machine.
Profiles (Selective Startup)
Great for running only parts of the stack (e.g., only dev tools, or skipping heavy monitoring).
Command: docker compose --profile dev up -d
Authentication & Security
Docker Hub Authentication
Login to Docker Hub:
Tag and push images:
Alternative Container Registries
Amazon ECR (Elastic Container Registry):
Google Artifact Registry:
GitHub Container Registry (GHCR):
Azure Container Registry:
Security Best Practices
1. Use secrets, not environment variables for sensitive data:
Docker Compose with secrets (Swarm mode):
2. Run containers as non-root:
3. Use official images from trusted sources
4. Scan images for vulnerabilities:
5. Don't store secrets in images:
Use environment variables
Use Docker secrets
Use external secret management (Vault, AWS Secrets Manager)
6. Use .dockerignore:
Data Engineering Workflow Examples
Example 1: PostgreSQL + dbt + Jupyter
Example 2: Airflow Data Pipeline
Example 3: Spark Cluster
CI/CD Integration
Modern deployment workflow with Docker:
Developer pushes code to Git repository
CI/CD pipeline triggers (GitHub Actions, GitLab CI, Jenkins)
Pipeline builds Docker image
Pipeline runs tests in container
Pipeline tags image with version/commit SHA
Pipeline pushes image to registry (Docker Hub, ECR, GHCR)
Deployment system pulls latest image
Old containers are replaced with new ones
Example GitHub Action:
Docker Swarm
Docker Swarm is Docker's native clustering and orchestration solution. It turns multiple Docker hosts into a single virtual host, enabling container deployment across a cluster of machines with built-in load balancing, service discovery, and high availability.
Swarm vs Compose vs Kubernetes
When to Use Docker Swarm
Good for:
Small to medium production deployments
Teams already familiar with Docker
Simple orchestration needs
When you want native Docker CLI integration
Quick setup without complex configuration
Not ideal for:
Very large scale (1000+ containers)
Complex networking requirements
Advanced scheduling needs
When industry-standard tooling is required (Kubernetes dominates)
Swarm Architecture
Key Swarm Concepts
Nodes: Individual Docker hosts in the cluster
Manager nodes: Control the cluster, schedule tasks, maintain state
Worker nodes: Execute containers (tasks)
Services: The definition of tasks to run (like Compose services)
Replicated services: Run specified number of replicas across the cluster
Global services: Run one task on every node
Tasks: Individual container instances running as part of a service
Stack: Group of services deployed together (like docker-compose.yml for Swarm)
Basic Swarm Commands
Initialize a Swarm:
Join Swarm:
Deploy a stack:
Manage services:
Node management:
Swarm Compose File Example
docker-compose.yml for Swarm:
Key Swarm-specific features:
deploysection defines deployment strategyreplicassets number of container instancesplacement.constraintscontrols node placementupdate_configdefines rolling update strategyresourcessets CPU/memory limitsoverlaynetwork enables cross-host communication
Deploy the stack:
Swarm Secrets Management
Swarm has built-in secrets management (unlike standalone Docker):
In docker-compose.yml:
Swarm Networking
Overlay networks enable container communication across hosts:
Ingress networking: Built-in load balancing
Any node in the cluster can receive traffic on published ports
Requests are automatically routed to healthy containers
Works even if the node doesn't run the service
Monitoring Swarm
When to Choose What
Use Docker Compose when:
Local development
Single-host deployments
Simple testing environments
Use Docker Swarm when:
Small to medium production deployments (< 100 nodes)
Need high availability without complexity
Team is comfortable with Docker but not Kubernetes
Want built-in secrets management
Need quick setup
Use Kubernetes when:
Large-scale production (100+ nodes)
Complex microservices architecture
Need advanced features (operators, custom resources)
Industry-standard tooling required
Have ops team with K8s expertise
Troubleshooting
Container won't start:
Port already in use:
Permission denied (volumes):
Out of disk space:
Container networking issues:
Quick Reference
Most Used Commands:
Data Engineering Essentials:
Additional Resources
Official Documentation:
Data Engineering Tools with Docker:
Container Registries:
Harbor (self-hosted)
Last updated