Docker, Docker Compose, and Docker Swarm


Docker & Docker Compose for Data Engineering

A comprehensive guide to containerizing data pipelines, databases, and analytics workflows.

Official Resources:


What is Docker?

Docker packages applications with their complete runtime environments into lightweight, portable containers. Unlike virtual machines that virtualize hardware, containers virtualize at the operating system level—sharing the host kernel while maintaining strong isolation through namespaces.

Why Docker for Data Engineering?

  • Reproducible environments across development, testing, and production

  • Version-controlled infrastructure alongside your code

  • Easy orchestration of complex data stacks (databases, ETL tools, streaming platforms)

  • Simplified dependency management for Python/R/Java data tools

  • Rapid environment teardown and rebuild during development


Core Concepts

Images vs Containers

Image: A read-only template defining what a container should be. Think of it as a class in OOP.

Container: A running instance of an image with its own filesystem, network, and process space. Think of it as an object instantiated from a class.

Key Insight: Deleting and recreating a container resets it to the original image state. Restarting a stopped container preserves changes made during its lifetime.

Container vs Virtual Machine

Docker Architecture

Docker Daemon: The background service that manages containers. Must be running for Docker to function.

Docker Desktop: A GUI wrapper (optional on Linux) that manages the daemon, provides visualization, and on Mac/Windows, handles the Linux VM that actually runs Docker.

Docker CLI: Your primary interface. Communicates directly with the daemon via socket/API.

Start here: docker --help or docker <command> --help for specific guidance.


Essential Commands

Container Lifecycle

Run a container:

Common flags:

  • -d: Detached mode (runs in background)

  • -p host:container: Port mapping.

    • It's two-way communication.

      When you map ports with -p 8080:80:

      Incoming (to container):

      • Requests to localhost:8080 on your machine → forwarded to port 80 in the container

      Outgoing (from container):

      • Responses from the container's port 80 → sent back through port 8080 to your machine

      So it's a complete bidirectional channel. You send HTTP requests to port 8080, they go into the container on port 80, the application processes them, and the responses come back out the same path.

  • -v volume:path: Mount volume for persistence

  • -e KEY=VALUE: Set environment variables

  • --name: Assign a custom name

  • --network: Attach to a specific network

  • --rm: Auto-remove container when stopped

  • -m, --memory=""

    • limit memory allocation. Number is a positive integer. Unit can be one of b, k, m, or g

  • --memory-swap=""

    • used to control the total amount of memory (RAM) plus swap space (virtual memory) that a container can use

List containers:

Stop/Start/Restart:

Remove containers:

Working with Images

List images:

Pull/Push images:

Remove images:

Search Docker Hub:


Building Custom Images

Dockerfile Basics

A Dockerfile is Infrastructure as Code for your container image. Place it in your project root.

Data Engineering Example - Python ETL Container:

Build the image:

Key Dockerfile Instructions:

  • FROM: Base image to build upon

  • WORKDIR: Set working directory

  • COPY/ADD: Copy files from host to image

  • RUN: Execute commands during build (install packages, etc.)

  • ENV: Set environment variables

  • EXPOSE: Document which ports the container listens on

  • CMD: Default command when container starts

  • ENTRYPOINT: Configure container as executable

CMD vs ENTRYPOINTarrow-up-right

Best Practices:

  • Use official base images when possible

  • Copy dependency files before application code (layer caching)

  • Minimize layers by combining RUN commands

  • Use .dockerignore to exclude unnecessary files

  • Don't store secrets in images (use environment variables or secrets management)


Executing Commands in Containers

One-off Commands

Interactive Shell Sessions

Flags:

  • -i: Interactive mode (keep STDIN open)

  • -t: Allocate pseudo-TTY (keyboard interface)


Monitoring & Debugging

Viewing Logs

Resource Monitoring

Check resource usage:

View processes in container:

Resource Limits

Control resource allocation to prevent containers from consuming excessive resources:

Memory suffixes: b, k, m, g (e.g., 512m, 2g)

Resource Constraints Documentationarrow-up-right


docker cp

Copy files between your host machine and a container:

Example scenario:


docker diff

Shows what files have been modified in a container since it started:

Example output:

  • A = Added file

  • C = Changed file/directory

  • D = Deleted file

Example scenario:

Output might show:

This is useful for debugging or understanding what a container modified before committing it to an image.


Persistent Storage: Volumes

Containers are ephemeral by design. Volumes provide persistent storage that survives container deletion.

Volume Types

Named Volumes (Recommended): Managed by Docker, stored in Docker's storage directory.

Anonymous Volumes: Created automatically, harder to reference later.

Bind Mounts: Direct mapping to host filesystem paths.

Working with Volumes

Using Volumes

Named volume example:

Bind mount example (development):

Multiple volumes:

Data Engineering Use Cases:

  • Database storage (PostgreSQL, MongoDB, Elasticsearch)

  • Persistent logs (Airflow, Spark)

  • Shared datasets between containers

  • ML model artifacts and checkpoints

  • Configuration files


Networking

Containers need networking for inter-service communication and external access.

Network Modes

Bridge (default): Private network for container communication.

Host: Container shares host's network stack.

None: No networking (isolated mode).

Custom networks: User-defined bridges with DNS resolution.

Working with Networks

Connecting Containers

Example: Connect application to database

Network Communication Diagram:

Key Feature: Docker provides automatic DNS resolution. Container etl-job can connect to postgres by name, not IP address.

Network Isolation

Offline mode (for security or testing):

Test connectivity:


Here is your modernized, up-to-date cheatsheet.

Key Changes Applied:

  1. Commands: Updated from docker-compose (v1) to docker compose (v2).

  2. File Name: Updated to compose.yaml, which is the preferred standard for V2 (though docker-compose.yml still works).

  3. Version Key: Removed the version: '3.8' line. The new Compose Specification (V2) defaults to the latest schema automatically, making the version line obsolete.


Docker Compose (V2) Cheatsheet

Docker Compose orchestrates multi-container applications using a declarative YAML file. Essential for data engineering where you typically run databases, message queues, ETL tools, and monitoring together.

Why Use Compose?

  • Define entire data stack in one file

  • Single command to start/stop all services

  • Automatic network creation and DNS resolution (service names = hostnames)

  • Service dependencies and health checks

Basic Compose File

File: compose.yaml (Preferred) or docker-compose.yml

Essential Compose Commands

chevron-rightmore commandshashtag

Common Docker Compose Commands

Here are some of the most frequently used commands and their functions:

Command

Description

docker compose up

Creates and starts all services defined in the Compose file. Use with the -d flag for detached mode (running in the background).

docker compose down

Stops and removes containers, networks, and volumes created by up.

docker compose ps

Lists the running containers for the current project.

docker compose logs

Views the aggregated log output from all containers.

docker compose build

Builds or rebuilds service images.

docker compose pull

Pulls service images from a registry (e.g., Docker Hub).

docker compose start

Starts existing, stopped service containers.

docker compose stop

Stops running service containers without removing them.

docker compose restart

Restarts service containers.

docker compose exec

Executes a command in a running container.

docker compose run

Runs a one-off command on a service (e.g., running tests or migrations).

For a full list of commands and options, refer to the Docker Compose CLI referencearrow-up-right.

Data Engineering Stack Example

Complete pipeline with database, message queue, and processing.

Environment Variables

1. Using .env file (Default)

Create .env in the same directory as compose.yaml. Compose automatically loads these.

Reference in YAML: ${VAR_NAME} or ${VAR_NAME:-default_value}.

2. Override for different environments

Advanced Compose Features

Service Dependencies with Health Checks

Ensures services don't start until their dependencies are actually ready (not just running).

Resource Limits

Limit CPU and RAM to prevent crashes on your local machine.

Profiles (Selective Startup)

Great for running only parts of the stack (e.g., only dev tools, or skipping heavy monitoring).

Command: docker compose --profile dev up -d


Authentication & Security

Docker Hub Authentication

Login to Docker Hub:

Tag and push images:

Alternative Container Registries

Amazon ECR (Elastic Container Registry):

Google Artifact Registry:

GitHub Container Registry (GHCR):

Azure Container Registry:

Security Best Practices

1. Use secrets, not environment variables for sensitive data:

Docker Compose with secrets (Swarm mode):

2. Run containers as non-root:

3. Use official images from trusted sources

4. Scan images for vulnerabilities:

5. Don't store secrets in images:

  • Use environment variables

  • Use Docker secrets

  • Use external secret management (Vault, AWS Secrets Manager)

6. Use .dockerignore:


Data Engineering Workflow Examples

Example 1: PostgreSQL + dbt + Jupyter

Example 2: Airflow Data Pipeline

Example 3: Spark Cluster


CI/CD Integration

Modern deployment workflow with Docker:

  1. Developer pushes code to Git repository

  2. CI/CD pipeline triggers (GitHub Actions, GitLab CI, Jenkins)

  3. Pipeline builds Docker image

  4. Pipeline runs tests in container

  5. Pipeline tags image with version/commit SHA

  6. Pipeline pushes image to registry (Docker Hub, ECR, GHCR)

  7. Deployment system pulls latest image

  8. Old containers are replaced with new ones

Example GitHub Action:


Docker Swarm

Docsarrow-up-right

Docker Swarm is Docker's native clustering and orchestration solution. It turns multiple Docker hosts into a single virtual host, enabling container deployment across a cluster of machines with built-in load balancing, service discovery, and high availability.

Swarm vs Compose vs Kubernetes

When to Use Docker Swarm

Good for:

  • Small to medium production deployments

  • Teams already familiar with Docker

  • Simple orchestration needs

  • When you want native Docker CLI integration

  • Quick setup without complex configuration

Not ideal for:

  • Very large scale (1000+ containers)

  • Complex networking requirements

  • Advanced scheduling needs

  • When industry-standard tooling is required (Kubernetes dominates)

Swarm Architecture

Key Swarm Concepts

Nodes: Individual Docker hosts in the cluster

  • Manager nodes: Control the cluster, schedule tasks, maintain state

  • Worker nodes: Execute containers (tasks)

Services: The definition of tasks to run (like Compose services)

  • Replicated services: Run specified number of replicas across the cluster

  • Global services: Run one task on every node

Tasks: Individual container instances running as part of a service

Stack: Group of services deployed together (like docker-compose.yml for Swarm)

Basic Swarm Commands

Initialize a Swarm:

Join Swarm:

Deploy a stack:

Manage services:

Node management:

Swarm Compose File Example

docker-compose.yml for Swarm:

Key Swarm-specific features:

  • deploy section defines deployment strategy

  • replicas sets number of container instances

  • placement.constraints controls node placement

  • update_config defines rolling update strategy

  • resources sets CPU/memory limits

  • overlay network enables cross-host communication

Deploy the stack:

Swarm Secrets Management

Swarm has built-in secrets management (unlike standalone Docker):

In docker-compose.yml:

Swarm Networking

Overlay networks enable container communication across hosts:

Ingress networking: Built-in load balancing

  • Any node in the cluster can receive traffic on published ports

  • Requests are automatically routed to healthy containers

  • Works even if the node doesn't run the service

Monitoring Swarm

When to Choose What

Use Docker Compose when:

  • Local development

  • Single-host deployments

  • Simple testing environments

Use Docker Swarm when:

  • Small to medium production deployments (< 100 nodes)

  • Need high availability without complexity

  • Team is comfortable with Docker but not Kubernetes

  • Want built-in secrets management

  • Need quick setup

Use Kubernetes when:

  • Large-scale production (100+ nodes)

  • Complex microservices architecture

  • Need advanced features (operators, custom resources)

  • Industry-standard tooling required

  • Have ops team with K8s expertise


Troubleshooting

Container won't start:

Port already in use:

Permission denied (volumes):

Out of disk space:

Container networking issues:


Quick Reference

Most Used Commands:

Data Engineering Essentials:


Additional Resources

Official Documentation:

Data Engineering Tools with Docker:

Container Registries:

Last updated