What is IaC?
Infrastructure as Code (IaC): A Deep but Practical Overview
Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure using machine-readable configuration files rather than manual processes or GUI consoles. For modern data engineering and cloud workflows, IaC is the backbone that makes cloud resources predictable, repeatable, and automatable.
Why IaC Exists: Historical Context
IaC stems from configuration management ideas from the 1970s–1980s (think CFEngine → Puppet/Chef → Terraform/CloudFormation). The evolution is:
Early Configuration Management (CM)
Tools like RCS, CFEngine, later Puppet and Chef automated software installs and OS configuration.
Still required someone to provision VMs/servers manually.
Cloud Era
Clouds (AWS/GCP/Azure) introduced highly programmable APIs.
This shifted the focus from configuring machines → creating infrastructure itself programmatically.
Modern IaC
Tools like Terraform, CloudFormation, Pulumi, Crossplane, and CDK allow teams to create entire systems (VPCs, clusters, IAM, pipelines) via code.
What IaC Actually Does
IaC formalizes infrastructure the same way source code formalizes application logic.
IaC = Desired State of Infrastructure + Automated Engine to Reach That State
It delivers:
Automation Everything — VPCs, IAM, databases, compute clusters, data pipelines — is created and updated automatically.
Reproducibility Same config file → same environment. Production and staging no longer drift apart.
Version Control IaC files go into Git. You get:
history
diffs
pull requests
code reviews
rollbacks
Idempotence Applying the same IaC code multiple times yields the same result.
Scalability for Data Engineering Especially helpful for:
spinning up EMR/Dataproc clusters
creating S3/GCS buckets
creating IAM roles/policies
infrastructure for Kafka, Kinesis, Redshift, Snowflake loaders
serverless data pipelines
Tooling Landscape (with quick mental models)
1) Terraform
Your notes align here: the lecture stresses Terraform using HCL.
Cloud-agnostic (“multi-cloud, multi-provider”)
Uses HCL (declarative)
Manages lifecycle using state files (local or remote)
Has rich provider ecosystem (AWS, DBT, GitHub, Kafka, Snowflake, Datadog…)
Best choice for multi-cloud or data engineering platforms needing broad integration
2) AWS CloudFormation
AWS-native equivalent to Terraform
Templates written in YAML/JSON
Great integration with AWS — full support before Terraform gets updates
No state file (AWS manages state internally)
Generally:
Use CloudFormation if you're fully AWS
Use Terraform if you're multi-cloud or want stronger ergonomics
3) Pulumi
IaC using general-purpose languages (Python/TS/Go/C#) → imperative IaC
More dynamic than HCL/YAML
Good for teams that prefer full programming power
4) AWS CDK / CDKTF
“IaC as real code” — similar to Pulumi
CDK for AWS, CDKTF for Terraform providers
5) Crossplane
Kubernetes-native IaC
Infrastructure resources are Kubernetes CRDs
Great for platform engineering
Declarative vs Imperative IaC
Declarative
You define the desired end state; engine figures out the steps
Terraform, CloudFormation, Kubernetes manifests
Idempotent, concise, predictable
Harder to express loops, conditionals, programmatic logic
Imperative
You write commands telling HOW to get to that state
Bash, Ansible, CDK (hybrid)
Very flexible, good for complex logic
Higher chance of divergent state; less repeatable
Why IaC favors declarative?
Cloud APIs can be flaky
Imperative scripts risk double-creating, overwriting, or skipping resources
Declarative engines maintain execution plans and state
Terraform plan example:
This makes infrastructure changes auditable.
IaC Workflow in Real Teams
Here’s how IaC is used day-to-day:
Developer writes or updates Terraform code
Push to Git → pull request
CI runs:
terraform fmtterraform validateterraform planpost plan diff as comment in PR
Reviewer approves
CI runs
terraform applyInfrastructure updates automatically
Results stored in Terraform state
For CloudFormation, you use Change Sets and CI pipelines similarly.
IaC for Data Engineering Infrastructure
IaC becomes critical when building data platforms because pipelines always depend on infrastructure:
Storage
S3 buckets, lifecycle rules, encryption
Compute
EMR, Dataproc, Databricks clusters
Serverless compute like Lambda/Cloud Functions
Networking
VPCs, subnets optimized for clusters
Private connectivity for databases/pipelines
Data Streaming
Kafka topics, schemas, ACLs
Kinesis streams
Data Warehouses
Redshift clusters
BigQuery datasets
Snowflake warehouses
Automation
EventBridge rules
Step Functions/Workflows
Secrets / IAM
Principals for ETL jobs
KMS keys
Vault
Most large teams use Terraform to orchestrate entire data platforms end-to-end.
IaC Anti-Patterns and Pitfalls
Important things teams learn the hard way:
Storing state locally → use remote backends (S3+DynamoDB, GCS, Terraform Cloud)
Mixing environments in the same state file → separate states for
dev,prod, etc.Letting Terraform manage rarely-changing core infra + rapidly changing pipelines → split project into modules and workspaces
Inline IAM policies → use reusable, tested policies or modules
Terraform destroys important resources unintentionally → use
prevent_destroyand resource lifecycle policies
Final notes
Think of IaC as GitOps for your cloud infrastructure:
Declare your infrastructure like you declare code
Store it in Git
Apply via automated pipelines
Treat infrastructure changes like code changes
Achieve consistency and repeatability across environments
IaC is not just about automation — it’s about bringing software engineering principles to your cloud environment.
When NOT to Use IaC
Extremely temporary testing resources
Manual one-off debugging
Interactive work in early prototyping
When platform teams already abstract IaC (e.g., a platform portal that creates clusters for you)
Last updated