# What is IaC?

***

## **Infrastructure as Code (IaC): A Deep but Practical Overview**

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure using machine-readable configuration files rather than manual processes or GUI consoles. For modern data engineering and cloud workflows, IaC is the backbone that makes cloud resources predictable, repeatable, and automatable.

***

### **Why IaC Exists: Historical Context**

IaC stems from configuration management ideas from the 1970s–1980s (think **CFEngine → Puppet/Chef → Terraform/CloudFormation**). The evolution is:

1. **Early Configuration Management (CM)**
   * Tools like **RCS**, **CFEngine**, later **Puppet** and **Chef** automated software installs and OS configuration.
   * Still required someone to provision VMs/servers manually.
2. **Cloud Era**
   * Clouds (AWS/GCP/Azure) introduced highly programmable APIs.
   * This shifted the focus from configuring machines → **creating infrastructure itself programmatically**.
3. **Modern IaC**
   * Tools like **Terraform**, **CloudFormation**, **Pulumi**, **Crossplane**, and **CDK** allow teams to create entire systems (VPCs, clusters, IAM, pipelines) via code.

***

### **What IaC Actually Does**

IaC formalizes infrastructure the same way source code formalizes application logic.

**IaC = Desired State of Infrastructure + Automated Engine to Reach That State**

#### It delivers:

1. **Automation**\
   Everything — VPCs, IAM, databases, compute clusters, data pipelines — is created and updated automatically.
2. **Reproducibility**\
   Same config file → same environment.\
   Production and staging no longer drift apart.
3. **Version Control**\
   IaC files go into Git.\
   You get:
   * history
   * diffs
   * pull requests
   * code reviews
   * rollbacks
4. **Idempotence**\
   Applying the same IaC code multiple times yields the same result.
5. **Scalability for Data Engineering**\
   Especially helpful for:
   * spinning up EMR/Dataproc clusters
   * creating S3/GCS buckets
   * creating IAM roles/policies
   * infrastructure for Kafka, Kinesis, Redshift, Snowflake loaders
   * serverless data pipelines

***

### **Tooling Landscape (with quick mental models)**

#### **1) Terraform**

*Your notes align here: the lecture stresses Terraform using HCL.*

* **Cloud-agnostic** (“multi-cloud, multi-provider”)
* Uses **HCL (declarative)**
* Manages lifecycle using **state files** (local or remote)
* Has rich provider ecosystem (AWS, DBT, GitHub, Kafka, Snowflake, Datadog…)
* Best choice for multi-cloud or data engineering platforms needing broad integration

#### **2) AWS CloudFormation**

* **AWS-native** equivalent to Terraform
* Templates written in **YAML/JSON**
* Great integration with AWS — full support before Terraform gets updates
* No state file (AWS manages state internally)

Generally:

* Use CloudFormation if you're **fully AWS**
* Use Terraform if you're **multi-cloud** or want stronger ergonomics

#### **3) Pulumi**

* IaC using general-purpose languages (Python/TS/Go/C#) → **imperative IaC**
* More dynamic than HCL/YAML
* Good for teams that prefer full programming power

#### **4) AWS CDK / CDKTF**

* “IaC as real code” — similar to Pulumi
* CDK for AWS, CDKTF for Terraform providers

#### **5) Crossplane**

* Kubernetes-native IaC
* Infrastructure resources are Kubernetes CRDs
* Great for platform engineering

***

### **Declarative vs Imperative IaC**

| Approach        | Description                                                    | Typical Tools                                   | Pros                                  | Cons                                                      |
| --------------- | -------------------------------------------------------------- | ----------------------------------------------- | ------------------------------------- | --------------------------------------------------------- |
| **Declarative** | You define the desired end state; engine figures out the steps | Terraform, CloudFormation, Kubernetes manifests | Idempotent, concise, predictable      | Harder to express loops, conditionals, programmatic logic |
| **Imperative**  | You write commands telling HOW to get to that state            | Bash, Ansible, CDK (hybrid)                     | Very flexible, good for complex logic | Higher chance of divergent state; less repeatable         |

Why IaC favors **declarative**?

* Cloud APIs can be flaky
* Imperative scripts risk double-creating, overwriting, or skipping resources
* Declarative engines maintain **execution plans** and **state**

Terraform plan example:

```
+ aws_s3_bucket.my_bucket (create)
~ aws_iam_role.pipeline_role (update)
- aws_lambda.function (delete)
```

This makes infrastructure changes auditable.

***

### **IaC Workflow in Real Teams**

Here’s how IaC is used day-to-day:

1. Developer writes or updates Terraform code
2. Push to Git → pull request
3. CI runs:
   * `terraform fmt`
   * `terraform validate`
   * `terraform plan`
   * post plan diff as comment in PR
4. Reviewer approves
5. CI runs `terraform apply`
6. Infrastructure updates automatically
7. Results stored in Terraform state

For CloudFormation, you use Change Sets and CI pipelines similarly.

***

### **IaC for Data Engineering Infrastructure**

IaC becomes critical when building data platforms because pipelines always depend on infrastructure:

* **Storage**
  * S3 buckets, lifecycle rules, encryption
* **Compute**
  * EMR, Dataproc, Databricks clusters
  * Serverless compute like Lambda/Cloud Functions
* **Networking**
  * VPCs, subnets optimized for clusters
  * Private connectivity for databases/pipelines
* **Data Streaming**
  * Kafka topics, schemas, ACLs
  * Kinesis streams
* **Data Warehouses**
  * Redshift clusters
  * BigQuery datasets
  * Snowflake warehouses
* **Automation**
  * EventBridge rules
  * Step Functions/Workflows
* **Secrets / IAM**
  * Principals for ETL jobs
  * KMS keys
  * Vault

Most large teams use Terraform to orchestrate **entire data platforms** end-to-end.

***

### **IaC Anti-Patterns and Pitfalls**

Important things teams learn the hard way:

1. **Storing state locally**\
   → use remote backends (S3+DynamoDB, GCS, Terraform Cloud)
2. **Mixing environments in the same state file**\
   → separate states for `dev`, `prod`, etc.
3. **Letting Terraform manage rarely-changing core infra + rapidly changing pipelines**\
   → split project into modules and workspaces
4. **Inline IAM policies**\
   → use reusable, tested policies or modules
5. **Terraform destroys important resources unintentionally**\
   → use `prevent_destroy` and resource lifecycle policies

***

### Final notes

Think of IaC as **GitOps for your cloud infrastructure**:

* Declare your infrastructure like you declare code
* Store it in Git
* Apply via automated pipelines
* Treat infrastructure changes like code changes
* Achieve consistency and repeatability across environments

IaC is not just about automation — it’s about bringing **software engineering principles** to your cloud environment.

***

### **When NOT to Use IaC**

* Extremely temporary testing resources
* Manual one-off debugging
* Interactive work in early prototyping
* When platform teams already abstract IaC (e.g., a platform portal that creates clusters for you)

***

***
