Data Lineage

Open-source data lineage tools: https://www.montecarlodata.com/blog-open-source-data-lineage-tools/#apache-atlasarrow-up-right


Data lineage is the visual lifecycle of your data. It tracks the data's journey from its origin, through every transformation it undergoes, to its final destination.

If Data Observability is the "heart monitor" checking the health of the system, Data Lineage is the GPS map showing exactly where the data traveled.

The 3 Core Components

Lineage answers three fundamental questions about any piece of data (a table, a column, or a metric):

  1. Who created this? (Source/Origin)

    • Example: This data came from the Salesforce API ingestion job.

  2. What happened to it? (Transformation)

    • Example: It was filtered to remove "inactive" users, joined with the Region table, and the price was multiplied by the tax_rate.

  3. Where is it going? (Destination/Consumption)

    • Example: It is currently feeding the CEO_Weekly_Report dashboard and the Marketing_Email_Tool.

Why Data Engineers Need It

Lineage is arguably the most useful tool for day-to-day engineering work because it solves two massive problems:

1. Root Cause Analysis (Looking Backwards)

When a chart on a dashboard looks wrong (e.g., "Why is revenue zero today?"), you don't have to guess. You look at the lineage graph and trace the line upstream (backwards).

  • Scenario: You trace the "Revenue" column back through 3 SQL views and a Python script, finally finding that the source raw file from the payment processor was empty.

2. Impact Analysis (Looking Forwards)

Before you make a code change, you need to know who you might break. You trace the lineage downstream (forwards).

  • Scenario: You want to rename the column user_id to customer_id. Lineage warns you: "If you change this, you will break 3 Executive Dashboards and 1 Machine Learning model that depend on user_id."

Levels of Granularity

Not all lineage is created equal. It typically comes in two depths:

  • Table-Level Lineage: Shows that Table A feeds Table B. It is high-level and good for general overviews.

  • Column-Level Lineage: This is the gold standard. It shows that Column X in Table A was used to calculate Column Y in Table B. This is much harder to implement but necessary for debugging complex calculations.

Technical Context: OpenLineage

Since you are learning Apache Spark, you will likely encounter OpenLineage.

Spark jobs are notoriously difficult to track because they happen in memory across distributed clusters. OpenLineage is an open standard that "watches" your Spark jobs as they run and automatically draws these lineage maps for you, so you don't have to document them manually.


Last updated