Data Lineage
Open-source data lineage tools: https://www.montecarlodata.com/blog-open-source-data-lineage-tools/#apache-atlas
Data lineage is the visual lifecycle of your data. It tracks the data's journey from its origin, through every transformation it undergoes, to its final destination.
If Data Observability is the "heart monitor" checking the health of the system, Data Lineage is the GPS map showing exactly where the data traveled.
The 3 Core Components
Lineage answers three fundamental questions about any piece of data (a table, a column, or a metric):
Who created this? (Source/Origin)
Example: This data came from the
Salesforce APIingestion job.
What happened to it? (Transformation)
Example: It was filtered to remove "inactive" users, joined with the
Regiontable, and thepricewas multiplied by thetax_rate.
Where is it going? (Destination/Consumption)
Example: It is currently feeding the
CEO_Weekly_Reportdashboard and theMarketing_Email_Tool.
Why Data Engineers Need It
Lineage is arguably the most useful tool for day-to-day engineering work because it solves two massive problems:
1. Root Cause Analysis (Looking Backwards)
When a chart on a dashboard looks wrong (e.g., "Why is revenue zero today?"), you don't have to guess. You look at the lineage graph and trace the line upstream (backwards).
Scenario: You trace the "Revenue" column back through 3 SQL views and a Python script, finally finding that the source raw file from the payment processor was empty.
2. Impact Analysis (Looking Forwards)
Before you make a code change, you need to know who you might break. You trace the lineage downstream (forwards).
Scenario: You want to rename the column
user_idtocustomer_id. Lineage warns you: "If you change this, you will break 3 Executive Dashboards and 1 Machine Learning model that depend onuser_id."
Levels of Granularity
Not all lineage is created equal. It typically comes in two depths:
Table-Level Lineage: Shows that Table A feeds Table B. It is high-level and good for general overviews.
Column-Level Lineage: This is the gold standard. It shows that Column X in Table A was used to calculate Column Y in Table B. This is much harder to implement but necessary for debugging complex calculations.
Technical Context: OpenLineage
Since you are learning Apache Spark, you will likely encounter OpenLineage.
Spark jobs are notoriously difficult to track because they happen in memory across distributed clusters. OpenLineage is an open standard that "watches" your Spark jobs as they run and automatically draws these lineage maps for you, so you don't have to document them manually.
Last updated