Data Contracts
Data Contract template: https://github.com/paypal/data-contract-template
Lessons for implementing data contracts: https://www.montecarlodata.com/blog-data-contracts/
What is a Data Contract?
A Data Contract is a formal, version-controlled agreement between a Data Producer (the team generating the data) and a Data Consumer (the team using the data for analytics, ML, or reporting).
In the past, data pipelines often broke silently because upstream software engineers would change a database schema without realizing it would crash a downstream dashboard. Data Contracts solve this by treating data as a product with a guaranteed interface.
The Core Purpose: Why do we need them?
In traditional data engineering, there is often a "wall" between the services generating data and the data warehouse.
The Problem: Software engineers change a column name in a microservice to fix a bug. Two hours later, the CEO’s dashboard breaks. The Data Engineer has to scramble to fix the pipeline.
The Solution: A Data Contract forces the software engineers to agree on the data structure before they push code. If they make a breaking change that violates the contract, their deployment fails (via CI/CD checks) before it ever reaches production.
What is inside a Data Contract?
A data contract is typically a clear specification file (often written in YAML or JSON) that lives in the code repository. It usually contains four key sections:
Component
Description
Example
Schema
The technical structure of the data.
Column names, data types (String, Int), and nullability.
Semantics
The business meaning and logic of the data.
"Status" can only be 'Active' or 'Churned'; "Price" must be > 0.
SLAs (Service Level Agreements)
operational guarantees regarding the data's reliability.
Data will arrive by 8:00 AM daily; Data will be fresh within 15 mins.
Metadata
Governance and ownership details.
Owner: checkout-team, PII: true (contains sensitive info).
SLAs are a component: Data contracts are a broader agreement that includes SLAs, but also other elements like schema definitions, data quality rules, ownership, and governance policies.
How it Works (The Workflow)
Definition: The Data Consumer (e.g., a Data Analyst) requests specific data fields they need.
Agreement: The Data Producer (e.g., a Backend Engineer) reviews the request. They agree, "Okay, we promise to deliver these 5 columns in this format every hour."
Codification: This agreement is written into a
contract.yamlfile.Enforcement:
CI/CD Checks: If the Producer changes their code (e.g., renames a column), the build system checks the
contract.yaml. If the change violates the contract, the build fails.8Runtime Checks: As data flows through the pipeline, it is validated against the contract. Bad data is quarantined before it pollutes the data warehouse.
Key Benefits for a Data Engineer
Decoupling: You stop being the "janitor" fixing broken pipelines. Producers become responsible for the quality of their output.
Shift-Left Quality: Data quality issues are caught in the development stage (upstream), rather than in the production warehouse (downstream).
Documentation: The contract serves as excellent, up-to-date documentation for what data is available and who owns it.
Think of a Data Contract like an API specification (like Swagger/OpenAPI) but for data pipelines. It prevents "schema drift" and ensures that the data landing in your warehouse is trustworthy and reliable.
Example of what a Data Contract looks like in YAML code
Here is a realistic example of what a Data Contract might look like for an E-commerce Orders dataset.
While there is no single universal standard yet, most organizations use a format very similar to this (often based on standards like Open Data Contract Standard).
Example: orders_contract.yaml
Breakdown of the File
Version (
1.2.0): This is crucial. If the checkout team needs to make a massive change, they don't just overwrite this file. They create version2.0.0. The Data Engineering team can continue reading1.2.0until they are ready to migrate, preventing immediate crashes.Ownership: It explicitly names the
checkout-engineeringteam. If the data arrives late, the Data Engineer knows exactly who to page.Quality Checks:
Hard Rules (
error): Thetotal_amountcan never be negative. If the application sends a-50.00order, the contract rejects it immediately so it doesn't mess up financial reporting.Soft Rules (
warning): If a new status likeRETURNEDappears but isn't in the list, we might just want a warning rather than stopping the whole pipeline.
Last updated
