Data Observability
Observability and Monitoring
In addition to the principles of Automation that data engineers brought over and build on from DevOps, the principles and practices of Observability & Monitoring are a key pillar of DataOps that also has its origins in software development.

Over the past decade or so, with the advent of the cloud and the move toward distributed systems, software engineers have developed observability tools help their teams gain visibility into the health of their systems. With these observability tools, teams are able to monitor metrics such as CPU and RAM usage and response time, which helps to quickly detect anomalies, identify problems, prevent downtime, and ensure reliable software products.
With these observability tools, teams are able to monitor metrics such as CPU and RAM usage and response time, which helps to quickly detect anomalies, identify problems, prevent downtime, and ensure reliable software products. When it comes to data observability and monitoring the health of your data systems, some of the same tools that software teams rely on can also be helpful to you as a data engineer.
With that being said, in addition to monitoring for things like CPU usage or system response times, as a data engineer, you also need visibility into the health, or in other words, the quality of your data.
If Data Quality is like a yearly health checkup for your data at rest, Data Observability is like a continuous heart monitor for your data in motion. It doesn't just tell you that something is wrong; it helps you figure out why it broke, where it broke, and what else is affected.
It is heavily inspired by software observability (logs, metrics, traces) but adapted for data engineering.
The 5 Pillars of Data Observability
To achieve observability, data engineers track five key "pillars" (popularized by companies like Monte Carlo):
Freshness
Question: Is the data up to date?
Check: Did the job run at 8:00 AM as scheduled? If the table hasn't been updated in 24 hours, the data is stale.
Distribution
Question: Is the data within expected ranges?
Check: If the
credit_scorecolumn usually averages 700, but suddenly drops to 10, the distribution has shifted (this is often called "Data Drift").
Volume
Question: Is the amount of data complete?
Check: We usually ingest 1 million rows per hour. If we suddenly ingest only 5,000 rows, there is likely a broken connection upstream.
Schema
Question: Did the structure change?
Check: Did a developer delete the
emailcolumn or changeuser_idfrom a number to a string? Schema changes are the #1 cause of broken pipelines.
Lineage
Question: Where did this data come from, and where does it go?
Check: If Table A breaks, Lineage tells you immediately that Dashboard B and Report C are also broken.
Observability vs. Monitoring
It is common to confuse the two, but there is a distinct difference in depth:
Monitoring tells you "What happened?"
Alert: "The pipeline failed."
Alert: "There are null values in column X."
Nature: Reactive. You set a threshold, and it notifies you when it is crossed.
Observability tells you "Why it happened?"
Insight: "The pipeline failed because the upstream schema changed, causing a volume drop of 90%. This impacts the Executive Dashboard."
Nature: Proactive and investigative. It connects the dots between the error and the root cause.
Why It Matters
For a Data Engineer, observability reduces MTTR (Mean Time To Resolution). Instead of spending 4 hours hunting through logs to find why a report is empty, observability tools can visually show you: "The error started at Step 2 of the Spark job because the source file was missing."
Data Incidents
In the world of Data Engineering, Data Incidents are the equivalent of "site outages" in software engineering.
What is a Data Incident?
A Data Incident is any unplanned event where data fails to meet the expectations set by your Data Contracts or SLAs, resulting in "Data Downtime."
Unlike software glitches (where the app crashes and gives a 404 error), data incidents are often silent. The pipeline might run successfully, but the data inside it could be wrong.
Common Examples:
Freshness Delay: The "Daily Sales" report is supposed to update at 8:00 AM, but the job is stuck, and executives are looking at yesterday's numbers without realizing it.
Volume Anomaly: You usually process 1 million rows/day. Today, you processed 50 rows because of an API change upstream.
Schema Drift: A software engineer renamed
user_idtocustomer_idin the production database, breaking your downstream ingestion script.Distribution Drift: The data looks technically correct (no nulls), but the "Average Order Value" suddenly dropped from $100 to $5.
Incident Response
Incident Response is the "firefighting" protocol you use to put data incidents out.
What is Incident Response?
Incident Response (IR) is the structured, standard operating procedure (SOP) a data team follows when an incident is detected. Its goal is to minimize Time to Resolution (MTTR) and preserve trust with stakeholders.
It typically follows a 4-step lifecycle:
Step 1: Detection & Triage (The "Fire Alarm")
What happens: An observability tool (like Monte Carlo) or a test (like Great Expectations) fires an alert to Slack/PagerDuty.
Action: You acknowledge the alert. You determine the Severity Level (SEV).
SEV-1 (Critical): The CEO's dashboard is wrong; legal reporting is blocked. Drop everything and fix.
SEV-3 (Minor): A non-critical marketing table is late. Fix it during business hours.
Step 2: Communication (The "Status Page")
What happens: Before you start fixing code, you tell the consumers.
Action: You post in the
#analytics-updateschannel: "We are investigating a freshness issue with the Sales Table. Please do not trust the dashboard until further notice."Why: This prevents the CEO from making business decisions based on bad data.
Step 3: Investigation & Fix (The "Firefighting")
What happens: You use Data Lineage to trace the break upstream.
Is it a Spark job failure? Check the executor logs.
Is it bad source data? Query the raw landing zone.
Action: You deploy a hotfix or revert the bad code. You often have to Backfill (re-run historical jobs) to correct the data gaps.
Step 4: Post-Mortem (The "Cleanup")
What happens: After the dust settles, the team meets to ensure this specific error never happens again.
Action: You write an Incident Report.
Root Cause: Why did this happen? (e.g., "The API token expired").
Action Item: "Add an automated check for token expiration 3 days in advance."
Key Metrics: MTTD vs. MTTR
To measure how good a data team is at Incident Response, we track two key numbers:
MTTD (Mean Time To Detection): How long did the data stay broken before you knew about it?
Bad: A user emails you 3 days later saying "This looks wrong."
Good: An automated alert pings you 5 minutes after the job fails.
MTTR (Mean Time To Resolution): Once you knew about it, how long did it take to fix?
Bad: It took 2 days to debug the Spark logs.
Good: Lineage tools showed you the error immediately, and you fixed it in 30 mins.
Measuring Success: DataOps Reliability Metrics
To quantify the health of your DataOps practice, you must track reliability and deployment metrics.
Reliability Metrics
Mean Time to Detection (MTTD): How long was the data broken before you knew?
Mean Time to Resolution (MTTR): How long did it take to fix once detected?
Mean Time Between Failures (MTBF): Measures the stability of your products.
$$MTBF = \frac{\text{Total Time} - \text{Total Downtime}}{\text{Number of Incidents}}$$
Deployment Metrics
Deployment Frequency (DF): How often you release changes. High frequency (1–2 times per week) indicates an agile, automated environment.
Change Failure Rate (CFR): The percentage of deployments that cause a failure.
$$CFR = \left( \frac{\text{Failed Deployments}}{\text{Total Deployments}} \right) \times 100\%$$
Note on "False Reliability": A high MTBF (long time between failures) might look good, but if your Deployment Frequency is very low (e.g., once every 6 months), it just means you aren't changing anything. True reliability is maintaining a high MTBF while also maintaining a high Deployment Frequency.
The Holistic View: Three Pillars of IT Observability
Data issues rarely happen in a vacuum. A comprehensive strategy requires monitoring three distinct layers:
Pillar
Focus
Tracking Examples
Data Observability
The data itself
Schema drift, volume anomalies, distribution (drift).
Application Observability
The transformation code
Code execution logs, runtime errors, memory leaks.
Infrastructure Observability
Hardware & Resources
CPU/RAM usage, network latency, disk space.
DataOps Reliability Metrics
Mean Time Between Failures (MTBF)
Definition: The average operational time between system failures. It measures the inherent stability of your data products.
Total Uptime Period = Total Period - Total Downtime.
Example: In a 720-hour month, with 12 hours of downtime across 4 incidents:
(720−12)/4=177 hours.
Interpretation: The system is stable for roughly 7.4 days before a bug or outage occurs.
Mean Time to Recovery (MTTR)
Definition: The average time it takes to restore service after a failure. In DataOps, this includes detecting the data issue, fixing the pipeline, and backfilling the data.
Example: 12 hours of total downtime / 4 incidents = 3 hours per incident.
Significance: While MTBF measures prevention, MTTR measures agility and the effectiveness of your observability tools.
Change Failure Rate (CFR)
Definition: The percentage of deployments or changes (e.g., new dbt models, updated Spark jobs) that cause a failure in production.
Interpretation: A high CFR suggests that your CI/CD tests or staging environments are not catching bugs effectively before they reach production.
Deployment Frequency (DF)
Definition: How often the team successfully releases code to production. This is a proxy for team velocity and automation maturity.
Example: 10 releases / 30 days = 0.33 deployments per day (or roughly 2–3 per week).
Elite Target: Elite data teams deploy on-demand (multiple times per day).
Interrelationship Analysis: The "Balanced Scorecard"
You correctly identified that these metrics cannot be viewed in isolation. Looking at one without the others creates "blind spots."
The False Reliability Trap
Scenario: High MTBF (No failures) but Very Low Deployment Frequency (Once a month).
Reality: The system is "stable" only because it is static. As soon as you increase velocity, the lack of robust testing will likely cause the MTBF to crash.
The DataOps "North Star" (The J-Curve)
A healthy DataOps team aims for the High-Velocity/High-Stability quadrant:
High Deployment Frequency: Rapidly delivering new features to stakeholders.
Low Change Failure Rate: Changes are safe and well-tested.
High MTBF: The system remains stable despite frequent changes.
Low MTTR: When things do break, they are fixed instantly.
Practical Application & Targets
Metric
Target (Elite)
Red Flag 🚩
MTBF
> 336 hours (2 weeks)
High MTBF + Low DF (Stagnation)
MTTR
< 1 hour
> 24 hours (Indicates poor observability/lineage)
CFR
< 15%
> 30% (Manual testing or "broken" CI/CD)
DF
Multiple times per day
Once per month (High risk, "big bang" releases)
The "Deployment-Adjusted Reliability" Formula
Your proposed formula is a great internal KPI for measuring Risk-Adjusted Stability:
This effectively penalizes the MTBF if the team isn't meeting a minimum "velocity" baseline, ensuring that stability isn't bought at the cost of stagnation.
Last updated