Data Quality
Data quality describes the condition of a dataset and whether it is "fit for use."
In simple terms, high-quality data is reliable and trustworthy enough to make decisions with. If data is accurate, complete, and available when you need it, it has high quality. If it is full of errors, missing values, or duplicates, it has low quality.
Why It Matters
Data quality is the foundation of any data initiative. The industry standard phrase is "Garbage In, Garbage Out" (GIGO):
Reliable Analytics: You cannot build a dashboard or train an AI model on bad data; the results will be misleading.
Operational Efficiency: Bad data causes manual rework (e.g., calling the wrong customer number).
Compliance: Regulations like GDPR require accurate handling of user data.
The Dimensions of Data Quality
1. Completeness 📝
This dimension checks if all required data fields are populated. A high completeness score means there are no missing values in critical columns.
Field
Before (Incomplete Data)
After (Complete Data)
User_ID
U101
U101
User_Name
Alex Smith
Alex Smith
User_Email
(Null)
alex.s@email.com
Registration_Date
2024-03-15
2024-03-15
Commentary: The original record is incomplete because the User_Email field is empty, which is a key piece of information for marketing campaigns.
2. Accuracy 🎯
Accuracy measures how correct the data is—how well it reflects the true state of the real world.
Field
Before (Inaccurate Data)
After (Accurate Data)
Order_ID
O500
O500
Product_Name
Laptop Model A
Laptop Model A
Price
$1200
$1250
Order_Date
2024-05-20
2024-05-20
Commentary: The original data is inaccurate because the Price was recorded as $1200, but the actual, correct price should have been $1250.
3. Consistency 🔄
Consistency ensures that data values are uniform and do not conflict across different systems or datasets.
Field
System A (CRM)
System B (Billing)
After (Consistent)
Customer_ID
C205
C205
C205
Customer_Name
Jane Miller
J. Miller
Jane Miller
Subscription_Status
Premium
Gold
Premium
Commentary: The customer's name and subscription status are inconsistent between the two systems, which can lead to billing errors or poor customer service.
4. Timeliness ⏳
Timeliness assesses whether data is available when it is needed, reflecting how up-to-date it is.
Field
Before (Untimely Data)
After (Timely Data)
Date_of_Report
2024-06-05
2024-06-05
Data_Date
2024-06-01
2024-06-05
Total_Sales
$55,000
$62,500
Commentary: A daily report run on June 5th should reflect data from June 5th. The original data is untimely because it is stale, showing data from four days prior due to a pipeline delay.
5. Uniqueness 👯
This dimension ensures that each record is distinct and that there are no duplicate entries in a dataset.
Record ID
User Name
User Email
Before (Duplicate Data)
After (Unique Data)
1
Chris Brown
chrisb@mail.com
(Original Record)
(Kept)
2
Chris Brown
chrisb@mail.com
(Duplicate Record)
(Deleted)
3
Sarah Jones
sarah.j@mail.com
(Original Record)
(Kept)
Commentary: The data originally contained two identical records for the same person, which would skew analytics. The Uniqueness dimension is violated by the duplicate record.
6. Validity ✅
Validity checks whether data conforms to a predefined set of rules, formats, or business constraints.
Field
Before (Invalid Data)
After (Valid Data)
Product_SKU
SKU123
SKU123
Customer_Zip_Code
1234
12345
Rating_Score
11
10
Commentary: The original data is invalid because the Zip_Code is the wrong length (rules for a valid US Zip Code) and the Rating_Score is outside the allowed range of 1-10.
Data quality checks determine metrics that address both quality and integrity.
The common data quality checks include:
Identifying duplicate data or overlaps for uniqueness.
Checking for mandatory fields, null values, and missing values to identify and fix data completeness
Applying formatting checks for consistency.
Using business rules with a range of values or default values and validity.
Checking how recent the data is or when it was last updated identifies the recency or freshness of data.
Validating row, column, conformity, and value checks for integrity.
Indicators of Data Quality
Data Source Metadata
Definition: Descriptive information about datasets including file locations, formats, and ownership details.
Why It Matters: Metadata mismatches can break entire pipelines. When an upstream system switches from CSV to JSON format but downstream applications expect CSV, processing failures cascade through dependent systems.
Schema
Definition: Expected structure defining field names, data types, and column arrangements for datasets.
Critical Impact: Schema violations cause immediate application failures. More subtly, when columns get reordered in a feature matrix, machine learning models apply coefficients to wrong variables, producing misleading predictions without obvious errors.
Data Lineage
Definition: Documentation of data transformations and flow paths through processing pipelines, creating visibility into how information moves and changes across systems.
Key Benefits:
Change Impact Management: Understanding which applications and users are affected before modifying pipelines
Usage Tracking: Creating catalogs of who accessed data, how it was transformed, and what outputs were generated
Compliance Validation: Ensuring prohibited data (like gender information in loan decisions) doesn't inadvertently influence automated processes through complex feature engineering
Application Context
Definition: Technical details about processing tools including code versions, execution timestamps, and runtime environments.
Hidden Quality Impact: Version mismatches create subtle issues often invisible to business teams. Applications running outdated code may process stale data, creating timeliness problems that only surface during technical audits, not business reviews.
Statistics and KPIs
Predefined statistics:
Standard metrics computed automatically on datasets:
Distribution Metrics: Min/max values, means, medians, and quantiles for numerical data; frequency distributions for categorical data
Freshness Indicators:
Frequency: Whether data updates arrive on schedule (daily noon updates arriving late)
Timeliness: How current the data actually is when consumed by downstream processes
Completeness Measures:
Volume: Row count variations that signal upstream collection issues (expecting 10M customer records but receiving only 5K)
Missing Values: Null percentages that may break machine learning models requiring complete datasets
Custom KPIs:
Tailored metrics addressing specific business or technical requirements:
Business-Focused: Boolean indicators ensuring domain logic (all sales quantities are positive values)
Technical-Focused: Row count comparisons between input and output datasets to validate join operations and detect data loss during transformations
Cross-System Consistency: Format validation across distributed data partitions (ensuring decimal precision matches between Parquet files)
These indicators work together to provide comprehensive data quality monitoring, from technical infrastructure issues to business rule violations.
Last updated