Data Quality


Data quality describes the condition of a dataset and whether it is "fit for use."

In simple terms, high-quality data is reliable and trustworthy enough to make decisions with. If data is accurate, complete, and available when you need it, it has high quality. If it is full of errors, missing values, or duplicates, it has low quality.

Why It Matters

Data quality is the foundation of any data initiative. The industry standard phrase is "Garbage In, Garbage Out" (GIGO):

  • Reliable Analytics: You cannot build a dashboard or train an AI model on bad data; the results will be misleading.

  • Operational Efficiency: Bad data causes manual rework (e.g., calling the wrong customer number).

  • Compliance: Regulations like GDPR require accurate handling of user data.


The Dimensions of Data Quality

1. Completeness 📝

This dimension checks if all required data fields are populated. A high completeness score means there are no missing values in critical columns.

Field

Before (Incomplete Data)

After (Complete Data)

User_ID

U101

U101

User_Name

Alex Smith

Alex Smith

User_Email

(Null)

alex.s@email.com

Registration_Date

2024-03-15

2024-03-15

Commentary: The original record is incomplete because the User_Email field is empty, which is a key piece of information for marketing campaigns.


2. Accuracy 🎯

Accuracy measures how correct the data is—how well it reflects the true state of the real world.

Field

Before (Inaccurate Data)

After (Accurate Data)

Order_ID

O500

O500

Product_Name

Laptop Model A

Laptop Model A

Price

$1200

$1250

Order_Date

2024-05-20

2024-05-20

Commentary: The original data is inaccurate because the Price was recorded as $1200, but the actual, correct price should have been $1250.


3. Consistency 🔄

Consistency ensures that data values are uniform and do not conflict across different systems or datasets.

Field

System A (CRM)

System B (Billing)

After (Consistent)

Customer_ID

C205

C205

C205

Customer_Name

Jane Miller

J. Miller

Jane Miller

Subscription_Status

Premium

Gold

Premium

Commentary: The customer's name and subscription status are inconsistent between the two systems, which can lead to billing errors or poor customer service.


4. Timeliness ⏳

Timeliness assesses whether data is available when it is needed, reflecting how up-to-date it is.

Field

Before (Untimely Data)

After (Timely Data)

Date_of_Report

2024-06-05

2024-06-05

Data_Date

2024-06-01

2024-06-05

Total_Sales

$55,000

$62,500

Commentary: A daily report run on June 5th should reflect data from June 5th. The original data is untimely because it is stale, showing data from four days prior due to a pipeline delay.


5. Uniqueness 👯

This dimension ensures that each record is distinct and that there are no duplicate entries in a dataset.

Record ID

User Name

User Email

Before (Duplicate Data)

After (Unique Data)

1

Chris Brown

chrisb@mail.com

(Original Record)

(Kept)

2

Chris Brown

chrisb@mail.com

(Duplicate Record)

(Deleted)

3

Sarah Jones

sarah.j@mail.com

(Original Record)

(Kept)

Commentary: The data originally contained two identical records for the same person, which would skew analytics. The Uniqueness dimension is violated by the duplicate record.


6. Validity ✅

Validity checks whether data conforms to a predefined set of rules, formats, or business constraints.

Field

Before (Invalid Data)

After (Valid Data)

Product_SKU

SKU123

SKU123

Customer_Zip_Code

1234

12345

Rating_Score

11

10

Commentary: The original data is invalid because the Zip_Code is the wrong length (rules for a valid US Zip Code) and the Rating_Score is outside the allowed range of 1-10.

Data quality checks determine metrics that address both quality and integrity.

The common data quality checks include:

  • Identifying duplicate data or overlaps for uniqueness.

  • Checking for mandatory fields, null values, and missing values to identify and fix data completeness

  • Applying formatting checks for consistency.

  • Using business rules with a range of values or default values and validity.

  • Checking how recent the data is or when it was last updated identifies the recency or freshness of data.

  • Validating row, column, conformity, and value checks for integrity.


Indicators of Data Quality

Data Source Metadata

Definition: Descriptive information about datasets including file locations, formats, and ownership details.

Why It Matters: Metadata mismatches can break entire pipelines. When an upstream system switches from CSV to JSON format but downstream applications expect CSV, processing failures cascade through dependent systems.

Schema

Definition: Expected structure defining field names, data types, and column arrangements for datasets.

Critical Impact: Schema violations cause immediate application failures. More subtly, when columns get reordered in a feature matrix, machine learning models apply coefficients to wrong variables, producing misleading predictions without obvious errors.

Data Lineage

Definition: Documentation of data transformations and flow paths through processing pipelines, creating visibility into how information moves and changes across systems.

Key Benefits:

  • Change Impact Management: Understanding which applications and users are affected before modifying pipelines

  • Usage Tracking: Creating catalogs of who accessed data, how it was transformed, and what outputs were generated

  • Compliance Validation: Ensuring prohibited data (like gender information in loan decisions) doesn't inadvertently influence automated processes through complex feature engineering

Application Context

Definition: Technical details about processing tools including code versions, execution timestamps, and runtime environments.

Hidden Quality Impact: Version mismatches create subtle issues often invisible to business teams. Applications running outdated code may process stale data, creating timeliness problems that only surface during technical audits, not business reviews.


Statistics and KPIs

Predefined statistics:

Standard metrics computed automatically on datasets:

Distribution Metrics: Min/max values, means, medians, and quantiles for numerical data; frequency distributions for categorical data

Freshness Indicators:

  • Frequency: Whether data updates arrive on schedule (daily noon updates arriving late)

  • Timeliness: How current the data actually is when consumed by downstream processes

Completeness Measures:

  • Volume: Row count variations that signal upstream collection issues (expecting 10M customer records but receiving only 5K)

  • Missing Values: Null percentages that may break machine learning models requiring complete datasets

Custom KPIs:

Tailored metrics addressing specific business or technical requirements:

  • Business-Focused: Boolean indicators ensuring domain logic (all sales quantities are positive values)

  • Technical-Focused: Row count comparisons between input and output datasets to validate join operations and detect data loss during transformations

  • Cross-System Consistency: Format validation across distributed data partitions (ensuring decimal precision matches between Parquet files)

These indicators work together to provide comprehensive data quality monitoring, from technical infrastructure issues to business rule violations.


Last updated