Data Profiling
Data profiling is the process of examining, analyzing, and reviewing a dataset to understand its structure, content, and quality before you start using it.
Think of it as a health checkup for your data. Just as a doctor takes your blood pressure and temperature to catch issues early, a data engineer profiles data to catch errors before they break a pipeline or ruin a report.
What Data Profiling Actually Doing?
Profiling typically involves running statistical analysis on the data to answer three key types of questions:
1. Structure Discovery (Format)
This checks if the data looks the way it is supposed to.
Pattern Matching: Do all email addresses have an "@" symbol? Do phone numbers have the right number of digits?
Data Types: Is the "Price" column actually a number, or did someone accidentally type "$10" (making it a string)?
2. Content Discovery (Statistics)
This looks at individual records to find errors or specific data characteristics.
Null Counts: What percentage of the data is missing? (e.g., "20% of users have no
last_name").Distinct Counts: How many unique values are there? (e.g., "The
Statuscolumn should only have 3 values: Active, Inactive, Pending. Why are there 5?").Min/Max/Average: Does the data make sense physically? (e.g., "The
Agecolumn has a max value of 250"—that is clearly an error).
3. Relationship Discovery (Links)
This checks how data relates across tables.
Key Verification: Is the
CustomerIDin the Orders table actually present in the Customers table? (Referential Integrity).Dependency: Does
Cityalways correspond to the correctZip Code?
Common Tools for Data Profiling
Engineers rarely do this manually line-by-line. They use tools to generate "Profile Reports."
Tool Type
Examples
Best For
SQL
SELECT COUNT(*), MIN(), MAX()
Quick, ad-hoc checks on databases.
Python
pandas_profiling, ydata-profiling
Generating comprehensive HTML reports from dataframes.
Data Ops
Great Expectations, dbt
Automating these checks in production pipelines.
Cloud
AWS Glue DataBrew, Azure Data Catalog
Enterprise-scale profiling without coding.
Why it Saves Time
Without profiling, you might build an entire dashboard, show it to a stakeholder, and then realize the "Revenue" numbers are wrong because 10% of the rows were null. Profiling catches that Day 1, allowing you to clean the data before building anything.
Last updated