MDM Entity resolution patterns
Golden Record
The Scenario
The Goal
Step 1: The "Deduplication" SQL Pattern
WITH raw_data AS (
-- Combine data from all sources into one staging area
SELECT
customer_id AS src_id,
LOWER(TRIM(email)) AS clean_email, -- Standardize the key!
first_name,
phone_number,
updated_at,
'website' AS source_system
FROM raw_shopify_customers
UNION ALL
SELECT
id AS src_id,
LOWER(TRIM(email_address)) AS clean_email,
full_name AS first_name, -- aligning columns
phone AS phone_number,
last_transaction AS updated_at,
'pos_system' AS source_system
FROM raw_pos_customers
),
ranked_data AS (
SELECT
*,
-- The "Golden Rule": How do we decide who wins?
-- Here we prioritize the most recent update.
ROW_NUMBER() OVER (
PARTITION BY clean_email
ORDER BY updated_at DESC
) as rank_id
FROM raw_data
WHERE clean_email IS NOT NULL -- Can't master data without a key
)
SELECT *
FROM ranked_data
WHERE rank_id = 1; -- Keep only the winner (The Golden Record)Step 2: The Advanced "Merge" (True Golden Record)
Why this matters for Data Engineering
Fuzzy Matching
1. Phonetic Matching (SOUNDEX)
SOUNDEX)2. Edit Distance (LEVENSHTEIN)
LEVENSHTEIN)The "Real World" Implementation
Connecting the dots
1. The Consolidation Pattern (What we just built)
2. The Registry Pattern
3. The Coexistence Pattern
4. The Centralized (Transaction) Pattern
Summary
Last updated