MDM Entity resolution patterns


Golden Record

Here is a practical example of how to build a "Golden Record" using standard SQL (compatible with BigQuery, Snowflake, or Postgres/dbt).

This approach replaces expensive MDM software with "Analytical MDM"—using logic to determine the "truth" inside your warehouse.

The Scenario

Your retail company, ShoesForLess, has customer data coming from two sources:

  1. Website (Shopify): High quality, has emails.

  2. In-Store POS (Legacy): Older data, sometimes missing emails, but has phone numbers.

You have duplicates: The same customer (john.doe@gmail.com) exists in both systems with slightly different information.

The Goal

Create a single dim_customer table where each customer appears exactly once, merging the best data from both sources.

Step 1: The "Deduplication" SQL Pattern

The most common technique is using the Window Function ROW_NUMBER(). We rank the records and pick the "winner."

WITH raw_data AS (
    -- Combine data from all sources into one staging area
    SELECT
        customer_id AS src_id,
        LOWER(TRIM(email)) AS clean_email, -- Standardize the key!
        first_name,
        phone_number,
        updated_at,
        'website' AS source_system
    FROM raw_shopify_customers
    UNION ALL
    SELECT
        id AS src_id,
        LOWER(TRIM(email_address)) AS clean_email,
        full_name AS first_name, -- aligning columns
        phone AS phone_number,
        last_transaction AS updated_at,
        'pos_system' AS source_system
    FROM raw_pos_customers
),

ranked_data AS (
    SELECT
        *,
        -- The "Golden Rule": How do we decide who wins?
        -- Here we prioritize the most recent update.
        ROW_NUMBER() OVER (
            PARTITION BY clean_email 
            ORDER BY updated_at DESC
        ) as rank_id
    FROM raw_data
    WHERE clean_email IS NOT NULL -- Can't master data without a key
)

SELECT *
FROM ranked_data
WHERE rank_id = 1; -- Keep only the winner (The Golden Record)

Step 2: The Advanced "Merge" (True Golden Record)

The snippet above picks the newest row and throws away the old one. But what if the old row had a phone number and the new one doesn't?

True MDM uses COALESCE to fill in the blanks from secondary records.

Why this matters for Data Engineering

  1. No Black Box: You understand exactly why a record was chosen (unlike some MDM tools).

  2. Auditability: You can keep the raw_data forever. If you change your logic (e.g., "Actually, trust the POS system more"), you just rerun the dbt model/query, and your whole history updates.

  3. Cost: This costs $0 extra if you already have a warehouse.


Fuzzy Matching

Fuzzy matching is where "Analytical MDM" gets interesting. This is how you solve the problem of human error—typos, nicknames, and inconsistencies—without buying expensive software.

Here are the two most common SQL techniques for finding "Jon Doe" when you search for "John Doe."

1. Phonetic Matching (SOUNDEX)

This is the "classic" approach. It converts words into a code based on how they sound in English. If two names sound the same (e.g., "Smyth" and "Smith"), they get the same code.

How it works:

  • Keeps the first letter.

  • Removes vowels (a, e, i, o, u, y).

  • Maps consonants to digits (e.g., B, F, P, V -> 1).

SQL Example (Postgres/BigQuery/Snowflake compatible):

The Catch: SOUNDEX is a blunt instrument. As you can see with "Jane" vs. "John," it often groups distinct names together because they share similar consonants. It is best used for blocking (narrowing down the search space) rather than final decision-making.

2. Edit Distance (LEVENSHTEIN)

This is the "precise" approach. It calculates the minimum number of edits (insertions, deletions, or substitutions) required to turn String A into String B.

  • "John" to "Jon" = 1 edit (delete 'h'). Distance = 1.

  • "John" to "Jane" = 2 edits (sub 'o'->'a', sub 'h'->'n'). Distance = 2.

SQL Pattern for Finding Duplicates:

The "Real World" Implementation

In a production Data Engineering pipeline, you rarely run LEVENSHTEIN on every possible pair of 1 million customers (that’s 1 trillion comparisons!).

Instead, you use a Hybrid Approach:

  1. Block: Use SOUNDEX (or matching zip codes/birth dates) to create small "buckets" of potential matches.

  2. Score: Use LEVENSHTEIN (or Jaro-Winkler, which is better for names) only on pairs inside those buckets to verify the match.


Connecting the dots

What we just discussed (Soundex, Levenshtein, Window Functions) are technically Matching Algorithms or Survivorship Logic. They are the tools you use to build MDM.

When data architects talk about "MDM Patterns," they are usually referring to the Architectural Styles—basically, where the MDM hub sits relative to your other systems and how data flows in and out of it.

There are 4 Standard MDM Patterns. It is helpful to know these because the SQL approach we just built corresponds to exactly one of them.

1. The Consolidation Pattern (What we just built)

  • How it works: You pull data from sources (Salesforce, POS) into a hub (or Data Warehouse), clean it there, and use the "Golden Record" only for reporting/analytics. You do not push the clean data back to the source systems.

  • Analogy: You take photos of a messy room, Photoshop them to look clean, and show the photos to your boss. The actual room stays messy.

  • Pros: Cheapest, fastest, easiest for Data Engineers to implement (using dbt/SQL).

  • Cons: The operational apps (Salesforce) still have duplicates.

2. The Registry Pattern

  • How it works: The MDM hub doesn't store the full customer record. It only stores the ID cross-reference. It tells you, "Customer ID 101 in Sales is the same as Customer ID 505 in Support."

  • Analogy: A phone book index. It tells you where to find the person, but it doesn't describe them.

  • Pros: Very lightweight, handles privacy/compliance well (no data duplication).

  • Cons: Queries are slow because you have to fetch data from all sources in real-time.

3. The Coexistence Pattern

  • How it works: Similar to Consolidation, but with a feedback loop. You create the Golden Record in the hub, and then you update the source systems with the clean data (Reverse ETL).

  • Analogy: You clean the room, and if someone tries to mess it up again, you gently correct them.

  • Pros: Operational systems get cleaner over time.

  • Cons: Expensive and complex. You need permission to write back to the source databases.

4. The Centralized (Transaction) Pattern

  • How it works: The MDM Hub becomes the system of entry. You don't create a customer in Salesforce anymore; you create them in the MDM tool, and it pushes the data to Salesforce.

  • Analogy: The MDM Hub is the "Vault." You cannot touch the money (data) unless you go through the teller (MDM).

  • Pros: Absolute data consistency. 100% clean data always.

  • Cons: Extremely expensive, creates a bottleneck, and business users often hate it because it changes their workflow.

Summary

  • Matching Algorithms (Soundex, Levenshtein) = The Mechanics (How you find duplicates).

  • MDM Patterns (Consolidation, Registry, etc.) = The Architecture (Where the data flows).


Last updated