# IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller

## Prerequisites

For this tutorial, you will need to ingest the following dataset:

{% file src="<https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FGvj1ubHMabEgaP5ATYUM%2Fexample_dataset.csv?alt=media&token=e1c3d790-238d-4450-afcc-c4ff539dca50>" %}

using the steps outlined in:

{% content-ref url="../prep-500-ingesting-csv-data-into-adobe-experience-platform" %}
[prep-500-ingesting-csv-data-into-adobe-experience-platform](https://data-distilller.gitbook.io/adobe-data-distiller-guide/prep-500-ingesting-csv-data-into-adobe-experience-platform)
{% endcontent-ref %}

We will also be using DBVisualizer to extract large volumes of data directly onto our machine from the Data Distiller backend:

{% content-ref url="../unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller" %}
[prep-400-dbvisualizer-sql-editor-setup-for-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller)
{% endcontent-ref %}

## **Overview**

Fuzzy matching is a technique used to identify and link records that are similar but not exact matches, such as CRM\_IDs or name fields with slight variations. By applying fuzzy matching, you can address discrepancies arising from data entry errors or inconsistent formatting, which are common causes of profile collapse in customer databases. When records are mistakenly treated as separate due to small variations, fuzzy matching can help consolidate them, leading to more accurate identity resolution.

Fuzzy matching is particularly valuable when dealing with data entry errors or subtle differences in identifier formats that cause profile fragmentation or collapse. For instance, CRM\_IDs that differ by one or two characters may represent the same individual, but these small discrepancies prevent a system from recognizing them as such. By applying fuzzy matching, you can detect and merge these near-duplicates, improving the quality and continuity of the customer profiles.

The implementation involves preprocessing the data with fuzzy matching techniques using Data Distiller. This initial processing phase standardizes and identifies close matches within CRM\_IDs or name fields. The cleaned and standardized values are then applied in SQL to create a consolidated dataset, resulting in more accurate profile records and reducing instances of incorrect merges or data fragmentation. This multi-step approach combines fuzzy matching capabilities with SQL’s power to efficiently handle large datasets, enhancing overall data accuracy and reliability.

## Levenshtein Distance

The **Levenshtein distance** is a way of measuring how different two words (or strings) are from each other. Imagine you have two words, and you want to see how many small changes it would take to turn one word into the other.

Each change could be:

1. **Inserting** a new letter (e.g., turning "cat" into "cart" by adding "r").
2. **Deleting** a letter (e.g., turning "cart" into "cat" by removing "r").
3. **Replacing** one letter with another (e.g., turning "cat" into "bat" by changing "c" to "b").

The **Levenshtein distance** counts the **minimum number of these small changes** needed to make the two words exactly the same. So, if two words are very similar, like "cat" and "bat," the distance is small (only one change). If they're quite different, like "cat" and "elephant," the distance is much larger because you'd need many changes.

In essence, Levenshtein distance gives us a simple way to measure how "close" or "far" two words are from each other based on the number of changes required. It's often used in spell-checkers or in finding similar records in databases to help match up entries that might be slightly different due to typos or inconsistencies.

## Identify Fuzzy Matches in `CRM_ID` using `LEVENSHTEIN` distance

In the SQL example below, fuzzy matching is applied to identify records with `CRM_ID`s that are nearly identical, even if they differ slightly due to minor data entry errors or formatting inconsistencies. The query leverages the **Levenshtein distance** function to calculate the edit distance between pairs of **`CRM_ID`**&#x73;, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to make one **`CRM_ID`** identical to another. By setting a threshold of 2, the query identifies `CRM_ID` pairs that have minor variations—indicating that they may refer to the same entity but were inconsistently recorded. This approach is particularly useful in cases where exact matches fail to capture all duplicates due to slight discrepancies. By storing these potential matches in a temporary table, **`fuzzy_matches`**, the process allows for a detailed review or automated cleanup to merge or consolidate profiles, ultimately improving the accuracy and integrity of the dataset.

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F0UDjngMe4oYGFG5VIrfs%2FScreen%20Shot%202024-10-30%20at%2010.34.23%20AM.png?alt=media&#x26;token=11459188-459c-4926-863b-0d99a2de6fa8" alt=""><figcaption><p>Results of the fuzzy match using the Levenshtein distance. </p></figcaption></figure>

## Select Best Match Based on the Highest Similarity Score

In the query below, the goal is to identify the **best match** for each **`ECID`** based on the **highest similarity score** between **`CRM_ID`**&#x73;, using our fuzzy matching algorithm. The query operates on the principle of **minimizing the Levenshtein distance** (called edit distance) between **`CRM_ID`** pairs within each **`ECID`** group. By finding the smallest possible **`crm_id_similarity_score`**, we capture the closest match—meaning the pair with the least number of character edits needed to make the **`CRM_ID`**&#x73; identical.

The subquery **`(SELECT MIN(crm_id_similarity_score)...)`** determines this closest match by selecting the smallest **`crm_id_similarity_score`** for each **`ECID1`**, representing the record with the highest similarity. The primary query then filters **`fuzzy_matches`** to include only those pairs whose similarity score is equal to this minimum value, effectively creating **`best_matches`**. This temporary table stores each **`ECID`** and its closest matching record, allowing for precise consolidation based on the closest possible **`CRM_ID`** values. By focusing on the minimum edit distance, the query ensures that only the best match is selected for each **`ECID`**, thus refining identity resolution and reducing the chance of incorrect profile merges.

{% code overflow="wrap" %}

```sql
-- Step 2: Select the best match for each ECID pair based on the highest similarity score
CREATE TEMP TABLE best_matches AS
SELECT ECID1, ECID2, CRM_ID1, CRM_ID2
FROM fuzzy_matches
WHERE crm_id_similarity_score = (
    SELECT MIN(crm_id_similarity_score)
    FROM fuzzy_matches f
    WHERE f.ECID1 = fuzzy_matches.ECID1
);

SELECT * FROM best_matches;
```

{% endcode %}

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Fl7MobJz1VALLEQiSdYiD%2FScreen%20Shot%202024-10-30%20at%2010.41.38%20AM.png?alt=media&#x26;token=0a749bb1-6da0-4bb1-b014-8f52dee0734d" alt=""><figcaption><p>Highest similarity scoring matches.</p></figcaption></figure>

## Create a Cleaned Dataset

To create the cleaned dataset, use the following:

{% code overflow="wrap" %}

```sql
-- Step 3: Create a new cleaned dataset with preferred CRM_ID for each ECID
CREATE TEMP TABLE cleaned_dataset AS
SELECT 
    a.ECID,
    COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID,  -- Use preferred CRM_ID from best_matches if available
    a.Device_Type,
    a.Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

SELECT * FROM cleaned_dataset;
```

{% endcode %}

In the query above, the goal is to create a **cleaned dataset** where each **`ECID`** is associated with a preferred **`CRM_ID`**, selected based on the closest match identified in the **`best_matches`** table.

The query works on the principle of **data standardization and preference selection**—it uses fuzzy matching results to replace potentially inconsistent or duplicate **`CRM_ID`**&#x73; with the most representative one for each **`ECID`**. Here’s how it achieves this:

1. **COALESCE Selection**: The query applies the **`COALESCE(b.CRM_ID1, a.CRM_ID)`** function, which takes the **`CRM_ID`** from **`best_matches`** (if available) as the preferred identifier. **`COALESCE`** ensures that if there is no match in **`best_matches`**, the original **`CRM_ID`** from **`example_dataset`** (**`a.CRM_ID`**) is retained. This means that for each **`ECID`**, the system first looks for a refined `CRM_ID` and defaults to the original one if no match exists.
2. **LEFT JOIN**: By performing a **`LEFT JOIN`** between **`example_dataset` (`a`)** and **`best_matches` (`b`)** on **`ECID`**, the query ensures that all records in **`example_dataset`** are preserved. Only records with a corresponding **`ECID`** in **`best_matches`** will have the **`CRM_ID`** replaced, making the cleaned dataset comprehensive while preserving unmatched entries.
3. **Resulting Cleaned Dataset**: The **`cleaned_dataset`** now contains records where each **`ECID`** is linked to the best possible **`CRM_ID`**, improving data consistency by standardizing identifiers based on the closest match.

The results are:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FFlqWFGDj1tDMsptiMD4i%2FScreen%20Shot%202024-10-30%20at%2010.48.18%20AM.png?alt=media&#x26;token=68a72290-66e0-4d75-8a22-8136e910dd9c" alt=""><figcaption><p>Cleaned dataset shows duplicate records.</p></figcaption></figure>

To remove duplicate records in the **`cleaned_dataset`** based on **`ECID`** while retaining only one record per **`ECID`**, you can use a **`ROW_NUMBER()`** function to rank the records within each **`ECID`** group, then select only the top-ranked record. This method ensures that duplicates are filtered out, leaving only one preferred **`CRM_ID`** for each **`ECID`**.

{% code overflow="wrap" %}

```sql
-- Step 2a: Create a dataset with preferred CRM_IDs and rank duplicates for removal
CREATE TEMP TABLE ranked_cleaned_dataset AS
SELECT 
    a.ECID,
    COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID,  -- Use preferred CRM_ID from best_matches if available
    a.Device_Type,
    a.Login_Timestamp,
    ROW_NUMBER() OVER (PARTITION BY a.ECID ORDER BY a.Login_Timestamp DESC) AS rn  -- Rank records within each ECID by Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

-- Step 2b: Select only the top-ranked record for each ECID to remove duplicates
CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM ranked_cleaned_dataset
WHERE rn = 1;
```

{% endcode %}

## Verify the Cleaned Dataset

You can preview the results especially for this small dataset:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F8sVweR4mo0ZmVosqPUEh%2FScreen%20Shot%202024-10-30%20at%2011.03.52%20AM.png?alt=media&#x26;token=db7b34c5-d401-4f45-9e3c-a8e874ab04e3" alt=""><figcaption><p>If you scroll down the resultset, you will only see 9458 rows.</p></figcaption></figure>

The reduced row count in the result set (9458 instead of the original 10,000) indicates that some records were filtered out as duplicates when we applied the **`ROW_NUMBER()`** function. This reduction occurred because multiple records with the same **`ECID`** were consolidated, keeping only the top-ranked (most recent) record for each **`ECID`**. In other words, many `ECID`s in the original **`example_dataset`** had multiple records with different **`CRM_ID`**&#x73; or duplicate entries based on **`Login_Timestamp`.** When applying **`ROW_NUMBER()`** and keeping only **`rn = 1`**, we retained only **one unique record per `ECID`** based on the most recent **`Login_Timestamp`**. Therefore, if there were originally 10,000 rows but many duplicate **`ECID`**&#x73;, filtering down to the top-ranked **`rn = 1`** record for each unique **`ECID`** results in only 9458 unique `ECID`s in **`cleaned_dataset`**.

```sql
-- Step 3: Verify the result by checking for profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
```

This shows that there are no duplicates:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FPQfUIeMjuKLTBrnuTuQ6%2FScreen%20Shot%202024-10-30%20at%2011.45.40%20AM.png?alt=media&#x26;token=71367d1d-f634-42da-bd90-741a3fd9754a" alt=""><figcaption><p>There are no duplicates in the 9458 identity matches. </p></figcaption></figure>