# IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller ## **Prerequisites** You need to get familiar with Profile attribute snapshot dataset explorations. Please complete or browse the following sections before proceeding: {% content-ref url="../unit-6-data-distiller-audiences/dda-300-audience-overlaps-with-data-distiller" %} [dda-300-audience-overlaps-with-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-6-data-distiller-audiences/dda-300-audience-overlaps-with-data-distiller) {% endcontent-ref %} ## Scenario The profile attribute snapshot dataset is a daily export from the Real-Time Customer Profile. This dataset also contains the identities for the profile data. The identities are encapsulated in a map data structure with identity types (called namespaces in AEP parlance) and identity values. There can be multiple identity types (email, cookie, cross-device such as CRM, etc.) and there can be multiple values within each. You are tasked with transforming this data in the map structure to a relational structure that can function as a lookup table. This lookup table will serve as the foundation of all analytics you will do on Real-Time Customer Profile data and beyond. Whether it is Customer 360, SQL Traits, or ML workflows, this lookup table will be invaluable for those use cases.

## **Exploring the Map Structure using Data Distiller: The Identity Map** Before you extract the identity information, you need to understand the map structure. Use the to\_json feature in Data Distiller to get in place information about the schema structure without having to go to the Schemas UI screen. This is an extremely important feature with deeply nested data structures and maps. {% code overflow="wrap" %} ```sql SELECT to_json(identityMap) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903; ``` {% endcode %} If you execute the following, you will see the following:

What you see above is the identity map structure. The map has the identity namespace (`email, crmid`) as the unique key to index the values. Note that this is an array of values. If you need to extract the identities, you will need to use the following code: {% code overflow="wrap" %} ```sql SELECT identityMap.email.id FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903; ``` {% endcode %} The results look like the following:

Typically we would explode this array {% code overflow="wrap" %} ```sql SELECT explode(identityMap.email.id) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903; ``` {% endcode %}

Exploding the array helps you extract the identities.

{% hint style="info" %} `identityMap.email.id` gives you the array of email identity values and not just an email identity value and these arrays are great for explode functions. {% endhint %} But there is a problem - plain vanilla "explode" will get rid of rows that do not have any identity values. That is a problem because the absence of an identity is a signal and is possibly associated with other identities that the person may have. Let us fix the issue by using *explode\_outer* from Data Distiller: {% code overflow="wrap" %} ```sql SELECT explode_outer(identityMap.email.id) AS email FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903 ``` {% endcode %}

explode_outer retains the nulls for the identity namespaces

## **Extract Identities from the Map Structure to Create an Identity Lookup Table** But just doing an `explode_outer` on a single identity namespace is of no use. We have destroyed the association between identities within that same namespace. Let us generate a synthetic UUID for each profile row to keep track of our identities as we explode them. If we do this right, we will generate a table of UUIDs and related identities as a columnar table that gives us leverage to use it for a variety of operations. If we take the first element of the email namespace and concatenate that with the first element of the `crmid` namespace, we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have that identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged it into a single row. Let us now generate the unique UUIDs along with the schema-friendly identity map structure {% code overflow="wrap" %} ```sql SELECT concat(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903 ``` {% endcode %} The results will look like:

Generate the unique ID without llosing identity associations.

But `concat` string function does a poor job of concatenating NULL values. We have to remedy that with `COALESCE`: {% code overflow="wrap" %} ```sql SELECT COALESCE(identityMap.email.id[0],identityMap.crmid.id[0]) as unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903 ``` {% endcode %} This works and you will get:

COALESCE function retrieves the first non-zero value in a list and is a good choice for a unique ID for the profile in our example.

If there are two identity namespaces, then the `explode_outer` operator works on each, one at a time. Make sure you remove the `to_json` as we do not need it anymore: {% code overflow="wrap" %} ```sql SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM (SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) ``` {% endcode %} You will get:

*email* identities have been separated into separate rows without breaking profile association.

We need to explode this one more time for `crmid` and we should get: {% code overflow="wrap" %} ```sql SELECT unique_id, email, explode_outer(crmid) as crmid FROM ( SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM (SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) ) ``` {% endcode %} The results would be:

Identity lookup table in relational form

Using this table, I can do some very interesting analysis. For example, we can look at the histogram of `email` IDs in the system: {% code overflow="wrap" %} ```sql SELECT bucket, count(unique_id) FROM( SELECT unique_id, count(CASE WHEN email IS NULL THEN 0 ELSE email END) as bucket FROM( SELECT unique_id, email, explode_outer(crmid) as crmid FROM ( SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM (SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) )) GROUP BY unique_id ORDER BY bucket ASC) GROUP BY bucket ``` {% endcode %} The answer looks like this:

Why are we seeing 976 as a bucket and 0 counts for unique\_ids? The sum of 8225 and 976 adds up to the number of profiles. Why did we even get this result in the first place? Is there something wrong with the way the SQL is written or is there something more profound happening in the identity graph? Still cannot figure it out? What if the *email* and *crmid i*dentities were non-existent? You have null rows in the table that you should clean before you do any analytics. For the identity lookup use case which is the goal of this section, it would not matter,