# EXPLORE 201: Exploring Web Analytics Data with Data Distiller

## Prerequisites

You need to make sure you complete this module and its prerequisites:

{% content-ref url="explore-200-exploring-behavioral-data-with-data-distiller-a-case-study-with-adobe-analytics-data" %}
[explore-200-exploring-behavioral-data-with-data-distiller-a-case-study-with-adobe-analytics-data](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-2-data-distiller-data-exploration/explore-200-exploring-behavioral-data-with-data-distiller-a-case-study-with-adobe-analytics-data)
{% endcontent-ref %}

## **Scenario**

We are going to ingest LUMA data into our test environment. This is a [fictitious online store ](https://luma.enablementadobe.com/content/luma/us/en.html)created by Adobe

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F6truaKqY41U0A2QO8pze%2FScreen%20Shot%202023-08-28%20at%208.50.53%20AM.png?alt=media&#x26;token=7ab53270-4e4a-4dab-856a-119b6edd1f35" alt=""><figcaption><p>Luma website</p></figcaption></figure>

The fastest way to understand what is happening on the website is to check the Products tab. There are 3 categories of products for different (and all) personas. You can browse them. You authenticate yourself and also can add items to a cart. The data that we are ingesting into the Platform is the test website traffic data that conforms to the Adobe Analytics schema.&#x20;

We need to run some analytical queries on this dataset.&#x20;

## Count the Number of Events  in the AA Dataset

```sql
SELECT count(event_id) FROM  Adobe_Analytics_View 
```

The answer should be **733,265**. This is also the web traffic volume.

## Count of Visitors and Authenticated Visitors

{% code overflow="wrap" %}

```sql
SELECT COUNT(DISTINCT mcid_id) AS Cookie_Visitors,  COUNT(DISTINCT email_id) AS authenticated_Vistors FROM  Adobe_Analytics_View 
```

{% endcode %}

The answer you should get for both should be 30,000. This means that every cookie is associated with an email which at first instance should come across as strange. But this is demo data and we can assume that someone has done the ID resolution for us for **ALL mcids**.

## Time Range of the Data

```sql
SELECT min(TimeStamp), max(TimeStamp) FROM  Adobe_Analytics_View 
```

The time range should come as 2020-06-30 22:04:47 to 2021-01-29 23:47:04

## Most Popular Web Pages

```sql
SELECT WebPageName, count(WebPageName) AS WebPageCounts FROM  Adobe_Analytics_View 
GROUP BY WebPageName
ORDER BY WebPageCounts DESC
```

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FkTIO1tY5vexILn1tc6aJ%2FScreen%20Shot%202023-08-30%20at%205.11.19%20PM.png?alt=media&#x26;token=cdb0a2df-9562-4458-9021-5cb5423ac08a" alt=""><figcaption><p>The top web pages by counts fro June 30, 2020 to Jan 29, 2021.</p></figcaption></figure>

## Count the Number of Visits/Sessions

One of the foundational concepts of web analytics is the idea of a session or a visit. When you visit a website, a timer starts ticking and all the pages that you visited, say in the next 30 minutes are part of that session. Sessions are great because they are the atomic unit of a journey. Customers interact with a channel or a medium as part of a session. What they do in the session has some intent or goal - if we can study what happens in these sessions, then we can get a solid understanding of the users.

{% code overflow="wrap" %}

```sql
SELECT mcid_id, Timestamp`, 
to_json(SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC
```

{% endcode %}

Let us understand the code first:

1. `to_json(SESS_TIMEOUT(Timestamp, 60 * 30)`: Here, the `SESS_TIMEOUT` function is used with the `Timestamp` column. This function calculates the session timeout by adding 30 minutes (60 \* 30 seconds) to the given `Timestamp`. The result is then converted to a JSON format using the `to_json` function.
2. `OVER (PARTITION BY mcid_id ORDER BY Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`: This is a window function that operates on partitions of data defined by the `mcid_id` column. It orders the rows within each partition based on the `Timestamp` column in ascending order. The `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` clause specifies that the window includes all rows from the beginning of the partition up to the current row.

The result is the following:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FU0Wfn6CjXKDXQojyN3TX%2FScreen%20Shot%202023-08-31%20at%204.31.44%20PM.png?alt=media&#x26;token=fe57bd10-4f61-4653-b0e3-f2d4b1ddb86c" alt=""><figcaption><p>Sessionization on the event data</p></figcaption></figure>

Let us now parse the results in the session object:

1. If you look at the mcid\_id column, all of those ids are sorted by the same person. The sessionization always operates on a single mcid\_id
2. `timestamp_diff:` The difference in time, in seconds, between the current record and the prior record. It starts with "0" for the first record and increases for the other records within the same session as indicated by `depth.`
3. `num:` A unique session number, starting at 1 for each mcid\_id. `isnew` is just a flag as to whether the record is the start of a new session or not.

I can now extract the session number at a visitor level and also assign it a unique session number across all visitors by doing the following:

{% code overflow="wrap" %}

```sql
SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC
)

```

{% endcode %}

{% hint style="warning" %}
**Warning:** I have removed `to_json` in the code here as I need to access the fields within the session object. If I use `to_json`, it will create a string and the fields cannot be extracted.&#x20;
{% endhint %}

The results are the following:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FI25lBuB1uG9TLlmthfJb%2FScreen%20Shot%202023-08-31%20at%204.55.38%20PM.png?alt=media&#x26;token=79ea3824-628a-4df0-a554-bc1d23d6ce76" alt=""><figcaption><p>Session number assigned at the visitor level and across all visitors.</p></figcaption></figure>

Let us compute the number of visits overall:

{% code overflow="wrap" %}

```sql
SELECT COUNT(DISTINCT unique_session_number) FROM (
SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC))
```

{% endcode %}

The result should be **104,721**.

The average number of pages visited per visit is **733,265/104,721=7.** This does agree with what we see when we inspect the results.
