# STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller

## Prerequisites

Download the following datasets

{% file src="<https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FufERCaQ2Y1JsnTLlgG59%2Fwebevents_train.csv?alt=media&token=126a2d21-ab6d-4266-a2e1-7992dcdbbd21>" %}

{% file src="<https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FwpjVc0X28u3mOIibFPoL%2Fwebevents_test.csv?alt=media&token=e7c3e8c9-213b-4f43-bcbd-441f38812f43>" %}

{% file src="<https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F0ou7GAGg7HQ6Uvhi1EsQ%2Fwebevents_inference.csv?alt=media&token=c815ae44-5d38-4a43-b1c2-af4c2214b66e>" %}

Ingest the above datasets using:

{% content-ref url="../prep-500-ingesting-csv-data-into-adobe-experience-platform" %}
[prep-500-ingesting-csv-data-into-adobe-experience-platform](https://data-distilller.gitbook.io/adobe-data-distiller-guide/prep-500-ingesting-csv-data-into-adobe-experience-platform)
{% endcontent-ref %}

Make sure you have read

{% content-ref url="statsml-600-data-distiller-advanced-statistics-and-machine-learning-models" %}
[statsml-600-data-distiller-advanced-statistics-and-machine-learning-models](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models)
{% endcontent-ref %}

## Overview

Businesses aim to optimize marketing efforts by identifying customer behaviors that lead to conversions (e.g., purchases). Using SQL-based feature engineering and a Random Forest model, we can analyze user interactions, extract actionable insights, and predict the likelihood of conversions.

A retail company tracks website activity, including page views, purchases, and campaign interactions. They want to:

1. **Understand Customer Behavior:** Analyze aggregated session data such as visit frequency, page views, and campaign participation.
2. **Predict Conversions:** Use historical data to predict whether a specific user interaction will result in a purchase.
3. **Optimize Engagement:** Focus marketing campaigns and resources on high-conversion-probability customers to maximize ROI.

## Random Forest Regression Model

A **Random Forest** is an ensemble machine learning algorithm that uses multiple decision trees to make predictions. It is a type of supervised learning algorithm widely employed for both classification and regression tasks. By combining the predictions of several decision trees, Random Forest enhances accuracy and reduces the risk of overfitting, making it a robust and reliable choice for a variety of machine learning problems.

The algorithm works by constructing multiple decision trees during training. Each tree is trained on a random subset of the data and features, a technique known as bagging. For classification problems, Random Forest aggregates the predictions of individual trees using majority voting. In regression problems, it averages the predictions across trees to determine the final output. By selecting random subsets of features for training, Random Forest reduces the correlation between individual trees, leading to improved overall prediction accuracy.

In this use case, the goal is to predict the score of user conversion based on web event data. Random Forest is particularly well-suited to this scenario for several reasons. First, it handles mixed data types seamlessly. The dataset contains both categorical features, such as browser and campaign IDs, and numerical features, like page views and purchases. Random Forest accommodates these variations without requiring extensive preprocessing.

Additionally, Random Forest is robust against noise and overfitting. Web activity data often contains irrelevant features or noisy observations. By averaging predictions across trees, the algorithm reduces the influence of noisy data and avoids overfitting, ensuring more reliable predictions. Furthermore, Random Forest provides valuable insights into feature importance, helping to identify which factors, such as page views or campaign IDs, contribute most significantly to user conversions.

Another advantage of Random Forest is its ability to model non-linear relationships. User conversion likelihood is often influenced by complex interactions between features. Random Forest captures these relationships effectively without requiring explicit feature engineering. The algorithm is also scalable, capable of handling large datasets with millions of user sessions, thanks to its parallel computation capabilities.

Random Forest is flexible for regression tasks, which is crucial for this use case where the target variable is a conversion score between 0 and 1. Its inherent design makes it ideal for predicting continuous outcomes. *In contrast, a single decision tree, while simpler, is prone to overfitting, especially in datasets with many features and potential noise.* Random Forest mitigates this limitation by averaging the predictions of multiple trees, delivering more generalizable and robust results.

## Rule-Based Labeling for Conversion Scoring: Automating Data Annotation with Data Distiller

Using SQL transformations to encode features and prepare the dataset:

```sql
-- Create a transformed dataset
CREATE TABLE transformed_webevents AS
SELECT
    visit_id,
    UPPER(country_cd) AS country_encode,
    campaign_id,
    browser_id,
    operating_system_id,
    COUNT(*) AS visits,
    SUM(pageviews) AS total_pageviews,
    SUM(purchases) AS total_purchases,
    CASE 
        WHEN SUM(purchases) > 0 THEN 1
        ELSE 0
    END AS converted
FROM webevents_train
GROUP BY visit_id, country_cd, campaign_id, browser_id, operating_system_id;
```

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FUWv0XsKEsqNTUzKV6mrF%2FScreen%20Shot%202024-11-14%20at%2010.42.15%20PM.png?alt=media&#x26;token=a1c83386-2200-43d7-9e55-d1ab965699c5" alt=""><figcaption><p>Creating thee feature set.</p></figcaption></figure>

Note that

* **`string_indexer`** encodes categorical features (**`visit_id`, `country_cd`, `campaign_id`, `browser_id`, `operating_system_id`**).
* **`vector_assembler`** combines encoded categorical and numerical features (**`visits`, `pageviews`, `purchases`**) into a single feature vector.
* **`standard_scaler`** scales this feature vector to normalize values for training and enhance model performance.

{% hint style="warning" %}
Note that we are using a simple CASE statement to assign a score in our data

**Loss of Nuance**: By converting the target variable to a binary 0 or 1, we may lose information about the magnitude of purchases. For instance, a user with one purchase is treated the same as a user with multiple purchases. In cases where we want to predict the extent of engagement or the volume of purchases, this binary target may not capture the full range of user behavior.

**Suitability for Regression**: Since we are using random forest regression, which is typically better suited for continuous targets, applying it to a binary target might not be ideal. Random forest regression will still function, but it may not fully leverage the model’s strengths in predicting continuous outcomes. If our primary goal is to predict conversion likelihood (0 or 1), a classifier like random forest classification might be more appropriate.

**Alternatives**: If we have access to more granular data on the number of purchases, we could consider using a different target variable that reflects this information, such as the count of purchases or the monetary value of purchases. Using a continuous target with random forest regression could enable the model to capture the full range of behaviors, giving us insights into not just who is likely to convert but also to what extent they engage in purchases. Alternatively, if our primary objective is binary conversion prediction, we could use a random forest classifier to better align with the binary nature of our target.
{% endhint %}

## **Build the Random Forest Model**

{% code overflow="wrap" %}

```sql
CREATE MODEL random_forest_model
TRANSFORM (
    string_indexer(visit_id) AS si_id,
    string_indexer(country_encode) AS country_code,
    string_indexer(campaign_id) AS campaign_encode,
    string_indexer(browser_id) AS browser_encode,
    string_indexer(operating_system_id) AS os_encode,
    vector_assembler(array(si_id, country_code, campaign_encode, browser_encode, os_encode, visits, total_pageviews, total_purchases)) AS features,
    standard_scaler(features) AS scaled_features
)
OPTIONS (
    MODEL_TYPE = 'random_forest_regression',
    NUM_TREES = 20,
    MAX_DEPTH = 5,
    LABEL = 'converted'
)
AS
SELECT *
FROM transformed_webevents;

```

{% endcode %}

The result will be:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Fa5zw1L7fphrH4a9Qf1U2%2FScreen%20Shot%202024-11-14%20at%2010.59.24%20PM.png?alt=media&#x26;token=742d2097-178a-4f71-abf0-b242ff66eda7" alt=""><figcaption><p>ML model is created.</p></figcaption></figure>

## **Model Evaluation**

Evaluate the model using test data:

```sql
SELECT * 
FROM model_evaluate(
    random_forest_model,
    1, -- Validation split percentage (1 for 100% evaluation on provided data)
    SELECT
        visit_id,
        country_cd AS country_encode,
        campaign_id,
        browser_id,
        operating_system_id,
        COUNT(*) AS visits,
        SUM(pageviews) AS total_pageviews,
        SUM(purchases) AS total_purchases,
        CASE 
            WHEN SUM(purchases) > 0 THEN 1
            ELSE 0
        END AS converted
    FROM webevents_test
    GROUP BY 
        visit_id, 
        country_cd, 
        campaign_id, 
        browser_id, 
        operating_system_id
);
```

The results are:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FEPTZXDm8eFwEGOsX8tkR%2FScreen%20Shot%202024-11-14%20at%2011.12.57%20PM.png?alt=media&#x26;token=ed2a388b-e169-468d-b3a7-b3e91896ac2a" alt=""><figcaption><p>Results of the evaluation.</p></figcaption></figure>

Here's what each metric means in the context of your Random Forest model evaluation:

**Root Mean Squared Error (RMSE):** RMSE is a metric that measures the average magnitude of the errors between the predicted values and the actual values in your test dataset. It is the square root of the average squared differences between predictions and actuals. In this case, an RMSE of `0.048` indicates that the model's predictions are, on average, about `0.048` away from the actual conversion likelihood values. Since RMSE is on the same scale as the target variable (in this case, a probability score between 0 and 1 for conversion likelihood), a lower RMSE suggests that the model's predictions are relatively accurate.

**R-squared (R²):** R², or the coefficient of determination, measures the proportion of variance in the dependent variable (conversion likelihood) that is predictable from the independent variables (features). An R² value of `0.9907` indicates that the model explains approximately 99.07% of the variance in the conversion likelihoods. This is a high R² value, which suggests that the model fits the data very well and that the features used in the model account for almost all of the variability in conversion outcomes.

#### **Overall Evaluation**

* **Model Accuracy:** The combination of a low RMSE and a high R² value suggests that your Random Forest model is performing exceptionally well in predicting conversion likelihood.
* **Suitability for Use:** These results indicate that the model is reliable for predicting conversions based on the test dataset, and it is likely capturing meaningful patterns in the data

{% hint style="info" %}
If this performance holds across additional data (e.g., an inference dataset or real-world data), the model can be a valuable tool for predicting user conversions and guiding targeted marketing efforts. However, it’s essential to validate the model with real-world data periodically, as models trained on historical data may degrade in accuracy over time.
{% endhint %}

## **Predictions**

Use the model for prediction on new data:

```sql
SELECT * 
FROM model_predict(
    random_forest_model,
    1, -- Validation split percentage (1 for 100% evaluation on provided data)
    SELECT
        visit_id,
        country_cd AS country_encode,
        campaign_id,
        browser_id,
        operating_system_id,
        COUNT(*) AS visits,
        SUM(pageviews) AS total_pageviews,
        SUM(purchases) AS total_purchases,
        CASE 
            WHEN SUM(purchases) > 0 THEN 1
            ELSE 0
        END AS converted
    FROM webevents_inference
    GROUP BY 
        visit_id, 
        country_cd, 
        campaign_id, 
        browser_id, 
        operating_system_id
);
```

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F87ibXDHXCa2qa6enHLLC%2FScreen%20Shot%202024-11-14%20at%2011.18.10%20PM.png?alt=media&#x26;token=a6bd8af9-321c-415a-800e-e4f1565c08a8" alt=""><figcaption><p>Predictions</p></figcaption></figure>