# STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models

## Prerequisites

{% content-ref url="statsml-400-data-distiller-basic-statistics-functions" %}
[statsml-400-data-distiller-basic-statistics-functions](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-400-data-distiller-basic-statistics-functions)
{% endcontent-ref %}

Use case tutorials are here:

{% content-ref url="../unit-7-data-distiller-business-intelligence/bi-500-optimizing-omnichannel-marketing-spend-using-marginal-return-analysis" %}
[bi-500-optimizing-omnichannel-marketing-spend-using-marginal-return-analysis](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-7-data-distiller-business-intelligence/bi-500-optimizing-omnichannel-marketing-spend-using-marginal-return-analysis)
{% endcontent-ref %}

{% content-ref url="statsml-601-building-a-period-to-period-customer-retention-model-using-logistics-regression" %}
[statsml-601-building-a-period-to-period-customer-retention-model-using-logistics-regression](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-601-building-a-period-to-period-customer-retention-model-using-logistics-regression)
{% endcontent-ref %}

{% content-ref url="statsml-602-techniques-for-bot-detection-in-data-distiller" %}
[statsml-602-techniques-for-bot-detection-in-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-602-techniques-for-bot-detection-in-data-distiller)
{% endcontent-ref %}

{% content-ref url="statsml-603-predicting-customer-conversion-scores-using-random-forest-in-data-distiller" %}
[statsml-603-predicting-customer-conversion-scores-using-random-forest-in-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-603-predicting-customer-conversion-scores-using-random-forest-in-data-distiller)
{% endcontent-ref %}

## Overview

Data Distiller users need a convenient way to generate data insights to predict the best strategies for targeting users across various use cases. They want the ability to predict a user's likelihood of buying a specific product, estimate the quantity they may purchase, and identify which products are most likely to be bought. Currently, there is no option to leverage machine learning algorithms directly through SQL to produce predictive insights from the data.

With the introduction of statistical functions such as `CREATE MODEL`, `MODEL_EVALUATE`, and `MODEL_PREDICT`, Data Distiller users will gain the capability to create predictive insights from data stored in the lake. This three-step querying process enables them to easily generate actionable insights from their data.

## Augmenting Fully Featured Machine Learning Platform Use Cases

Data Distiller's statistics and ML capabilities can play a crucial role in augmenting full-scale ML platforms like Databricks, Google Cloud AI Platform, Azure Machine Learning and Amazon SageMaker, providing valuable support for the end-to-end machine learning workflow. Here's how these features could be leveraged:

#### **Prototyping and Rapid Experimentation**

* **Quick Prototyping**: The ability to use SQL-based ML models and transformations allows data scientists and engineers to quickly prototype models and test different features without setting up complex ML pipelines. This rapid iteration is particularly valuable in the early stages of feature engineering and model development.
* **Feature Validation**: By experimenting with various feature transformations and basic models within Data Distiller, users can validate the quality and impact of different features. This ensures that only the most relevant features are sent for training in full-scale ML platforms, thereby optimizing model performance.

#### **Preprocessing and Feature Engineering**

* **Efficient Feature Processing**: Data Distiller's built-in transformers (e.g., vector assemblers, scalers, and encoders) can be used for feature engineering and data preprocessing steps. This enables seamless integration with platforms by preparing the data in a format that is ready for advanced model training.
* **Automated Feature Selection**: With basic statistical and machine learning capabilities, Data Distiller can help automate feature selection by running simple models to identify the most predictive features before moving to a full-scale ML environment.

#### **Reducing Development Time and Cost**

* **Cost-Effective Experimentation**: By using Data Distiller to conduct initial model experiments and transformations, teams can avoid the high costs associated with running large-scale ML jobs on platforms. This is particularly useful when working with large datasets or conducting frequent iterations.
* **Integrated Workflow**: Once features and models are validated in Data Distiller, the results can be easily transferred to the machine learning platform for full-scale training. This integrated approach streamlines the development process, reducing the time needed for data preparation and experimentation.

#### **Use Case Scenarios**

* **Feature Prototyping**: Data Distiller can serve as a testing ground for new features and transformations. For example, users can build basic predictive models or clustering algorithms to understand the potential of different features before moving to more complex models on Databricks or SageMaker.
* **Model Evaluation and Validation**: Basic model evaluation (e.g., classification accuracy, regression metrics) within Data Distiller can help identify promising feature sets. These insights can guide further tuning and training in full-scale ML environments, reducing the need for costly experiments.

#### **Best Practices for Integration**

* **Modular Approach**: Design Data Distiller processes to produce well-defined outputs that can be easily integrated into downstream ML workflows. For instance, transformed features and initial model insights can be exported as data artifacts for further training.
* **Continuous Learning Loop**: Use the insights from Data Distiller to inform feature engineering strategies. This iterative loop ensures that the models trained on full-scale platforms are built on well-curated and optimized data.

## Advanced Statistics & **Machine Learning Functions in Data Distiller**

Data Distiller supports various advanced statistics and machine learning operations through SQL commands, enabling users to:

* Create models
* Evaluate models
* Make predictions

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F3gdone6h3WUSdkSXGM2f%2FScreen%20Shot%202024-10-22%20at%2011.18.56%20AM.png?alt=media&#x26;token=c938ed62-a216-48a6-99fe-d83c9554fc79" alt=""><figcaption><p>Steps in a predictive flow.</p></figcaption></figure>

The steps above describe the following:

* **Source Data**: The process begins with the available source data, which serves as the input for training the machine learning model.
* **`CREATE MODEL` Using Training Data**: A predictive model is created using the training data. This step involves selecting the appropriate machine learning algorithm and training it to learn patterns from the data.
* **`MODEL_EVALUATE` to Check the Accuracy of the Model**: The trained model is then evaluated to measure its accuracy and ensure it performs well on unseen data. This step helps validate the model's effectiveness.
* **`MODEL_PREDICT` to Make Predictions on New Data**: Once the model's accuracy is verified, it is used to make predictions on new, unseen data, generating predictive insights.
* **Output Prediction Data**: Finally, the predictions are outputted, providing actionable insights based on the processed data.

## **Supported Advanced Statistics & Machine Learning Algorithms**

### Regression (Supervised)

* **Linear Regression**: Fits a linear relationship between features and a target variable.
* **Decision Tree Regression**: Uses a tree structure to model and predict continuous values.
* **Random Forest Regression**: An ensemble of decision trees that predicts the average output.
* **Gradient Boosted Tree Regression**: Uses an ensemble of trees to minimize prediction error iteratively.
* **Generalized Linear Regression**: Extends linear regression to model non-normal target distributions.
* **Isotonic Regression**: Fits a non-decreasing or non-increasing sequence to the data.
* **Survival Regression**: Models time-to-event data based on the Weibull distribution.
* **Factorization Machines Regression**: Models interactions between features, making it suitable for sparse datasets and high-dimensional data.

### Classification (Supervised)

* **Logistic Regression**: Predicts probabilities for binary or multiclass classification problems.
* **Decision Tree Classifier**: Uses a tree structure to classify data into distinct categories.
* **Random Forest Classifier**: An ensemble of decision trees that classifies data based on majority voting.
* **Naive Bayes Classifier**: Uses Bayes' theorem with strong independence assumptions between features.
* **Factorization Machines Classifier**: Models interactions between features for classification, making it suitable for sparse and high-dimensional data.
* **Linear Support Vector Classifier (LinearSVC)**: Constructs a hyperplane for binary classification tasks, maximizing the margin between classes.
* **Multilayer Perceptron Classifier**: A neural network classifier with multiple layers for mapping inputs to outputs using an activation function.

### Unsupervised

* **K-Means**: Partitions data into k clusters based on distance to cluster centroids.
* **Bisecting K-Means**: Uses a hierarchical divisive approach for clustering.
* **Gaussian Mixture**: Models data as a mixture of multiple Gaussian distributions.
* **Latent Dirichlet Allocation (LDA)**: Identifies topics in a collection of text documents.

## Summary Table of Available Algorithms

<table data-full-width="true"><thead><tr><th>Category</th><th>Algorithm</th><th>Description</th></tr></thead><tbody><tr><td><strong>Regression (Supervised)</strong></td><td><strong>Linear Regression</strong></td><td>Fits a linear relationship between features and a target variable.</td></tr><tr><td></td><td><strong>Decision Tree Regression</strong></td><td>Uses a tree structure to model and predict continuous values.</td></tr><tr><td></td><td><strong>Random Forest Regression</strong></td><td>An ensemble of decision trees that predicts the average output.</td></tr><tr><td></td><td><strong>Gradient Boosted Tree Regression</strong></td><td>Uses an ensemble of trees to minimize prediction error iteratively.</td></tr><tr><td></td><td><strong>Generalized Linear Regression</strong></td><td>Extends linear regression to model non-normal target distributions.</td></tr><tr><td></td><td><strong>Isotonic Regression</strong></td><td>Fits a non-decreasing or non-increasing sequence to the data.</td></tr><tr><td></td><td><strong>Survival Regression</strong></td><td>Models time-to-event data based on the Weibull distribution.</td></tr><tr><td></td><td><strong>Factorization Machines Regression</strong></td><td>Models interactions between features, making it suitable for sparse datasets and high-dimensional data.</td></tr><tr><td><strong>Classification (Supervised)</strong></td><td><strong>Logistic Regression</strong></td><td>Predicts probabilities for binary or multiclass classification problems.</td></tr><tr><td></td><td><strong>Decision Tree Classifier</strong></td><td>Uses a tree structure to classify data into distinct categories.</td></tr><tr><td></td><td><strong>Random Forest Classifier</strong></td><td>An ensemble of decision trees that classifies data based on majority voting.</td></tr><tr><td></td><td><strong>Naive Bayes Classifier</strong></td><td>Uses Bayes' theorem with strong independence assumptions between features.</td></tr><tr><td></td><td><strong>Factorization Machines Classifier</strong></td><td>Models interactions between features for classification, suitable for sparse and high-dimensional data.</td></tr><tr><td></td><td><strong>Linear Support Vector Classifier (LinearSVC)</strong></td><td>Constructs a hyperplane for binary classification tasks, maximizing the margin between classes.</td></tr><tr><td></td><td><strong>Multilayer Perceptron Classifier</strong></td><td>A neural network classifier with multiple layers for mapping inputs to outputs using an activation function.</td></tr><tr><td><strong>Unsupervised</strong></td><td><strong>K-Means</strong></td><td>Partitions data into k clusters based on distance to cluster centroids.</td></tr><tr><td></td><td><strong>Bisecting K-Means</strong></td><td>Uses a hierarchical divisive approach for clustering.</td></tr><tr><td></td><td><strong>Gaussian Mixture</strong></td><td>Models data as a mixture of multiple Gaussian distributions.</td></tr><tr><td></td><td><strong>Latent Dirichlet Allocation (LDA)</strong></td><td>Identifies topics in a collection of text documents.</td></tr></tbody></table>

## **SQL Syntax for Advanced Statistics & Machine Learning Functions**

### **Creating Models**

Use the **`CREATE MODEL`** command to define a new machine learning model.

```sql
CREATE MODEL IF NOT EXISTS my_linear_model
OPTIONS (MODEL_TYPE='linear_reg', MAX_ITER=100, REG_PARAM=0.1)
AS
SELECT feature1, feature2, target_variable
FROM training_dataset;
```

In this example:

* `MODEL_TYPE` specifies the algorithm.
* `MAX_ITER` sets the number of iterations.
* `REG_PARAM` is the regularization parameter.

{% hint style="info" %}
Note that the syntax does not support reading from a **`TEMP`** table and does not allow for braces such as:

<pre class="language-sql"><code class="lang-sql"><strong>(SELECT feature1, feature2, target_variable
</strong>FROM training_dataset);
</code></pre>

{% endhint %}

### **Creating a Model with Preprocessing**

The **`TRANSFORM`** clause allows you to preprocess features before training.

```sql
CREATE MODEL my_classification_model
TRANSFORM (
    binarizer(numeric_feature, 50) as binarized_feature,
    string_indexer(categorical_feature) as indexed_feature,
    vector_assembler(array(binarized_feature, indexed_feature)) as features
)
OPTIONS (MODEL_TYPE='logistic_reg', LABEL='label_column')
AS
SELECT numeric_feature, categorical_feature, label_column
FROM training_data;
```

This example demonstrates:

* Binarizing a numeric feature.
* Indexing a categorical feature.
* Assembling multiple features into a vector.

## **Feature Transformation Functions**

Feature transformation is the process of extracting meaningful features from raw data to enhance the accuracy of downstream statistical models. The Data Distiller feature engineering SQL extension provides a comprehensive suite of techniques that streamline and automate data preprocessing. These functions allow for seamless, efficient data preparation and enable easy experimentation with various feature engineering methods. Designed for distributed computing, the SQL extension supports feature engineering on large datasets in a parallel and scalable manner, significantly reducing the time needed for preprocessing.

Feature transformation is broadly used for the following purposes:

1. **Extraction**: Extracts important information from data columns, helping models to identify key signals. For example, in textual data, long sentences may contain irrelevant words that need to be removed to improve model performance.
2. **Transformation**: Converts raw data into a format that machine learning models can consume. Since models understand numbers but not text, transformers are used to convert non-numerical data into numerical features.

### **Manual Preprocessing**

Define custom preprocessing steps using the `TRANSFORM` clause.

{% code overflow="wrap" %}

```sql
CREATE MODEL custom_transform_model
TRANSFORM (
    numeric_imputer(missing_numeric_column, 'mean') as imputed_column,
    binarizer(imputed_column, 0.0) as binarized_column
)
OPTIONS (MODEL_TYPE='logistic_reg', LABEL='outcome')
AS
SELECT missing_numeric_column, outcome
FROM raw_dataset;
```

{% endcode %}

### **Automatic Preprocessing**

If the `TRANSFORM` clause is omitted, Data Distiller performs basic preprocessing.

## &#x20;**Data Distiller Transformers**

Several transformers can be used for feature engineering:

### **Numeric Imputer**

* **Description**: Fills missing numeric values using a specified strategy such as "mean," "median," or "mode."
* **Example**:

  ```sql
  TRANSFORM (numeric_imputer(age, 'median') as age_imputed)
  ```

### **String Imputer**

* **Description**: Replaces missing string values with a specified string.
* **Example**:

  ```sql
  TRANSFORM (string_imputer(city, 'unknown') as city_imputed)
  ```

### **Boolean Imputer**

* **Description**: Completes missing values in a boolean column using a specified boolean value.
* **Example**:

  ```sql
  TRANSFORM (boolean_imputer(has_account, true) as account_imputed)
  ```

### **Vector Assembler**

* **Description**: Combines multiple columns into a single vector column. Useful for creating feature vectors from multiple features.
* **Example**:

  ```sql
  TRANSFORM (vector_assembler(array(col1, col2)) as feature_vector)
  ```

### **Binarizer**

* **Description**: Converts a numeric column to a binary value (0 or 1) based on a specified threshold.
* **Example**:

  ```sql
  TRANSFORM (binarizer(rating, 10.0) as binarized_rating)
  ```

### **Bucketizer**

* **Description**: Splits a continuous numeric column into discrete bins based on specified thresholds.
* **Example**:

  ```sql
  TRANSFORM (bucketizer(age, array(18, 30, 50)) as age_group)
  ```

### **String Indexer**

* **Description**: Converts a column of strings into a column of indexed numerical values, typically used for categorical features.
* **Example**:

  ```sql
  TRANSFORM (string_indexer(category) as indexed_category)
  ```

### **One-Hot Encoder**

* **Description**: Converts categorical features represented as indices into a one-hot encoded vector.
* **Example**:

  ```sql
  TRANSFORM (one_hot_encoder(indexed_category) as encoded_category)
  ```

### **Standard Scaler**

* **Description**: Standardizes a numeric column by removing the mean and scaling to unit variance.
* **Example**:

  ```sql
  TRANSFORM (standard_scaler(income) as scaled_income)
  ```

### **Min-Max Scaler**

* **Description**: Scales a numeric column to a specified range, typically \[0, 1].
* **Example**:

  ```sql
  TRANSFORM (min_max_scaler(income, 0, 1) as scaled_income)
  ```

### **Max-Abs Scaler**

* **Description**: Scales a numeric column by dividing each value by the maximum absolute value in that column.
* **Example**:

  ```sql
  TRANSFORM (max_abs_scaler(weight) as scaled_weight)
  ```

### **Normalizer**

* **Description**: Normalizes a vector to have unit norm, typically used for scaling individual samples.
* **Example**:

  ```sql
  TRANSFORM (normalizer(feature_vector) as normalized_features)
  ```

### **Polynomial Expansion**

* **Description**: Expands a vector of features into a polynomial feature space.
* **Example**:

  ```sql
  TRANSFORM (polynomial_expansion(features, 2) as poly_features)
  ```

### **Chi-Square Selector**

* **Description**: Selects the top features based on the Chi-Square test of independence.
* **Example**:

```
TRANSFORM (chi_square_selector(features, 3) as selected_features)
```

### **PCA (Principal Component Analysis)**

* **Description**: Reduces the dimensionality of the data by projecting it onto a lower-dimensional subspace.
* **Example**:

  ```sql
  TRANSFORM (pca(features, 5) as pca_features)
  ```

### **Feature Hasher**

* **Description**: Converts categorical features into numerical features using the hashing trick, resulting in a fixed-length feature vector.
* **Example**:

  ```sql
  TRANSFORM (feature_hasher(array(col1, col2), 100) as hashed_features)
  ```

### **Stop Words Remover**

* **Description**: Removes common stop words from a column of text data.
* **Example**:

  ```sql
  TRANSFORM (stop_words_remover(text_column) as cleaned_text)
  ```

### **NGram**

* **Description**: Converts a column of text data into a sequence of n-grams.
* **Example**:

  ```sql
  sqlCopy codeTRANSFORM (ngram(words, 2) as bigrams)
  ```

### **Tokenization**

* **Description**: Splits a string column into a list of words.
* **Example**:

  ```sql
  TRANSFORM (tokenizer(sentence) as words)
  ```

### TF-IDF (Term Frequency-Inverse Document Frequency)

* **Description**: TF-IDF is a statistic that reflects how important a word is to a document within a collection or corpus. It is widely used in text mining and natural language processing to transform text data into numerical features. Given a term ttt, a document ddd, and a corpus DDD:

  * **Term Frequency (TF)** measures the frequency of a term in a document is the number of times term ttt appears in document ddd.
  * **Document Frequency (DF)** counts how many documents contain the term is the number of documents in the corpus DDD that include the term ttt.

  Using only term frequency can overemphasize terms that appear frequently but carry little meaningful information (e.g., "a," "the," "of"). TF-IDF addresses this by weighting terms inversely proportional to their frequency across the corpus, thus highlighting terms that are more informative for a particular document.
* **Example**:

  ```sql
  TRANSFORM (tf_idf(tokenized_text) as tfidf_features)
  ```

TF-IDF helps in converting a collection of text documents into a matrix of numerical features that can be used as input for machine learning models. It is particularly useful for feature extraction in text classification tasks, sentiment analysis, and information retrieval.

### Word2Vec

* **Description**: **`Word2Vec`** is an estimator that takes sequences of words representing documents and trains a **`Word2VecModel`**. The model maps each word to a unique fixed-size vector in a continuous vector space. The Word2VecModel then transforms each document into a vector by averaging the vectors of all the words in the document. This technique is widely used in natural language processing (NLP) tasks to capture the semantic meaning of words and represent them in a numerical format suitable for machine learning models.
* **Example**:

  ```sql
  TRANSFORM (
    tokenizer(review) as tokenized,
    word2vec(tokenized, 10, 1) as word2vec_features
  )
  ```

In this example:

* The `tokenizer` transformer splits the input text into individual words.
* The `word2vec` transformer generates a fixed-size vector (with a specified size of 10) for each word in the sequence and computes the average vector for all words in the document.

**`Word2Vec`** is commonly used to convert text data into numerical features, allowing machine learning algorithms to process textual information while capturing semantic relationships between words.

### CountVectorizer

* **Description**: The CountVectorizer is used to convert a collection of text documents into vectors of token counts. It generates sparse representations for the documents based on the vocabulary, allowing further processing by algorithms such as Latent Dirichlet Allocation (LDA) and other text analysis techniques. The output is a sparse vector where the value of each element represents the count of a term in the document.
* **Input Data Type**: `array[string]`
* **Output Data Type**: Sparse vector
* **Parameters**:
  * **`VOCAB_SIZE`**: The maximum size of the vocabulary. The CountVectorizer will build a vocabulary that considers only the top `vocabSize` terms, ordered by term frequency across the corpus.
  * **`MIN_DOC_FREQ`**: Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If set as an integer, it indicates the number of documents; if a double in `[0,1)`, it indicates a fraction of documents.
  * **`MAX_DOC_FREQ`**: Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. Terms appearing more than the threshold are ignored. If set as an integer, it indicates the maximum number of documents; if a double in `[0,1)`, it indicates the maximum fraction of documents.
  * **`MIN_TERM_FREQ`**: Filters out rare words in a document. Terms with a frequency lower than the threshold in a document are ignored. If an integer, it specifies the count; if a double in `[0,1)`, it specifies a fraction.
* **Example**:

  ```sql
  TRANSFORM (
    count_vectorizer(texts) as cv_output
  )
  ```

## Summary Table of Transformers

<table data-full-width="true"><thead><tr><th>Transformer</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td><strong>Numeric Imputer</strong></td><td>Fills missing numeric values using "mean," "median," or "mode."</td><td><code>TRANSFORM (numeric_imputer(age, 'median') as age_imputed)</code></td></tr><tr><td><strong>String Imputer</strong></td><td>Replaces missing string values with a specified string.</td><td><code>TRANSFORM (string_imputer(city, 'unknown') as city_imputed)</code></td></tr><tr><td><strong>Boolean Imputer</strong></td><td>Completes missing values in a boolean column using a specified boolean value.</td><td><code>TRANSFORM (boolean_imputer(has_account, true) as account_imputed)</code></td></tr><tr><td><strong>Vector Assembler</strong></td><td>Combines multiple columns into a single vector column.</td><td><code>TRANSFORM (vector_assembler(array(col1, col2)) as feature_vector)</code></td></tr><tr><td><strong>Binarizer</strong></td><td>Converts a numeric column to a binary value (0 or 1) based on a specified threshold.</td><td><code>TRANSFORM (binarizer(rating, 10.0) as binarized_rating)</code></td></tr><tr><td><strong>Bucketizer</strong></td><td>Splits a continuous numeric column into discrete bins based on specified thresholds.</td><td>TRANSFORM (bucketizer(age, <code>array(18, 30, 50)) as age_group)</code></td></tr><tr><td><strong>String Indexer</strong></td><td>Converts a column of strings into indexed numerical values.</td><td><code>TRANSFORM (string_indexer(category) as indexed_category)</code></td></tr><tr><td><strong>One-Hot Encoder</strong></td><td>Converts categorical features represented as indices into a one-hot encoded vector.</td><td><code>TRANSFORM (one_hot_encoder(indexed_category) as encoded_category)</code></td></tr><tr><td><strong>Standard Scaler</strong></td><td>Standardizes a numeric column by removing the mean and scaling to unit variance.</td><td><code>TRANSFORM (standard_scaler(income) as scaled_income)</code></td></tr><tr><td><strong>Min-Max Scaler</strong></td><td>Scales a numeric column to a specified range, typically [0, 1].</td><td><code>TRANSFORM (min_max_scaler(income, 0, 1) as scaled_income)</code></td></tr><tr><td><strong>Max-Abs Scaler</strong></td><td>Scales a numeric column by dividing each value by the maximum absolute value in the column.</td><td><code>TRANSFORM (max_abs_scaler(weight) as scaled_weight)</code></td></tr><tr><td><strong>Normalizer</strong></td><td>Normalizes a vector to have unit norm.</td><td><code>TRANSFORM (normalizer(feature_vector) as normalized_features)</code></td></tr><tr><td><strong>Polynomial Expansion</strong></td><td>Expands a vector of features into a polynomial feature space.</td><td><code>TRANSFORM (polynomial_expansion(features, 2) as poly_features)</code></td></tr><tr><td><strong>Chi-Square Selector</strong></td><td>Selects top features based on the Chi-Square test of independence.</td><td><code>TRANSFORM (chi_square_selector(features, 3) as selected_features)</code></td></tr><tr><td><strong>PCA (Principal Component Analysis)</strong></td><td>Reduces data dimensionality by projecting onto a lower-dimensional subspace.</td><td><code>TRANSFORM (pca(features, 5) as pca_features)</code></td></tr><tr><td><strong>Feature Hasher</strong></td><td>Converts categorical features into numerical features using the hashing trick.</td><td><code>TRANSFORM (feature_hasher(array(col1, col2), 100) as hashed_features)</code></td></tr><tr><td><strong>Stop Words Remover</strong></td><td>Removes common stop words from a text data column.</td><td><code>TRANSFORM (stop_words_remover(text_column) as cleaned_text)</code></td></tr><tr><td><strong>NGram</strong></td><td>Converts text data into a sequence of n-grams.</td><td><code>TRANSFORM (ng</code>ram(words, 2) as bigrams)</td></tr><tr><td><strong>Tokenization</strong></td><td>Splits a string column into a list of words.</td><td><code>TRANSFORM (tokenizer(sentence) as words</code>)</td></tr><tr><td><strong>TF-IDF</strong></td><td>Converts a collection of text documents to a matrix of numerical features.</td><td><code>TRANSFORM (tf_idf(tokenized_text) as tfidf_features)</code></td></tr><tr><td><strong>Word2Vec</strong></td><td>Maps words to a vector space and averages vectors for each document.</td><td><code>TRANSFORM (word2vec(tokenized, 10, 1) as word2vec_features)</code></td></tr><tr><td><strong>CountVectorizer</strong></td><td>Converts text documents to vectors of token counts.</td><td><code>TRANSFORM (count_vectorizer(texts) as cv_output)</code></td></tr></tbody></table>

## **Hyper-parameter Tuning and Model Configuration**

Set hyper-parameters using the `OPTIONS` clause to optimize model performance.

Example:

```sql
CREATE MODEL tuned_random_forest
OPTIONS (
    MODEL_TYPE='random_forest_regression',
    NUM_TREES=50,
    MAX_DEPTH=10
)
AS
SELECT feature1, feature2, target
FROM training_data;
```

## **Example Workflows**

### **Predicting Customer Churn Using Logistic Regression**

```sql
CREATE MODEL customer_churn_model
OPTIONS (MODEL_TYPE='logistic_reg', LABEL='churn')
AS
SELECT age, income, num_purchases, churn
FROM customer_data;
```

### **Clustering Customers Based on Purchase Behavior**

```sql
CREATE MODEL customer_clusters
OPTIONS (MODEL_TYPE='kmeans', NUM_CLUSTERS=5, MAX_ITER=20)
AS
SELECT purchase_amount, num_items_bought
FROM transaction_data;
```

### **Topic Modeling Using LDA**

```sql
CREATE MODEL topic_model
OPTIONS (MODEL_TYPE='lda', NUM_CLUSTERS=10)
AS
SELECT document_text
FROM text_data;
```

## **Model Evaluation and Prediction**

### **Evaluate a Model**

{% code overflow="wrap" %}

```sql
SELECT * FROM MODEL_EVALUATE(customer_churn_model, SELECT age, income, num_purchases FROM new_data);
```

{% endcode %}

### **Make Predictions**

{% code overflow="wrap" %}

```sql
SELECT * FROM MODEL_PREDICT(customer_churn_model, SELECT age, income, num_purchases FROM new_data);
```

{% endcode %}

## **Best Practices and Recommendations**

* Use vector assemblers to combine related features.
* Perform feature scaling (e.g., normalization) where applicable.
* Choose models based on the problem type (e.g., classification vs. regression).

## Most Common Regression Algorithm Parameters

The detailed list is [**here**](https://experienceleague.adobe.com/en/docs/experience-platform/query/advanced-statistics/implement-models/regression)

<table data-full-width="true"><thead><tr><th width="221">Algorithm</th><th>Parameter</th><th>Description</th><th>Default Value</th><th>Possible Values</th></tr></thead><tbody><tr><td><strong>Linear Regression</strong><br><strong>'</strong><code>linear_reg'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for optimization.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter for controlling model complexity.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>ELASTIC_NET_PARAM</code></td><td>Mixing parameter for ElasticNet regularization (L1 vs. L2 penalty).</td><td>0</td><td>[0, 1]</td></tr><tr><td><strong>Decision Tree Regression '</strong><code>decision_tree_regression'</code></td><td><code>MAX_BINS</code></td><td>Maximum number of bins for discretizing continuous features.</td><td>32</td><td>>=2</td></tr><tr><td></td><td><code>CACHE_NODE_IDS</code></td><td>Whether to cache node IDs for training deeper trees.</td><td>FALSE</td><td>true, false</td></tr><tr><td></td><td><code>CHECKPOINT_INTERVAL</code></td><td>How often to checkpoint cached node IDs during training.</td><td>10</td><td>>=1</td></tr><tr><td></td><td><code>IMPURITY</code></td><td>Criterion for information gain calculation ("variance" used for regression).</td><td>"variance"</td><td>"variance"</td></tr><tr><td></td><td><code>MAX_DEPTH</code></td><td>Maximum depth of the tree.</td><td>5</td><td>[0, 30]</td></tr><tr><td><p><strong>Random Forest Regression</strong></p><p><code>'random_forest_regression'</code></p></td><td><code>NUM_TREES</code></td><td>Number of trees in the forest.</td><td>20</td><td>>=1</td></tr><tr><td></td><td><code>MAX_DEPTH</code></td><td>Maximum depth of each tree in the forest.</td><td>5</td><td>>=0</td></tr><tr><td></td><td><code>SUBSAMPLING_RATE</code></td><td>Fraction of data used to train each tree.</td><td>1</td><td>(0, 1]</td></tr><tr><td></td><td><code>FEATURE_SUBSET_STRATEGY</code></td><td>Strategy for selecting features for each split.</td><td>"auto"</td><td>"auto", "all", "sqrt", "log2"</td></tr><tr><td></td><td><code>IMPURITY</code></td><td>Criterion for information gain calculation ("variance" used for regression).</td><td>"variance"</td><td>"variance"</td></tr><tr><td><strong>Gradient Boosted Tree Regression</strong><br><strong>'</strong><code>gradient_boosted_tree_regression'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations (equivalent to the number of trees).</td><td>20</td><td>>=0</td></tr><tr><td></td><td><code>STEP_SIZE</code></td><td>Step size (learning rate) for scaling the contribution of each tree.</td><td>0.1</td><td>(0, 1]</td></tr><tr><td></td><td><code>LOSS_TYPE</code></td><td>Loss function to be minimized during training.</td><td>"squared"</td><td>"squared", "absolute"</td></tr><tr><td><strong>Generalized Linear Regression '</strong><code>generalized_linear_reg'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for optimization.</td><td>25</td><td>>=0</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter for controlling model complexity.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>FAMILY</code></td><td>Family of distributions for the response variable (e.g., Gaussian, Poisson).</td><td>"gaussian"</td><td>"gaussian", "binomial", "poisson", "gamma", "tweedie"</td></tr><tr><td><strong>Isotonic Regression '</strong><code>isotonic_regression'</code></td><td><code>ISOTONIC</code></td><td>Whether the output sequence should be isotonic (increasing) or antitonic (decreasing).</td><td>TRUE</td><td>true, false</td></tr><tr><td><strong>Survival Regression '</strong><code>survival_regression'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for optimization.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td><strong>Factorization Machines Regression '</strong><code>factorization_machines_regression'</code></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td></td><td><code>FACTOR_SIZE</code></td><td>Dimensionality of the factors.</td><td>8</td><td>>=0</td></tr><tr><td></td><td>FIT_INTERCEP<code>T</code></td><td>Whether to fit an intercept term.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>FIT_LINEAR</code></td><td>Whether to fit linear terms (1-way interactions).</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>INIT_STD</code></td><td>Standard deviation of initial coefficients.</td><td>0.01</td><td>>=0</td></tr><tr><td></td><td><code>MAX_ITER</code></td><td>Number of iterations for the algorithm.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>MINI_BATCH_FRACTION</code></td><td>Fraction of data used in each mini-batch.</td><td>1</td><td>(0, 1]</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>SOLVER</code></td><td>Solver algorithm used for optimization.</td><td>"adamW"</td><td>"gd", "adamW"</td></tr><tr><td></td><td><code>STEP_SIZE</code></td><td>Initial step size for the first step.</td><td>1</td><td>>0</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Name of the column for prediction output.</td><td>"prediction"</td><td>Any string</td></tr></tbody></table>

## Most Common Classification Algorithm Parameters

The detailed list is[ here](https://experienceleague.adobe.com/en/docs/experience-platform/query/advanced-statistics/implement-models/classification)

<table data-full-width="true"><thead><tr><th>Algorithm</th><th>Parameter</th><th>Description</th><th>Default Value</th><th>Possible Values</th></tr></thead><tbody><tr><td><strong>Logistic Regression '</strong><code>logistic_reg'</code><br></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for optimization.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter for controlling model complexity.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>ELASTIC_NET_PARAM</code></td><td>Mixing parameter for ElasticNet regularization (L1 vs. L2 penalty).</td><td>0</td><td>[0, 1]</td></tr><tr><td></td><td><code>FIT_INTERCEPT</code></td><td>Whether to fit an intercept term in the model.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction output (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td></td><td><code>THRESHOLDS</code></td><td>Thresholds for binary or multiclass classification.</td><td>Not set</td><td>Array of doubles</td></tr><tr><td><strong>Decision Tree Classifier '</strong><code>decision_tree_classifier'</code></td><td><code>MAX_BINS</code></td><td>Maximum number of bins for discretizing continuous features.</td><td>32</td><td>>=2</td></tr><tr><td></td><td><code>CACHE_NODE_IDS</code></td><td>Whether to cache node IDs for training deeper trees.</td><td>FALSE</td><td>true, false</td></tr><tr><td></td><td><code>CHECKPOINT_INTERVAL</code></td><td>How often to checkpoint cached node IDs during training.</td><td>10</td><td>>=1</td></tr><tr><td></td><td><code>IMPURITY</code></td><td>Criterion for information gain calculation.</td><td>"gini"</td><td>"gini", "entropy"</td></tr><tr><td></td><td><code>MAX_DEPTH</code></td><td>Maximum depth of the tree.</td><td>5</td><td>[0, 30]</td></tr><tr><td></td><td><code>MIN_INFO_GAIN</code></td><td>Minimum information gain required for a split at a node.</td><td>0</td><td>>=0.0</td></tr><tr><td></td><td><code>MIN_INSTANCES_PER_NODE</code></td><td>Minimum number of instances required in each child after a split.</td><td>1</td><td>>=1</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td>WE<code>IGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td><strong>Random Forest Classifier '</strong><code>random_forest_classifier'</code></td><td><code>NUM_TREES</code></td><td>Number of trees in the forest.</td><td>20</td><td>>=1</td></tr><tr><td></td><td><code>MAX_BINS</code></td><td>Maximum number of bins for discretizing continuous features.</td><td>32</td><td>>=2</td></tr><tr><td></td><td><code>MAX_DEPTH</code></td><td>Maximum depth of each tree in the forest.</td><td>5</td><td>>=0</td></tr><tr><td></td><td><code>IMPURITY</code></td><td>Criterion for information gain calculation.</td><td>"gini"</td><td>"gini", "entropy"</td></tr><tr><td></td><td><code>SUBSAMPLING_RATE</code></td><td>Fraction of data used to train each tree.</td><td>1</td><td>(0, 1]</td></tr><tr><td></td><td><code>FEATURE_SUBSET_STRATEGY</code></td><td>Strategy for selecting features for each split.</td><td>"auto"</td><td>"auto", "all", "sqrt", "log2"</td></tr><tr><td></td><td><code>BOOTSTRAP</code></td><td>Whether to use bootstrap sampling when building trees.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>WEIGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction output (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td><strong>Naive Bayes Classifier '</strong><code>naive_bayes_classifier'</code></td><td><code>MODEL_TYPE</code></td><td>Type of Naive Bayes model used (e.g., multinomial, bernoulli).</td><td>"multinomial"</td><td>"multinomial", "bernoulli", "gaussian"</td></tr><tr><td></td><td><code>SMOOTHING</code></td><td>Smoothing parameter to prevent zero probabilities.</td><td>1</td><td>>=0.0</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction output (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td></td><td><code>WEIGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td><strong>Factorization Machines Classifier '</strong><code>factorization_machines_classifier'</code></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td></td><td><code>FACTOR_SIZE</code></td><td>Dimensionality of the factors.</td><td>8</td><td>>=0</td></tr><tr><td></td><td><code>FIT_INTERCEPT</code></td><td>Whether to fit an intercept term.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>FIT_LINEAR</code></td><td>Whether to fit linear terms (1-way interactions).</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>INIT_STD</code></td><td>Standard deviation of initial coefficients.</td><td>0.01</td><td>>=0</td></tr><tr><td></td><td><code>MAX_ITER</code></td><td>Number of iterations for the algorithm.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>MINI_BATCH_FRACTION</code></td><td>Fraction of data used in each mini-batch.</td><td>1</td><td>(0, 1]</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>SOLVER</code></td><td>Solver algorithm used for optimization.</td><td>"adamW"</td><td>"gd", "adamW"</td></tr><tr><td></td><td><code>STEP_SIZE</code></td><td>Initial step size for the first step.</td><td>1</td><td>>0</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class conditional probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Name of the column for prediction output.</td><td>"prediction"</td><td>Any string</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td></td><td><code>ONE_VS_REST</code></td><td>Whether to enable one-vs-rest classification.</td><td>FALSE</td><td>true, false</td></tr><tr><td><strong>Linear Support Vector Classifier '</strong><code>linear_svc_classifier'</code></td><td><code>MAX_ITER</code></td><td>Number of iterations for optimization.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>AGGREGATION_DEPTH</code></td><td>Suggested depth for tree aggregation.</td><td>2</td><td>>=2</td></tr><tr><td></td><td><code>FIT_INTERCEPT</code></td><td>Whether to fit an intercept term.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td></td><td><code>MAX_BLOCK_SIZE_IN_MB</code></td><td>Maximum memory in MB for stacking input data into blocks.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>REG_PARAM</code></td><td>Regularization parameter.</td><td>0</td><td>>=0</td></tr><tr><td></td><td><code>STANDARDIZATION</code></td><td>Whether to standardize the training features.</td><td>TRUE</td><td>true, false</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Name of the column for prediction output.</td><td>"prediction"</td><td>Any string</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td></td><td><code>ONE_VS_REST</code></td><td>Whether to enable one-vs-rest classification.</td><td>FALSE</td><td>true, false</td></tr><tr><td><strong>Multilayer Perceptron Classifier '</strong><code>multilayer_perceptron_classifier'</code></td><td><code>MAX_ITER</code></td><td>Number of iterations for the algorithm.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>BLOCK_SIZE</code></td><td>Block size for stacking input data in matrices.</td><td>128</td><td>>=1</td></tr><tr><td></td><td><code>STEP_SIZE</code></td><td>Step size for each iteration of optimization.</td><td>0.03</td><td>>0</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for optimization.</td><td>1.00E-06</td><td>>=0</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Name of the column for prediction output.</td><td>"prediction"</td><td>Any string</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class conditional probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>RAW_PREDICTION_COL</code></td><td>Column name for raw prediction (confidence scores).</td><td>"rawPrediction"</td><td>Any column name</td></tr><tr><td></td><td><code>ONE_VS_REST</code></td><td>Whether to enable one-vs-rest classification.</td><td>FALSE</td><td>true, false</td></tr><tr><td><strong>Gradient Boosted Tree Classifier</strong> '<code>gradient_boosted_tree_classifier'</code></td><td><code>MAX_BINS</code></td><td>Maximum number of bins used for discretizing continuous features and choosing how to split on features at each node. More bins give higher granularity.</td><td>32</td><td>Must be >= 2 and >= number of categories in any categorical feature</td></tr><tr><td></td><td><code>CACHE_NODE_IDS</code></td><td>If false, the algorithm passes trees to executors to match instances with nodes. If true, node IDs for each instance are cached to speed up training of deeper trees.</td><td>false</td><td>true, false</td></tr><tr><td></td><td><code>CHECKPOINT_INTERVAL</code></td><td>Specifies how often to checkpoint the cached node IDs (e.g., 10 means checkpoint every 10 iterations). This is used only if <code>cacheNodeIds</code> is true and the checkpoint directory is set in SparkContext.</td><td>10</td><td><p></p><p>>= 1</p></td></tr><tr><td></td><td><code>MAX_DEPTH</code></td><td>Maximum depth of the tree. For example, depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.</td><td>5</td><td>>=0</td></tr></tbody></table>

## Most Common Unsupervised Algorithm Parameters

The detailed list is[ **here**](https://experienceleague.adobe.com/en/docs/experience-platform/query/advanced-statistics/implement-models/clustering)

<table data-full-width="true"><thead><tr><th>Algorithm</th><th>Parameter</th><th>Description</th><th>Default Value</th><th>Possible Values</th></tr></thead><tbody><tr><td><p><strong>K-Means</strong> </p><p><code>'kmeans'</code></p></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for the clustering algorithm.</td><td>20</td><td>>=0</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for the iterative algorithm.</td><td>0.0001</td><td>>=0</td></tr><tr><td></td><td><code>NUM_CLUSTERS</code></td><td>Number of clusters to form.</td><td>2</td><td>>1</td></tr><tr><td></td><td><code>DISTANCE_TYPE</code></td><td>Distance measure used for clustering.</td><td>"euclidean"</td><td>"euclidean", "cosine"</td></tr><tr><td></td><td><code>KMEANS_INIT_METHOD</code></td><td>Initialization algorithm for cluster centers.</td><td>`"k-means</td><td></td></tr><tr><td></td><td><code>INIT_STEPS</code></td><td>Number of steps for the k-means</td><td>initialization mode.</td><td></td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Column name for the predicted cluster.</td><td>"prediction"</td><td>Any column name</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>WEIGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td><strong>Bisecting K-Means</strong><br><strong>'</strong><code>bisecting_kmeans'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for the clustering algorithm.</td><td>20</td><td>>=0</td></tr><tr><td></td><td><code>NUM_CLUSTERS</code></td><td>Number of leaf clusters to form.</td><td>4</td><td>>1</td></tr><tr><td></td><td><code>DISTANCE_MEASURE</code></td><td>Distance measure used for clustering.</td><td>"euclidean"</td><td>"euclidean", "cosine"</td></tr><tr><td></td><td><code>MIN_DIVISIBLE_CLUSTER_SIZE</code></td><td>Minimum number of points for a divisible cluster.</td><td>1</td><td>>0</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Column name for the predicted cluster.</td><td>"prediction"</td><td>Any column name</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>WEIGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td><strong>Gaussian Mixture '</strong><code>gaussian_mixture'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for the EM algorithm.</td><td>100</td><td>>=0</td></tr><tr><td></td><td><code>NUM_CLUSTERS</code></td><td>Number of Gaussian distributions in the mixture model.</td><td>2</td><td>>1</td></tr><tr><td></td><td><code>TOL</code></td><td>Convergence tolerance for iterative algorithms.</td><td>0.01</td><td>>=0</td></tr><tr><td></td><td><code>AGGREGATION_DEPTH</code></td><td>Depth for tree aggregation during the EM algorithm.</td><td>2</td><td>>=2</td></tr><tr><td></td><td><code>PROBABILITY_COL</code></td><td>Column name for predicted class conditional probabilities.</td><td>"probability"</td><td>Any column name</td></tr><tr><td></td><td><code>PREDICTION_COL</code></td><td>Column name for the predicted cluster.</td><td>"prediction"</td><td>Any column name</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>WEIGHT_COL</code></td><td>Column name for sample weights.</td><td>Not set</td><td>Any column name</td></tr><tr><td><strong>Latent Dirichlet Allocation (LDA)</strong><br><strong>'</strong><code>lda'</code></td><td><code>MAX_ITER</code></td><td>Maximum number of iterations for the algorithm.</td><td>20</td><td>>=0</td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td><code>OPTIMIZER</code></td><td>Optimizer used to estimate the LDA model.</td><td>"online"</td><td>"online", "em"</td></tr><tr><td></td><td><code>NUM_CLUSTERS</code></td><td>Number of topics to identify.</td><td>10</td><td>>1</td></tr><tr><td></td><td><code>DOC_CONCENTRATION</code></td><td>Concentration parameter for the prior placed on documents' distributions over topics.</td><td>Automatic</td><td>>0</td></tr><tr><td></td><td><code>TOPIC_CONCENTRATION</code></td><td>Concentration parameter for the prior placed on topics' distributions over terms.</td><td>Automatic</td><td>>0</td></tr><tr><td></td><td><code>LEARNING_DECAY</code></td><td>Learning rate for the online optimizer.</td><td>0.51</td><td>(0.5, 1.0]</td></tr><tr><td></td><td><code>LEARNING_OFFSET</code></td><td>Learning parameter that downweights early iterations for the online optimizer.</td><td>1024</td><td>>0</td></tr><tr><td></td><td><code>SUBSAMPLING_RATE</code></td><td>Fraction of the corpus used for each iteration of mini-batch gradient descent.</td><td>0.05</td><td>(0, 1]</td></tr><tr><td></td><td><code>OPTIMIZE_DOC_CONCENTRATION</code></td><td>Whether to optimize the doc concentration during training.</td><td>FALSE</td><td>true, false</td></tr><tr><td></td><td><code>CHECKPOINT_INTERVAL</code></td><td>Frequency of checkpointing the cached node IDs.</td><td>10</td><td>>=1</td></tr><tr><td></td><td><code>SEED</code></td><td>Random seed for reproducibility.</td><td>Not set</td><td>Any 64-bit integer</td></tr><tr><td></td><td><code>TOPIC_DISTRIBUTION_COL</code></td><td>Output column with estimates of the topic mixture distribution for each document.</td><td>Not set</td><td>Any column name</td></tr></tbody></table>

### **FAQs**

#### **Can external models be imported and exported from the platform?**

No, we currently do not support direct import/export of models from common formats Our ML capabilities are **SQL-based**, making it easier for data engineers or teams with a statistics background to create models directly within the platform.

#### &#x20;**What types of models and algorithms are available?**

A list of available models is documented [here](https://data-distiller.all-stuff-data.com/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models#summary-table-of-available-algorithms).\
We support models built using SQL-based statistical methods, with a focus on ease of deployment and interpretation.

#### **How does the platform handle missing or incomplete data?**

We provide **feature engineering functions** such as **imputation** (e.g., mean, median substitution) to handle missing values.\
Additionally, users can create their own algorithms using SQL to manage missing data.

#### **Is real-time inferencing supported?**

No, **only batch-based inferencing** is supported.\
You must precompute model outputs offline and then batch load them into the platform for decision-making.

#### **Are explainability reports available for models?**

Yes, explainability reports are available, but you must manually create schedules and aggregation commands to generate them.\
Reports can also be surfaced within BI engines and dashboards, although this requires some SQL-based configuration.

#### **How does the solution work with offline model outputs?**

Offline model outputs are supported through batch inferencing.\
Users are expected to precompute model scores externally and then load them into the platform.\
We **do not support real-time scoring**.

#### **What conversion goals can a model be trained towards?**

You can define custom conversion goal models in Data Distiller ML models. However, if using out-of-the-box "black box" models like Customer AI, Lookalike Models, or Adobe Mix Modeler, predefined conversion goals are enforced and **not customizable**.

#### **What variables can be considered in models?**

You have **full control** over the feature engineering and variable creation in SQL.\
However, variables within black-box models cannot be modified or accessed.

#### **Can users create new attributes for predictive models?**

Yes, the platform supports the creation of new attributes for predictive models using SQL and all available feature engineering functions.
