> For the complete documentation index, see [llms.txt](https://data-distilller.gitbook.io/adobe-data-distiller-guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-604-car-loan-propensity-prediction-using-logistic-regression.md).

# STATSML 604: Car Loan Propensity Prediction using Logistic Regression

## Overview

Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.&#x20;

#### **Stage 1: Awareness**

* The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).
* Signals:
  * Search behavior (Google/search logs)
  * Browsing car loan info on bank websites
  * Interactions with car dealerships or vehicle-related services

#### **Stage 2: Research**

* They begin comparing options, calculating EMIs, and checking eligibility.
* Signals:
  * Clicks on car loan calculators
  * App logins increase
  * Engaging with bank agents or chatbot loan FAQs
  * Increasing balance inquiries

#### **Stage 3: Financial Readiness**

* Evaluating “Can I afford it?”
* Signals:
  * Growth in monthly net income
  * Stable or rising credit score
  * High cash reserves (in savings + checking)
  * Existing auto loans closed or paid off

#### **Stage 4: Application Intent**

* They are ready to apply.
* Signals:
  * Clicking on “Apply Now” or loan inquiry forms
  * Uploading documents to the portal

In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on **profile** and **bank transaction** behavior.

### Use Cases

* **Targeted Campaigns**: Focus offers on high-propensity segments
* **Loan Eligibility Filtering**: Pre-qualify candidates automatically
* **Customer Risk Profiling**: Understand financial behavior deep.

## Prerequisites

{% content-ref url="/pages/B0nz0qFAt19NA6ofV5EA" %}
[PREP 400: DBVisualizer SQL Editor Setup for Data Distiller](/adobe-data-distiller-guide/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller.md)
{% endcontent-ref %}

{% content-ref url="/pages/qAD2VzPL5alOeVLBqATI" %}
[STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models](/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models.md)
{% endcontent-ref %}

## Goals

Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:

* Demographic and behavioral data
* Financial account balances
* Derived income and loan features

## Sample Dataset

### Fields & Description

| Field Name              | Data Type      | Description                               |
| ----------------------- | -------------- | ----------------------------------------- |
| customer\_id            | STRING         | Unique identifier                         |
| age                     | INT            | Customer age                              |
| gender                  | STRING         | Male / Female / Other                     |
| marital\_status         | STRING         | Married / Single / Divorced               |
| employment\_status      | STRING         | Employed / Self-employed / Retired etc.   |
| annual\_income          | DECIMAL        | Yearly income                             |
| credit\_score           | INT            | Credit bureau score                       |
| checking\_balance       | DECIMAL        | Balance in checking account               |
| savings\_balance        | DECIMAL        | Balance in savings account                |
| monthly\_debit\_volume  | DECIMAL        | Monthly average spending                  |
| monthly\_credit\_volume | DECIMAL        | Monthly average income                    |
| loan\_history           | ARRAY\<STRUCT> | Past loans (type, amount, status)         |
| existing\_auto\_loan    | BOOLEAN        | Existing car loan                         |
| owns\_vehicle           | BOOLEAN        | Whether customer owns a car               |
| propensity\_car\_loan   | FLOAT          | Target: Likelihood (0-1) of taking a loan |

If you were to go and explore the sample data, it should look like this:

<figure><img src="/files/2GV5uv9F8oKPKPBgFbFx" alt=""><figcaption><p>A simple data exploration query with DB Visualizer</p></figcaption></figure>

### Derived Features

These features are engineered to improve model performance:

| Feature                   | Formula / Logic                                |
| ------------------------- | ---------------------------------------------- |
| `savings_to_income_ratio` | `savings_balance / annual_income`              |
| `debt_to_income_ratio`    | `(monthly_debit_volume * 12) / annual_income`  |
| `avg_monthly_net_income`  | `monthly_credit_volume - monthly_debit_volume` |
| `loan_count`              | `COUNT(loan_history)`                          |
| `previous_auto_loans`     | `COUNT WHERE loan_type = 'Auto'`               |
| `good_credit_flag`        | `credit_score >= 700`                          |
| `high_cash_reserve_flag`  | `checking_balance + savings_balance > 10000`   |

{% code overflow="wrap" %}

```sql
Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;
```

{% endcode %}

This should look like the following:

<figure><img src="/files/2he1ysB69akwT3KE5l4n" alt=""><figcaption><p>Derived features along with the base features</p></figcaption></figure>

### Auto-Labeling of Training Data

<table data-full-width="false"><thead><tr><th>Feature</th><th width="215.63226318359375">Weight</th><th>Explanation</th></tr></thead><tbody><tr><td>Not having an auto loan</td><td>0.30</td><td>More likely to consider buying</td></tr><tr><td>Doesn’t own a vehicle</td><td>0.20</td><td>May need a car, hence loan</td></tr><tr><td>Good credit score</td><td>0.20</td><td>More eligible for credit</td></tr><tr><td>Income > $60K</td><td>0.10</td><td>Likely to get approved</td></tr><tr><td>Checking balance > $2K</td><td>0.10</td><td>Has funds for down payment</td></tr><tr><td>Net income > $2K</td><td>0.10</td><td>Better repayment capacity</td></tr></tbody></table>

<figure><img src="/files/a4NjlKjDLLURisY9Ged5" alt=""><figcaption><p>Auto-labeling of training data</p></figcaption></figure>

### Model Definition

```sql
DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;

CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
  vector_assembler(array(
     age,
     annual_income,
     credit_score,
     checking_balance,
     savings_balance,
     monthly_debit_volume,
     monthly_credit_volume,
     existing_auto_loan,
     owns_vehicle,
     savings_to_income_ratio,
     debt_to_income_ratio,
     avg_monthly_net_income,
     good_credit_flag,
     high_cash_reserve_flag
  )) features
)
OPTIONS (
  MODEL_TYPE = 'logistic_reg',
  LABEL = 'propensity'
)
AS
SELECT
  *,
  CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;

```

## Model Evaluation

```sql
SELECT * FROM model_evaluate(
  customer_propensity_using_LogisticRegression, 
  1, 
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_train
  )
);

```

### Results

The query should return the folllowing:

<figure><img src="/files/YNtxu7zInyq48ZUnRKGO" alt=""><figcaption><p>Query results on the evaluation</p></figcaption></figure>

| Metric    | Value  |
| --------- | ------ |
| AUC ROC   | 0.9362 |
| Accuracy  | 0.9361 |
| Precision | 0.9367 |
| Recall    | 0.9372 |

### Predict on New Customers

```sql
SELECT * FROM model_predict(
  customer_propensity_using_LogisticRegression, 
  1,
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_predict
  )
);

```

The query looks like the foilowing:

<figure><img src="/files/7ekDDLvbsK9T2Toedvs9" alt=""><figcaption><p>Predictions on the same dataset</p></figcaption></figure>