# STATSML 604: Car Loan Propensity Prediction using Logistic Regression

## Overview

Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.&#x20;

#### **Stage 1: Awareness**

* The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).
* Signals:
  * Search behavior (Google/search logs)
  * Browsing car loan info on bank websites
  * Interactions with car dealerships or vehicle-related services

#### **Stage 2: Research**

* They begin comparing options, calculating EMIs, and checking eligibility.
* Signals:
  * Clicks on car loan calculators
  * App logins increase
  * Engaging with bank agents or chatbot loan FAQs
  * Increasing balance inquiries

#### **Stage 3: Financial Readiness**

* Evaluating “Can I afford it?”
* Signals:
  * Growth in monthly net income
  * Stable or rising credit score
  * High cash reserves (in savings + checking)
  * Existing auto loans closed or paid off

#### **Stage 4: Application Intent**

* They are ready to apply.
* Signals:
  * Clicking on “Apply Now” or loan inquiry forms
  * Uploading documents to the portal

In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on **profile** and **bank transaction** behavior.

### Use Cases

* **Targeted Campaigns**: Focus offers on high-propensity segments
* **Loan Eligibility Filtering**: Pre-qualify candidates automatically
* **Customer Risk Profiling**: Understand financial behavior deep.

## Prerequisites

{% content-ref url="../unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller" %}
[prep-400-dbvisualizer-sql-editor-setup-for-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller)
{% endcontent-ref %}

{% content-ref url="statsml-600-data-distiller-advanced-statistics-and-machine-learning-models" %}
[statsml-600-data-distiller-advanced-statistics-and-machine-learning-models](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models)
{% endcontent-ref %}

## Goals

Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:

* Demographic and behavioral data
* Financial account balances
* Derived income and loan features

## Sample Dataset

### Fields & Description

| Field Name              | Data Type      | Description                               |
| ----------------------- | -------------- | ----------------------------------------- |
| customer\_id            | STRING         | Unique identifier                         |
| age                     | INT            | Customer age                              |
| gender                  | STRING         | Male / Female / Other                     |
| marital\_status         | STRING         | Married / Single / Divorced               |
| employment\_status      | STRING         | Employed / Self-employed / Retired etc.   |
| annual\_income          | DECIMAL        | Yearly income                             |
| credit\_score           | INT            | Credit bureau score                       |
| checking\_balance       | DECIMAL        | Balance in checking account               |
| savings\_balance        | DECIMAL        | Balance in savings account                |
| monthly\_debit\_volume  | DECIMAL        | Monthly average spending                  |
| monthly\_credit\_volume | DECIMAL        | Monthly average income                    |
| loan\_history           | ARRAY\<STRUCT> | Past loans (type, amount, status)         |
| existing\_auto\_loan    | BOOLEAN        | Existing car loan                         |
| owns\_vehicle           | BOOLEAN        | Whether customer owns a car               |
| propensity\_car\_loan   | FLOAT          | Target: Likelihood (0-1) of taking a loan |

If you were to go and explore the sample data, it should look like this:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FxY00wO1x5Mkq8WJm2D4A%2F%7BB281514A-BE5C-4188-9052-C35B490F2FDF%7D.png?alt=media&#x26;token=85e722c1-3e2a-4e7e-accd-7994024bc441" alt=""><figcaption><p>A simple data exploration query with DB Visualizer</p></figcaption></figure>

### Derived Features

These features are engineered to improve model performance:

| Feature                   | Formula / Logic                                |
| ------------------------- | ---------------------------------------------- |
| `savings_to_income_ratio` | `savings_balance / annual_income`              |
| `debt_to_income_ratio`    | `(monthly_debit_volume * 12) / annual_income`  |
| `avg_monthly_net_income`  | `monthly_credit_volume - monthly_debit_volume` |
| `loan_count`              | `COUNT(loan_history)`                          |
| `previous_auto_loans`     | `COUNT WHERE loan_type = 'Auto'`               |
| `good_credit_flag`        | `credit_score >= 700`                          |
| `high_cash_reserve_flag`  | `checking_balance + savings_balance > 10000`   |

{% code overflow="wrap" %}

```sql
Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;
```

{% endcode %}

This should look like the following:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F2J8a2HgpHX0xPRyMd7d5%2F%7BAF5E5ABD-9C76-4AFA-82C3-CEB1B1B8255D%7D.png?alt=media&#x26;token=fc09948e-2ba6-40d7-a334-a81297f43b47" alt=""><figcaption><p>Derived features along with the base features</p></figcaption></figure>

### Auto-Labeling of Training Data

<table data-full-width="false"><thead><tr><th>Feature</th><th width="215.63226318359375">Weight</th><th>Explanation</th></tr></thead><tbody><tr><td>Not having an auto loan</td><td>0.30</td><td>More likely to consider buying</td></tr><tr><td>Doesn’t own a vehicle</td><td>0.20</td><td>May need a car, hence loan</td></tr><tr><td>Good credit score</td><td>0.20</td><td>More eligible for credit</td></tr><tr><td>Income > $60K</td><td>0.10</td><td>Likely to get approved</td></tr><tr><td>Checking balance > $2K</td><td>0.10</td><td>Has funds for down payment</td></tr><tr><td>Net income > $2K</td><td>0.10</td><td>Better repayment capacity</td></tr></tbody></table>

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Few8KMopRq2BywmEgbEad%2F%7BEE08FD64-5F7B-42B7-A5DA-820F9468BC3D%7D.png?alt=media&#x26;token=17d2eb6f-efd2-4190-ba84-e91255921abb" alt=""><figcaption><p>Auto-labeling of training data</p></figcaption></figure>

### Model Definition

```sql
DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;

CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
  vector_assembler(array(
     age,
     annual_income,
     credit_score,
     checking_balance,
     savings_balance,
     monthly_debit_volume,
     monthly_credit_volume,
     existing_auto_loan,
     owns_vehicle,
     savings_to_income_ratio,
     debt_to_income_ratio,
     avg_monthly_net_income,
     good_credit_flag,
     high_cash_reserve_flag
  )) features
)
OPTIONS (
  MODEL_TYPE = 'logistic_reg',
  LABEL = 'propensity'
)
AS
SELECT
  *,
  CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;

```

## Model Evaluation

```sql
SELECT * FROM model_evaluate(
  customer_propensity_using_LogisticRegression, 
  1, 
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_train
  )
);

```

### Results

The query should return the folllowing:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2F4gzHDaj5c51vfGi1B1JU%2F%7BDFC50A68-6239-4AA6-9B93-3DC670B2126E%7D.png?alt=media&#x26;token=f86c82c8-d120-40eb-a4f0-aa798380abd4" alt=""><figcaption><p>Query results on the evaluation</p></figcaption></figure>

| Metric    | Value  |
| --------- | ------ |
| AUC ROC   | 0.9362 |
| Accuracy  | 0.9361 |
| Precision | 0.9367 |
| Recall    | 0.9372 |

### Predict on New Customers

```sql
SELECT * FROM model_predict(
  customer_propensity_using_LogisticRegression, 
  1,
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_predict
  )
);

```

The query looks like the foilowing:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FtF3uUjwAjrhv3UKEhfna%2F%7B9E2236E8-1E36-4353-9249-C9C6F8EFCDBE%7D.png?alt=media&#x26;token=760c1826-5cdd-4d9e-89d4-50872174da0b" alt=""><figcaption><p>Predictions on the same dataset</p></figcaption></figure>
