# STATSML 604: Car Loan Propensity Prediction using Logistic Regression

## Overview

Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.&#x20;

#### **Stage 1: Awareness**

* The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).
* Signals:
  * Search behavior (Google/search logs)
  * Browsing car loan info on bank websites
  * Interactions with car dealerships or vehicle-related services

#### **Stage 2: Research**

* They begin comparing options, calculating EMIs, and checking eligibility.
* Signals:
  * Clicks on car loan calculators
  * App logins increase
  * Engaging with bank agents or chatbot loan FAQs
  * Increasing balance inquiries

#### **Stage 3: Financial Readiness**

* Evaluating “Can I afford it?”
* Signals:
  * Growth in monthly net income
  * Stable or rising credit score
  * High cash reserves (in savings + checking)
  * Existing auto loans closed or paid off

#### **Stage 4: Application Intent**

* They are ready to apply.
* Signals:
  * Clicking on “Apply Now” or loan inquiry forms
  * Uploading documents to the portal

In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on **profile** and **bank transaction** behavior.

### Use Cases

* **Targeted Campaigns**: Focus offers on high-propensity segments
* **Loan Eligibility Filtering**: Pre-qualify candidates automatically
* **Customer Risk Profiling**: Understand financial behavior deep.

## Prerequisites

{% content-ref url="/pages/B0nz0qFAt19NA6ofV5EA" %}
[PREP 400: DBVisualizer SQL Editor Setup for Data Distiller](/adobe-data-distiller-guide/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller.md)
{% endcontent-ref %}

{% content-ref url="/pages/qAD2VzPL5alOeVLBqATI" %}
[STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models](/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models.md)
{% endcontent-ref %}

## Goals

Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:

* Demographic and behavioral data
* Financial account balances
* Derived income and loan features

## Sample Dataset

### Fields & Description

| Field Name              | Data Type      | Description                               |
| ----------------------- | -------------- | ----------------------------------------- |
| customer\_id            | STRING         | Unique identifier                         |
| age                     | INT            | Customer age                              |
| gender                  | STRING         | Male / Female / Other                     |
| marital\_status         | STRING         | Married / Single / Divorced               |
| employment\_status      | STRING         | Employed / Self-employed / Retired etc.   |
| annual\_income          | DECIMAL        | Yearly income                             |
| credit\_score           | INT            | Credit bureau score                       |
| checking\_balance       | DECIMAL        | Balance in checking account               |
| savings\_balance        | DECIMAL        | Balance in savings account                |
| monthly\_debit\_volume  | DECIMAL        | Monthly average spending                  |
| monthly\_credit\_volume | DECIMAL        | Monthly average income                    |
| loan\_history           | ARRAY\<STRUCT> | Past loans (type, amount, status)         |
| existing\_auto\_loan    | BOOLEAN        | Existing car loan                         |
| owns\_vehicle           | BOOLEAN        | Whether customer owns a car               |
| propensity\_car\_loan   | FLOAT          | Target: Likelihood (0-1) of taking a loan |

If you were to go and explore the sample data, it should look like this:

<figure><img src="/files/2GV5uv9F8oKPKPBgFbFx" alt=""><figcaption><p>A simple data exploration query with DB Visualizer</p></figcaption></figure>

### Derived Features

These features are engineered to improve model performance:

| Feature                   | Formula / Logic                                |
| ------------------------- | ---------------------------------------------- |
| `savings_to_income_ratio` | `savings_balance / annual_income`              |
| `debt_to_income_ratio`    | `(monthly_debit_volume * 12) / annual_income`  |
| `avg_monthly_net_income`  | `monthly_credit_volume - monthly_debit_volume` |
| `loan_count`              | `COUNT(loan_history)`                          |
| `previous_auto_loans`     | `COUNT WHERE loan_type = 'Auto'`               |
| `good_credit_flag`        | `credit_score >= 700`                          |
| `high_cash_reserve_flag`  | `checking_balance + savings_balance > 10000`   |

{% code overflow="wrap" %}

```sql
Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;
```

{% endcode %}

This should look like the following:

<figure><img src="/files/2he1ysB69akwT3KE5l4n" alt=""><figcaption><p>Derived features along with the base features</p></figcaption></figure>

### Auto-Labeling of Training Data

<table data-full-width="false"><thead><tr><th>Feature</th><th width="215.63226318359375">Weight</th><th>Explanation</th></tr></thead><tbody><tr><td>Not having an auto loan</td><td>0.30</td><td>More likely to consider buying</td></tr><tr><td>Doesn’t own a vehicle</td><td>0.20</td><td>May need a car, hence loan</td></tr><tr><td>Good credit score</td><td>0.20</td><td>More eligible for credit</td></tr><tr><td>Income > $60K</td><td>0.10</td><td>Likely to get approved</td></tr><tr><td>Checking balance > $2K</td><td>0.10</td><td>Has funds for down payment</td></tr><tr><td>Net income > $2K</td><td>0.10</td><td>Better repayment capacity</td></tr></tbody></table>

<figure><img src="/files/a4NjlKjDLLURisY9Ged5" alt=""><figcaption><p>Auto-labeling of training data</p></figcaption></figure>

### Model Definition

```sql
DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;

CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
  vector_assembler(array(
     age,
     annual_income,
     credit_score,
     checking_balance,
     savings_balance,
     monthly_debit_volume,
     monthly_credit_volume,
     existing_auto_loan,
     owns_vehicle,
     savings_to_income_ratio,
     debt_to_income_ratio,
     avg_monthly_net_income,
     good_credit_flag,
     high_cash_reserve_flag
  )) features
)
OPTIONS (
  MODEL_TYPE = 'logistic_reg',
  LABEL = 'propensity'
)
AS
SELECT
  *,
  CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;

```

## Model Evaluation

```sql
SELECT * FROM model_evaluate(
  customer_propensity_using_LogisticRegression, 
  1, 
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_train
  )
);

```

### Results

The query should return the folllowing:

<figure><img src="/files/YNtxu7zInyq48ZUnRKGO" alt=""><figcaption><p>Query results on the evaluation</p></figcaption></figure>

| Metric    | Value  |
| --------- | ------ |
| AUC ROC   | 0.9362 |
| Accuracy  | 0.9361 |
| Precision | 0.9367 |
| Recall    | 0.9372 |

### Predict on New Customers

```sql
SELECT * FROM model_predict(
  customer_propensity_using_LogisticRegression, 
  1,
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_predict
  )
);

```

The query looks like the foilowing:

<figure><img src="/files/7ekDDLvbsK9T2Toedvs9" alt=""><figcaption><p>Predictions on the same dataset</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-604-car-loan-propensity-prediction-using-logistic-regression.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
