# STATSML 604: Car Loan Propensity Prediction using Logistic Regression ## Overview Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation. #### **Stage 1: Awareness** * The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.). * Signals: * Search behavior (Google/search logs) * Browsing car loan info on bank websites * Interactions with car dealerships or vehicle-related services #### **Stage 2: Research** * They begin comparing options, calculating EMIs, and checking eligibility. * Signals: * Clicks on car loan calculators * App logins increase * Engaging with bank agents or chatbot loan FAQs * Increasing balance inquiries #### **Stage 3: Financial Readiness** * Evaluating “Can I afford it?” * Signals: * Growth in monthly net income * Stable or rising credit score * High cash reserves (in savings + checking) * Existing auto loans closed or paid off #### **Stage 4: Application Intent** * They are ready to apply. * Signals: * Clicking on “Apply Now” or loan inquiry forms * Uploading documents to the portal In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on **profile** and **bank transaction** behavior. ### Use Cases * **Targeted Campaigns**: Focus offers on high-propensity segments * **Loan Eligibility Filtering**: Pre-qualify candidates automatically * **Customer Risk Profiling**: Understand financial behavior deep. ## Prerequisites {% content-ref url="../unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller" %} [prep-400-dbvisualizer-sql-editor-setup-for-data-distiller](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-1-getting-started/prep-400-dbvisualizer-sql-editor-setup-for-data-distiller) {% endcontent-ref %} {% content-ref url="statsml-600-data-distiller-advanced-statistics-and-machine-learning-models" %} [statsml-600-data-distiller-advanced-statistics-and-machine-learning-models](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/statsml-600-data-distiller-advanced-statistics-and-machine-learning-models) {% endcontent-ref %} ## Goals Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on: * Demographic and behavioral data * Financial account balances * Derived income and loan features ## Sample Dataset ### Fields & Description | Field Name | Data Type | Description | | ----------------------- | -------------- | ----------------------------------------- | | customer\_id | STRING | Unique identifier | | age | INT | Customer age | | gender | STRING | Male / Female / Other | | marital\_status | STRING | Married / Single / Divorced | | employment\_status | STRING | Employed / Self-employed / Retired etc. | | annual\_income | DECIMAL | Yearly income | | credit\_score | INT | Credit bureau score | | checking\_balance | DECIMAL | Balance in checking account | | savings\_balance | DECIMAL | Balance in savings account | | monthly\_debit\_volume | DECIMAL | Monthly average spending | | monthly\_credit\_volume | DECIMAL | Monthly average income | | loan\_history | ARRAY\ | Past loans (type, amount, status) | | existing\_auto\_loan | BOOLEAN | Existing car loan | | owns\_vehicle | BOOLEAN | Whether customer owns a car | | propensity\_car\_loan | FLOAT | Target: Likelihood (0-1) of taking a loan | If you were to go and explore the sample data, it should look like this:

A simple data exploration query with DB Visualizer

### Derived Features These features are engineered to improve model performance: | Feature | Formula / Logic | | ------------------------- | ---------------------------------------------- | | `savings_to_income_ratio` | `savings_balance / annual_income` | | `debt_to_income_ratio` | `(monthly_debit_volume * 12) / annual_income` | | `avg_monthly_net_income` | `monthly_credit_volume - monthly_debit_volume` | | `loan_count` | `COUNT(loan_history)` | | `previous_auto_loans` | `COUNT WHERE loan_type = 'Auto'` | | `good_credit_flag` | `credit_score >= 700` | | `high_cash_reserve_flag` | `checking_balance + savings_balance > 10000` | {% code overflow="wrap" %} ```sql Select customer_id, cast(age as int) age, gender, marital_status, employment_status, cast(annual_income as decimal(18,2)) annual_income, cast(credit_score as Int) credit_score, cast(checking_balance as decimal(18,2))checking_balance, cast(savings_balance as decimal(18,2)) savings_balance, cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume, cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume, --Derived Features ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio, ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio, (monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income, CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag, CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag from customer_bank_data; ``` {% endcode %} This should look like the following:

Derived features along with the base features

### Auto-Labeling of Training Data

Feature	Weight	Explanation
Not having an auto loan	0.30	More likely to consider buying
Doesn’t own a vehicle	0.20	May need a car, hence loan
Good credit score	0.20	More eligible for credit
Income > $60K	0.10	Likely to get approved
Checking balance > $2K	0.10	Has funds for down payment
Net income > $2K	0.10	Better repayment capacity

### Model Definition ```sql DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression; CREATE MODEL customer_propensity_using_LogisticRegression TRANSFORM ( vector_assembler(array( age, annual_income, credit_score, checking_balance, savings_balance, monthly_debit_volume, monthly_credit_volume, existing_auto_loan, owns_vehicle, savings_to_income_ratio, debt_to_income_ratio, avg_monthly_net_income, good_credit_flag, high_cash_reserve_flag )) features ) OPTIONS ( MODEL_TYPE = 'logistic_reg', LABEL = 'propensity' ) AS SELECT *, CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity FROM vw_customer_bank_profile_data_train ORDER BY RANDOM() LIMIT 50000; ``` ## Model Evaluation ```sql SELECT * FROM model_evaluate( customer_propensity_using_LogisticRegression, 1, ( SELECT *, CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity FROM vw_customer_bank_profile_data_train ) ); ``` ### Results The query should return the folllowing:

| Metric | Value | | --------- | ------ | | AUC ROC | 0.9362 | | Accuracy | 0.9361 | | Precision | 0.9367 | | Recall | 0.9372 | ### Predict on New Customers ```sql SELECT * FROM model_predict( customer_propensity_using_LogisticRegression, 1, ( SELECT *, CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity FROM vw_customer_bank_profile_data_predict ) ); ``` The query looks like the foilowing: