Skip to content

hacktivist211/Credit-Risk-Management-Model

Repository files navigation

Credit Risk Management Model

This repository contains a comprehensive machine learning pipeline for a Credit Risk Management Model designed to predict the probability of default (PoD) for borrowers using alternative data sources. The model leverages a variety of features, including demographic, financial, behavioral, and geolocation-based data, to assess credit risk and track potential defaulters. The pipeline includes data generation, model training, evaluation, and stacking to combine predictions from multiple models for improved performance.


Table of Contents

  1. Problem Statement
  2. Approach to the Problem
  3. Quick Start Guide
  4. Project Structure
  5. Setup Instructions
  6. Key Features and Data Generation
  7. Machine Learning Pipeline
  8. Model Visualization
  9. Key Code Snippets
  10. Dependencies
  11. Usage
  12. Future Improvements
  13. License

Problem Statement

The goal of this project is to build a robust machine learning model to predict the probability of default (PoD) for borrowers, enabling effective credit risk management. The model uses alternative data sources (e.g., demographic, financial, social, and geolocation data) to assess the likelihood of a borrower defaulting on a loan. Additionally, the model supports tracking potential defaulters by analyzing behavioral and network-based features. The pipeline generates synthetic data, trains multiple machine learning models, and combines their predictions using a stacking approach to achieve high recall and AUC metrics, which are critical for identifying high-risk borrowers.


Approach to the Problem

The approach to solving the credit risk management problem involves the following steps:

  1. Data Generation: Synthetic data is generated to simulate real-world credit risk scenarios, including demographic, financial, and alternative data features (e.g., geolocation, social network, and device usage patterns).
  2. Feature Engineering: Over 100 features are created, covering traditional credit metrics (e.g., debt-to-income ratio, credit utilization) and alternative data (e.g., geolocation entropy, social graph density).
  3. Model Training: Three base models are trained:
    • CatBoost: A gradient boosting model that handles categorical features natively.
    • HistGradientBoosting (HGB): A histogram-based gradient boosting model optimized for large datasets.
    • XGBoost: A scalable gradient boosting model with optimized hyperparameters.
  4. Stacking Ensemble: Predictions from the base models are combined using a Logistic Regression meta-model to improve overall performance.
  5. Evaluation: Models are evaluated using recall, AUC, and cost-based metrics (e.g., penalizing false negatives more heavily due to their higher financial impact).
  6. Stability Checks: Cross-validation and segment-based analysis ensure model robustness across different data subsets.
  7. Tracking Defaulters: Features like peer_defaulter_count, social_network_risk_score, and geolocation_change_frequency help identify and track potential defaulters.

Novelty and Technical Innovation

The project introduces a novel approach to credit risk management through model ensembling and stacking, which significantly enhances predictive performance and robustness. By combining three diverse gradient boosting models—CatBoost, HistGradientBoosting, and XGBoost—the pipeline leverages complementary strengths:

  • CatBoost excels in handling categorical features without extensive preprocessing, capturing complex interactions in features like income_category and device_type.
  • HistGradientBoosting offers computational efficiency for large datasets through histogram-based techniques, making it ideal for the 2,000,000-record training set.
  • XGBoost provides scalability and precision with GPU-accelerated training and an optimized threshold for imbalanced classes.

The stacking ensemble is a key innovation, where predictions from these base models are fed into a Logistic Regression meta-model. This meta-model learns to optimally weight the base model predictions, mitigating individual model biases and improving overall accuracy, particularly in recall for identifying defaulters. The stacking approach is implemented in the_pipeline.py, generating a stacked_train.csv file with base model predictions (cat_pred, hgb_pred, xgb_pred) as features. This technique ensures the model captures diverse patterns in alternative data, such as geolocation entropy and social network risk, providing a more holistic risk assessment compared to traditional credit scoring methods.


Quick Start Guide

New to machine learning or GitHub? Follow these steps to get started quickly:

  1. Clone the Repository: Download the project files to your computer.
    git clone https://github.com/hacktivist211/credit-risk-management.git
    cd credit-risk-management
  2. Set Up the Environment: Create a virtual environment using Conda and install dependencies.
    conda create -n credit_risk_env python=3.8
    conda activate credit_risk_env
    pip install -r requirements.txt
  3. Generate Data: Create synthetic datasets for training, validation, and testing.
    python data.py
  4. Run the Pipeline: Train models and generate predictions.
    python the_pipeline.py
  5. View Results: Check the Combined_Training_Output directory for model artifacts and evaluation results.

For detailed instructions, see the Setup Instructions and Usage sections.


Project Structure

The repository is organized as follows:

credit-risk-management/
│
├── data.py                   # Script to generate synthetic credit risk data
├── model_eval.py             # Script to evaluate base models and meta-model
├── the_pipeline.py           # Main pipeline for training, stacking, and evaluation
├── Combined_Training_Output/ # Directory to store model artifacts and outputs
├── requirements.txt          # List of dependencies
├── README.md                 # This file

Setup Instructions

Prerequisites

  • Python: Version 3.8 or higher
  • Conda: For creating and managing the virtual environment
  • Git: For cloning the repository

Steps to Set Up the Environment

  1. Clone the Repository:

    git clone https://github.com/hacktivist211/credit-risk-management.git
    cd credit-risk-management
  2. Create a Conda Environment:

    conda create -n credit_risk_env python=3.8
    conda activate credit_risk_env
  3. Install Dependencies: Install the required packages listed in requirements.txt:

    pip install -r requirements.txt
  4. Directory Setup: Ensure you have a directory at E:\ML\Projects\IITH\New folder for data storage, or modify the BASE_PATH variable in data.py, model_eval.py, and the_pipeline.py to point to a valid directory on your system.

  5. Generate Synthetic Data: Run the data generation script to create train, validation, and test datasets:

    python data.py
  6. Run the Pipeline: Execute the main pipeline to train models, generate stacked predictions, and train the meta-model:

    python the_pipeline.py
  7. Evaluate Models: Evaluate the trained models using:

    python model_eval.py

Key Features and Data Generation

The data.py script generates synthetic datasets with 115 features relevant to credit risk assessment. Key features include:

  • Demographic Features: age, income, education, marital_status, profession
  • Financial Features: debt_ratio, credit_limit, revolving_utilization_rate, loan_amount
  • Behavioral Features: transaction_frequency, payment_regularity, missed_payment_count
  • Alternative Data:
    • Geolocation: geolocation_entropy, avg_daily_travel_distance, night_time_movement_ratio
    • Social Network: social_graph_density, peer_defaulter_count, unique_contacts_count
    • Device Usage: device_type, device_security_score, login_frequency
  • Target Variable: risk_flag (binary: 0 = non-default, 1 = default), derived from probability_of_default

The data is split into:

  • Training Set: 2,000,000 records
  • Validation Set: 15,000 records
  • Test Set: 15,000 records

The probability_of_default is calculated using a logistic function based on key risk factors (e.g., debt_ratio, num_delinq_90_plus_days, payment_regularity), ensuring realistic risk profiles.


Machine Learning Pipeline

Data Preprocessing

  • CatBoost: Categorical features (income_category, education, etc.) are kept as strings, leveraging CatBoost's native handling of categorical data.
  • HGB: Categorical features are label-encoded, and missing values are imputed with zeros.
  • XGBoost: Categorical features are label-encoded, numerical features are scaled using RobustScaler, and missing values are imputed with medians.
  • Stacking: Predictions from base models are used as features for the meta-model.

Base Models

  1. CatBoost (CatBoostClassifier):

    • Handles categorical features directly.
    • Parameters: iterations=1000, depth=8, learning_rate=0.05, auto_class_weights='Balanced'.
    • Optimized for AUC with early stopping.
  2. HistGradientBoosting (HistGradientBoostingClassifier):

    • Uses histogram-based gradient boosting for efficiency.
    • Parameters: max_iter=1000, max_leaf_nodes=128, learning_rate=0.05, early_stopping=True.
    • Categorical features are label-encoded, and missing values are imputed with zeros.
  3. XGBoost (xgb.Booster):

    • Optimized for large datasets with GPU support (if available).
    • Parameters: max_depth=8, eta=0.05, scale_pos_weight adjusted for class imbalance.
    • Uses an optimal threshold based on the maximum F1-score from precision-recall curves.

Stacking and Meta-Model

  • Stacking: Predictions from CatBoost, HGB, and XGBoost are combined into a new dataset (stacked_train.csv) with columns cat_pred, hgb_pred, and xgb_pred.
  • Meta-Model: A LogisticRegression model is trained on the stacked predictions to produce the final probability of default.
  • Rationale: Stacking leverages the strengths of individual models, improving overall performance by learning to weight their predictions optimally.

Model Evaluation

  • Metrics:
    • Recall: Prioritized to minimize false negatives (missed defaulters).
    • AUC: Measures the model's ability to distinguish between defaulters and non-defaulters.
    • Confusion Matrix: Provides insight into true positives, false positives, true negatives, and false negatives.
    • Cost-Based Evaluation: Assigns higher cost to false negatives (fn_cost=10) than false positives (fp_cost=1) to reflect the financial impact of missing defaulters.
  • Stability Checks:
    • Cross-Validation: 5-fold stratified cross-validation to assess model stability.
    • Segment Analysis: Evaluates performance on different data segments to ensure robustness.

Model Visualization

Model performance metrics (Recall, AUC, and confusion matrix) for CatBoost, HistGradientBoosting, XGBoost, and the Meta-Model are saved in evaluation_results.json files within the Combined_Training_Output directory after running model_eval.py. To visualize these metrics, you can extract the results and create plots using tools like Matplotlib or Seaborn in Python, or use the provided JSON files for custom visualizations.


Key Code Snippets

1. Data Generation (data.py)

This snippet generates synthetic data for the age and income features:

value = np.random.randint(21, 70, num_samples)  # Age between 21 and 70
update_progress('age', value)

value = np.random.lognormal(mean=11.5, sigma=0.7, size=num_samples)  # Lognormal distribution for income
update_progress('income', value)

2. CatBoost Training (the_pipeline.py)

Training the CatBoost model with progress tracking:

params = self.get_model_parameters()
self.model = CatBoostClassifier(**params)
self.model.fit(self.train_pool, eval_set=self.val_pool, plot=False, verbose=False,
               callbacks=[self._get_progress_callback(progress, task)])

3. Stacking Predictions (the_pipeline.py)

Generating predictions for stacking:

cat_model = CatBoostClassifier()
cat_model.load_model(os.path.join(cat_run_path, 'catboost_model.cbm'))
cat_pool = Pool(X_test_orig, cat_features=['income_category', 'education', ...])
predictions['cat_pred'] = cat_model.predict_proba(cat_pool)[:, 1]

4. Meta-Model Training (the_pipeline.py)

Training the Logistic Regression meta-model:

meta_model = LogisticRegression(random_state=42)
meta_model.fit(X_meta, y_meta)
joblib.dump(meta_model, meta_model_path)

5. Model Evaluation (model_eval.py)

Evaluating model performance with recall, AUC, and cost-based metrics:

recall = recall_score(y_true, y_pred)
auc = roc_auc_score(y_true, model.predict_proba(X)[:, 1])
cm = confusion_matrix(y_true, y_pred)
total_cost = fn * 10 + fp * 1

Dependencies

The project requires the following Python packages (listed in requirements.txt):

pandas
numpy
tqdm
catboost
scikit-learn
xgboost
joblib
tabulate
rich
psutil

Install them using:

pip install -r requirements.txt

Usage

  1. Generate Data:

    python data.py

    This creates train_data.csv, validation_data.csv, and test_data.csv in the specified directory.

  2. Run the Full Pipeline:

    python the_pipeline.py

    This trains the CatBoost, HGB, and XGBoost models, generates stacked predictions, and trains the meta-model.

  3. Evaluate Models:

    python model_eval.py

    This evaluates the performance of all models and the meta-model, displaying metrics like recall, AUC, and cost.

  4. Output:

    • Model artifacts are saved in Combined_Training_Output/.
    • Stacked predictions are saved in stacked_train.csv.
    • Evaluation results are saved as JSON files in the respective model directories.

Future Improvements

  • Feature Selection: Implement feature importance analysis to reduce the number of features and improve model efficiency.
  • Hyperparameter Tuning: Use grid search or Bayesian optimization to fine-tune model parameters.
  • Additional Models: Incorporate other algorithms (e.g., LightGBM, Neural Networks) into the stacking ensemble.
  • Real-Time Tracking: Develop a real-time system for monitoring defaulter behavior using streaming data.
  • Alternative Data Expansion: Include more alternative data sources, such as social media sentiment or real-time transaction data.
  • Explainability: Integrate SHAP or LIME to provide interpretable insights into model predictions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages