This repository contains a comprehensive machine learning pipeline for a Credit Risk Management Model designed to predict the probability of default (PoD) for borrowers using alternative data sources. The model leverages a variety of features, including demographic, financial, behavioral, and geolocation-based data, to assess credit risk and track potential defaulters. The pipeline includes data generation, model training, evaluation, and stacking to combine predictions from multiple models for improved performance.
- Problem Statement
- Approach to the Problem
- Quick Start Guide
- Project Structure
- Setup Instructions
- Key Features and Data Generation
- Machine Learning Pipeline
- Model Visualization
- Key Code Snippets
- Dependencies
- Usage
- Future Improvements
- License
The goal of this project is to build a robust machine learning model to predict the probability of default (PoD) for borrowers, enabling effective credit risk management. The model uses alternative data sources (e.g., demographic, financial, social, and geolocation data) to assess the likelihood of a borrower defaulting on a loan. Additionally, the model supports tracking potential defaulters by analyzing behavioral and network-based features. The pipeline generates synthetic data, trains multiple machine learning models, and combines their predictions using a stacking approach to achieve high recall and AUC metrics, which are critical for identifying high-risk borrowers.
The approach to solving the credit risk management problem involves the following steps:
- Data Generation: Synthetic data is generated to simulate real-world credit risk scenarios, including demographic, financial, and alternative data features (e.g., geolocation, social network, and device usage patterns).
- Feature Engineering: Over 100 features are created, covering traditional credit metrics (e.g., debt-to-income ratio, credit utilization) and alternative data (e.g., geolocation entropy, social graph density).
- Model Training: Three base models are trained:
- CatBoost: A gradient boosting model that handles categorical features natively.
- HistGradientBoosting (HGB): A histogram-based gradient boosting model optimized for large datasets.
- XGBoost: A scalable gradient boosting model with optimized hyperparameters.
- Stacking Ensemble: Predictions from the base models are combined using a Logistic Regression meta-model to improve overall performance.
- Evaluation: Models are evaluated using recall, AUC, and cost-based metrics (e.g., penalizing false negatives more heavily due to their higher financial impact).
- Stability Checks: Cross-validation and segment-based analysis ensure model robustness across different data subsets.
- Tracking Defaulters: Features like
peer_defaulter_count,social_network_risk_score, andgeolocation_change_frequencyhelp identify and track potential defaulters.
The project introduces a novel approach to credit risk management through model ensembling and stacking, which significantly enhances predictive performance and robustness. By combining three diverse gradient boosting models—CatBoost, HistGradientBoosting, and XGBoost—the pipeline leverages complementary strengths:
- CatBoost excels in handling categorical features without extensive preprocessing, capturing complex interactions in features like
income_categoryanddevice_type. - HistGradientBoosting offers computational efficiency for large datasets through histogram-based techniques, making it ideal for the 2,000,000-record training set.
- XGBoost provides scalability and precision with GPU-accelerated training and an optimized threshold for imbalanced classes.
The stacking ensemble is a key innovation, where predictions from these base models are fed into a Logistic Regression meta-model. This meta-model learns to optimally weight the base model predictions, mitigating individual model biases and improving overall accuracy, particularly in recall for identifying defaulters. The stacking approach is implemented in the_pipeline.py, generating a stacked_train.csv file with base model predictions (cat_pred, hgb_pred, xgb_pred) as features. This technique ensures the model captures diverse patterns in alternative data, such as geolocation entropy and social network risk, providing a more holistic risk assessment compared to traditional credit scoring methods.
New to machine learning or GitHub? Follow these steps to get started quickly:
- Clone the Repository: Download the project files to your computer.
git clone https://github.com/hacktivist211/credit-risk-management.git cd credit-risk-management - Set Up the Environment: Create a virtual environment using Conda and install dependencies.
conda create -n credit_risk_env python=3.8 conda activate credit_risk_env pip install -r requirements.txt
- Generate Data: Create synthetic datasets for training, validation, and testing.
python data.py
- Run the Pipeline: Train models and generate predictions.
python the_pipeline.py
- View Results: Check the
Combined_Training_Outputdirectory for model artifacts and evaluation results.
For detailed instructions, see the Setup Instructions and Usage sections.
The repository is organized as follows:
credit-risk-management/
│
├── data.py # Script to generate synthetic credit risk data
├── model_eval.py # Script to evaluate base models and meta-model
├── the_pipeline.py # Main pipeline for training, stacking, and evaluation
├── Combined_Training_Output/ # Directory to store model artifacts and outputs
├── requirements.txt # List of dependencies
├── README.md # This file
- Python: Version 3.8 or higher
- Conda: For creating and managing the virtual environment
- Git: For cloning the repository
-
Clone the Repository:
git clone https://github.com/hacktivist211/credit-risk-management.git cd credit-risk-management -
Create a Conda Environment:
conda create -n credit_risk_env python=3.8 conda activate credit_risk_env
-
Install Dependencies: Install the required packages listed in
requirements.txt:pip install -r requirements.txt
-
Directory Setup: Ensure you have a directory at
E:\ML\Projects\IITH\New folderfor data storage, or modify theBASE_PATHvariable indata.py,model_eval.py, andthe_pipeline.pyto point to a valid directory on your system. -
Generate Synthetic Data: Run the data generation script to create train, validation, and test datasets:
python data.py
-
Run the Pipeline: Execute the main pipeline to train models, generate stacked predictions, and train the meta-model:
python the_pipeline.py
-
Evaluate Models: Evaluate the trained models using:
python model_eval.py
The data.py script generates synthetic datasets with 115 features relevant to credit risk assessment. Key features include:
- Demographic Features:
age,income,education,marital_status,profession - Financial Features:
debt_ratio,credit_limit,revolving_utilization_rate,loan_amount - Behavioral Features:
transaction_frequency,payment_regularity,missed_payment_count - Alternative Data:
- Geolocation:
geolocation_entropy,avg_daily_travel_distance,night_time_movement_ratio - Social Network:
social_graph_density,peer_defaulter_count,unique_contacts_count - Device Usage:
device_type,device_security_score,login_frequency
- Geolocation:
- Target Variable:
risk_flag(binary: 0 = non-default, 1 = default), derived fromprobability_of_default
The data is split into:
- Training Set: 2,000,000 records
- Validation Set: 15,000 records
- Test Set: 15,000 records
The probability_of_default is calculated using a logistic function based on key risk factors (e.g., debt_ratio, num_delinq_90_plus_days, payment_regularity), ensuring realistic risk profiles.
- CatBoost: Categorical features (
income_category,education, etc.) are kept as strings, leveraging CatBoost's native handling of categorical data. - HGB: Categorical features are label-encoded, and missing values are imputed with zeros.
- XGBoost: Categorical features are label-encoded, numerical features are scaled using
RobustScaler, and missing values are imputed with medians. - Stacking: Predictions from base models are used as features for the meta-model.
-
CatBoost (
CatBoostClassifier):- Handles categorical features directly.
- Parameters:
iterations=1000,depth=8,learning_rate=0.05,auto_class_weights='Balanced'. - Optimized for AUC with early stopping.
-
HistGradientBoosting (
HistGradientBoostingClassifier):- Uses histogram-based gradient boosting for efficiency.
- Parameters:
max_iter=1000,max_leaf_nodes=128,learning_rate=0.05,early_stopping=True. - Categorical features are label-encoded, and missing values are imputed with zeros.
-
XGBoost (
xgb.Booster):- Optimized for large datasets with GPU support (if available).
- Parameters:
max_depth=8,eta=0.05,scale_pos_weightadjusted for class imbalance. - Uses an optimal threshold based on the maximum F1-score from precision-recall curves.
- Stacking: Predictions from CatBoost, HGB, and XGBoost are combined into a new dataset (
stacked_train.csv) with columnscat_pred,hgb_pred, andxgb_pred. - Meta-Model: A
LogisticRegressionmodel is trained on the stacked predictions to produce the final probability of default. - Rationale: Stacking leverages the strengths of individual models, improving overall performance by learning to weight their predictions optimally.
- Metrics:
- Recall: Prioritized to minimize false negatives (missed defaulters).
- AUC: Measures the model's ability to distinguish between defaulters and non-defaulters.
- Confusion Matrix: Provides insight into true positives, false positives, true negatives, and false negatives.
- Cost-Based Evaluation: Assigns higher cost to false negatives (
fn_cost=10) than false positives (fp_cost=1) to reflect the financial impact of missing defaulters.
- Stability Checks:
- Cross-Validation: 5-fold stratified cross-validation to assess model stability.
- Segment Analysis: Evaluates performance on different data segments to ensure robustness.
Model performance metrics (Recall, AUC, and confusion matrix) for CatBoost, HistGradientBoosting, XGBoost, and the Meta-Model are saved in evaluation_results.json files within the Combined_Training_Output directory after running model_eval.py. To visualize these metrics, you can extract the results and create plots using tools like Matplotlib or Seaborn in Python, or use the provided JSON files for custom visualizations.
This snippet generates synthetic data for the age and income features:
value = np.random.randint(21, 70, num_samples) # Age between 21 and 70
update_progress('age', value)
value = np.random.lognormal(mean=11.5, sigma=0.7, size=num_samples) # Lognormal distribution for income
update_progress('income', value)Training the CatBoost model with progress tracking:
params = self.get_model_parameters()
self.model = CatBoostClassifier(**params)
self.model.fit(self.train_pool, eval_set=self.val_pool, plot=False, verbose=False,
callbacks=[self._get_progress_callback(progress, task)])Generating predictions for stacking:
cat_model = CatBoostClassifier()
cat_model.load_model(os.path.join(cat_run_path, 'catboost_model.cbm'))
cat_pool = Pool(X_test_orig, cat_features=['income_category', 'education', ...])
predictions['cat_pred'] = cat_model.predict_proba(cat_pool)[:, 1]Training the Logistic Regression meta-model:
meta_model = LogisticRegression(random_state=42)
meta_model.fit(X_meta, y_meta)
joblib.dump(meta_model, meta_model_path)Evaluating model performance with recall, AUC, and cost-based metrics:
recall = recall_score(y_true, y_pred)
auc = roc_auc_score(y_true, model.predict_proba(X)[:, 1])
cm = confusion_matrix(y_true, y_pred)
total_cost = fn * 10 + fp * 1The project requires the following Python packages (listed in requirements.txt):
pandas
numpy
tqdm
catboost
scikit-learn
xgboost
joblib
tabulate
rich
psutil
Install them using:
pip install -r requirements.txt-
Generate Data:
python data.py
This creates
train_data.csv,validation_data.csv, andtest_data.csvin the specified directory. -
Run the Full Pipeline:
python the_pipeline.py
This trains the CatBoost, HGB, and XGBoost models, generates stacked predictions, and trains the meta-model.
-
Evaluate Models:
python model_eval.py
This evaluates the performance of all models and the meta-model, displaying metrics like recall, AUC, and cost.
-
Output:
- Model artifacts are saved in
Combined_Training_Output/. - Stacked predictions are saved in
stacked_train.csv. - Evaluation results are saved as JSON files in the respective model directories.
- Model artifacts are saved in
- Feature Selection: Implement feature importance analysis to reduce the number of features and improve model efficiency.
- Hyperparameter Tuning: Use grid search or Bayesian optimization to fine-tune model parameters.
- Additional Models: Incorporate other algorithms (e.g., LightGBM, Neural Networks) into the stacking ensemble.
- Real-Time Tracking: Develop a real-time system for monitoring defaulter behavior using streaming data.
- Alternative Data Expansion: Include more alternative data sources, such as social media sentiment or real-time transaction data.
- Explainability: Integrate SHAP or LIME to provide interpretable insights into model predictions.
This project is licensed under the MIT License. See the LICENSE file for details.