Skip to content

FatineHic/chd-prediction-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🫀 CHD Prediction using Machine Learning

Predicting the 10-year risk of Coronary Heart Disease (CHD) using machine learning models trained on patient clinical data.

Python Scikit-Learn XGBoost


📌 Table of Contents


🔬 Overview

Coronary Heart Disease (CHD) is one of the leading causes of death worldwide. Early prediction of CHD risk can enable preventive interventions and better patient outcomes.

This project builds and compares 8 machine learning models to predict whether a patient is at risk of developing CHD within the next 10 years, based on clinical and lifestyle features. The dataset is imbalanced (far more negative cases than positive), making this a challenging real-world classification problem.


🫀 Clinical Context

CHD occurs when plaque builds up in coronary arteries, reducing blood flow to the heart. Key risk factors include:

Category Risk Factors
Lifestyle Smoking habits, physical inactivity
Physiological High blood pressure, high cholesterol, glucose levels
Demographic Age, sex

👉 Identifying high-risk patients early through predictive modeling can support clinical decision-making and reduce mortality.


📂 Dataset

The dataset contains patient health information sourced from an online clinical dataset. Each record represents a patient with the following types of features:

Feature Category Examples
Demographic Age, sex
Behavioral Cigarettes per day, current smoker status
Medical history Prevalent stroke, prevalent hypertension, diabetes
Clinical measurements Total cholesterol, systolic/diastolic BP, BMI, heart rate, glucose
Target variable TenYearCHD — 10-year risk of coronary heart disease (0 or 1)

Class Distribution

The dataset is imbalanced: the majority of patients do not develop CHD within 10 years. This imbalance is a central challenge addressed throughout the pipeline.


⚙️ Workflow

Raw Patient Data
       │
       ▼
┌──────────────────────┐
│ 1. Data Exploration  │──→ Distribution analysis, correlations
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 2. Data Cleaning     │──→ Missing values, outlier treatment
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 3. Feature           │──→ Feature selection, transformations
│    Engineering       │    Encoding, scaling
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 4. Class Imbalance   │──→ Resampling techniques
│    Handling          │    (oversampling / undersampling)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 5. Model Training    │──→ 8 models trained & compared
│    & Evaluation      │    Cross-validation, metrics
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 6. Prediction        │──→ Best model on test set
│    on Test Data      │    Final performance evaluation
└──────────────────────┘
           │
           ▼
   ✅ CHD Risk Prediction

Step Details

  1. Data Exploration — Analyzed distributions, correlations, and class balance to understand the data landscape.
  2. Data Cleaning — Handled missing values and identified/treated outliers to ensure data quality.
  3. Feature Engineering — Selected relevant features, applied transformations, encoding, and scaling to optimize model input.
  4. Class Imbalance Handling — Applied resampling techniques to address the strong imbalance between CHD-positive and CHD-negative cases.
  5. Model Training & Evaluation — Trained 8 different models and compared them using balanced accuracy and other metrics.
  6. Prediction on Test Data — Applied the best-performing model (AdaBoost) to the held-out test set.

🤖 Models Used

Eight models were trained and compared, ranging from simple baselines to ensemble methods:

Model Type Notes
Linear Regression Baseline Simple linear approach
Decision Tree Tree-based Interpretable, prone to overfitting
K-Nearest Neighbors Instance-based Distance-based classification
Random Forest Ensemble (bagging) Multiple decision trees
Support Vector Machine Kernel-based Margin maximization
Neural Network (MLP) Deep learning Multi-layer perceptron
XGBoost Ensemble (boosting) Gradient boosted trees
AdaBoost Ensemble (boosting) Best model

📊 Results

Metric Value
Best Model AdaBoost
Balanced Accuracy ~0.697

👉 AdaBoost outperformed all other models on this imbalanced classification task, achieving the best trade-off between sensitivity and specificity.

Why Balanced Accuracy?

Standard accuracy is misleading on imbalanced datasets — a model predicting "no CHD" for everyone would score >80%. Balanced accuracy accounts for performance on both classes equally, making it the appropriate metric for this problem.


💡 Key Learnings

  • Handling imbalanced datasets is critical in medical ML problems — without it, models ignore the minority (positive CHD) class
  • Simpler models can outperform complex ones if data is well processed — AdaBoost beat the neural network
  • Feature engineering has a strong impact on performance — proper selection and transformation of clinical features matters more than model complexity
  • AdaBoost performed best for this classification task, likely due to its ability to focus iteratively on misclassified samples

📂 Project Structure

chd-prediction/
│
├── notebooks/                  # Jupyter notebooks
│   └── Hackaton.ipynb          # Main analysis notebook
│
├── src/                        # Source code
│   ├── preprocessing.py        # Data cleaning & feature engineering
│   ├── models.py               # Model training & evaluation
│   └── utils.py                # Helper functions
│
├── data/                       # Dataset (online source)
│
├── requirements.txt            # Python dependencies
└── README.md                   # Project documentation

🚀 Installation & Usage

# Clone the repository
git clone https://github.com/YOUR_USERNAME/chd-prediction.git
cd chd-prediction

# Install dependencies
pip install -r requirements.txt

# Run the notebook
jupyter notebook notebooks/Hackaton.ipynb

About

Predicting 10-year coronary heart disease (CHD) risk using 8 machine learning models on imbalanced clinical data. Best model: AdaBoost (balanced accuracy ~0.697). Includes data cleaning, feature engineering, class imbalance handling, and model comparison.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors