🫀 CHD Prediction using Machine Learning

Predicting the 10-year risk of Coronary Heart Disease (CHD) using machine learning models trained on patient clinical data.

📌 Table of Contents

Overview
Clinical Context
Dataset
Workflow
Models Used
Results
Key Learnings
Project Structure
Installation & Usage

🔬 Overview

Coronary Heart Disease (CHD) is one of the leading causes of death worldwide. Early prediction of CHD risk can enable preventive interventions and better patient outcomes.

This project builds and compares 8 machine learning models to predict whether a patient is at risk of developing CHD within the next 10 years, based on clinical and lifestyle features. The dataset is imbalanced (far more negative cases than positive), making this a challenging real-world classification problem.

🫀 Clinical Context

CHD occurs when plaque builds up in coronary arteries, reducing blood flow to the heart. Key risk factors include:

Category	Risk Factors
Lifestyle	Smoking habits, physical inactivity
Physiological	High blood pressure, high cholesterol, glucose levels
Demographic	Age, sex

👉 Identifying high-risk patients early through predictive modeling can support clinical decision-making and reduce mortality.

📂 Dataset

The dataset contains patient health information sourced from an online clinical dataset. Each record represents a patient with the following types of features:

Feature Category	Examples
Demographic	Age, sex
Behavioral	Cigarettes per day, current smoker status
Medical history	Prevalent stroke, prevalent hypertension, diabetes
Clinical measurements	Total cholesterol, systolic/diastolic BP, BMI, heart rate, glucose
Target variable	`TenYearCHD` — 10-year risk of coronary heart disease (0 or 1)

Class Distribution

The dataset is imbalanced: the majority of patients do not develop CHD within 10 years. This imbalance is a central challenge addressed throughout the pipeline.

⚙️ Workflow

Raw Patient Data
       │
       ▼
┌──────────────────────┐
│ 1. Data Exploration  │──→ Distribution analysis, correlations
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 2. Data Cleaning     │──→ Missing values, outlier treatment
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 3. Feature           │──→ Feature selection, transformations
│    Engineering       │    Encoding, scaling
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 4. Class Imbalance   │──→ Resampling techniques
│    Handling          │    (oversampling / undersampling)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 5. Model Training    │──→ 8 models trained & compared
│    & Evaluation      │    Cross-validation, metrics
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ 6. Prediction        │──→ Best model on test set
│    on Test Data      │    Final performance evaluation
└──────────────────────┘
           │
           ▼
   ✅ CHD Risk Prediction

Step Details

Data Exploration — Analyzed distributions, correlations, and class balance to understand the data landscape.
Data Cleaning — Handled missing values and identified/treated outliers to ensure data quality.
Feature Engineering — Selected relevant features, applied transformations, encoding, and scaling to optimize model input.
Class Imbalance Handling — Applied resampling techniques to address the strong imbalance between CHD-positive and CHD-negative cases.
Model Training & Evaluation — Trained 8 different models and compared them using balanced accuracy and other metrics.
Prediction on Test Data — Applied the best-performing model (AdaBoost) to the held-out test set.

🤖 Models Used

Eight models were trained and compared, ranging from simple baselines to ensemble methods:

Model	Type	Notes
Linear Regression	Baseline	Simple linear approach
Decision Tree	Tree-based	Interpretable, prone to overfitting
K-Nearest Neighbors	Instance-based	Distance-based classification
Random Forest	Ensemble (bagging)	Multiple decision trees
Support Vector Machine	Kernel-based	Margin maximization
Neural Network (MLP)	Deep learning	Multi-layer perceptron
XGBoost	Ensemble (boosting)	Gradient boosted trees
AdaBoost	Ensemble (boosting)	⭐ Best model

📊 Results

Metric	Value
Best Model	AdaBoost
Balanced Accuracy	~0.697

👉 AdaBoost outperformed all other models on this imbalanced classification task, achieving the best trade-off between sensitivity and specificity.

Why Balanced Accuracy?

Standard accuracy is misleading on imbalanced datasets — a model predicting "no CHD" for everyone would score >80%. Balanced accuracy accounts for performance on both classes equally, making it the appropriate metric for this problem.

💡 Key Learnings

Handling imbalanced datasets is critical in medical ML problems — without it, models ignore the minority (positive CHD) class
Simpler models can outperform complex ones if data is well processed — AdaBoost beat the neural network
Feature engineering has a strong impact on performance — proper selection and transformation of clinical features matters more than model complexity
AdaBoost performed best for this classification task, likely due to its ability to focus iteratively on misclassified samples

📂 Project Structure

chd-prediction/
│
├── notebooks/                  # Jupyter notebooks
│   └── Hackaton.ipynb          # Main analysis notebook
│
├── src/                        # Source code
│   ├── preprocessing.py        # Data cleaning & feature engineering
│   ├── models.py               # Model training & evaluation
│   └── utils.py                # Helper functions
│
├── data/                       # Dataset (online source)
│
├── requirements.txt            # Python dependencies
└── README.md                   # Project documentation

🚀 Installation & Usage

# Clone the repository
git clone https://github.com/YOUR_USERNAME/chd-prediction.git
cd chd-prediction

# Install dependencies
pip install -r requirements.txt

# Run the notebook
jupyter notebook notebooks/Hackaton.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🫀 CHD Prediction using Machine Learning

📌 Table of Contents

🔬 Overview

🫀 Clinical Context

📂 Dataset

Class Distribution

⚙️ Workflow

Step Details

🤖 Models Used

📊 Results

Why Balanced Accuracy?

💡 Key Learnings

📂 Project Structure

🚀 Installation & Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🫀 CHD Prediction using Machine Learning

📌 Table of Contents

🔬 Overview

🫀 Clinical Context

📂 Dataset

Class Distribution

⚙️ Workflow

Step Details

🤖 Models Used

📊 Results

Why Balanced Accuracy?

💡 Key Learnings

📂 Project Structure

🚀 Installation & Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages