Predicting the 10-year risk of Coronary Heart Disease (CHD) using machine learning models trained on patient clinical data.
- Overview
- Clinical Context
- Dataset
- Workflow
- Models Used
- Results
- Key Learnings
- Project Structure
- Installation & Usage
Coronary Heart Disease (CHD) is one of the leading causes of death worldwide. Early prediction of CHD risk can enable preventive interventions and better patient outcomes.
This project builds and compares 8 machine learning models to predict whether a patient is at risk of developing CHD within the next 10 years, based on clinical and lifestyle features. The dataset is imbalanced (far more negative cases than positive), making this a challenging real-world classification problem.
CHD occurs when plaque builds up in coronary arteries, reducing blood flow to the heart. Key risk factors include:
| Category | Risk Factors |
|---|---|
| Lifestyle | Smoking habits, physical inactivity |
| Physiological | High blood pressure, high cholesterol, glucose levels |
| Demographic | Age, sex |
👉 Identifying high-risk patients early through predictive modeling can support clinical decision-making and reduce mortality.
The dataset contains patient health information sourced from an online clinical dataset. Each record represents a patient with the following types of features:
| Feature Category | Examples |
|---|---|
| Demographic | Age, sex |
| Behavioral | Cigarettes per day, current smoker status |
| Medical history | Prevalent stroke, prevalent hypertension, diabetes |
| Clinical measurements | Total cholesterol, systolic/diastolic BP, BMI, heart rate, glucose |
| Target variable | TenYearCHD — 10-year risk of coronary heart disease (0 or 1) |
The dataset is imbalanced: the majority of patients do not develop CHD within 10 years. This imbalance is a central challenge addressed throughout the pipeline.
Raw Patient Data
│
▼
┌──────────────────────┐
│ 1. Data Exploration │──→ Distribution analysis, correlations
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ 2. Data Cleaning │──→ Missing values, outlier treatment
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ 3. Feature │──→ Feature selection, transformations
│ Engineering │ Encoding, scaling
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ 4. Class Imbalance │──→ Resampling techniques
│ Handling │ (oversampling / undersampling)
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ 5. Model Training │──→ 8 models trained & compared
│ & Evaluation │ Cross-validation, metrics
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ 6. Prediction │──→ Best model on test set
│ on Test Data │ Final performance evaluation
└──────────────────────┘
│
▼
✅ CHD Risk Prediction
- Data Exploration — Analyzed distributions, correlations, and class balance to understand the data landscape.
- Data Cleaning — Handled missing values and identified/treated outliers to ensure data quality.
- Feature Engineering — Selected relevant features, applied transformations, encoding, and scaling to optimize model input.
- Class Imbalance Handling — Applied resampling techniques to address the strong imbalance between CHD-positive and CHD-negative cases.
- Model Training & Evaluation — Trained 8 different models and compared them using balanced accuracy and other metrics.
- Prediction on Test Data — Applied the best-performing model (AdaBoost) to the held-out test set.
Eight models were trained and compared, ranging from simple baselines to ensemble methods:
| Model | Type | Notes |
|---|---|---|
| Linear Regression | Baseline | Simple linear approach |
| Decision Tree | Tree-based | Interpretable, prone to overfitting |
| K-Nearest Neighbors | Instance-based | Distance-based classification |
| Random Forest | Ensemble (bagging) | Multiple decision trees |
| Support Vector Machine | Kernel-based | Margin maximization |
| Neural Network (MLP) | Deep learning | Multi-layer perceptron |
| XGBoost | Ensemble (boosting) | Gradient boosted trees |
| AdaBoost | Ensemble (boosting) | ⭐ Best model |
| Metric | Value |
|---|---|
| Best Model | AdaBoost |
| Balanced Accuracy | ~0.697 |
👉 AdaBoost outperformed all other models on this imbalanced classification task, achieving the best trade-off between sensitivity and specificity.
Standard accuracy is misleading on imbalanced datasets — a model predicting "no CHD" for everyone would score >80%. Balanced accuracy accounts for performance on both classes equally, making it the appropriate metric for this problem.
- Handling imbalanced datasets is critical in medical ML problems — without it, models ignore the minority (positive CHD) class
- Simpler models can outperform complex ones if data is well processed — AdaBoost beat the neural network
- Feature engineering has a strong impact on performance — proper selection and transformation of clinical features matters more than model complexity
- AdaBoost performed best for this classification task, likely due to its ability to focus iteratively on misclassified samples
chd-prediction/
│
├── notebooks/ # Jupyter notebooks
│ └── Hackaton.ipynb # Main analysis notebook
│
├── src/ # Source code
│ ├── preprocessing.py # Data cleaning & feature engineering
│ ├── models.py # Model training & evaluation
│ └── utils.py # Helper functions
│
├── data/ # Dataset (online source)
│
├── requirements.txt # Python dependencies
└── README.md # Project documentation
# Clone the repository
git clone https://github.com/YOUR_USERNAME/chd-prediction.git
cd chd-prediction
# Install dependencies
pip install -r requirements.txt
# Run the notebook
jupyter notebook notebooks/Hackaton.ipynb