Module: Business Analytics — Group Project
Institution: Trinity College Dublin
Deadline: 16 April 2025
Dataset: E-Commerce Customer Insights and Churn Dataset (Kaggle, nabihazahid)
1 in 4 customers on this e-commerce platform stops purchasing. This project builds an interpretable machine learning pipeline to identify which customers are most at risk of churning, quantify the revenue impact, and recommend targeted retention interventions with measurable ROI.
Primary metric: ROC-AUC on held-out test set
Secondary metric: Top three churn-driving features identified and explained
Business impact metric: Estimated revenue saved by targeting the top 20% highest-risk customers
├── data/
│ └── E_Commerce_Customer_Insights_and_Churn_Dataset.csv # Dataset subset
├── notebooks/
│ ├── ML1_EDA_LogisticRegression_v2.py # EDA + Logistic Regression baseline
│ ├── ML2_churn_explainable_pipeline.ipynb # Random Forest + XGBoost models
├── charts/
│ ├── chart1_churn_distribution.png
│ ├── chart2_churn_by_country.png
│ ├── chart3_churn_by_category.png
│ ├── chart4_feature_distributions.png
│ ├── chart5_correlation_heatmap.png
│ ├── chart6_confusion_matrix_lr.png
│ ├── chart7_roc_curve_lr.png
│ └── chart8_feature_importance_lr.png
├── presentation/
│ └── Group_Churn_Presentation.pptx # Final 10-slide deck
├── LLM PRompts/
│ └── LLM_Prompts_Log.md # All prompts, inputs, and outputs
└── README.md
pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn xgboost shapNote: SHAP 0.51.0 is required for XGBoost 3.x compatibility.
Runpip install shap==0.51.0if you encounter version errors.
-
Clone the repository:
git clone https://github.com/<(https://github.com/sundusafreen/Analytics_in_Practice_Group_7)> cd <repo-name>
-
Place the dataset in the
data/folder (or update the file path in the notebook). -
Run the EDA and Logistic Regression baseline:
python notebooks/ML1_EDA_LogisticRegression_v2.py
-
Run the Random Forest and XGBoost models:
python notebooks/ML2_churn_explainable_pipeline.py
All charts will be saved automatically to the charts/ folder.
Explore the full interactive churn analysis dashboard built by Bin Yu and Ummea Salma Hossain:
Link: https://public.tableau.com/app/profile/bin.yu4679/viz/4_17760011003540/ChurnIntelligenceDashboard
- Overall churn rate: 24.6% (493 out of 2,000 customers)
- Strongest churn drivers: Country (India 29.3%, Pakistan 28.0%) and preferred product category (Home 28.6%, Electronics 25.4%)
- Loyal customer signal: Customers with higher purchase quantity and higher unit price are less likely to churn
- Model performance: All three models (Logistic Regression, Random Forest, XGBoost) converge to a ROC-AUC of ~0.53, consistent with near-zero feature-churn correlations in this synthetic dataset
- Best model: XGBoost (ROC-AUC 0.5382); recommended for per-customer SHAP waterfall explanations
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression (Baseline) | 0.38 | 0.25 | 0.78 | 0.38 | 0.536 |
| Random Forest | 0.7050 | 0.3208 | 0.1717 | 0.2237 | 0.5303 |
| XGBoost | 0.6950 | 0.3380 | 0.2424 | 0.2824 | 0.5382 |
The Logistic Regression baseline was configured with
class_weight={0:1, 1:3}to prioritise recall — catching true churners matters more than minimising false alarms in a retention use case.
| Name | Role |
|---|---|
| Duygu Barutcu | Data Engineer |
| Lin Htet Aung | Data Engineer |
| Weiyi Yan | Modeler / Analyst |
| Cesc Gomez | Modeler / Analyst |
| Lavanya Batra | LLM / Prompt Specialist |
| Konstantinos Fragkos | LLM / Prompt Specialist |
| Bin Yu | Visualization Expert |
| Ummea Salma Hossain | Visualization Expert |
| Sundus Afreen | Project Manager / Storyteller |
| Naija Budhiraja | Project Manager / Storyteller |
All prompts, model inputs/outputs, and critical reflections are logged in notebooks/LLM_Prompts_Log.md.
Three core prompt categories were used:
- Prompt 1 — Translating technical churn drivers into plain language for a non-technical CEO audience
- Prompt 2 — Generating a reproducible Python preprocessing pipeline for the churn dataset
- Prompt 3 — Converting model feature importance scores into concrete, business-facing action steps
GenAI Acknowledgement: Generative AI tools (including Claude by Anthropic) were used in this project for code generation, insight synthesis, and document drafting. All AI-generated content was reviewed and edited by team members. AI-generated citations were manually verified. Use is acknowledged at slide level in the presentation, in accordance with TCD GenAI referencing guidelines.
All references were manually verified by team members. AI-generated citations were checked for accuracy before inclusion, per TCD guidelines.
- [1] Reichheld, F. F., & Schefter, P. (2000). E-loyalty: Your secret weapon on the web. Harvard Business Review, 78(4), 105–113.
- [2] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), San Francisco, CA, pp. 785–794.
- [3] Alshamsi, A. et al. (2024). E-commerce churn prediction for analyzing customer behaviour based on machine learning. In Lecture Notes in Business Information Processing. Springer.
- [4] Hu, W. et al. (2019). A hybrid prediction model for e-commerce customer churn based on logistic regression and extreme gradient boosting algorithm. ISI Journal, 24(5).
- [5] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems (NeurIPS), 30, 4766–4777.
- [6] Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
This project follows TCD referencing guidelines: https://libguides.tcd.ie/c.php?g=667784&p=4736271
GenAI referencing guidelines: https://libguides.tcd.ie/genai/acknowledging-referencing