How-much-mortality

This repository provides a single R script for analysing NHS Emergency Department (ED) data to examine crude mortality across organisations.

The script is saved as NHSD AE dataWrangleML pipe.R but is documented internally as ed_mortality_analysis.R. It performs extensive data cleaning, modelling and visualisation using the tidymodels ecosystem.

Overview

The following notes provide a structured, scientific-style summary of ed_mortality_analysis.R along with suggestions for validating intermediate and final results.

Overview and Objectives

This script performs a comprehensive analysis of emergency department (ED) mortality across NHS trusts for the 2023/24 year. Its goals are to:

Import and clean two data sources
- Workforce staffing levels (Excel)
- ED performance (ECDS CSV)
Merge these to compute a crude mortality rate per 100 000 attendances
Exclude outliers and very low-rate trusts
Define peer groups for benchmarking
Conduct exploratory plots (bar charts, scatterplots)
Apply principal component analysis (PCA) and an ordinary least squares (OLS) model on PCs to explore drivers of raw death counts
Build and tune a random forest regression to predict death counts
Use variable-importance (VIP) and LIME explanations for model interpretability
Produce predicted versus observed death plots, funnel plots, and interactive dashboards

This pipeline combines tidy-data principles (Wickham et al., 2019) with modern machine-learning workflows (Kuhn & Johnson, 2020).

1. Package Loading

suppressPackageStartupMessages({
  library(readr); library(dplyr); library(tidyr); library(janitor)
  library(ggplot2); library(plotly)
  library(rsample); library(recipes); library(parsnip)
  library(workflows); library(tune); library(yardstick)
  library(ranger); library(lime); library(vip)
  library(tidymodels); library(dials); library(readxl); library(rlang)
})

Loads data-wrangling (tidyverse) and modeling (tidymodels) packages.
Suppresses startup messages for readability.
Reference: Wickham & Grolemund (2019), R for Data Science.

2. Import Workforce Data (Excel)

Download NHS workforce Excel via download.file().
Inspect with excel_sheets(), head(), str(), row/column counts.
Read four sheets (Table_1a, 1b, 1c, 2), skipping header rows.
Rename key columns (e.g. Doctors_FTE, Stability_Index).
Filter out summary rows, convert stability to numeric.
Merge into one “wide” table dfW_final.

Validation:

Check file.exists(destfileW) and dimensions.
After cleaning, stopifnot(ncol(dfW_final) == 5) and no NAs in code columns.

3. Import ED Performance Data (CSV)

Download ECDS CSV and read via read_csv(), treating "", "NA", "*" as missing.
Clean names (clean_names()) and drop rows missing org_description, measure, or measure_type.
Aggregate duplicates by mean if any (group_by() + summarise()).
Create composite header measure_full = "type — measure" and pivot_wider().
Compute crude mortality per 100 000:

$$ \text{mortality}_{100k} = 10^5 \times \frac{\text{Died}}{\text{Total Attendances}} $$
Filter out infinite/non-finite rates.

Validation:

Print summary:

cat(sprintf("%d rows, %d missing values\n", nrow(df), sum(is.na(df$measure_value))))

Ensure all(df_wide$crude_mortality_per_100k > 0).

4. Merge Workforce and ED Data

Normalize organisation names (uppercase, strip “THE”, punctuation).

Join on org_match, confirm match rate:

sum(!is.na(df_merged$Doctors_FTE)) / nrow(df_merged)

Clean up duplicate columns and restore org_description.

Validation:

Expect ≥ 90% match; investigate any unjoined trusts manually.

5. Outlier Removal

Compute IQR, define fences at 1.5 × IQR.
Flag & remove outliers, drop trusts with mortality < 10 per 100 000.
Print pre-/post-filter statistics.

Validation:

Compare nrow(df_merged) vs. nrow(df_no_outliers) and ensure removals match printed counts.

6. Peer-Group Assignment

Define two peer-group lists (CHKS, NHSE) plus a special “SaTH” catch for Shrewsbury & Telford; all others → Other.
Adds a factor column peer_group.

Validation:

table(df_no_outliers$peer_group) yields counts for each group.

7. Exploratory Visualisations

Bar chart of crude mortality ± 95% CI by trust (coloured by peer group).
Facetted bar chart by peer group.
Top/Bottom 10 trusts by rate (error bars).
Scatter: mortality vs. Doctors_FTE (with linear fit).
Scatter: mortality vs. Consultants_FTE, highlighting SaTH.

Validation:

Visually inspect that plots run without errors and axes are sensible.

8. PCA & OLS on Principal Components

Filter non-SaTH, select numeric predictors (no zero-variance).
Build a recipe: median-impute → normalize → PCA (10 components).
Prep and bake to extract PCs.
Fit lm(deaths ~ PC1 + … + PC10).
Summarise, plot diagnostic panels (residuals, QQ).
Compute absolute t-values for each PC → “component importance.”
Extract and display top-contributing variables for selected PCs.
Plot cumulative variance explained.

References:

Jolliffe & Cadima (2016), tutorial on PCA.
James et al. (2013), Introduction to Statistical Learning.

Validation:

Check that cumulative variance curve reaches ≥ 80% by PC10.
Ensure residual plots show no gross violations of assumptions.

9. Random Forest Regression

Split data 80/20 (stratified by death count).
Recipe: remove zero/near-zero variance, median-impute, normalize.
Specify rand_forest(mtry, trees=500, min_n) using ranger.
Workflow + 5-fold cross-validation.
Grid-tune mtry and min_n via Latin hypercube (10 points).
Select best by RMSE, finalize workflow, and fit on full training set.
Evaluate on test set: RMSE, R².

Reference:

Liaw & Wiener (2002), randomForest methodology.

Validation:

Compare train vs. test RMSE to check for overfitting.
Confirm best_params are sensible (e.g. mtry around √p).

10. Model Interpretation (VIP & LIME)

VIP: extract impurity-based importance and plot top features.
LIME: generate local explanations for first 3 cases, plot feature contributions.

Reference:

Molnar (2020), Interpretable Machine Learning.

Validation:

Ensure VIP plot runs and LIME explanations show non-zero contributions.

11. Predictions & Funnel Plots

Predict death counts for all trusts, combine with observed values.
Observed vs. Predicted scatter with peer-group colours and 1:1 line.
Compute O/E ratio, standard error under Poisson:

$$ \text{SE}(\mathrm{O/E}) = \frac{1}{\sqrt{\text{expected}}} $$
Funnel plot: O/E vs. expected with ± 1.96 limits.
Interactive Plotly versions of funnel and a > 12-hour wait vs. O/E scatter.

Validation:

Confirm that most trusts lie within the funnel limits as expected under Poisson variation.
Hover tooltips should display correct labels.

12. Reproducibility and Sanity-Check Checklist

Set seed at start (e.g. set.seed(2025)) to fix cross-validation splits.

Dimension checks after each major join/transform:

stopifnot(nrow(df_wide) == length(unique(df$org_description)))
stopifnot(ncol(dfW_final) == 5)

Missingness reports for each key column:

summarise_all(df_merged, ~ sum(is.na(.)))

Summary statistics:

summary(df_merged$crude_mortality_per_100k)

Cross-validation diagnostics: examine tune_res object for any failed folds.
Model performance: ensure rmse(test) is within 10–20% of training RMSE.
Visual inspections: all ggplot and plotly commands run without error.

By following this document and running the suggested checks, you can both understand the purpose of each section and verify that ed_mortality_analysis.R is producing correct, reproducible results.

Requirements

R 4.0 or higher
R packages: readr, dplyr, tidyr, janitor, ggplot2, plotly, rsample, recipes, parsnip, workflows, tune, yardstick, ranger, lime, vip, and the tidymodels suite.

Install missing packages with:

install.packages(c("tidymodels", "plotly", "janitor", "ranger", "lime", "vip"))

Usage

Edit the file_path variable near the top of NHSD AE dataWrangleML pipe.R so that it points to your local CSV file.
Run the script within R or from the command line:

Rscript "NHSD AE dataWrangleML pipe.R"

The script will print summaries and open plots in your R session.

License

Distributed under the terms of the GNU General Public License v3.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
NHSD AE dataWrangleML pipe.R		NHSD AE dataWrangleML pipe.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How-much-mortality

Overview

Overview and Objectives

1. Package Loading

2. Import Workforce Data (Excel)

3. Import ED Performance Data (CSV)

4. Merge Workforce and ED Data

5. Outlier Removal

6. Peer-Group Assignment

7. Exploratory Visualisations

8. PCA & OLS on Principal Components

9. Random Forest Regression

10. Model Interpretation (VIP & LIME)

11. Predictions & Funnel Plots

12. Reproducibility and Sanity-Check Checklist

Requirements

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How-much-mortality

Overview

Overview and Objectives

1. Package Loading

2. Import Workforce Data (Excel)

3. Import ED Performance Data (CSV)

4. Merge Workforce and ED Data

5. Outlier Removal

6. Peer-Group Assignment

7. Exploratory Visualisations

8. PCA & OLS on Principal Components

9. Random Forest Regression

10. Model Interpretation (VIP & LIME)

11. Predictions & Funnel Plots

12. Reproducibility and Sanity-Check Checklist

Requirements

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages