This repository provides a single R script for analysing NHS Emergency Department (ED) data to examine crude mortality across organisations.
The script is saved as NHSD AE dataWrangleML pipe.R but is documented internally as ed_mortality_analysis.R. It performs extensive data cleaning, modelling and visualisation using the tidymodels ecosystem.
The following notes provide a structured, scientific-style summary of ed_mortality_analysis.R along with suggestions for validating intermediate and final results.
This script performs a comprehensive analysis of emergency department (ED) mortality across NHS trusts for the 2023/24 year. Its goals are to:
- Import and clean two data sources
- Workforce staffing levels (Excel)
- ED performance (ECDS CSV)
- Merge these to compute a crude mortality rate per 100 000 attendances
- Exclude outliers and very low-rate trusts
- Define peer groups for benchmarking
- Conduct exploratory plots (bar charts, scatterplots)
- Apply principal component analysis (PCA) and an ordinary least squares (OLS) model on PCs to explore drivers of raw death counts
- Build and tune a random forest regression to predict death counts
- Use variable-importance (VIP) and LIME explanations for model interpretability
- Produce predicted versus observed death plots, funnel plots, and interactive dashboards
This pipeline combines tidy-data principles (Wickham et al., 2019) with modern machine-learning workflows (Kuhn & Johnson, 2020).
suppressPackageStartupMessages({
library(readr); library(dplyr); library(tidyr); library(janitor)
library(ggplot2); library(plotly)
library(rsample); library(recipes); library(parsnip)
library(workflows); library(tune); library(yardstick)
library(ranger); library(lime); library(vip)
library(tidymodels); library(dials); library(readxl); library(rlang)
})- Loads data-wrangling (tidyverse) and modeling (tidymodels) packages.
- Suppresses startup messages for readability.
- Reference: Wickham & Grolemund (2019), R for Data Science.
- Download NHS workforce Excel via
download.file(). - Inspect with
excel_sheets(),head(),str(), row/column counts. - Read four sheets (
Table_1a,1b,1c,2), skipping header rows. - Rename key columns (e.g.
Doctors_FTE,Stability_Index). - Filter out summary rows, convert stability to numeric.
- Merge into one “wide” table
dfW_final.
Validation:
- Check
file.exists(destfileW)and dimensions. - After cleaning,
stopifnot(ncol(dfW_final) == 5)and no NAs in code columns.
-
Download ECDS CSV and read via
read_csv(), treating"","NA","*"as missing. -
Clean names (
clean_names()) and drop rows missingorg_description,measure, ormeasure_type. -
Aggregate duplicates by mean if any (
group_by()+summarise()). -
Create composite header
measure_full = "type — measure"andpivot_wider(). -
Compute crude mortality per 100 000:
$$ \text{mortality}_{100k} = 10^5 \times \frac{\text{Died}}{\text{Total Attendances}} $$ -
Filter out infinite/non-finite rates.
Validation:
-
Print summary:
cat(sprintf("%d rows, %d missing values\n", nrow(df), sum(is.na(df$measure_value))))
-
Ensure
all(df_wide$crude_mortality_per_100k > 0).
-
Normalize organisation names (uppercase, strip “THE”, punctuation).
-
Join on
org_match, confirm match rate:sum(!is.na(df_merged$Doctors_FTE)) / nrow(df_merged)
-
Clean up duplicate columns and restore
org_description.
Validation:
- Expect ≥ 90% match; investigate any unjoined trusts manually.
- Compute IQR, define fences at 1.5 × IQR.
- Flag & remove outliers, drop trusts with mortality < 10 per 100 000.
- Print pre-/post-filter statistics.
Validation:
- Compare
nrow(df_merged)vs.nrow(df_no_outliers)and ensure removals match printed counts.
- Define two peer-group lists (
CHKS,NHSE) plus a special “SaTH” catch for Shrewsbury & Telford; all others →Other. - Adds a factor column
peer_group.
Validation:
table(df_no_outliers$peer_group)yields counts for each group.
- Bar chart of crude mortality ± 95% CI by trust (coloured by peer group).
- Facetted bar chart by peer group.
- Top/Bottom 10 trusts by rate (error bars).
- Scatter: mortality vs. Doctors_FTE (with linear fit).
- Scatter: mortality vs. Consultants_FTE, highlighting SaTH.
Validation:
- Visually inspect that plots run without errors and axes are sensible.
- Filter non-SaTH, select numeric predictors (no zero-variance).
- Build a recipe: median-impute → normalize → PCA (10 components).
- Prep and bake to extract PCs.
- Fit
lm(deaths ~ PC1 + … + PC10). - Summarise, plot diagnostic panels (residuals, QQ).
- Compute absolute t-values for each PC → “component importance.”
- Extract and display top-contributing variables for selected PCs.
- Plot cumulative variance explained.
References:
- Jolliffe & Cadima (2016), tutorial on PCA.
- James et al. (2013), Introduction to Statistical Learning.
Validation:
- Check that cumulative variance curve reaches ≥ 80% by PC10.
- Ensure residual plots show no gross violations of assumptions.
- Split data 80/20 (stratified by death count).
- Recipe: remove zero/near-zero variance, median-impute, normalize.
- Specify
rand_forest(mtry, trees=500, min_n)usingranger. - Workflow + 5-fold cross-validation.
- Grid-tune
mtryandmin_nvia Latin hypercube (10 points). - Select best by RMSE, finalize workflow, and fit on full training set.
- Evaluate on test set: RMSE, R².
Reference:
- Liaw & Wiener (2002),
randomForestmethodology.
Validation:
- Compare train vs. test RMSE to check for overfitting.
- Confirm
best_paramsare sensible (e.g.mtryaround √p).
- VIP: extract impurity-based importance and plot top features.
- LIME: generate local explanations for first 3 cases, plot feature contributions.
Reference:
- Molnar (2020), Interpretable Machine Learning.
Validation:
- Ensure VIP plot runs and LIME explanations show non-zero contributions.
-
Predict death counts for all trusts, combine with observed values.
-
Observed vs. Predicted scatter with peer-group colours and 1:1 line.
-
Compute O/E ratio, standard error under Poisson:
$$ \text{SE}(\mathrm{O/E}) = \frac{1}{\sqrt{\text{expected}}} $$
-
Funnel plot: O/E vs. expected with ± 1.96 limits.
-
Interactive Plotly versions of funnel and a > 12-hour wait vs. O/E scatter.
Validation:
- Confirm that most trusts lie within the funnel limits as expected under Poisson variation.
- Hover tooltips should display correct labels.
-
Set seed at start (e.g.
set.seed(2025)) to fix cross-validation splits. -
Dimension checks after each major join/transform:
stopifnot(nrow(df_wide) == length(unique(df$org_description))) stopifnot(ncol(dfW_final) == 5)
-
Missingness reports for each key column:
summarise_all(df_merged, ~ sum(is.na(.)))
-
Summary statistics:
summary(df_merged$crude_mortality_per_100k)
-
Cross-validation diagnostics: examine
tune_resobject for any failed folds. -
Model performance: ensure
rmse(test)is within 10–20% of training RMSE. -
Visual inspections: all ggplot and plotly commands run without error.
By following this document and running the suggested checks, you can both understand the purpose of each section and verify that ed_mortality_analysis.R is producing correct, reproducible results.
- R 4.0 or higher
- R packages:
readr,dplyr,tidyr,janitor,ggplot2,plotly,rsample,recipes,parsnip,workflows,tune,yardstick,ranger,lime,vip, and thetidymodelssuite.
Install missing packages with:
install.packages(c("tidymodels", "plotly", "janitor", "ranger", "lime", "vip"))- Edit the
file_pathvariable near the top ofNHSD AE dataWrangleML pipe.Rso that it points to your local CSV file. - Run the script within R or from the command line:
Rscript "NHSD AE dataWrangleML pipe.R"The script will print summaries and open plots in your R session.
Distributed under the terms of the GNU General Public License v3.