Model Concept

Overview

This repository implements a multi-horizon probabilistic forecasting system for Combined Sewer Overflow (CSO) risk at flow monitoring stations.

The system:

Ingests hydrological and infrastructure data
Engineers time-series features
Trains ensemble machine learning models
Applies conformal prediction for calibrated confidence
Selects the best model per station and horizon
Generates scalable, structured predictions

Model Concept

For each station and forecast horizon, the system predicts:

P(CSO in next H hours | current state)

where:

H ∈ {1, 6, 24, 168, 720} hours

CSO is defined as:

fill_ratio_target > threshold (default = 0.95)

Predicting CSO exceedance probability is preferred operationally because:

Predicting CSO exceedance probability aligns directly with operational decisions, as water companies act on the likelihood of overflow events rather than raw depth values
It focuses on the physically relevant extreme by targeting peak behaviour, which is what actually drives CSO occurrence
It is more robust to timing errors and rainfall forecast uncertainty, avoiding sensitivity to small misalignments in when peaks occur
It enables calibrated probabilistic outputs, allowing risk to be expressed in a consistent and interpretable way
It scales efficiently across networks and multiple forecast horizons, making it suitable for large operational deployments

Though full depth trajectory forecasting (e.g. a time series) provides richer detail, it is generally less stable, harder to interpret, and less directly actionable for real-world CSO management

Pipeline Structure

1. Data Preparation (`prepare_station_data`)

Inputs

GEFS precipitation forecasts
Rainfall (HDE)
Groundwater (HDE)
Flow data (Southern Water)
Station metadata

Processing steps

Filter to station_id and alt_id
Remove duplicates and invalid data
Create complete 2-minute time grid
Merge rainfall, groundwater, forecast data
Interpolate short missing gaps (≤ 7 days)

Compute:

fill_ratio = depth_mm / pipe_diameter_mm

2. Feature Engineering (`engineer_station_features`)

Transforms raw time series into predictive features.

Categories of features

State variables
- fill_ratio, depth, flow, velocity
Rolling statistics
- max, mean, std over windows (10min → 24h)
Lag features
- past values at multiple horizons
Rates of change
- slopes, derivatives
Rainfall aggregates
- cumulative rainfall windows
Groundwater indices
- smoothed and lagged signals
Interactions
- e.g. rainfall × fill_ratio
Temporal encoding
- hour of day, day of year (sin/cos)
Binary regime indicators
- is_raining, is_surcharged

Target

fill_ratio_target = max(fill_ratio over next H hours)

3. Feature Selection (`select_kept_features`)

Step 1: Permutation importance

Train LightGBM regressor
Evaluate on temporal validation fold

Step 2: Thresholding

importance_threshold = importance_frac * max_importance

Removes weak features.

Step 3: Correlation grouping

Features grouped if correlation ≥ 0.9
One representative kept per group

Output

kept_features = [...]

4. Model Training (`build_cso_model`)

Model type

LightGBM binary classifier

Target

y = (fill_ratio_target > threshold)

4.1 Temporal Cross-Validation

TimeSeriesSplit(n_splits=5)
No leakage (strict forward chaining)

Each fold:

train on past
validate on future

4.2 Ensemble Prediction

p = mean(p_fold_1, ..., p_fold_n)

This gives:

P(CSO | features)

4.3 Conformal Prediction (Confidence)

Uses out-of-fold predictions (oof_prob):

Nonconformity scores

if y == 1:
    score = 1 - p
else:
    score = p

Threshold

q_hat = quantile(nonconformity, 1 - alpha)
conf_threshold = 1 - q_hat

Interpretation:

Predictions above conf_threshold are statistically reliable.

4.4 Fallback logic

If:

no CSO events in training folds
insufficient calibration data

Then:

model falls back to:
- persistence (fill_ratio)
- no conformal confidence

5. Model Selection (`best_model_table.csv`)

For each station × horizon, best model is selected based on:

performance metrics vs persistence
ranking across metrics

Possible models:

Persistence
EnsembleClassifier
ConformalEnsemble

6. Model Saving (`save_cso_model_artifacts`)

Each model is saved to:

model/model_weights/{station_id}/{horizon}hr/

Contents:

model_bundle.joblib
metrics.csv
feature list
conformal calibration arrays

7. Prediction Pipeline

Key idea

For each station and horizon:

Engineer features
Select best model from best_model_table.csv
Predict in vectorized batches

Prediction outputs

For each timestamp:

Field	Meaning
`risk_percent`	P(CSO)
`risk_category`	low / moderate / high
`confidence_percent`	conformal reliability
`confidence_category`	low / moderate / high
`confidence_direction`	CSO / no CSO
`trend`	rising / falling / stable
`p_cso`	conformal CSO evidence
`performance_sentence`	model performance summary

Prediction Output (Parquet)

File structure

One file per station:

predictions/{station_id}.parquet

Example data

station_id	time	horizon_hours	best_model	risk_percent	risk_category	confidence_percent	confidence_category	confidence_direction	trend	p_cso	performance_sentence
8399	2024-01-01 00:00	6	EnsembleClassifier	18.7	low	68.4	moderate	no CSO	stable	6.8	Similar predictions…
8399	2024-01-01 00:02	6	EnsembleClassifier	20.3	low	65.7	moderate	no CSO	rising	7.9	Similar predictions…

Accessing predictions

Load full file

df = pd.read_parquet("predictions/8399.parquet")

Filter by horizon

df[df["horizon_hours"] == 6]

Latest predictions

df.sort_values("time").groupby("horizon_hours").tail(1)

Why Parquet?

Efficient for large datasets (millions of rows)
Columnar storage → fast reads
Preserves types (no parsing needed)
Ideal for analytics + ML pipelines

Running the Pipeline

python model_pipeline.py --config config.yaml

Example Interface

View flow monitors in user-friendly dashboard

Watch risk change in real-time (updates every two minutes)

Click on flow monitor to get specific information about CSO risk, up to 30 days

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
assets		assets
data		data
model		model
predictions		predictions
processing		processing
.gitattributes		.gitattributes
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Overview

Model Concept

Pipeline Structure

1. Data Preparation (prepare_station_data)

Inputs

Processing steps

2. Feature Engineering (engineer_station_features)

Categories of features

Target

3. Feature Selection (select_kept_features)

Step 1: Permutation importance

Step 2: Thresholding

Step 3: Correlation grouping

Output

4. Model Training (build_cso_model)

Model type

Target

4.1 Temporal Cross-Validation

4.2 Ensemble Prediction

4.3 Conformal Prediction (Confidence)

Nonconformity scores

Threshold

4.4 Fallback logic

5. Model Selection (best_model_table.csv)

6. Model Saving (save_cso_model_artifacts)

7. Prediction Pipeline

Key idea

Prediction outputs

Prediction Output (Parquet)

File structure

Example data

Accessing predictions

Load full file

Filter by horizon

Latest predictions

Why Parquet?

Running the Pipeline

Example Interface

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Data Preparation (`prepare_station_data`)

2. Feature Engineering (`engineer_station_features`)

3. Feature Selection (`select_kept_features`)

4. Model Training (`build_cso_model`)

5. Model Selection (`best_model_table.csv`)

6. Model Saving (`save_cso_model_artifacts`)

Packages