diff --git a/tools-appendix/modules/python/images/Autocorrelation-monthly.png b/tools-appendix/modules/python/images/Autocorrelation-monthly.png
new file mode 100644
index 000000000..915c46063
Binary files /dev/null and b/tools-appendix/modules/python/images/Autocorrelation-monthly.png differ
diff --git a/tools-appendix/modules/python/images/Back-to-the-future.jpg b/tools-appendix/modules/python/images/Back-to-the-future.jpg
new file mode 100644
index 000000000..b00e6d3cc
Binary files /dev/null and b/tools-appendix/modules/python/images/Back-to-the-future.jpg differ
diff --git a/tools-appendix/modules/python/images/CalibrationPlot.png b/tools-appendix/modules/python/images/CalibrationPlot.png
new file mode 100644
index 000000000..332d318ba
Binary files /dev/null and b/tools-appendix/modules/python/images/CalibrationPlot.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-1e.png b/tools-appendix/modules/python/images/TimeSeriesSection2-1e.png
new file mode 100644
index 000000000..d1c342b4f
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-1e.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-2c.png b/tools-appendix/modules/python/images/TimeSeriesSection2-2c.png
new file mode 100644
index 000000000..5a19ba131
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-2c.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-3c.png b/tools-appendix/modules/python/images/TimeSeriesSection2-3c.png
new file mode 100644
index 000000000..46cdad7a5
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-3c.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-4c.png b/tools-appendix/modules/python/images/TimeSeriesSection2-4c.png
new file mode 100644
index 000000000..d6e9b9d02
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-4c.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-5c.png b/tools-appendix/modules/python/images/TimeSeriesSection2-5c.png
new file mode 100644
index 000000000..05b29dd1a
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-5c.png differ
diff --git a/tools-appendix/modules/python/images/TimeSeriesSection2-6b.png b/tools-appendix/modules/python/images/TimeSeriesSection2-6b.png
new file mode 100644
index 000000000..1b989179a
Binary files /dev/null and b/tools-appendix/modules/python/images/TimeSeriesSection2-6b.png differ
diff --git a/tools-appendix/modules/python/images/Train-test-split.png b/tools-appendix/modules/python/images/Train-test-split.png
new file mode 100644
index 000000000..482958363
Binary files /dev/null and b/tools-appendix/modules/python/images/Train-test-split.png differ
diff --git a/tools-appendix/modules/python/images/append-data-vis.png b/tools-appendix/modules/python/images/append-data-vis.png
new file mode 100644
index 000000000..9dc5da735
Binary files /dev/null and b/tools-appendix/modules/python/images/append-data-vis.png differ
diff --git a/tools-appendix/modules/python/images/climate-change.png b/tools-appendix/modules/python/images/climate-change.png
new file mode 100644
index 000000000..2994cfd13
Binary files /dev/null and b/tools-appendix/modules/python/images/climate-change.png differ
diff --git a/tools-appendix/modules/python/images/time-series-ex.png b/tools-appendix/modules/python/images/time-series-ex.png
new file mode 100644
index 000000000..57a7e5dcd
Binary files /dev/null and b/tools-appendix/modules/python/images/time-series-ex.png differ
diff --git a/tools-appendix/modules/python/pages/DTW_project.adoc b/tools-appendix/modules/python/pages/DTW_project.adoc
new file mode 100644
index 000000000..e14e39d20
--- /dev/null
+++ b/tools-appendix/modules/python/pages/DTW_project.adoc
@@ -0,0 +1,784 @@
+= TDM 40200: Dynamic Time Warping Project
+:page-mathjax: true
+
+== Project Objectives
+
+In this project, you will explore Dynamic Time Warping (DTW), a method for measuring similarity between time series. DTW focuses on the shape of each time series, allowing sequences with similar behavior to be compared. This makes DTW well suited for clustering large collections of time series, helping uncover shared temporal patterns that can help build later forecasting time series models.
+
+.Learning Objectives
+****
+- You will learn how to clean and structure time series data so it is suitable for both traditional time series analysis and Dynamic Time Warping.
+- You will build intuition for how DTW works behind the scenes, then apply it to cluster time series with similar temporal and seasonal patterns.
+- You will visualize the resulting DTW-based clusters to better understand how different time series are grouped together.
+- You will use insights from DTW clustering to guide time series modeling decisions. Rather than fitting a separate forecasting model for every individual time series, you will explore how grouping similar series allows you to build and evaluate models developed for similar time series.
+****
+
+== Dataset
+- `/anvil/projects/tdm/data/noaa_dtw/*.csv`
+
+== Introduction to Dynamic Time Warping
+
+image::DTWImage1.png[width=600, height=450, title="Dynamic Time Warping matching. Source: Wiki Commons"]
+
+
+Time series data appears in many settings, including weather measurements, user behavior over time, sales trends, medical signals, and financial indicators. Dynamic Time Warping (DTW) offers a different way to think about similarity in time series. Instead of asking whether two sequences match at the same time index, DTW asks whether they follow a similar pattern or shape.
+
+
+DTW was originally developed in the nineteen seventies in the context of speech recognition. Researchers were interested in comparing spoken words that were clearly the same but spoken faster or slower by different speakers. When sound signals like these are compared point by point, they often appear dissimilar because peaks and pauses do not align in time. However, when attention is given to the overall structure of the signal, the similarity becomes much more apparent. DTW was designed to capture this idea by allowing flexibility in how time is aligned.
+
+At a high level, Dynamic Time Warping is an algorithm that aligns two time series by allowing the time axis to stretch or compress where necessary. Rather than enforcing a one to one comparison between observations, DTW finds an alignment that minimizes the total distance between the two sequences after accounting for differences in timing. As a result, a single observation in one series may align with multiple observations in the other if doing so improves the overall match.
+
+:imagesdir: ../../images
+image::DTWImage3.png[width=600, height=450, title="Dynamic Time Warping alignment between two time series, showing the optimal warping path and aligned observations. Source Mark Stent (Medium, 2019)"]
+
+Internally, DTW compares every point in one time series with every point in another and organizes these comparisons into a grid. It then searches for a path through this grid that respects the ordering of time and produces the smallest cumulative distance. The alignment path itself is just as important as the final distance value. It reveals where one series had to be stretched, where it was compressed, and where both series progressed at a similar pace.
+
+DTW distance can be defined as:
+
+:imagesdir: ../../images
+image::DTWImage4.png[width=600, height=450, title="Dynamic Time Warping Formula Source: Mark Stent (Medium, 2019)"]
+
+Because of this flexibility, DTW is especially useful in settings where timing differences are meaningful rather than problematic. It has been applied in speech and audio analysis, healthcare signals such as heart rate data, financial and economic time series, marketing and user behavior patterns, and environmental and climate data. In each of these cases, the goal is to find similarity in structure.
+
+As for **the data** used for this project, we will use daily weather data from NOAA covering January 2024 through December 2024. The dataset contains one full year of temperature observations for the capital city of each U.S. state, giving us 50 separate temperature time series, with one per capital. Let's look at one time series of Weekly Average Temperature in Desmoines, Iowa:
+
+:imagesdir: ../../images
+
+image::DTWImage2.png[width=600, height=450, title="Using DTW to Cluster Time Series in Our Temperature Data"]
+
+If our goal is to predict next week's average temperature, it would seem that the most straightforward approach would be to build 50 separate time series models, one for each capital. While feasible, this can become difficult to maintain. Instead, we ask a more strategic question: what if some capitals follow similar temperature patterns over time and we could group them? This is where Dynamic Time Warping (DTW) comes in. DTW allows us to measure similarity between time series based on their overall shape rather than exact timing. By grouping capitals with similar temperature dynamics, we can cluster time series together and use those clusters to inform model building reducing complexity.
+
+**Columns in the dataset:**
+
+- Station: NOAA weather station identifier
+
+- Date: Timestamp of the observation
+
+- DailyAverageDryBulbTemperature: Daily average air temperature (°C)
+
+- State: U.S. state name
+
+- Capital: Capital city of the state
+
+Each row represents a temperature observation at a given time for a specific capital city. Some timestamps may contain missing temperature values, which we will address during data cleaning and aggregation before applying DTW and time series modeling.
+
+In this project, DTW will be used as a tool for understanding and comparing time series patterns rather than as a standalone distance metric. The focus will be on how DTW aligns sequences, what those alignments reveal about the data, and how this information can support clustering and downstream modeling decisions.
+
+== Questions
+
+=== Question 1 - Understanding Our Data (2 points)
+
+.Deliverables
+====
+**1a. You are given multiple CSV files containing daily weather observations, where each file corresponds to 50 unique U.S. state capital weather data for 2024. Fill in the code below to read all files from the provided directory, extract the state and capital names from each filename, and append the data into a single DataFrame. Retain only the `STATION`, `DATE`, and `DailyAverageDryBulbTemperature`, then add `state` and `capital` columns.**
+
+[source,python]
+----
+import pandas as pd
+import glob
+import os
+
+data_path = os.path.expanduser("~/data/noaa_dtw/*.csv")
+
+files = glob.glob(data_path)
+
+print(f"Number of files found: {len(files)}")
+
+dfs = []
+
+for file in files:
+ filename = os.path.basename(file).replace(".csv", "")
+ state, capital = filename.split("_")
+
+ df = pd.read_csv(
+ file,
+ usecols=["____", "_____", "_____"], # For YOU to fill in
+ engine="python",
+ on_bad_lines="skip"
+ )
+
+ df["State"] = # For YOU to fill in
+ df["Capital"] = # For YOU to fill in
+
+ dfs.append(df)
+
+combined_df = pd.concat(dfs, ignore_index=True)
+combined_df.head()
+----
+
+**1b. Complete the code below to convert the DATE column to a datetime object and ensure that `DailyAverageDryBulbTemperature` is numeric. Then, sort the dataset by `State`, `Capital`, and `DATE`. Display the first few rows of the cleaned DataFrame to verify that the transformations were applied correctly and In 2–3 sentences, explain why these steps are necessary when preparing data for time series analysis.**
+
+[source,python]
+----
+combined_df["DATE"] = pd.to_datetime(combined_df["DATE"])
+combined_df["______"] = pd.to_numeric(combined_df["________"],errors="coerce")
+
+combined_df = (
+ combined_df
+ .sort_values(["____", "_____", "_____"])
+ .reset_index(drop=True))
+----
+
+**1c. As part of this preparation, you’ll notice that we’ll need to remove rows that are missing the `DailyAverageDryBulbTemperature` measurements. Carefully observe the pattern of missingness in the data before removing these rows. Report the number of missing values remaining in each column and in 1–2 sentences, explain why rows with no `DailyAverageDryBulbTemperature` weather information should be excluded prior to time series modeling.**
+
+[source,python]
+----
+cols_to_check = [
+ col for col in combined_df.columns
+ if col not in ["______", "______", "_____", "______"]] # For YOU to Fill in
+
+combined_df = combined_df.dropna(
+ subset=_______, # For YOU to Fill in
+ how="all")
+
+combined_df.isna().sum()
+----
+
+**1d. Convert the cleaned daily temperature data into weekly data by computing the mean weekly temperature for each (State, Capital) pair, using weeks that end on Sunday. Check for missing weekly values and use forward-fill and backward-fill within each group to address them. Confirm that the final weekly temperature series contains no missing values.**
+
+[source,python]
+----
+combined_df = (
+ combined_df
+ .set_index("DATE")
+ .groupby(["State", "Capital"])["DailyAverageDryBulbTemperature"]
+ .resample("W-SUN")
+ .mean()
+ .reset_index(name="WeeklyAvgTemp")
+ .sort_values(["State", "Capital", "DATE"])
+ .reset_index(drop=True)
+)
+
+combined_df["WeeklyAvgTemp"] = (
+ combined_df
+ .groupby(["____", "______"])["WeeklyAvgTemp"] # For YOU to fill in
+ .ffill()
+ .bfill()
+)
+----
+====
+
+== Question 2 - Preparing Our Data For Time Series (2 points)
+
+In this section, we focus on getting to know our data at a deeper level. We explore how temperature patterns vary across locations and how multiple time series are represented within a single dataset. By visually examining weekly temperature trends for different capital cities, we begin to build intuition around seasonality, variability, and timing differences across series.
+
+Randomly sampling states allows us to explore a manageable subset of the data without relying on arbitrary or biased choices. Using a fixed random seed ensures that this exploration is reproducible, which is important as the analysis builds and evolves.
+
+Plotting each capital’s weekly temperature as its own time series reinforces the idea that each location represents a complete sequence rather than a collection of independent observations. This step encourages thinking in terms of overall patterns and structure before introducing formal similarity measures.
+
+Taken together, this exploratory work helps us understand what similarity might look like in the data. These observations motivate the use of methods that compare entire sequences based on their overall structure rather than exact alignment in time. This foundation is essential for clustering approaches that rely on sequence level similarity, such as Dynamic Time Warping.
+
+.Deliverables
+====
+**2a. Rather than selecting states manually, use the function below t to randomly select a fixed number of unique states from the dataset. The function should allow the number of states and the random seed to be specified so that the selection is reproducible. Use this function to select five random states.**
+
+[source,python]
+----
+import numpy as np
+
+def sample_states(df, n_states=5, seed=42):
+ np.random.seed(seed)
+
+ states = (
+ df["State"]
+ .drop_duplicates()
+ .sample(n_states)
+ .tolist()
+ )
+
+ return states
+
+random_states = sample_states(_____, n_states=__, seed=__) # For YOU to fill in
+random_states
+----
+
+**2b. Using the selected states, weekly average dry bulb temperature time series are plotted for each corresponding capital city. These plots allow you to visually compare seasonal patterns and temperature variability across locations. Briefly describe any similarities or differences you observe in the trends.**
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+plot_df = combined_df[combined_df["State"].isin(______)] # For YOU to fill in
+
+for (state, capital), group in plot_df.groupby(["State", "Capital"]):
+
+ plt.figure(figsize=(10, 4))
+
+ plt.plot(
+ group["DATE"],
+ group["WeeklyAvgTemp"]
+ )
+
+ plt.xlabel("_____") # For YOU to fill in
+ plt.ylabel("______") # For YOU to fill in
+ plt.title(f"{capital}, {state}: Weekly Average Dry Bulb Temperature")
+
+ plt.tight_layout()
+ plt.show()
+----
+
+**2c. Each (State, Capital) pair represents a distinct weekly temperature time series in the dataset. The total number of unique series is computed by identifying these pairs. Report the total number and explain in 1-2 sentences why this count is important for later time-series analysis.**
+
+
+[source,python]
+----
+n_series = (
+ combined_df[[_____, _____]] # For YOU to fill in
+ .drop_duplicates()
+ .shape[0]
+)
+----
+
+
+====
+
+=== Question 3 - Using DTW to Cluster Our Time Series Weather Data of State Capitals (2 points)
+
+:imagesdir: ../../images
+image::DTWImage5.png[width=600, height=450, title="Dynamic Time Warping Formula Source: Mark Stent (Medium, 2019)"]
+
+In this section, we will review how time series data must be represented and prepared before it can be compared using Dynamic Time Warping. The focus is on reshaping, standardizing, and organizing the data so that each sequence can be treated as a single analytical object.
+
+We will begin by restructuring the data from a long format to a wide format. By pivoting the dataset so that each State and Capital pair occupies one row with dates running across columns, each time series becomes explicit. This reinforces the idea that a time series is not a collection of rows but a single ordered sequence of values. This step is essential because DTW compares entire sequences rather than individual timestamps.
+
+Next, we will convert the wide table into a numerical array and apply z score normalization to each time series independently. This introduces an important modeling concept: scale matters. Without normalization, DTW would primarily capture differences in absolute temperature levels rather than similarities in seasonal patterns. By centering and scaling each series, the comparison focuses on shape and timing instead of magnitude.
+
+We will then apply DTW within a clustering framework using the TimeSeriesKMeans class from the tslearn package. In this step, k means clustering is performed with dtw specified as the distance metric, which means similarity between time series is determined by Dynamic Time Warping rather than by point by point differences. This allows the model to group capital cities based on the overall shape and seasonal structure of their temperature patterns, even when peaks and troughs occur at slightly different times across locations.
+
+Finally, we will visualize the clustering results using the original temperature values. Plotting non normalized series by cluster keeps the seasonal patterns interpretable and helps us connect the cluster assignments back to real world temperature dynamics. This step reinforces the idea that normalization is a modeling tool, not a visualization tool.
+
+Together, these steps will help us understand how DTW clustering works, why careful data preparation is critical, and how to interpret the resulting clusters in a meaningful way.
+
+.Deliverables
+====
+**3a. To apply DTW, the weekly temperature data must be reshaped so that each (State, Capital) pair corresponds to a single row, with dates running across columns. This wide format allows each row to be treated as an individual time series. Fill in the code below to create a wide DataFrame containing weekly average temperatures indexed by state and capital.**
+
+[source,python]
+----
+wide_df = combined_df.pivot_table(
+ index=["____", "_____"], # For YOU to fill in
+ columns="DATE",
+ values="WeeklyAvgTemp"
+)
+----
+
+**3b. DTW operates on numerical arrays rather than pandas DataFrames. Use the code below to convert the wide table into a NumPy array and apply z-score normalization so that each time series has mean zero and unit variance. This ensures that DTW compares temperature patterns instead of absolute temperature levels.**
+
+[source,python]
+----
+import numpy as np
+
+X = wide_df.values
+
+X = (X - X.mean(axis=1, keepdims=True)) / X.std(axis=1, keepdims=True)
+----
+
+**3c. Cluster the normalized time series using Dynamic Time Warping (DTW) to group capital cities with similar seasonal temperature patterns. Using the code below, set the number of clusters to 4 and fit a k-means model with `dtw` as the distance metric and store the resulting cluster assignment for each state–capital pair in a new DataFrame. Then, write 1–2 sentences explaining how DTW is used to group capitals into clusters.**
+
+[source,python]
+----
+from tslearn.clustering import TimeSeriesKMeans
+
+n_clusters = _ # For YOU to fill in
+
+dtw_km = TimeSeriesKMeans(
+ n_clusters=n_clusters,
+ metric="____",
+ random_state=42
+)
+
+labels = dtw_km.fit_predict(X)
+
+cluster_df = wide_df.reset_index()[["State", "Capital"]].copy()
+cluster_df["Cluster"] = labels
+
+cluster_df.head()
+----
+
+**3d. To interpret the DTW clustering results, the weekly temperature time series for each capital are visualized by cluster. For each cluster, plot all member time series using the original (non-normalized) temperature values so the seasonal patterns remain interpretable. Examine how temperature dynamics differ across clusters. Write 2-3 sentences about your obervations of each cluster. What patters do you notice?**
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+dates = wide_df.columns # calendar dates
+wide_values = wide_df.values # original (non-normalized) temperatures
+
+for cluster in range(n_clusters):
+ plt.figure(figsize=(10, 4))
+
+ for i, label in enumerate(labels):
+ if label == cluster:
+ plt.plot(
+ dates,
+ wide_values[i],
+ alpha=0.4
+ )
+
+ plt.title(f"DTW Cluster {cluster}")
+ plt.xlabel("___") # For YOU to fill in
+ plt.ylabel("_____") # For YOU to fill in
+ plt.tight_layout()
+ plt.show()
+----
+====
+
+=== Question 4 Comparing DTW Clusters to Regional Clusters (2 points)
+
+:imagesdir: ../../images
+image::DTWImage6.png[width=600, height=450, title="Dynamic Time Warping Formula Source: Mark Stent (Medium, 2019)"]
+
+Now we will focus on interpreting and contextualizing the DTW clustering results. After grouping capital cities based on seasonal temperature patterns, it is important to understand what those clusters represent geographically and how they relate to existing regional definitions.
+
+We will begin by creating a clean table that links each State and Capital pair to its assigned DTW cluster. This step formalizes the clustering output and produces a structure that can be merged with other datasets. Organizing cluster assignments in a standalone table reinforces the idea that clusters are properties of entire time series, not individual observations.
+
+Next, we will address a common but critical data preparation issue: inconsistent naming conventions. To successfully merge cluster information with geographic boundary files, state names must match exactly across datasets. Applying a cleaning function allows us to standardize state names and avoid silent merge failures. This highlights the importance of careful string handling and data validation when working across multiple data sources.
+We will then introduce geographic context by assigning each state to a U.S. Census region. These regions provide a familiar baseline that reflects traditional geographic groupings. Comparing DTW based clusters to Census regions helps us evaluate whether temperature driven patterns align with established regional boundaries or reveal different groupings driven by climate rather than geography.
+
+Using `GeoPandas` and publicly available Census boundary data, we will visualize both DTW clusters and Census regions side by side on a map. Mapping the results allows us to assess spatial coherence, identify regional transitions, and detect mismatches between climate based clusters and political or geographic regions. This step emphasizes that clustering results should be interpreted visually and contextually, not just numerically.
+
+Finally, we will merge the DTW cluster labels back onto the weekly temperature dataset. This integration step is essential for downstream analysis. By attaching cluster assignments to each observation, we enable cluster level modeling, forecasting, and evaluation in later questions. This reinforces the idea that clustering is not an endpoint, but a tool that informs subsequent time series analysis.
+
+
+Together, these steps show how DTW clustering can support modeling and decision making.
+
+
+.Deliverables
+====
+**4a. To visualize DTW clusters geographically, each state must be associated with a single cluster label. Create a DataFrame that contains one row per `State` and `Capital` along with its assigned DTW cluster. This table will later be merged with geographic boundary data.**
+
+[source,python]
+----
+cluster_df = wide_df.reset_index()[["____", "_____"]].copy() # For YOU to fill in
+cluster_df["Cluster"] = labels
+
+cluster_df.head()
+----
+
+**4b. To successfully merge cluster data with geographic boundary files and develop a map, state names must follow a consistent naming convention. Some state names contain formatting differences or errors introduced during file ingestion. Use the code provided below to ampply a cleaning function to standardize state names.**
+
+[source,python]
+----
+state_name_fix = {
+ "NewYork": "New York",
+ "NewJersey": "New Jersey",
+ "NewMexico": "New Mexico",
+ "NewHampshire": "New Hampshire",
+ "NorthCarolina": "North Carolina",
+ "NorthDakota": "North Dakota",
+ "SouthCarolina": "South Carolina",
+ "SouthDakota": "South Dakota",
+ "WestVirginia": "West Virginia",
+ "RhodeIsland": "Rhode Island",
+ "Atlanta": "Georgia",
+ "Tallahassee": "Florida",
+ "OklahomaCity": "Oklahoma"
+}
+
+def clean_state(s):
+ return s.replace(_____).str.strip().str.title() # For YOU to fill in
+
+cluster_df["State"] = clean_state(cluster_df["State"])
+combined_df["State"] = clean_state(combined_df["State"])
+----
+
+**4c. To add geographic context, assign each state to its corresponding U.S. Census region. Use the code below to merge region information into a state-level summary table. Then write 1–2 sentences explaining why it may be informative to compare DTW-based temperature clusters with traditional Census region groupings.**
+
+[source,python]
+----
+import pandas as pd
+
+region_df = pd.DataFrame({
+ "State": [
+ # Northeast
+ "Connecticut","Maine","Massachusetts","New Hampshire","Rhode Island",
+ "Vermont","New Jersey","New York","Pennsylvania",
+ # Midwest
+ "Illinois","Indiana","Michigan","Ohio","Wisconsin","Iowa","Kansas",
+ "Minnesota","Missouri","Nebraska","North Dakota","South Dakota",
+ # South
+ "Delaware","Florida","Georgia","Maryland","North Carolina","South Carolina",
+ "Virginia","West Virginia","Alabama","Kentucky","Mississippi","Tennessee",
+ "Arkansas","Louisiana","Oklahoma","Texas",
+ # West
+ "Arizona","Colorado","Idaho","Montana","Nevada","New Mexico","Utah",
+ "Wyoming","Alaska","California","Hawaii","Oregon","Washington"
+ ],
+ "Region": (
+ ["Northeast"] * 9 +
+ ["Midwest"] * 12 +
+ ["South"] * 16 +
+ ["West"] * 13
+ )
+})
+
+state_to_region = dict(zip(region_df["State"], region_df["Region"]))
+
+combined_df["Region"] = combined_df["State"].map(_____) # For YOU to fill in
+----
+
+**4d. Using the code below and geographic boundary data for U.S. states, visualize the DTW cluster assignments and U.S. Census regions side by side on a map. These maps allow you to compare temperature-based clusters with established geographic regions. Briefly describe any alignments or mismatches you observe between the DTW clusters and Census regions in 2-3 sentences.**
+
+[source,python]
+----
+import geopandas as gpd
+import matplotlib.pyplot as plt
+
+state_summary = cluster_df[["State", "Cluster"]].merge(
+ combined_df[["State", "Region"]].drop_duplicates(),
+ on="State",
+ how="left"
+)
+
+us_states = (
+ gpd.read_file(
+ "https://www2.census.gov/geo/tiger/GENZ2021/shp/cb_2021_us_state_20m.zip"
+ )
+ .query("STUSPS not in ['PR','VI','GU','MP','AS','DC']")
+ .assign(State=lambda df: df["NAME"].str.strip().str.title())
+)
+
+map_df = us_states.merge(state_summary, on="State", how="left")
+
+fig, axes = plt.subplots(1, 2, figsize=(22, 10))
+
+map_df.plot(column="Cluster", categorical=True, cmap="Set2",
+ legend=True, edgecolor="white", ax=axes[0])
+axes[0].set_title("DTW Clusters")
+axes[0].axis("off")
+
+map_df.plot(column="Region", categorical=True, cmap="tab10",
+ legend=True, edgecolor="white", ax=axes[1])
+axes[1].set_title("Census Regions")
+axes[1].axis("off")
+
+plt.tight_layout()
+plt.show()
+----
+
+**4e. Each weekly temperature observation must be associated with its DTW cluster in order to build cluster-level models. Use the code below to merge the cluster assignments onto the weekly dataset using `State` and `Capital` identifiers. Verify that the resulting dataset contains both temperature and cluster information.**
+
+[source,python]
+----
+df_labeled = combined_df.merge(
+ cluster_df[["State", "Capital", "Cluster"]],
+ on=["____", "____"], # For YOU to fill in
+ how="inner")
+----
+====
+
+=== Question 5 Transition To Forecast Models (2 points)
+
+In this section, we shift from discovering structure to building forecasts. Up to this point, Dynamic Time Warping helped us identify groups of capital cities that share similar temperature patterns over time. We now use that structure to guide forecasting, allowing us to model behavior at the cluster level rather than fitting a separate model for every capital.
+
+**Establishing a time-aware train–test split**
+
+We'll split each capital’s time series into training and test sets. The data is first sorted by State, Capital, and Date to ensure the observations are in correct chronological order. We then reserve the final four weekly observations as a test set using positional indexing with iloc.
+This choice reflects a core principle of time series analysis: when forecasting the future, models should be trained only on past data. Holding out the most recent observations creates a realistic out-of-sample evaluation.
+
+**Constructing a cluster-level temperature pattern**
+
+With training data defined, we can focus on capturing the shared behavior within each DTW cluster. Each capital’s training series is standardized using a z-score transformation, removing differences in average temperature and variability across locations. These standardized series are then averaged within each cluster. The resulting cluster-level pattern represents a smooth “average shape” that emphasizes common seasonal dynamics while reducing city-specific noise.
+
+**Modeling with SARIMAX**
+
+Now let's have the cluster-level patterns form the basis for forecasting for us. Rather than modeling individual capitals directly, we fit SARIMAX models to each cluster-level series using the statsmodels package. The non-seasonal order $(p, d, q)$ controls how the series evolves over time: $p$ determines how many past values influence current predictions, $d$ governs differencing to address long-term trends, and $q$ accounts for short-term dependence in forecast errors.
+
+**Selecting model parameters using validation performance**
+
+To determine appropriate values for $(p, d, q)$, we evaluate a small grid of candidate models using a short validation window and RMSE as the evaluation metric. This approach reinforces the idea that model selection should be guided by empirical performance rather than arbitrary choices. For each cluster, the parameter configuration that minimizes validation error is selected.
+
+**Accounting for yearly seasonality**
+Because weekly temperature data exhibits strong annual seasonality, we incorporate a fixed seasonal structure of $(0, 1, 0, 52)$. Seasonal differencing with $D = 1$ removes repeating yearly patterns, while $s = 52$ reflects the length of one seasonal cycle in weekly data. Keeping the seasonal autoregressive and moving average terms at zero maintains model simplicity while allowing the dominant seasonal signal to be captured effectively
+
+**Applying cluster models to individual capitals**
+Once cluster-level models are selected, we can apply them back to individual capitals. Forecasts are generated on the standardized scale and then transformed back to the original temperature units using each capital’s mean and standard deviation. This step connects the shared cluster-level structure to city-specific predictions.
+
+Here is a table of the parameters in SARIMAX model:
+
+[cols="1,1,3", options="header"]
+|===
+| Parameter | Name | What it Controls
+| $p$
+| Autoregressive order
+| How many past values of the time series are used to predict the current value. Larger values allow the model to use longer memory but may increase complexity.
+| $d$
+| Differencing order
+| How many times the series is differenced to remove long term trends and make the series more stable over time. A value of 1 is common for trending data.
+| $q$
+| Moving average order
+| How many past forecast errors are used to correct current predictions. This helps capture short term noise and shocks in the data.
+| $P$
+| Seasonal autoregressive order
+| The autoregressive component applied at the seasonal scale. This models relationships between values separated by one full seasonal cycle.
+| $D$
+| Seasonal differencing order
+| How many times seasonal differencing is applied to remove repeating seasonal patterns. A value of 1 removes yearly seasonality when using weekly data.
+| $Q$
+| Seasonal moving average order
+| The moving average component applied at the seasonal scale, modeling seasonal error patterns.
+| $s$
+| Seasonal period
+| The number of observations in one seasonal cycle. For weekly temperature data, $s = 52$ represents one year.
+|===
+
+This step bridges the gap between standardized modeling and interpretable temperature values.
+Finally, we will evaluate and visualize forecast performance. RMSE provides a numerical summary of accuracy, while forecast plots allow us to assess whether the model captures seasonal timing, smoothness, and overall trend. Comparing these visual cues with RMSE values helps us understand both the strengths and limitations of cluster based forecasting.
+
+Together, these steps show how DTW clustering, thoughtful model selection, and SARIMAX forecasting work together to produce scalable and interpretable time series predictions.
+
+
+.Deliverables
+====
+**5a. To evaluate forecasting performance, split each capital’s time series into training and test sets. Use the last four weekly observations as the test set and all prior observations for training. Pay close attention to the use of iloc, which selects rows based on their position within each grouped time series rather than on date values. This ensures that the split is strictly time-ordered and reflects true out-of-sample forecasting.**
+
+[source,python]
+----
+# Ensure consistent ordering for time series operations
+combined_df = (
+ combined_df
+ .sort_values(["State", "Capital", "DATE"])
+ .reset_index(drop=True))
+
+train_df = (
+ combined_df
+ .groupby(["State", "Capital"], group_keys=False)
+ .apply(lambda g: g.iloc[:__], include_groups=False)) # For YOU to fill in
+
+test_df = (
+ combined_df
+ .groupby(["State", "Capital"], group_keys=False)
+ .apply(lambda g: g.iloc[__:], include_groups=False)) # For YOU to fill in
+----
+
+**5b. Rather than fitting a separate model for each capital, use the code below to estimate a standardized “average shape” for each DTW cluster. Weekly temperature series are standardized within each capital and then averaged across capitals in the same cluster. These cluster-level patterns form the basis for forecasting. In 1–2 sentences, describe how creating a standardized “average shape” at the cluster level helps capture common temperature dynamics across capitals. Why might this approach be preferable to modeling each capital independently?**
+
+[source,python]
+----
+import numpy as np
+
+capital_scale = {}
+cluster_shapes = {}
+
+for cluster, grp in train_df.groupby("Cluster"):
+ z_series = []
+
+ for (state, capital), g in grp.groupby(["State", "Capital"]):
+ y = g.set_index("DATE")["WeeklyAvgTemp"].asfreq("W-SUN").interpolate()
+
+ mu, sigma = y.mean(), y.std(ddof=0)
+ if sigma > 0:
+ capital_scale[(state, capital)] = (mu, sigma)
+ z_series.append((y - mu) / sigma)
+
+ cluster_shapes[cluster] = pd.concat(z_series, axis=1).mean(axis=1)
+----
+
+**5c. Rather than fitting a separate model for each capital, select a SARIMAX model for each DTW cluster using the cluster-level standardized temperature pattern. In the SARIMAX model below, use seasonal_order = $(0, 1, 0, 52)$. Fill in these values directly in the code. Then, in 1–2 sentences, explain what each part of this seasonal order is doing and why a 52-week seasonal period makes sense for weekly temperature data.**
+
+Read more about the SARIMAX model in python https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html[here].
+
+[source,python]
+----
+from statsmodels.tsa.statespace.sarimax import SARIMAX
+import itertools
+import warnings
+from statsmodels.tools.sm_exceptions import ConvergenceWarning
+
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=ConvergenceWarning)
+warnings.filterwarnings("ignore", category=RuntimeWarning)
+
+def rmse(y, yhat):
+ return np.sqrt(np.mean((y - yhat) ** 2))
+
+# Grid over non-seasonal ARIMA parameters
+param_grid = list(itertools.product([0, 1, 2], [0, 1], [0, 1, 2]))
+best_cluster_params = {}
+
+for cluster, z in cluster_shapes.items():
+ z = z.asfreq("W-SUN").interpolate()
+ z_train, z_val = z.iloc[:-4], z.iloc[-4:]
+
+ best_score, best_cfg = np.inf, None
+
+ for order in param_grid:
+ try:
+ res = SARIMAX(
+ z_train,
+ order=order,
+ # Annual seasonality fixed using domain knowledge
+ seasonal_order=(0, 1, 0, 52),
+ enforce_stationarity=False,
+ enforce_invertibility=False
+ ).fit(disp=False)
+
+ preds = res.get_forecast(4).predicted_mean
+ score = rmse(z_val, preds)
+
+ if score < best_score:
+ best_score, best_cfg = score, order
+
+ except Exception:
+ continue
+
+ best_cluster_params[cluster] = best_cfg
+ print(f"Cluster {cluster}: RMSE={best_score:.3f}, order={best_cfg}")
+
+----
+
+**5d. Using the selected cluster-level SARIMAX models, generate forecasts for each capital’s held-out test period using the code below. Predictions are transformed back to the original temperature scale and compared against observed values. Compute the RMSE for each capital and visualize forecasts for the best-performing locations. In 2-3 sentences, interpret the RMSE values and describe whether it indicates strong or weak model performance.**
+
+[source,python]
+----
+predictions = {}
+
+for (state, capital), g in train_df.groupby(["State", "Capital"]):
+ cluster = int(g["Cluster"].iloc[0])
+
+ if (state, capital) not in capital_scale:
+ continue
+
+ mu, sigma = capital_scale[(state, capital)]
+ z = cluster_shapes[cluster]
+
+ model = SARIMAX(
+ z,
+ order=best_cluster_params[cluster],
+ seasonal_order=(0, 1, 0, 52),
+ enforce_stationarity=False,
+ enforce_invertibility=False).fit(disp=False)
+
+ y_test = (
+ test_df
+ .query("State == @state and Capital == @capital")
+ .set_index("DATE")["WeeklyAvgTemp"]
+ .asfreq("W-SUN")
+ .interpolate())
+
+ z_pred = model.get_forecast(len(y_test)).predicted_mean
+ predictions[(state, capital)] = (y_test, z_pred * sigma + mu)
+
+rmse_df = pd.DataFrame([
+ {
+ "State": s,
+ "Capital": c,
+ "Cluster": train_df.query(
+ "State == @s and Capital == @c"
+ )["Cluster"].iloc[0],
+ "RMSE": rmse(y.values, yhat.values)
+ }
+ for (s, c), (y, yhat) in predictions.items()])
+
+top3 = rmse_df.nsmallest(3, "RMSE")
+----
+
+**5e. Use the code below to plot the test-set forecasts for selected capitals from different DTW clusters. Examine how well the predicted temperatures track the observed values across the seasonal cycle. In 1–2 sentences, explain what visual cues (such as timing of peaks and troughs, smoothness, or systematic bias) indicate good or poor model performance and how these observations relate to differences in RMSE across capitals.**
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+for _, row in top3.iterrows():
+
+ state = row["State"]
+ capital = row["Capital"]
+ cluster = int(row["Cluster"])
+ score = row["RMSE"]
+
+ train_obs = (
+ train_df
+ .query("State == @state and Capital == @capital")
+ .set_index("DATE")["WeeklyAvgTemp"]
+ .asfreq("W-SUN")
+ .interpolate())
+
+ test_obs = (
+ test_df
+ .query("State == @state and Capital == @capital")
+ .set_index("DATE")["WeeklyAvgTemp"]
+ .asfreq("W-SUN")
+ .interpolate())
+
+ _, y_pred = predictions[(state, capital)]
+
+ plt.figure(figsize=(11, 4))
+
+ # Training observations
+ plt.plot(train_obs, label="Train (Observed)", linewidth=2)
+
+ # Explicit connection from train → test
+ plt.plot(
+ [train_obs.index[-1], test_obs.index[0]],
+ [train_obs.iloc[-1], test_obs.iloc[0]],
+ color="black",
+ linewidth=2)
+
+ # Test observations
+ plt.plot(test_obs, label="Test (Observed)", color="black", linewidth=2)
+ # Test predictions
+ plt.plot(y_pred, "--", linewidth=2, label="Test (Predicted)")
+
+ # Train/test split marker
+ plt.axvline(
+ test_obs.index.min(),
+ color="gray",
+ linestyle=":",
+ label="Train/Test Split")
+ plt.title(f"{capital}, {state} — Cluster {cluster}\nRMSE = {score:.2f}")
+ plt.xlabel("Date")
+ plt.ylabel("Weekly Avg Temperature")
+ plt.legend()
+ plt.tight_layout()
+ plt.show()
+----
+
+====
+
+=== Question 6 - Evaluating the Model on Each Cluster (2 points)
+
+
+.Deliverables
+====
+**6a. Using the rmse_df created in Question 5, compute the average RMSE for each DTW cluster. This summary provides a high-level view of how well the cluster-level SARIMAX models perform across different temperature pattern groups.**
+
+[source,python]
+----
+# Students fill in
+# cluster_rmse_summary = (
+# rmse_df
+# .groupby("_____") # For YOU to fill in
+# ["_____"] # For YOU to fill in
+# .mean()
+# .reset_index()
+# )
+
+# cluster_rmse_summary
+----
+
+**6b. Identify the Best and Worst Performing Clusters by sorting the clusters by their average RMSE to identify which clusters are easiest and hardest to forecast. Which DTW clusters exhibit the lowest and highest average RMSE?**
+
+Hint: Use `sort_values`
+
+**6c. In 2–3 sentences, reflect on whether modeling at the DTW cluster level appears to be a reasonable alternative to fitting separate models for each capital. Then based on the RMSE results, discuss one or two ways you might further improve forecasting accuracy.**
+====
+
+== References
+Stent, M. (2024, April 9). Dynamic Time Warping: An introduction. Medium. https://medium.com/@markstent/dynamic-time-warping-a8c5027defb6
+
+
+== Submitting your Work
+
+Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
+
+.Items to submit
+====
+- firstname_lastname_project1.ipynb
+====
+
+[WARNING]
+====
+You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
+
+You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.
+====
\ No newline at end of file
diff --git a/tools-appendix/modules/python/pages/linear_regression.adoc b/tools-appendix/modules/python/pages/linear_regression.adoc
index ebd6a59d6..26db94339 100644
--- a/tools-appendix/modules/python/pages/linear_regression.adoc
+++ b/tools-appendix/modules/python/pages/linear_regression.adoc
@@ -221,14 +221,13 @@ Parts of these explanations have been adapted from *Applied Statistics with R* (
- We can expand the model by adding more features (`loudness`, `danceability`, `energy`, `valence`, etc.) for better predictions -> this is called **Multiple Linear Regression** which we will explain in question 2.
+== Reading and Preparing the Data
-== Questions
+In this section, we will load the Spotify popularity dataset and prepare it for modeling. This includes reading the data into Python, removing unnecessary columns, and separating the prediction target from the feature variables.
-=== Question 1 Reading and Preparing the Data (2 points)
+=== Step 1: Read the Data
-.Deliverables
-====
-**1a. Read in the data and print the first five rows of the dataset. Save the dataframe as `spotify_popularity_data`.**
+We begin by reading the dataset into Python using pandas. After loading the data, we print the first five rows to get an initial sense of the structure and variables included in the dataset. We will store the dataset in a DataFrame called `spotify_popularity_data`.
[source,python]
----
@@ -239,9 +238,12 @@ spotify_popularity_data = pd.read_csv("/anvil/projects/tdm/data/spotify/linear_r
spotify_popularity_data.head()
----
-**1b. Use the code provided to drop the columns listed from `spotify_popularity_data`. After dropping them, print the columns still in the data.**
+=== Step 2: Remove Unnecessary Columns
+
+The original dataset contains many columns that are not needed for our analysis, such as identifiers, text-based fields, and metadata related to albums and artists. To simplify the dataset and focus on numeric features relevant for modeling, we drop these columns.
-_Note: For more information on the drop function in pandas you can go here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html[here]._
+For more information on the `drop` function, you can refer to the pandas documentation:
+https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
[source,python]
----
@@ -249,16 +251,18 @@ drop_cols = [
"Unnamed: 0", "Unnamed: 0.1", "track_id", "track_name", "available_markets", "href",
"album_id", "album_name", "album_release_date", "album_type",
"artists_names", "artists_ids", "principal_artist_id",
- "principal_artist_name", "artist_genres", "analysis_url", "duration_min"]
+ "principal_artist_name", "artist_genres", "analysis_url", "duration_min"
+]
# Drop unnecessary columns
spotify_popularity_data = spotify_popularity_data.drop(columns=drop_cols)
-for col in spotify_popularity_data:
+for col in spotify_popularity_data:
print(col)
-
----
+After dropping these columns, the dataset now includes the following variables:
+
[source,python]
----
popularity
@@ -283,10 +287,11 @@ valence
year
----
+=== Step 3: Define the Target and Features
-**1c. Use the code provided to set up your prediction target and features. Then, print the shape of `X` and `y` using `.shape()`. **
+Next, we separate the dataset into a prediction target and a set of feature variables.
-_Note: We are using the “popularity” column as y, and use all the other columns as X._
+In this project, we are interested in predicting song popularity. Therefore, we use the `popularity` column as our target variable `y`, and all remaining columns as the feature matrix `X`.
[source,python]
----
@@ -299,13 +304,14 @@ print(X.shape)
print(y.shape)
----
+This confirms that we have 1,500 observations and 19 feature variables, with one corresponding popularity value for each observation.
+
[source,python]
----
(1500, 19)
(1500,)
----
-====
=== Question 2 Splitting the Data and Understanding the Data (2 points)
@@ -367,24 +373,30 @@ In supervised learning, our dataset is split into *predictors (`X`)* and a *targ
====
In practice, it's recommended to use **cross-validation**, which provides a more reliable estimate of a model’s performance by repeatedly splitting the data into training and validation sets and averaging the results. This helps reduce the variability that can come from a single random split. However, for this project, we will only perform a single random train/test split using a fixed random seed.
====
+== Splitting the Data and Exploring the Target Variable
-.Deliverables
-====
-**2a. Use the code provided to create an 80/20 train/test split (use random_state=42). Then, print the shapes of X_train, X_test, y_train, and y_test using `.shape()`.**
+Now that the data has been cleaned and structured, we move on to splitting the dataset into training and testing sets and exploring the distribution of the response variable. These steps are essential for building and evaluating predictive models.
+
+=== Step 1: Create a Train/Test Split
+To evaluate model performance fairly, we split the dataset into training and testing subsets. We use an 80/20 split, where 80% of the data is used for training and 20% is reserved for testing. A fixed `random_state` is set to ensure reproducibility.
[source,python]
----
from sklearn.model_selection import train_test_split
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.2, random_state=42
+)
-print("X_train shape:", X_train.shape)
-print("X_test shape:", X_test.shape)
-print("y_train shape:", y_train.shape)
-print("y_test shape:", y_test.shape)
+print("X_train shape:", X_train.shape)
+print("X_test shape:", X_test.shape)
+print("y_train shape:", y_train.shape)
+print("y_test shape:", y_test.shape)
----
+This confirms that the training set contains 1,200 observations and the test set contains 300 observations.
+
[source,python]
----
X_train shape: (1200, 19)
@@ -393,9 +405,9 @@ y_train shape: (1200,)
y_test shape: (300,)
----
-**2b. Generate a histogram of y_train (popularity) using the code provided. Be sure to include clear axis labels and a title for the plot.**
+=== Step 2: Examine the Distribution of Popularity
-Note: See documentation on using `.histplot` in seaborn library https://seaborn.pydata.org/generated/seaborn.histplot.html[here].
+Before fitting any models, it is helpful to understand the distribution of the target variable. We begin by visualizing the popularity values in the training set using a histogram.
[source,python]
----
@@ -403,45 +415,40 @@ import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8,5))
-sns.histplot(y, bins=30, kde=True, color="skyblue")
-plt.xlabel("Popularity")
-plt.ylabel("Count")
-plt.title("Distribution of Track Popularity")
+sns.histplot(y_train, bins=30, kde=True)
+plt.xlabel("Popularity")
+plt.ylabel("Count")
+plt.title("Distribution of Track Popularity (Training Set)")
plt.show()
----
image::distribution_track_popularity.png[Distribution of Track Popularity, width=600, height=450, loading=lazy, title="Track Popularity"]
-**2c. Examine the plot above and determine whether the distribution appears roughly symmetric. In 2–3 sentences, note your observations of it's skewness and distribution (mean, min, max).**
+After examining the histogram, we observe that the popularity distribution appears roughly symmetric and approximately normal. There is a slight right tail near the upper bound (around 100), but no strong outliers. Most popularity values fall between roughly 45 and 100, with the mean near the upper-middle of this range.
+=== Step 3: Explore the Relationship Between Popularity and Duration
-Students should note that the y_train looks roughly normal and symmetric. They may note additional observations such as slight right tail near the cap (~99) but no strong outliers. Mean is arount 70 and min is 45 while max is around 100.
-
-
-**2d. Using the provided code, generate a scatterplot of popularity versus duration (in minutes) and include a fitted regression line. In 2–3 sentences, describe (1) the relationship you observe between the two variables, and (2) how the regression line is constructed to represent the overall trend in the data using residuals. Make sure to include labels for the plot.**
+Next, we explore the relationship between song duration and popularity. Since duration is measured in milliseconds, we first convert it to minutes for easier interpretation. We then create a scatterplot using the training data and overlay a fitted linear regression line.
[source,python]
----
-import matplotlib.pyplot as plt
-import seaborn as sns
-
# Convert duration to minutes for training data
duration_min_train = X_train["duration_ms"] / 60000
plt.figure(figsize=(8,5))
sns.scatterplot(x=duration_min_train, y=y_train, alpha=0.6)
-sns.regplot(x=duration_min_train, y=y_train, scatter=False, color="red", ci=None)
+sns.regplot(x=duration_min_train, y=y_train, scatter=False, ci=None)
-plt.xlabel("Duration (minutes)")
-plt.ylabel("Popularity") )
-plt.title("Popularity vs. Duration (train set)") )
+plt.xlabel("Duration (minutes)")
+plt.ylabel("Popularity")
+plt.title("Popularity vs. Duration (Training Set)")
plt.show()
-
----
image::RegressionLineSpotify.png[Regression line for Spotify data, width=600, height=450, loading=lazy, title="Scatter Plot with a Simple Linear Regression Line - Spotify Data"]
-====
+From the scatterplot, we observe a weak relationship between track duration and popularity, with popularity varying widely across different song lengths. The regression line represents the best-fitting linear trend by minimizing the sum of squared residuals, providing a simple summary of the overall direction of the relationship despite substantial variability in the data.
+
=== Question 3 Checking for Multicollinearity and Influential Points (2 points)
@@ -511,13 +518,15 @@ How we use it:
- `n` = number of rows in your training data.
- As `n` gets larger, `4/n` gets smaller, the bar for “unusually influential” gets stricter.
+== Addressing Multicollinearity and Influential Observations
+Before fitting a linear regression model, it is important to ensure that the feature set does not suffer from severe multicollinearity and that influential outliers are appropriately handled. In this section, we use Variance Inflation Factor (VIF) to diagnose multicollinearity and Cook’s Distance to identify influential observations in the training data.
-.Deliverables
-====
-**3a. Using the provided code, keep only the numeric columns and compute the Variance Inflation Factor (VIF) values. Be sure to specify the threshold 10 in the function.**
+=== Step 1: Evaluate Multicollinearity Using VIF
+
+We begin by assessing multicollinearity among the predictor variables using the Variance Inflation Factor (VIF). VIF measures how strongly each feature is linearly related to the other features in the model. Large VIF values indicate high multicollinearity, which can inflate standard errors and destabilize coefficient estimates.
-_Note: The function is provided and operates iteratively by removing the variable with the highest VIF at each step until all remaining variables have VIF values less than or equal to the chosen threshold (commonly set at 10). Your task is to run the function and fill in the appropriate threshold._
+Before computing VIF, we ensure that all variables are numeric by converting boolean columns to integers. We then use a provided iterative function that repeatedly removes the variable with the highest VIF until all remaining variables have VIF values below a specified threshold. In this project, we use a commonly accepted threshold of 10.
[source,python]
----
@@ -531,12 +540,15 @@ if len(bool_cols):
X_train[bool_cols] = X_train[bool_cols].astype(int)
-def calculate_vif_iterative(X, thresh=10.0):
+def calculate_vif_iterative(X, thresh=10.0):
X_ = X.astype(float).copy()
while True:
vif_df = pd.DataFrame({
"variable": X_.columns,
- "VIF": [variance_inflation_factor(X_.values, i) for i in range(X_.shape[1])]
+ "VIF": [
+ variance_inflation_factor(X_.values, i)
+ for i in range(X_.shape[1])
+ ]
}).sort_values("VIF", ascending=False).reset_index(drop=True)
max_vif = vif_df["VIF"].iloc[0]
@@ -547,13 +559,11 @@ def calculate_vif_iterative(X, thresh=10.0):
print(f"Dropping '{worst}' with VIF={max_vif:.2f}")
X_ = X_.drop(columns=[worst])
-
-
----
-**3b. Using the provided code, keep only the columns with VIF ≤ 10 and update the X_train dataset. Then, print the kept columns along with their VIF using `vif_summary`.**
+=== Step 2: Filter Features Based on VIF
-_Note: Your task is to print the VIF summary table._
+Next, we run the iterative VIF function using a threshold of 10. The function returns two objects: a filtered version of `X_train` containing only variables with acceptable VIF values, and a summary table listing the remaining variables and their corresponding VIFs.
[source,python]
----
@@ -565,6 +575,8 @@ X_train = result_vif[0]
vif_summary = result_vif[1]
----
+During this process, several highly collinear variables are removed:
+
[source,python]
----
Dropping 'year' with VIF=320.29
@@ -575,10 +587,7 @@ Dropping 'danceability' with VIF=19.10
Dropping 'tempo' with VIF=12.17
----
-[source,python]
-----
-vif_summary
-----
+The remaining variables and their VIF values are shown below:
[cols="1,1", options="header"]
|===
@@ -598,43 +607,44 @@ vif_summary
| duration_ms | 9.724675
|===
+All remaining features now have VIF values at or below the threshold, indicating acceptable levels of multicollinearity.
+=== Step 3: Identify Influential Observations Using Cook’s Distance
+After addressing multicollinearity, we examine the training data for influential observations using Cook’s Distance. Cook’s Distance measures how much each observation influences the fitted regression coefficients. Observations with unusually large Cook’s Distance values can disproportionately affect the model’s results.
-**3c. Use the provided code to calculate Cook’s Distance and identify potential outliers. Use the `.drop(index=____)` function on both X_train and y_train to remove `cooks_outliers`.**
-
-_Note: This code identifies influential outliers in the training data using Cook’s Distance. It begins by aligning and cleaning X_train and y_train, then fits a regression model. Cook’s Distance is computed for each observation, and any values exceeding the threshold 4/n are flagged as influential points. Your task is to ensure these flagged observations are removed from both the X_train and y_train dataframes._
+We fit a linear regression model on the cleaned training data and compute Cook’s Distance for each observation. Any observation with Cook’s Distance greater than $4/n$, where $n$ is the number of observations, is flagged as potentially influential.
[source,python]
----
-import numpy as np
-import pandas as pd
import statsmodels.api as sm
# Align and clean training data
X_train_cook = X_train.loc[X_train.index.intersection(y_train.index)]
y_train_cook = y_train.loc[X_train_cook.index]
-# Keep only rows without missing/infinite values
+# Keep only rows without missing or infinite values
mask = np.isfinite(X_train_cook).all(1) & np.isfinite(y_train_cook.to_numpy())
X_train_cook, y_train_cook = X_train_cook.loc[mask], y_train_cook.loc[mask]
# Fit model on the cleaned data
-ols_model_cook = sm.OLS(y_train_cook, sm.add_constant(X_train_cook, has_constant="add")).fit()
+ols_model_cook = sm.OLS(
+ y_train_cook,
+ sm.add_constant(X_train_cook, has_constant="add")
+).fit()
-# Cook's Distance values for each observation
+# Cook's Distance values
cooks_distance = ols_model_cook.get_influence().cooks_distance[0]
cooks_threshold = 4 / len(X_train_cook)
-# Identify outlier indices
+# Identify influential observations
cooks_outliers = X_train_cook.index[cooks_distance > cooks_threshold]
print(f"Flagged {len(cooks_outliers)} outliers (Cook's D > 4/n).")
-
-X_train = X_train.drop(index=cooks_outliers)
-y_train = y_train.drop(index=cooks_outliers)
-
+# Remove influential observations
+X_train = X_train.drop(index=cooks_outliers)
+y_train = y_train.drop(index=cooks_outliers)
----
[source,python]
@@ -642,11 +652,10 @@ y_train = y_train.drop(index=cooks_outliers)
Flagged 61 outliers (Cook's D > 4/n).
----
+=== Step 4: Reflection on Model Stability
-**3d. In 2–3 sentences, explain why it is important to (1) remove features with high multicollinearity and (2) remove outliers identified by Cook’s Distance before building a linear regression model.**
+Removing features with high multicollinearity is important because strongly correlated predictors can inflate standard errors and make it difficult to interpret which variables are truly associated with the response. Similarly, outliers identified by Cook’s Distance can exert an outsized influence on regression coefficients and distort the model’s fit. By addressing both multicollinearity and influential observations, we build a linear regression model that is more stable, reliable, and interpretable.
-Students should get full credit if they write something along the lines like: Removing features with high multicollinearity is important because correlated predictors can inflate standard errors and make it difficult to interpret which variables are truly significant. Outliers identified by Cook’s Distance can have an outsized influence on regression coefficients, which may distort the model’s fit. By addressing both, we build a regression model that is more stable, reliable, and interpretable.
-====
=== Question 4 Feature Selection and Model Summary (2 points)
@@ -689,12 +698,15 @@ While we are using AIC in this project, you could also use:
Each criterion has trade-offs. AIC is popular because it balances model fit and complexity, making it a solid choice when comparing linear regression models. For consistency, we'll use AIC throughout this project.
====
+== Feature Selection and Model Interpretation
-.Deliverables
-====
-**4a. Use the code provided below to perform feature selection using stepwise selection with the AIC criterion. Then write 1–2 sentences explaining how stepwise selection with AIC works and why feature selection is useful in model building.**
+With a cleaned feature set and influential observations removed, we are now ready to build a linear regression model. Before fitting a final model, we perform feature selection to identify a subset of predictors that best explain variation in song popularity while balancing model complexity and goodness of fit.
-_Note: The function has been provided. Your task is to run it successfully and then write 1-2 sentences about feature selection._
+=== Step 1: Perform Feature Selection Using Stepwise AIC
+
+To select an appropriate set of predictors, we use stepwise selection with the Akaike Information Criterion (AIC). AIC rewards models that fit the data well but penalizes unnecessary complexity, helping to prevent overfitting.
+
+The provided function implements a hybrid stepwise approach. At each iteration, it considers adding variables (forward selection) while also checking whether any previously added variables should be removed (backward pruning) if doing so further reduces AIC.
[source,python]
----
@@ -715,17 +727,24 @@ def stepwise_aic(X_train, y_train, max_vars=None, verbose=True):
best_aic, best_var = min(aics, key=lambda t: t[0])
if best_aic + 1e-6 < current_aic:
- selected.append(best_var); remaining.remove(best_var); current_aic = best_aic
- if verbose: print(f"+ {best_var} (AIC {best_aic:.2f})")
- # backward
+ selected.append(best_var)
+ remaining.remove(best_var)
+ current_aic = best_aic
+ if verbose:
+ print(f"+ {best_var} (AIC {best_aic:.2f})")
+
+ # Backward step
improved = True
while improved and len(selected) > 1:
backs = [(sm.OLS(y, sm.add_constant(X[[v for v in selected if v != d]], has_constant='add')).fit().aic, d)
for d in selected]
back_aic, drop_var = min(backs, key=lambda t: t[0])
if back_aic + 1e-6 < current_aic:
- selected.remove(drop_var); remaining.add(drop_var); current_aic = back_aic
- if verbose: print(f"- {drop_var} (AIC {back_aic:.2f})")
+ selected.remove(drop_var)
+ remaining.add(drop_var)
+ current_aic = back_aic
+ if verbose:
+ print(f"- {drop_var} (AIC {back_aic:.2f})")
else:
improved = False
else:
@@ -735,20 +754,24 @@ def stepwise_aic(X_train, y_train, max_vars=None, verbose=True):
return selected, model
----
-Students should get full credit if they run the code succesfully and also explain feature selection. They should also write something along the lines: Stepwise selection is a hybrid approach: it adds variables like forward selection but also considers removing variables at each step (backward pruning) if doing so reduces AIC further.
+Stepwise selection is useful because it helps identify a parsimonious model that balances interpretability and predictive performance. By reducing redundancy and excluding weak predictors, feature selection often leads to more stable coefficient estimates and better generalization.
+
+=== Step 2: Run Stepwise Selection on the Training Data
-**4b. Use the provided code to run the stepwise_aic function on your training data. Then print the `selected_cols` and interpret the results of the feature selection method in 1-2 sentences.**
+Next, we apply the stepwise AIC function to the training data. The function returns both the selected feature names and a fitted OLS model using those features.
[source,python]
----
results_feature_selection = stepwise_aic(X_train, y_train, max_vars=None, verbose=True)
-selected_cols = results_feature_selection[0] # list of features
+selected_cols = results_feature_selection[0] # list of selected features
model = results_feature_selection[1] # fitted OLS model
print(selected_cols)
----
+During the selection process, features are added sequentially as long as they improve the AIC:
+
[source,python]
----
+ principal_artist_followers (AIC 8054.83)
@@ -764,77 +787,22 @@ print(selected_cols)
+ speechiness (AIC 7857.16)
----
+The final selected model includes variables related to artist popularity, audio characteristics, and track structure. This suggests that both musical features and artist-level signals contribute to explaining song popularity.
-**4c. Print the `model` summary and write 2-3 sentences interpreting the results about the variables in the model and their relationship to our target variable `popularity`.**
+=== Step 3: Interpret the Final Regression Model
-_Hint: for printing the model summary, use the `.summary()` function to see the summary of the model._
+We now examine the summary of the fitted regression model to interpret coefficient estimates, statistical significance, and overall model performance.
[source,python]
----
print(model.summary())
----
-[cols="2,2", options="header"]
-|===
-| **OLS Regression Results** |
-
-| **Dependent Variable:** | popularity
-| **Model:** | OLS
-| **Method:** | Least Squares
-| **Date:** | Wed, 15 Oct 2025
-| **Time:** | 08:43:29
-| **No. Observations:** | 1139
-| **Df Residuals:** | 1127
-| **Df Model:** | 11
-| **R-squared:** | 0.289
-| **Adj. R-squared:** | 0.282
-| **F-statistic:** | 41.60
-| **Prob (F-statistic):** | 7.07e-76
-| **Log-Likelihood:** | -3916.6
-| **AIC:** | 7857.0
-| **BIC:** | 7918.0
-| **Covariance Type:** | nonrobust
-|===
-
-.Coefficients
-[cols="1,1,1,1,1,1,1", options="header"]
-|===
-| Variable | Coef | Std Err | t | P>|t| | [0.025 | 0.975]
-| const | 82.1306 | 1.483 | 55.366 | 0.000 | 79.220 | 85.041
-| principal_artist_followers | 1.706e-07 | 1.32e-08 | 12.938 | 0.000 | 1.45e-07 | 1.97e-07
-| duration_ms | -2.986e-05 | 4.1e-06 | -7.287 | 0.000 | -3.79e-05 | -2.18e-05
-| loudness | 0.6164 | 0.074 | 8.274 | 0.000 | 0.470 | 0.763
-| album_total_tracks | -0.1788 | 0.037 | -4.819 | 0.000 | -0.252 | -0.106
-| acousticness | 5.4814 | 0.972 | 5.642 | 0.000 | 3.575 | 7.388
-| explicit | 2.7251 | 0.629 | 4.336 | 0.000 | 1.492 | 3.958
-| mode | -1.9647 | 0.497 | -3.956 | 0.000 | -2.939 | -0.990
-| valence | -3.5808 | 0.969 | -3.696 | 0.000 | -5.482 | -1.680
-| track_number | -0.1283 | 0.059 | -2.173 | 0.030 | -0.244 | -0.012
-| key | 0.1134 | 0.064 | 1.766 | 0.078 | -0.013 | 0.239
-| speechiness | -4.6218 | 2.966 | -1.558 | 0.119 | -10.442 | 1.198
-|===
-
-.Diagnostic Tests
-[cols="2,2", options="header"]
-|===
-| **Omnibus:** | 43.762
-| **Prob(Omnibus):** | 0.000
-| **Jarque-Bera (JB):** | 20.356
-| **Prob(JB):** | 3.80e-05
-| **Skew:** | 0.077
-| **Kurtosis:** | 2.363
-| **Durbin-Watson:** | 1.960
-| **Cond. No.:** | 2.71e+08
-|===
-
+The model explains approximately 29% of the variability in song popularity ($R^2 = 0.289$), which is reasonable given the complexity of popularity as an outcome. Several predictors, such as `principal_artist_followers`, `loudness`, `acousticness`, and `explicit`, have statistically significant positive relationships with popularity. Other variables, including `album_total_tracks`, `mode`, and `track_number`, show significant negative associations.
+Not all variables are statistically significant at conventional levels; for example, `key` and `speechiness` have larger p-values, suggesting weaker evidence of association with popularity. Overall, the results indicate that while the model captures meaningful trends, there remains substantial unexplained variability, leaving room for improvement through additional features or alternative modeling approaches.
-Students should get full credit if they write 2-3 sentences interpreting the summary results of the model. For example they might say that some variables, like key, explicit, and acousticness, have a positive effect on popularity. Others, like album_total_tracks, mode, and track_number, are negative predictors. Not all variables are significant though, for example, key and speechiness do not have a singificant p-value. They may note that R^2 isn't really that high and can be improved as well.
-
-
-====
-
=== Question 5 Checking Assumptions of Linear Regression Model (2 points)
@@ -898,10 +866,13 @@ What to look for:
image::fittedvsresspotify.png[width=600, height=450, caption="Fitted vs Residuals Plot Spotify Data"]
+== Model Diagnostics and Assumption Checking
-.Deliverables
-====
-**5a. Use the provided code to test for normality assumption. Make sure to label the plot and write 2-3 sentences on whether or not you think the model passes the normality assumption.**
+After fitting the linear regression model, it is important to evaluate whether the key assumptions of linear regression are reasonably satisfied. In this section, we examine the normality, independence, and homoscedasticity assumptions by analyzing the model residuals.
+
+=== Step 1: Check Normality Using a Histogram of Residuals
+
+We begin by examining the normality of the residuals using a histogram of standardized residuals overlaid with a standard normal density curve. Standardizing the residuals allows us to directly compare their distribution to a normal distribution with mean 0 and variance 1.
[source,python]
----
@@ -915,18 +886,22 @@ resid = model.resid
z = (resid - np.mean(resid)) / np.std(resid, ddof=1)
# Histogram with normal curve
-plt.hist(z, bins=30, density=True, alpha=0.6, color="skyblue", edgecolor="black")
+plt.hist(z, bins=30, density=True, alpha=0.6, edgecolor="black")
x = np.linspace(z.min(), z.max(), 100)
plt.plot(x, stats.norm.pdf(x, 0, 1), "r", linewidth=2)
plt.title("Histogram of Standardized Residuals with Normal Curve")
+plt.xlabel("Standardized Residuals")
+plt.ylabel("Density")
plt.show()
----
-image::histogramstandarizedresiduals.png[width=600, height=450, caption="Histogram of Standarized Residuals"]
+image::histogramstandarizedresiduals.png[width=600, height=450, caption="Histogram of Standardized Residuals"]
-Students should get full credit if they also write 2-3 sentences on whether the model passes the normality assumption. The histogram looks normal visually, so it looks like it does pass.
+The histogram appears approximately symmetric and closely follows the overlaid normal curve, suggesting that the residuals are reasonably normally distributed. While minor deviations may exist in the tails, there is no strong evidence of severe non-normality. Overall, the model appears to satisfy the normality assumption.
-**5b. Use the provided code to test for normality using the q-q plot. Make sure to label your plot and interpret the results in 2-3 sentences.**
+=== Step 2: Check Normality Using a Q–Q Plot
+
+Next, we assess residual normality using a quantile–quantile (Q–Q) plot. In a Q–Q plot, normally distributed residuals should fall approximately along the 45-degree reference line.
[source,python]
----
@@ -936,18 +911,16 @@ plt.title("Q–Q Plot of Residuals")
plt.show()
----
-image::qqplotresiduals.png[width=600, height=450, caption="Q-Q Plot of Residuals"]
-
+image::qqplotresiduals.png[width=600, height=450, caption="Q–Q Plot of Residuals"]
+The residuals closely follow the reference line, with only slight deviations at the extremes. This visual evidence further supports the conclusion that the normality assumption is reasonably met for this model.
-Students should get full credit if they also write 1-2 sentences on whether the model passes the normality assumption using q-q plot. It does visually, so it looks like it does pass.
+=== Step 3: Test Independence Using the Durbin–Watson Statistic
-
-**5c. Run the code below to test for independence assumption in regression using the Durbin Watson test. Write 2-3 sentences interpreting the results and why testing for independence is important when building a linear regression model.**
+We now test the independence assumption using the Durbin–Watson statistic, which detects autocorrelation in the residuals. Values close to 2 indicate little to no autocorrelation, while values far from 2 suggest potential dependence among residuals.
[source,python]
----
-# Durbin–Watson independence test
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(resid)
@@ -970,17 +943,14 @@ H0: Residuals are independent (no autocorrelation).
-> Pass (approx. independent)
----
-Students should get full credit if they also write 2-3 sentences interpreting the results and why independence is important.
-
-For example, they may write the Durbin-Watson test checks for autocorrelation in the residuals, with values close to 2 indicating independence. Independence is important because if residuals are correlated, it can lead to invalid linear regression results. The model passes the test.
+The Durbin–Watson statistic is very close to 2, indicating that the residuals are approximately independent. Independence is important because correlated residuals can lead to biased standard errors and invalid inference. Based on this test, the model satisfies the independence assumption.
+=== Step 4: Check Linearity and Homoscedasticity Using Residuals vs. Fitted Values
-**5d. Run the code provided below to plot a residuals vs fitted values plot. Make sure to label your plot and interpret the results in 2-3 sentences.**
+Finally, we examine a residuals versus fitted values plot to assess the assumptions of linearity and constant variance (homoscedasticity). Ideally, residuals should be randomly scattered around zero with no clear patterns or changes in spread.
[source,python]
----
-import matplotlib.pyplot as plt
-
plt.scatter(model.fittedvalues, resid, alpha=0.6)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residuals vs. Fitted Values")
@@ -989,46 +959,47 @@ plt.ylabel("Residuals")
plt.show()
----
-Students should get full credit if they label the plot and interpret the results in 2-3 sentences. The residuals vs. fitted values plot shows that the residuals are scattered fairly evenly around the horizontal red line at zero. This suggests that the model’s assumptions of linearity and homoscedasticity (equal variance of residuals) are reasonably satisfied.
+image::residualsfittedvals.png[width=600, height=450, caption="Residuals vs. Fitted Values"]
-image::residualsfittedvals.png[width=600, height=450, caption="Residuals Fitted Values"]
+The residuals appear to be evenly scattered around the horizontal reference line with no obvious patterns or funnel shapes. This suggests that the assumptions of linearity and homoscedasticity are reasonably satisfied, supporting the overall validity of the fitted linear regression model.
-====
+== Model Evaluation and Predicting Song Popularity
-=== Question 6 Calculating R-squared and Predicting Popularity (2 points)
+In the final stage of the analysis, we evaluate the model’s performance on unseen data and use the fitted regression model to interpret feature effects and make predictions for a new song. These steps help assess how well the model generalizes and how it can be used in practice.
-.Deliverables
-====
-**6a. Use the provided code below to get the $R^2$ on the test set. Write 2-3 sentences interpreting the results of the $R^2$ on the test set.**
+=== Step 1: Evaluate Model Performance on the Test Set
+
+To assess how well the model generalizes to new data, we compute the coefficient of determination ($R^2$) on the test set. This measures the proportion of variability in song popularity that is explained by the model when applied to unseen observations.
[source,python]
----
from sklearn.metrics import r2_score
import statsmodels.api as sm
-# Making sure test data has the same selected columns
+# Ensure test data uses the selected features
X_test = X_test[selected_cols].copy()
mask = np.isfinite(X_test).all(1)
X_test = X_test.loc[mask]
y_test = y_test.loc[X_test.index]
-# Predict with statsmodels OLS
+# Predict using the trained OLS model
y_pred = model.predict(sm.add_constant(X_test, has_constant='add'))
# R² on the test set
-r2 = r2_score(y_test, y_pred)
+r2 = r2_score(y_test, y_pred)
print(f"Test R²: {r2:.3f}")
----
-Students may write that the R^2 can still improved but it is close to the R^2 of the training dataset which means there are not many signs of overfitting.
[source,python]
----
Test R²: 0.250
----
-**6b. Run the code provided to generate a table displaying each variable along with its coefficient, direction of effect (positive or negative), p-value, and significance. Then, write 2–3 sentences interpreting the results. Highlight any findings that stand out or seem surprising.**
+The test set $R^2$ of approximately 0.25 indicates that the model explains about 25% of the variability in song popularity on unseen data. While there is room for improvement, the test $R^2$ is reasonably close to the training $R^2$, suggesting that the model does not suffer from severe overfitting and generalizes moderately well.
+
+=== Step 2: Interpret Feature Effects and Statistical Significance
-_Note: The code has been provided, your task is to interpret the table's results._
+Next, we examine a summary table that displays each selected variable’s coefficient, direction of effect, p-value, and statistical significance. This allows us to interpret which features are most strongly associated with popularity and whether their effects are statistically meaningful.
[source,python]
----
@@ -1036,24 +1007,23 @@ import pandas as pd
alpha = 0.05 # significance threshold
-# Collect pieces from the statsmodels OLS result
+# Extract model components
coef = model.params.rename("coef")
pval = model.pvalues.rename("p_value")
ci = model.conf_int()
-# Assemble table
+# Build summary table
coef_tbl = (
pd.concat([coef, pval, ci], axis=1)
- .drop(index="const", errors="ignore") # drop intercept row
+ .drop(index="const", errors="ignore")
.assign(
effect=lambda d: d["coef"].apply(lambda x: "Positive" if x > 0 else "Negative"),
significant=lambda d: d["p_value"] < alpha
)
[["coef", "effect", "p_value", "significant"]]
- .sort_values("p_value") # most significant first
+ .sort_values("p_value")
)
-# rounding
coef_tbl_rounded = coef_tbl.round({"coef": 4, "p_value": 4})
print(coef_tbl_rounded)
----
@@ -1074,64 +1044,60 @@ key 0.1134 Positive 0.0777 False
speechiness -4.6218 Negative 0.1195 False
----
+Several variables, such as `principal_artist_followers`, `loudness`, `acousticness`, and `explicit`, show strong and statistically significant positive relationships with popularity. Other features, including `album_total_tracks`, `valence`, and `track_number`, are associated with lower popularity. Not all variables are statistically significant; for example, `key` and `speechiness` have p-values above 0.05, suggesting weaker evidence of association.
-**6c. Using the code below, predict the popularity of a new song with the following features using the trained model:**
-
-Principal artist followers: 5,000,000
-
-Duration: 210,000 ms (~3.5 minutes)
+=== Step 3: Predict Popularity for a New Song
-Loudness: -6.0
+Finally, we use the trained regression model to predict the popularity of a new song based on its known characteristics. This demonstrates how the model can be applied to real-world scenarios.
-Album total tracks: 10
+The new song has the following features:
-Acousticness: 0.20
-
-Explicit: 0 (not explicit)
-
-Mode: 1 (major key)
-
-Valence: 0.40
-
-Track number: 3
-
-Key: 5 (F major)
-
-Speechiness: 0.05
+* Principal artist followers: 5,000,000
+* Duration: 210,000 ms (approximately 3.5 minutes)
+* Loudness: −6.0
+* Album total tracks: 10
+* Acousticness: 0.20
+* Explicit: 0 (not explicit)
+* Mode: 1 (major key)
+* Valence: 0.40
+* Track number: 3
+* Key: 5 (F major)
+* Speechiness: 0.05
[source,python]
----
import pandas as pd
import statsmodels.api as sm
-# Students will fill these in with the new song's known features
new_song = pd.DataFrame([{
"principal_artist_followers": 5_000_000,
- "duration_ms": 210_000,
- "loudness": -6.0,
+ "duration_ms": 210_000,
+ "loudness": -6.0,
"album_total_tracks": 10,
"acousticness": 0.20,
- "explicit": 0,
- "mode": 1,
+ "explicit": 0,
+ "mode": 1,
"valence": 0.40,
"track_number": 3,
- "key": 5,
+ "key": 5,
"speechiness": 0.05
}])
-# Ensure the model sees the same columns it was trained on
+# Align columns with the trained model
new_song = new_song.reindex(columns=selected_cols)
new_song = new_song.fillna(X_train[selected_cols].median(numeric_only=True))
+
pred_pop = model.predict(sm.add_constant(new_song, has_constant="add"))[0]
print(f"Predicted popularity: {pred_pop:.1f}")
----
[source,python]
----
-Predicted popularity: 68.9
+Predicted popularity: 68.9
----
-====
+Based on the fitted model, the predicted popularity score for this new song is approximately 69. This value reflects the combined influence of artist popularity, audio features, and track characteristics captured by the regression model.
+
== References
diff --git a/tools-appendix/modules/python/pages/logistic_regression.adoc b/tools-appendix/modules/python/pages/logistic_regression.adoc
index 66662aa37..18899d6f5 100644
--- a/tools-appendix/modules/python/pages/logistic_regression.adoc
+++ b/tools-appendix/modules/python/pages/logistic_regression.adoc
@@ -897,40 +897,39 @@ final_model = sm.Logit(y_train, X_train_final).fit()
print(final_model.summary())
print(f"AIC: {final_model.aic}")
----
-
-.Logit Regression Results
-[cols="1,1,1,1,1,1", options="header"]
-|===
-| Term | coef | std err | z | P>|z| | [0.025, 0.975]
-
-| const | -2.3807 | 0.038 | -63.081 | 0.000 | [-2.455, -2.307]
-| Cost_per_claim | 2.4191 | 0.036 | 66.890 | 0.000 | [2.348, 2.490]
-| Prscrbr_Type_Grouped_Primary Care | 1.2536 | 0.045 | 28.164 | 0.000 | [1.166, 1.341]
-| Prscrbr_Type_Grouped_GI/Renal/Rheum | -4.8014 | 0.357 | -13.449 | 0.000 | [-5.501, -4.102]
-| Tot_Day_Suply | -0.7831 | 0.044 | -17.731 | 0.000 | [-0.870, -0.697]
-| Prscrbr_Type_Grouped_Endocrinology | 1.8776 | 0.148 | 12.667 | 0.000 | [1.587, 2.168]
-| Prscrbr_Type_Grouped_Neuro/Psych | -6.7406 | 0.959 | -7.027 | 0.000 | [-8.621, -4.861]
-| Prscrbr_Type_Grouped_Missing | -1.4901 | 0.136 | -10.951 | 0.000 | [-1.757, -1.223]
-| Prscrbr_Type_Grouped_Surgery | -2.5203 | 0.343 | -7.348 | 0.000 | [-3.193, -1.848]
-| Prscrbr_Type_Grouped_Other | -1.6574 | 0.268 | -6.196 | 0.000 | [-2.182, -1.133]
-| Prscrbr_Type_Grouped_Women's Health | -2.0414 | 0.451 | -4.524 | 0.000 | [-2.926, -1.157]
-| Prscrbr_State_Region_South | 0.2295 | 0.043 | 5.301 | 0.000 | [0.145, 0.314]
-|===
-
-[NOTE]
-====
-*Optimization terminated successfully.*
-**Model Information:**
-- Dependent Variable: `Semaglutide_drug`
-- Observations: 29,091
-- Pseudo R²: 0.4312
-- Log-Likelihood: -7640.2
-- LL-Null: -13431
-- AIC: 15304.48
-- Converged: True
-- Covariance Type: nonrobust
-- LLR p-value: 0.000
-====
+[source,text]
+----
+Optimization terminated successfully.
+ Current function value: 0.262632
+ Iterations 10
+ Logit Regression Results
+==============================================================================
+Dep. Variable: Semaglutide_drug No. Observations: 29091
+Model: Logit Df Residuals: 29079
+Method: MLE Df Model: 11
+Date: Tue, 30 Dec 2025 Pseudo R-squ.: 0.4312
+Time: 16:35:04 Log-Likelihood: -7640.2
+converged: True LL-Null: -13431.
+Covariance Type: nonrobust LLR p-value: 0.000
+=======================================================================================================
+ coef std err z P>|z| [0.025 0.975]
+-------------------------------------------------------------------------------------------------------
+const -2.3807 0.038 -63.081 0.000 -2.455 -2.307
+Cost_per_claim 2.4191 0.036 66.890 0.000 2.348 2.490
+Prscrbr_Type_Grouped_Primary Care 1.2536 0.045 28.164 0.000 1.166 1.341
+Prscrbr_Type_Grouped_GI/Renal/Rheum -4.8014 0.357 -13.449 0.000 -5.501 -4.102
+Tot_Day_Suply -0.7831 0.044 -17.731 0.000 -0.870 -0.697
+Prscrbr_Type_Grouped_Endocrinology 1.8776 0.148 12.667 0.000 1.587 2.168
+Prscrbr_Type_Grouped_Neuro/Psych -6.7406 0.959 -7.027 0.000 -8.621 -4.861
+Prscrbr_Type_Grouped_Missing -1.4901 0.136 -10.951 0.000 -1.757 -1.223
+Prscrbr_Type_Grouped_Surgery -2.5203 0.343 -7.348 0.000 -3.193 -1.848
+Prscrbr_Type_Grouped_Other -1.6574 0.268 -6.196 0.000 -2.182 -1.133
+Prscrbr_Type_Grouped_Women's Health -2.0414 0.451 -4.524 0.000 -2.926 -1.157
+Prscrbr_State_Region_South 0.2295 0.043 5.301 0.000 0.145 0.314
+=======================================================================================================
+
+Final AIC: 15304.476049756959
+----
@@ -1041,6 +1040,54 @@ print("Test AUC:", roc_auc_score(y_test, final_model.predict(X_test_final)))
image::roccurvepres.png[width=600, height=450, caption="Figure: ROC Curve Comparison"]
+=== Calibration Plot
+
+*What Is a Calibration Plot?**
+
+A calibration plot, is a tool used to evaluate how well a classification model’s predicted probabilities align with actual observed outcomes. Rather than measuring how well the model ranks observations, a calibration plot assesses whether the probabilities produced by the model are trustworthy.
+
+To create a calibration plot, predicted probabilities are grouped into bins (for example, 0–0.1, 0.1–0.2, and so on). For each bin, the average predicted probability is plotted against the observed proportion of positive outcomes. The x-axis represents the mean predicted probability, while the y-axis represents the observed fraction of positives.
+
+A perfectly calibrated model will produce points that lie close to the diagonal line $y = x$. Points below this line indicate that the model is overconfident, meaning it predicts probabilities that are too high. Points above the line indicate that the model is underconfident, meaning it predicts probabilities that are too low.
+
+Calibration is especially important in probability-based decision settings, such as healthcare analytics, where predicted probabilities are often used to inform risk assessment and decision-making rather than simple yes/no classifications.
+
+[source,python]
+----
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.calibration import calibration_curve
+
+# Predicted probabilities on the test set
+test_probs = final_model.predict(X_test_final)
+
+# Compute calibration curve
+# n_bins controls how many probability bins are used
+prob_true, prob_pred = calibration_curve(
+ y_test,
+ test_probs,
+ n_bins=10,
+ strategy="uniform"
+)
+
+# Plot calibration curve
+plt.figure(figsize=(7, 7))
+plt.plot(prob_pred, prob_true, marker="o", label="Model")
+plt.plot([0, 1], [0, 1], linestyle="--", color="gray", label="Perfectly Calibrated")
+
+plt.xlabel("Mean Predicted Probability")
+plt.ylabel("Observed Proportion of Positives")
+plt.title("Calibration Plot (Test Set)")
+plt.legend()
+plt.grid(True)
+plt.tight_layout()
+plt.show()
+----
+image::CalibrationPlot.jpg[width=600, height=450, caption="Calibration Plot using the Data."]
+
+This calibration plot shows how well the model’s predicted probabilities match the actual outcomes. The dashed diagonal line represents perfect calibration, where predicted probabilities equal the observed proportion of prescribers. Points close to this line indicate reliable probability estimates, while points below the line indicate overconfidence and points above the line indicate underconfidence. For example, if a point lies on the diagonal line of a calibration plot, the model’s predicted probabilities closely match what actually occurs in the data. If a point falls below the diagonal, the model is overconfident: for instance, predicting an 80% chance when only about 60% of prescribers actually prescribe the drug. Conversely, if a point falls above the diagonal, the model is underconfident, such as predicting a 30% chance when about 50% of prescribers end up prescribing it. Overall, points above the line indicate underconfidence, while points below the line indicate overconfidence.
+
+
== 6. Making Predictions on New Prescribers
diff --git a/tools-appendix/modules/python/pages/timeseriespart1.adoc b/tools-appendix/modules/python/pages/timeseriespart1.adoc
new file mode 100644
index 000000000..95f45066e
--- /dev/null
+++ b/tools-appendix/modules/python/pages/timeseriespart1.adoc
@@ -0,0 +1,926 @@
+= TDM 40100: Project 8 - Introduction to Time Series
+
+== Project Objectives
+
+This project will introduce you to the essential steps of preparing time series data for analysis and forecasting. Time series data requires special attention during the cleaning and transformation process before applying any machine learning method. You will learn how to handle common challenges such as missing values and repeated timestamps to prepare data for modeling.
+
+.Learning Objectives
+****
+- Understand the unique structure of time series data and why certain cleaning methods are required before analysis.
+- Identify and address common data quality issues such as missing values and duplicate timestamps.
+- Convert and format datetime variables using pandas.
+- Apply time based interpolation and feature transformation techniques to prepare data for time series analysis.
+- Gain experience in structuring raw weather data into a usable format for modeling or forecasting.
+****
+
+[NOTE]
+====
+This guide was originally created as a seminar project, so you may notice it follows a question-and-answer format.
+Even so, it should still serve as a helpful resource to guide you through the process of building your own model. Use 4 cores if you want to replicate this project in Anvil.
+====
+
+== Dataset
+
+The dataset is located at:
+
+- /anvil/projects/tdm/data/noaa_timeseries
+
+
+== Introduction
+
+“Climate is what we expect, weather is what we get.” – Mark Twain
+
+image::climate-change.png[width=600, height=450, title="Global 2-meter air temperature anomalies on February 19, 2021, visualized by Climate Reanalyzer, Climate Change Institute, University of Maine."]
+
+Everyone is talking about climate change—but what does the data actually say? In this project, you will explore over 18 years of daily weather from Indianapolis and investigate how climate patterns are shifting over time. Of course, the data will not be a clean, pre-packaged dataset. Real data is messy. Some days are missing. Some values use symbols like `"T"` for trace amounts (small amounts close to 0) or `"M"` for missing entries. Before we can analyze trends or train a forecasting model, we will need to clean and prepare the data.
+
+This dataset comes from link:https://www.ncei.noaa.gov/access/search/data-search/local-climatological-data?pageNum=1[National Oceanic and Atmospheric Administration (NOAA) Local Climatological Data] and includes variables such as daily temperature, precipitation, humidity, wind speed, and snowfall for Indianapolis.
+
+These are the key environmental indicators we will work with:
+
+- `DailyAverageDryBulbTemperature`: Average air temperature
+- `DailyMaximumDryBulbTemperature` and `DailyMinimumDryBulbTemperature`: Max and Min Air Temperature for the day
+- `DailyPrecipitation`: Rainfall or melted snow (inches)
+- `DailyAverageRelativeHumidity`: Moisture in the air
+- `DailyAverageWindSpeed`: Wind patterns (mph)
+- `DailySnowfall`: Snow accumulation (inches)
+
+**Why do we need to use time series analysis for this data?**
+
+Time series methods are specifically designed to work with data where observations are collected at consistent intervals—like daily, weekly, or monthly so we can understand how things change over time and make predictions about the future. This dataset is a good for time series analysis because it tracks temperature and other weather variables on a daily basis over 18+ years.
+
+In this case, we’re interested in uncovering patterns in climate data, such as:
+
+- Seasonal trends (e.g., warmer summers, colder winters)
+- Long-term trends (e.g., gradual increase in temperature due to climate change),
+- Short-term fluctuations (e.g., sudden cold snaps or heatwaves),
+- And even forecasting future conditions based on past observations.
+
+However, we cannot jump straight into modeling. Before building a time series model, we must ensure:
+
+- Dates are properly formatted and consistently spaced,
+- All variables are numeric and in usable units,
+- Missing or trace values (close to 0) are handled appropriately,
+- And the dataset contains no duplicated or skipped dates.
+
+This project will guide you through these foundational steps. By the end, you will not only have a cleaned and structured dataset but you will also understand what makes time series unique, and how to prepare real-world data for powerful forecasting tools like ARIMA or LSTM neural networks.
+
+Time series analysis is a crucial skill in data science, especially for applications in weather forecasting, finance, agriculture, and public health. Mastering the preparation process is your first step toward building models that can anticipate the future.
+
+
+[IMPORTANT]
+====
+We will ask a series of questions to help you explore the dataset before the deliverables section. These are meant to guide your thinking. The **deliverables** listed under each question describe what you will need to submit.
+====
+
+== Guided Exploration: Understanding One Year of Weather Data
+
+Before cleaning or modeling time series data, it is essential to understand how the data is structured. Time series methods assume observations occur at regular intervals and that timestamps and measurements are stored in appropriate formats. In this section, you will explore one year of Indianapolis weather data to determine whether it is ready for time series analysis and to identify any structural issues that must be addressed first.
+
+Our focus here is not on fixing the data yet, but on *observing* how it is recorded: how often measurements occur, whether multiple observations exist per day, and whether variables are stored in usable formats.
+
+=== Inspecting Data Types and Structure
+
+A critical first step in any time series workflow is verifying data types. Time series analysis relies heavily on correctly formatted datetime variables and numeric measurements. If dates are stored as strings or numeric values are stored as text due to special symbols (such as `"T"` for trace precipitation or `"M"` for missing values), many operations—including plotting, resampling, and interpolation—will fail or produce misleading results.
+
+In this dataset, the `DATE` column must be converted to a datetime object so Python can recognize the temporal ordering of observations. Likewise, temperature, precipitation, and wind variables should be numeric wherever possible.
+
+For reference, Python’s standard built-in data types include:
+
+- *Numeric* – `int`, `float`, `complex`
+- *Sequence Types* – `str`, `list`, `tuple`
+- *Mapping Type* – `dict`
+- *Boolean* – `bool`
+- *Set Types* – `set`, `frozenset`
+- *Binary Types* – `bytes`, `bytearray`, `memoryview`
+
+[small]#Source: https://www.geeksforgeeks.org/python-data-types/#
+
+As you explore the dataset, keep the following guiding questions in mind:
+
+- What time resolution does the data appear to use?
+- Are observations recorded once per day, or multiple times per day?
+- Do duplicate timestamps exist?
+- Which variables are most relevant for daily weather analysis?
+- Are the current data types appropriate for time series modeling?
+
+=== Loading and Previewing the Data
+
+To begin, load the 2006 Indianapolis weather dataset and preview the first few rows. This initial glance can reveal missing values, repeated timestamps, or unexpected reporting formats.
+
+[source,python]
+----
+import pandas as pd
+indy_climatedata_2006 = pd.read_csv(
+ "/anvil/projects/tdm/data/noaa_timeseries/indyclimatedata_2006.csv",
+ low_memory=False
+)
+----
+
+A preview of selected columns is shown below to illustrate the structure of the raw data:
+
+[cols="1,2,1,1,1,3,1,1,1,1", options="header"]
+|===
+| STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | REPORT_TYPE | SOURCE | HourlyAltimeterSetting | HourlyDewPointTemperature
+
+| USW00093819
+| 2006-01-01T00:00:00
+| 39.72517
+| -86.28168
+| 241.1
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+| FM-15
+| 335.0
+| 1013.5
+| -2.0
+
+| USW00093819
+| 2006-01-01T00:00:00
+| 39.72517
+| -86.28168
+| 241.1
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+| SOD
+| NaN
+| NaN
+| NaN
+
+| USW00093819
+| 2006-01-01T00:00:00
+| 39.72517
+| -86.28168
+| 241.1
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+| SOM
+| NaN
+| NaN
+| NaN
+
+| USW00093819
+| 2006-01-01T00:02:00
+| 39.72517
+| -86.28168
+| 241.1
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+| FM-16
+| 343.0
+| 1013.2
+| -2.0
+
+| USW00093819
+| 2006-01-01T00:55:00
+| 39.72517
+| -86.28168
+| 241.1
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+| FM-15
+| 343.0
+| 1012.9
+| -2.2
+|===
+
+=== Examining Dataset Size and Column Types
+
+Next, examine the size of the dataset and inspect the column names and data types. This information helps determine which variables are likely useful for daily weather analysis and whether any columns will require type conversion or cleaning.
+
+[source,python]
+----
+pd.set_option('display.max_rows', None)
+print(indy_climatedata_2006.shape)
+print(indy_climatedata_2006.dtypes)
+----
+
+Pay particular attention to daily weather variables such as:
+
+- `DailyAverageDryBulbTemperature`
+- `DailyMaximumDryBulbTemperature`
+- `DailyMinimumDryBulbTemperature`
+- `DailyPrecipitation`
+- `DailySnowfall`
+- `DailyAverageRelativeHumidity`
+- `DailyAverageWindSpeed`
+
+Many of these variables are numeric and well-suited for analysis, while others, such as precipitation and snowfall, may be stored as objects due to special symbols or missing values.
+
+=== Converting and Exploring the Date Variable
+
+Time series analysis requires a properly formatted datetime variable. Convert the `DATE` column using `pd.to_datetime()` and inspect the result.
+
+[source,python]
+----
+indy_climatedata_2006['DATE'] = pd.to_datetime(indy_climatedata_2006['DATE'])
+print(indy_climatedata_2006['DATE'].head())
+----
+
+[source,python]
+----
+0 2006-01-01 00:00:00
+1 2006-01-01 00:00:00
+2 2006-01-01 00:00:00
+3 2006-01-01 00:02:00
+4 2006-01-01 00:55:00
+Name: DATE, dtype: datetime64[ns]
+----
+
+Notice that multiple observations can occur within the same calendar day, indicating that the dataset contains sub-daily (hourly or event-based) records rather than exactly one row per day.
+
+=== Understanding the Time Series Granularity
+
+Finally, compare the number of unique calendar dates to the total number of rows. This step confirms whether the dataset represents a true daily time series or a higher-frequency dataset that will need to be aggregated.
+
+[source,python]
+----
+indy_climatedata_2006['DATE'].dt.date.nunique()
+----
+
+[source,python]
+----
+365
+----
+
+In 2006, there are 365 unique calendar days in the dataset, but many more rows, confirming that multiple observations exist per day. This observation will directly inform how the data must be transformed before applying daily time series models.
+
+== Combining Weather Data Across Multiple Years
+
+In many real-world data science projects, data does not arrive in a single clean file. Instead, it is often spread across multiple years, files, or sources. Building a usable dataset requires carefully combining these pieces into one coherent whole.
+
+In this section, we will assemble nearly two decades of Indianapolis weather data by stacking yearly files into a single DataFrame. Because each file follows the same structure and uses the same column names, they can be safely appended row by row to form one continuous time series.
+
+Combining the data across years allows us to:
+
+- Examine long-term climate trends in temperature, precipitation, snowfall, and wind,
+- Compare seasonal patterns across different years,
+- Identify gaps, inconsistencies, or incomplete years,
+- Prepare the data for meaningful time series modeling and forecasting.
+
+=== Understanding the File Structure
+
+Each year of weather data is stored in its own CSV file, following a consistent naming pattern:
+
+- `indyclimatedata_2006.csv`
+- `indyclimatedata_2007.csv`
+- `indyclimatedata_2008.csv`
+- ...
+- `indyclimatedata_2024.csv`
+
+You can view the available files on ANVIL using the following command:
+
+[source,bash]
+----
+%%bash
+ls /anvil/projects/tdm/data/noaa_timeseries/
+----
+
+Each file contains weather observations for a single year. Our goal is to stack these files together so that the resulting dataset represents a continuous timeline spanning multiple years.
+
+Conceptually, this process is known as *appending* or *stacking* data: placing one dataset directly on top of another when they share the same columns. This idea is illustrated below:
+
+image::append-data-vis.png[width=600, height=450, title="Figure Source: Combine or Append Data – Main Concepts, The Power User (2019)"]
+
+As you work through this section, consider the following:
+
+- Are some years more complete than others?
+- Do all years appear to contain the same number of observations?
+- How might missing or uneven data affect time series analysis?
+
+=== Loading and Stacking the Yearly Files
+
+To streamline the stacking process, use the function provided below. This function reads each yearly file, adds a `year` column for tracking, and concatenates all files into a single DataFrame.
+
+[source,python]
+----
+import pandas as pd
+
+def load_and_stack_climate_data(start_year=2006, end_year=2024,
+ base_path="/anvil/projects/tdm/data/noaa_timeseries/"):
+ dfs = []
+ for year in range(start_year, end_year + 1):
+ file_path = f"{base_path}indyclimatedata_{year}.csv"
+ try:
+ df = pd.read_csv(file_path, low_memory=False)
+ df['year'] = year
+ dfs.append(df)
+ except FileNotFoundError:
+ print(f"File not found for year {year}: {file_path}")
+ continue
+ combined_df = pd.concat(dfs, ignore_index=True)
+ return combined_df
+----
+
+Use this function to load and combine all available years:
+
+[source,python]
+----
+all_years_df_indy_climate = load_and_stack_climate_data()
+----
+
+=== Examining the Combined Dataset
+
+Once all years have been stacked together, examine the size of the combined dataset. This helps verify that the stacking process worked as expected.
+
+[source,python]
+----
+all_years_df_indy_climate.shape
+----
+
+Next, break the dataset down by year to see how many rows are recorded for each year:
+
+[source,python]
+----
+all_years_df_indy_climate["year"]
+ .value_counts()
+ .sort_index()
+----
+
+This breakdown reveals whether some years contain more observations than others and provides early insight into data completeness across time.
+
+=== Focusing on Daily Weather Variables
+
+The full dataset contains many columns, including hourly, daily, and monthly measurements. For the next stages of this project, we will focus on a smaller set of daily weather variables that are most relevant for time series analysis.
+
+Use the code below to filter the dataset to a core set of columns and overwrite the existing DataFrame with this simplified version:
+
+[source,python]
+----
+columns_to_keep = [
+ "DATE",
+ "DailyAverageDryBulbTemperature",
+ "DailyMaximumDryBulbTemperature",
+ "DailyMinimumDryBulbTemperature",
+ "DailyPrecipitation",
+ "DailyAverageRelativeHumidity",
+ "DailyAverageWindSpeed",
+ "DailySnowfall",
+ "NAME"
+]
+
+all_years_df_indy_climate = all_years_df_indy_climate[columns_to_keep]
+----
+
+=== Cleaning Daily Weather Data for Time Series Analysis
+
+Before working with time series models, the dataset must be cleaned so that each row represents meaningful daily weather information. This step focuses on selecting relevant variables, removing unusable rows, and ensuring the time variable is stored in a format suitable for time-based analysis.
+
+The variables of interest represent key daily weather conditions, including average, minimum, and maximum temperatures, precipitation, humidity, wind speed, and snowfall. These measurements form the foundation for identifying trends, seasonal patterns, and long-term changes in climate.
+
+Because this project focuses on daily time series analysis, the dataset must satisfy two conditions:
+
+- Each row must contain usable weather measurements
+- The time variable must follow a consistent daily structure
+
+==== Removing Rows with No Weather Information
+
+In real-world climate data, some rows contain timestamps and location information but no actual weather measurements. These rows do not contribute anything meaningful to analysis or visualization and can introduce misleading patterns if left in the dataset.
+
+When removing empty rows, it is important to focus only on weather-related columns. Columns such as `DATE` and `NAME` should be excluded from this check, since their presence alone does not indicate that useful data exists for a given day.
+
+The goal is to remove rows where *all* weather measurements are missing, while retaining rows that contain at least one recorded weather value.
+
+[source,python]
+----
+cols_to_check = [
+ col for col in all_years_df_indy_climate.columns
+ if col not in ["DATE", "NAME"]
+]
+
+all_years_df_indy_climate = (
+ all_years_df_indy_climate
+ .dropna(subset=cols_to_check, how="all")
+)
+
+all_years_df_indy_climate.head()
+----
+
+A preview of the cleaned data is shown below. Notice that each remaining row contains at least one recorded weather measurement, even if some individual values are still missing or marked as trace amounts.
+
+[cols="1,2,2,2,2,2,2,2,3", options="header"]
+|===
+| DATE | DailyAverageDryBulbTemperature | DailyMaximumDryBulbTemperature | DailyMinimumDryBulbTemperature | DailyPrecipitation | DailyAverageRelativeHumidity | DailyAverageWindSpeed | DailySnowfall | NAME
+
+| 2006-01-01T00:00:00
+| 4.5
+| 10.6
+| -1.7
+| 0.0
+| 74.0
+| 5.1
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-02T00:00:00
+| 13.3
+| 18.3
+| 8.3
+| 6.6
+| 89.0
+| 4.9
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-03T00:00:00
+| 7.0
+| 8.3
+| 5.6
+| 0.5
+| 96.0
+| NaN
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-04T00:00:00
+| 6.4
+| 10.0
+| 2.8
+| T
+| 87.0
+| 7.7
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-05T00:00:00
+| 1.7
+| 3.3
+| 0.0
+| T
+| 82.0
+| 5.8
+| T
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+|===
+
+==== Converting DATE to a Datetime Format
+
+Time series analysis requires that the time variable be stored in a datetime format. Converting the `DATE` column allows Python to correctly interpret ordering, spacing, and calendar-based operations.
+
+[source,python]
+----
+all_years_df_indy_climate["DATE"] = pd.to_datetime(
+ all_years_df_indy_climate["DATE"]
+)
+
+all_years_df_indy_climate.head()
+----
+
+After conversion, the DATE column follows a clean `YYYY-MM-DD` format, which is appropriate for daily time series analysis.
+
+[cols="2,2,2,2,2,2,2,2,3", options="header"]
+|===
+| DATE | DailyAverageDryBulbTemperature | DailyMaximumDryBulbTemperature | DailyMinimumDryBulbTemperature | DailyPrecipitation | DailyAverageRelativeHumidity | DailyAverageWindSpeed | DailySnowfall | NAME
+
+| 2006-01-01
+| 4.5
+| 10.6
+| -1.7
+| 0.0
+| 74.0
+| 5.1
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-02
+| 13.3
+| 18.3
+| 8.3
+| 6.6
+| 89.0
+| 4.9
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-03
+| 7.0
+| 8.3
+| 5.6
+| 0.5
+| 96.0
+| NaN
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-04
+| 6.4
+| 10.0
+| 2.8
+| T
+| 87.0
+| 7.7
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-05
+| 1.7
+| 3.3
+| 0.0
+| T
+| 82.0
+| 5.8
+| T
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+|===
+
+==== Verifying Dataset Size and Date Coverage
+
+As a final validation step, it is helpful to examine the size of the cleaned dataset and confirm the range of dates it spans. This ensures that no large portions of the timeline were unintentionally removed during cleaning.
+
+[source,python]
+----
+print(all_years_df_indy_climate.shape)
+print(all_years_df_indy_climate["DATE"].min().date())
+print(all_years_df_indy_climate["DATE"].max().date())
+----
+
+(6940, 9)
+
+2006-01-01
+
+2024-12-31
+
+
+== Preparing Climate Data for Time Series Analysis
+
+Time series analysis requires data to be numeric, time-ordered, and indexed in a way that reflects how observations occur over time. Before building forecasting models or creating time-based visualizations, the dataset must be structured so Python can correctly interpret temporal relationships.
+
+This section focuses on two critical preparation steps:
+setting the date as the time index and ensuring all weather variables are numeric and usable.
+
+=== Why the Date Index Matters
+
+Many time series operations in Python rely on the dataframe index rather than column values. One such operation is time-based interpolation, which fills in missing values using the spacing between dates.
+
+If interpolation is attempted before setting the date as the index, Python cannot determine which column represents time. As a result, time-based methods will fail.
+
+For example, attempting to interpolate temperature values without a datetime index produces an error:
+
+[source,python]
+----
+DF["DailyAverageDryBulbTemperature"].interpolate(method="time")
+----
+
+The error typically indicates that time-weighted interpolation only works when the index is a `DatetimeIndex`. Without this structure, Python has no information about how observations are ordered in time.
+
+=== Converting DATE and Setting the Index
+
+To enable time-aware operations, the DATE column must first be converted to a datetime format and then set as the dataframe index.
+
+[source,python]
+----
+DF["DATE"] = pd.to_datetime(DF["DATE"])
+DF = DF.set_index("DATE")
+----
+
+Once DATE is set as the index, Python recognizes the dataset as time-ordered. Time-based interpolation can now be applied correctly, filling gaps using nearby dates along the timeline.
+
+[source,python]
+----
+DF["DailyAverageDryBulbTemperature"].interpolate(
+ method="time",
+ limit_direction="both"
+)
+----
+
+This approach preserves the structure of the time series and avoids introducing artificial jumps or distortions.
+
+=== Identifying Numeric Columns
+
+Before applying interpolation across the dataset, it is important to identify which columns are numeric. Time-based interpolation should only be applied to numeric variables such as temperature, humidity, and wind speed.
+
+[source,python]
+----
+all_years_df_indy_climate.set_index("DATE", inplace=True)
+
+numeric_cols = (
+ all_years_df_indy_climate
+ .select_dtypes(include="number")
+ .columns
+)
+
+numeric_cols
+----
+
+----
+Index(['DailyAverageDryBulbTemperature', 'DailyMaximumDryBulbTemperature',
+ 'DailyMinimumDryBulbTemperature', 'DailyAverageRelativeHumidity',
+ 'DailyAverageWindSpeed'],
+ dtype='object')
+----
+
+These columns represent continuous measurements that vary smoothly over time and are appropriate for interpolation.
+
+=== Interpolating Missing Values Over Time
+
+With DATE set as the index and numeric columns identified, time-based interpolation can be applied across all numeric variables simultaneously.
+
+[source,python]
+----
+all_years_df_indy_climate[numeric_cols] = (
+ all_years_df_indy_climate[numeric_cols]
+ .interpolate(method="time", limit_direction="both")
+)
+----
+
+This ensures that missing values are filled in a way that respects the temporal spacing between observations.
+
+=== Resetting the Index
+
+After time-based operations are complete, it is often useful to restore DATE as a regular column. This makes the dataframe easier to inspect, merge with other datasets, or visualize using plotting libraries.
+
+[source,python]
+----
+all_years_df_indy_climate.reset_index(inplace=True)
+all_years_df_indy_climate.head()
+----
+
+[cols="1,2,2,2,2,2,2,2,3", options="header"]
+|===
+| DATE | DailyAverageDryBulbTemperature | DailyMaximumDryBulbTemperature | DailyMinimumDryBulbTemperature | DailyPrecipitation | DailyAverageRelativeHumidity | DailyAverageWindSpeed | DailySnowfall | NAME
+
+| 2006-01-01
+| 4.5
+| 10.6
+| -1.7
+| 0.0
+| 74.0
+| 5.1
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-02
+| 13.3
+| 18.3
+| 8.3
+| 6.6
+| 89.0
+| 4.9
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-03
+| 7.0
+| 8.3
+| 5.6
+| 0.5
+| 96.0
+| 6.3
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-04
+| 6.4
+| 10.0
+| 2.8
+| T
+| 87.0
+| 7.7
+| 0.0
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+
+| 2006-01-05
+| 1.7
+| 3.3
+| 0.0
+| T
+| 82.0
+| 5.8
+| T
+| INDIANAPOLIS INTERNATIONAL AIRPORT, IN US
+|===
+
+=== Handling Trace Values in Weather Data
+
+Weather datasets often include trace values recorded as `"T"`, representing very small amounts of precipitation or snowfall. These values are not numeric and must be addressed before modeling or statistical analysis.
+
+Attempting to convert columns containing `"T"` directly to floats will result in errors. To resolve this, trace values are replaced with `0` and the columns are then converted to numeric types.
+
+[source,python]
+----
+cols_to_fix = [
+ "DailyAverageDryBulbTemperature",
+ "DailyMaximumDryBulbTemperature",
+ "DailyMinimumDryBulbTemperature",
+ "DailyPrecipitation",
+ "DailyAverageRelativeHumidity",
+ "DailyAverageWindSpeed",
+ "DailySnowfall"
+]
+
+for col in cols_to_fix:
+ all_years_df_indy_climate[col] = (
+ all_years_df_indy_climate[col]
+ .replace("T", 0)
+ )
+
+all_years_df_indy_climate[cols_to_fix] = (
+ all_years_df_indy_climate[cols_to_fix]
+ .astype(float)
+)
+----
+
+A final check confirms that no trace values remain in the dataset.
+
+[source,python]
+----
+(all_years_df_indy_climate[cols_to_fix] == "T").any()
+----
+----
+DailyAverageDryBulbTemperature False
+DailyMaximumDryBulbTemperature False
+DailyMinimumDryBulbTemperature False
+DailyPrecipitation False
+DailyAverageRelativeHumidity False
+DailyAverageWindSpeed False
+DailySnowfall False
+dtype: bool
+----
+
+== Exploring Climate Trends Over Time
+
+With the data cleaned and properly structured, the next step is to visualize how temperature changes over time. Visualization plays a central role in time series analysis because it allows trends, seasonal patterns, and unusual behavior to emerge before any formal modeling begins.
+
+This section explores daily average temperature across the full time span of the dataset and then zooms in on a single year to examine seasonal dynamics in greater detail.
+
+++++
+
+++++
+
+=== Converting Daily Temperature to Fahrenheit
+
+The daily average temperature variable is stored in Celsius. For interpretability and consistency with common U.S. weather reporting, the temperature is converted to Fahrenheit and stored as a new column. Keeping the original Celsius values ensures that no information is lost during the transformation.
+
+[source,python]
+----
+all_years_df_indy_climate[
+ "DailyAverageDryBulbTemperature_Farenheit"
+] = (
+ all_years_df_indy_climate["DailyAverageDryBulbTemperature"] * 9/5 + 32
+)
+----
+
+=== Visualizing Temperature Across Multiple Years
+
+Plotting daily average temperature from 2006 to 2024 provides a long-term view of climate behavior in Indianapolis. At this scale, individual daily fluctuations blend together, allowing broader patterns to stand out more clearly.
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+import pandas as pd
+
+all_years_df_indy_climate["DATE"] = pd.to_datetime(
+ all_years_df_indy_climate["DATE"]
+)
+
+plt.figure(figsize=(12, 6))
+plt.plot(
+ all_years_df_indy_climate["DATE"],
+ all_years_df_indy_climate["DailyAverageDryBulbTemperature_Farenheit"],
+ linewidth=1
+)
+
+plt.title("Average Daily Temperature in Indianapolis (2006–2024)")
+plt.xlabel("Date")
+plt.ylabel("Average Temperature (°F)")
+plt.xticks(rotation=45)
+plt.tight_layout()
+plt.show()
+----
+
+image::time-series-ex.png[width=600, height=450, title="Example of a Long-Term Climate Time Series"]
+
+When viewed over many years, several features typically emerge:
+
+- A strong and repeating seasonal cycle
+- Warmer summers and colder winters
+- Year-to-year variability layered on top of the seasonal pattern
+- Occasional sharp spikes or drops associated with extreme weather events
+
+This visualization helps distinguish long-term structure from short-term noise.
+
+=== Zooming In on a Single Year
+
+To better understand seasonal behavior, it is useful to focus on a single year. Examining daily average temperature for 2024 allows individual seasonal transitions to be observed more clearly.
+
+At this resolution, patterns such as the following are easier to identify:
+
+- Temperatures gradually rising from winter into summer
+- Peak temperatures occurring in mid-summer
+- A steady decline from late summer into winter
+- Short-term fluctuations caused by cold snaps, heatwaves, or transitional seasons
+
+Anomalies also become more visible when zoomed in. Brief temperature drops near or below freezing in winter, sudden spikes during transitional months, and irregular spring or fall behavior stand out more clearly at the single-year level.
+
+=== Why Visual Exploration Matters
+
+Time series visualization provides critical insight into how data behaves over time. Observing strong seasonality suggests the need for models that account for periodic patterns. Long-term trends motivate forecasting approaches that allow gradual change, while anomalies highlight periods that may require special attention.
+
+By examining climate data at both long-term and short-term scales, the dataset begins to tell a story about how temperature evolves over time. This understanding lays the groundwork for the forecasting and modeling techniques introduced next.
+
+
+== Creating Time-Based Features for Seasonal Analysis
+
+With the dataset cleaned, indexed, and visualized, the next step is to extract information directly from the time variable itself. In time series analysis, features derived from the date—such as month, day of year, or day of week—help models recognize recurring patterns and seasonal structure that are not always obvious from raw timestamps alone. This section focuses on creating a month-based feature and using it to summarize long-term seasonal temperature behavior.
+
+=== Why Time-Based Features Matter
+
+Time-based features allow temporal patterns to be expressed explicitly in the data. Rather than expecting a model to infer seasonality solely from the order of observations, features such as month provide a clear signal about where each observation falls within the yearly cycle.
+
+For climate data, month-based features are especially useful because they align closely with natural seasonal changes:
+
+- Winter months tend to be colder
+- Summer months tend to be warmer
+- Transitional months often show greater variability
+
+Encoding this information directly makes seasonal structure easier to visualize and model.
+
+=== Extracting the Month from the DATE Column
+
+To capture seasonal information, the month is extracted from the `DATE` column and stored as a new variable. The month is represented as a string so it can be treated as a categorical feature in later analyses or models.
+
+[source,python]
+----
+all_years_df_indy_climate["MONTH"] = (
+ all_years_df_indy_climate["DATE"]
+ .dt.month
+ .astype(str)
+)
+----
+
+This new column assigns each observation a value from `"1"` through `"12"`, corresponding to January through December.
+
+=== Calculating Average Temperature by Month
+
+Once the month feature is created, temperatures can be aggregated across all years to examine typical seasonal behavior. Averaging daily temperatures by month reveals long-term patterns that persist year after year.
+
+[source,python]
+----
+monthly_avg_temp = (
+ all_years_df_indy_climate
+ .groupby("MONTH")["DailyAverageDryBulbTemperature_Farenheit"]
+ .mean()
+ .reset_index()
+)
+
+monthly_avg_temp.columns = ["MONTH", "AvgTemp"]
+----
+
+This table summarizes the average temperature associated with each month, smoothing out daily and yearly fluctuations to highlight the overall seasonal cycle.
+
+=== Visualizing Average Monthly Temperatures
+
+A line plot of average monthly temperatures provides a clear visual representation of seasonal trends. Unlike daily time series plots, this view emphasizes recurring patterns rather than short-term variability.
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+all_years_df_indy_climate["Month"] = (
+ all_years_df_indy_climate["DATE"].dt.month
+)
+
+monthly_avg = (
+ all_years_df_indy_climate
+ .groupby("Month")["DailyAverageDryBulbTemperature_Farenheit"]
+ .mean()
+ .reset_index()
+)
+
+monthly_avg.rename(
+ columns={"DailyAverageDryBulbTemperature_Farenheit": "monthly_avg_temp"},
+ inplace=True
+)
+
+plt.figure(figsize=(10, 5))
+plt.plot(
+ monthly_avg["Month"],
+ monthly_avg["monthly_avg_temp"],
+ marker="o"
+)
+
+plt.title("Average Monthly Temperature in Indianapolis (Across All Years)")
+plt.xlabel("Month")
+plt.ylabel("Average Temperature (°F)")
+plt.xticks(
+ range(1, 13),
+ ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
+ "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
+)
+plt.grid(True)
+plt.tight_layout()
+plt.show()
+----
+
+This visualization typically reveals a smooth seasonal curve:
+
+- Temperatures increase steadily from winter into summer
+- Peak temperatures occur in mid-summer
+- Temperatures decline gradually from late summer into winter
+- Coldest conditions appear in January, with the warmest in July
+
+=== Connecting Features to Future Modeling
+
+Month-based features play an important role in time series forecasting. Strong seasonal structure suggests that future models should account for periodic behavior rather than treating observations as independent over time.
+
+By explicitly encoding seasonal information and summarizing long-term monthly trends, the dataset becomes better suited for forecasting methods that rely on seasonality, such as SARIMAX models or machine learning approaches that incorporate time-based features.
+
+This step bridges exploratory visualization and formal modeling, ensuring that temporal patterns observed in the data are captured directly in the features used for prediction.
+
+
+
diff --git a/tools-appendix/modules/python/pages/timeseriespart2.adoc b/tools-appendix/modules/python/pages/timeseriespart2.adoc
new file mode 100644
index 000000000..67bdc2b0f
--- /dev/null
+++ b/tools-appendix/modules/python/pages/timeseriespart2.adoc
@@ -0,0 +1,811 @@
+= Introduction to Time Series - Part 2 (NOAA Monthly Climate Data)
+:page-mathjax: true
+
+== Project Objectives
+
+In this project, you'll explore and model monthly climate trends in Indianapolis using an Augmented Regressive model and SARIMAX, a type of ARIMA model that can handle seasonal patterns.
+
+.Learning Objectives
+****
+- Understand the structure and purpose of time series data in climate forecasting.
+- Visualize long-term patterns and identify trends and seasonality in temperature data.
+- Test for stationarity using the Augmented Dickey-Fuller test and interpret results.
+- Apply differencing to transform non-stationary series into stationary ones.
+- Build a time series model using ARIMA and SARIMAX.
+- Evaluate model performance using Mean Absolute Error (MAE) on train and test sets.
+
+****
+
+== Dataset
+- /anvil/projects/tdm/data/noaa_timeseries2/monthly_avg_df_NOAA.csv
+
+
+[NOTE]
+====
+This guide was originally created as a seminar project, so you may notice it follows a question-and-answer format.
+Even so, it should still serve as a helpful resource to guide you through the process of building your own model. Use 4 cores if you want to replicate this project in Anvil.
+====
+
+
+
+== Introduction
+
+image::Back-to-the-future.jpg[width=600, height=450, title="Figure 1: Back to the Future image © Universal Pictures (1985)"]
+
+[quote, Dr. Emmet Brown, Back to the Future]
+____
+“Roads? Where we're going, we don't need roads.”
+____
+
+Time series forecasting is not just number crunching, it is the closest thing we have to a crystal ball. By recognizing patterns across time, we can uncover hidden structures in the data, anticipate what's coming, and even influence decisions before events unfold. It's not magic, it's just cool math behind the scenes.
+
+Climate is not random, it follows patterns, some obvious and others subtle. When we zoom out from daily weather reports and look at broader climate trends, we start to see the underlying patterns of our weather. In this project, we will work with nearly two decades of monthly-aggregated climate data from Indianapolis (2006–2024), sourced from the National Oceanic and Atmospheric Administration (NOAA).
+
+This time, the focus shifts from day-to-day fluctuations to monthly averages, a scale better suited for modeling long-term behavior and seasonal trends. The dataset includes:
+
+* *Monthly_AverageDryBulbTemperature_Farenheit*: Our main target variable—how warm or cold it was on average each month.
+* *Monthly_Precipitation*: Total rainfall and snowmelt.
+* *Monthly_AverageRelativeHumidity*: Average monthly moisture in the air.
+* *Monthly_AverageWindSpeed*: Wind patterns that may influence temperature or storm formation.
+* *Monthly_Snowfall*: A key seasonal indicator in the Midwest.
+
+The power of time series forecasting lies in its ability to take these past patterns and use them to make educated predictions about the future. In this project, you'll start off with building an AR (autoregressive) model and then you’ll learn how to build a type of ARIMA (autoregressive integrated moving average) model. Specifically, a SARIMAX (seasonal autoregressive integrated moving average with exogenous regressors) model that incorporates not just past temperature values, but also external predictors like humidity and precipitation.
+
+You will clean and transform the data, test for stationarity, evaluate model performance, and ultimately forecast future climate trends for Indianapolis. The goal? To understand not only what the temperature was, but what might be next. We are going to try to predict the future!
+
+== Questions
+
+
+== Getting to Know the Data
+
+Before building any time series model, it is essential to understand the structure, completeness, and behavior of the data you are working with. In this section, you will load the NOAA monthly climate dataset, explore its basic properties, and visualize temperature trends over time.
+
+++++
+
+++++
+
+=== Loading and Inspecting the Dataset
+
+Begin by loading the monthly NOAA climate dataset and previewing its contents. This will help you understand the variables available and how the data is organized. Save the DataFrame as `monthly_df` and display the first five rows.
+
+[source,python]
+----
+import pandas as pd
+
+monthly_df = pd.read_csv("/anvil/projects/tdm/data/noaa_timeseries2/monthly_avg_df_NOAA.csv")
+monthly_df.head()
+----
+
+[cols="2,2,2,2,3,2", options="header"]
+|===
+| Monthly_Precipitation | Monthly_AverageRelativeHumidity | Monthly_AverageWindSpeed | Monthly_Snowfall | Monthly_AverageDryBulbTemperature_Farenheit | DATE
+| 2.706452 | 77.870968 | 5.532258 | 2.225806 | 39.780645 | 2006-01-31
+| 1.717857 | 68.892857 | 4.875000 | 3.571429 | 32.347143 | 2006-02-28
+| 5.567742 | 70.838710 | 4.951613 | 4.516129 | 41.725806 | 2006-03-31
+| 3.073333 | 61.366667 | 4.806667 | 0.000000 | 57.146000 | 2006-04-30
+| 3.558065 | 68.387097 | 3.903226 | 0.000000 | 61.310968 | 2006-05-31
+|===
+
+Next, examine the overall size of the dataset and confirm whether any values are missing. This step ensures the data is suitable for time series modeling without requiring additional cleaning.
+
+[source,python]
+----
+print(monthly_df.shape)
+print(monthly_df.isnull().sum())
+----
+
+----
+(228, 6)
+Monthly_Precipitation 0
+Monthly_AverageRelativeHumidity 0
+Monthly_AverageWindSpeed 0
+Monthly_Snowfall 0
+Monthly_AverageDryBulbTemperature_Farenheit 0
+DATE 0
+dtype: int64
+----
+
+=== Visualizing Monthly Temperature Trends
+
+To better understand the temporal behavior of the data, create a time series line plot of monthly average temperature. This visualization will allow you to observe long-term trends, recurring seasonal patterns, and potential anomalies.
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+monthly_df['DATE'] = pd.to_datetime(monthly_df['DATE'])
+
+plt.plot(
+ monthly_df['DATE'],
+ monthly_df['Monthly_AverageDryBulbTemperature_Farenheit']
+)
+plt.title("Monthly Average Temperature in Indianapolis")
+plt.xlabel("Date")
+plt.ylabel("Temperature (F)")
+plt.grid()
+plt.show()
+----
+
+image::TimeSeriesSection2-1e.png[width=600, height=450, title="Monthly average dry bulb temperature in Indianapolis over time"]
+
+After viewing the plot, let's cocus on identifying trends and seasonality, such as repeating yearly patterns or long-term changes in temperature. For example, noting strong annual seasonality with recurring peaks and troughs.
+
+
+== Understanding Lag Through Autoregressive Models
+
+++++
+
+++++
+
+Time series models are different than other models. From forecasting stock prices to anticipating weather patterns, people attempt it constantly. But when we narrow our focus to short-term forecasting, predicting the near future based on recent historical data—the task becomes more manageable.
+
+Take, for example, your plot of average monthly temperature. One thing you will notice right away is that observations from month to month are not independent. Instead, they are correlated with one another. This is known as *autocorrelation*, when values close together in time tend to be similar.
+
+This feature distinguishes time series data from other datasets you have likely seen, where each row can typically be treated as an independent observation. In time series, the order of the data matters. Patterns, cycles, and trends can all emerge over time—and understanding those structures is the key to effective forecasting.
+
+=== Why Autoregressive (AR) Models?
+
+Autoregressive (AR) models are a natural starting point for time series forecasting. At their core, they use past values to predict the future. An AR model assumes that recent values carry useful information about what comes next.
+
+These models are simple, interpretable, and often surprisingly effective, especially when patterns persist over time. In this project, we will start with AR models to help introduce foundational ideas like *lags*, *autocorrelation*, and *stationarity*—concepts that carry through to more advanced models.
+
+=== Lag in Time Series
+
+In time series analysis, we assume that the past influences the future. This makes time-based data different from other datasets—observations are not independent, and patterns often persist over time.
+
+A *lag* is simply a previous value of the same variable:
+
+* Lag 1 → the value one time step ago
+* Lag 2 → the value two time steps ago
+* Lag _n_ → the value _n_ time steps ago
+
+By including lagged values in a model, we give it memory. This lets the model "remember" past behavior and use that memory to explain current outcomes.
+
+=== The AR(1) Model: A First Look at Autoregression
+
+One of the simplest models that uses lags is the autoregressive model of order 1, or AR(1). It assumes the current value depends on the previous value, plus some random noise.
+
+Yₜ = ϕ × Yₜ₋₁ + εₜ
+
+where:
+
+* Yₜ is the current value
+* Yₜ₋₁ is the value one step before
+* ϕ is the autoregressive coefficient
+* εₜ is random noise
+
+This equation may look intimidating at first, but conceptually it states that today’s value is largely a continuation of yesterday’s value, with some variability added in. In the context of this dataset, it suggests that this month’s temperature is related to last month’s temperature, plus some noise.
+
+Let’s look at how autocorrelation appears in the monthly temperature data.
+
+image::Autocorrelation-monthly.png[width=600, height=450, title="Autocorrelation function for monthly average temperature"]
+
+The figure above shows the autocorrelation for `Monthly_AverageDryBulbTemperature_Farenheit`, where each lag corresponds to one month. Strong positive correlations appear at lags of 12, 24, and 36 months, indicating a clear yearly seasonal pattern in monthly temperatures. The autocorrelation at lag 1 is also high, suggesting that consecutive months are strongly related.
+
+Understanding the concept of *lag* is foundational before moving on to more complex models like **SARIMAX**. To see this idea in action, you will begin by fitting an AR-style comparison using consecutive months.
+
+=== Working with Lagged Monthly Data
+
+Begin by ensuring the dataset is properly ordered in time. Convert the `DATE` column to a datetime object and sort the DataFrame in ascending chronological order.
+
+[source,python]
+----
+monthly_df['DATE'] = pd.to_datetime(monthly_df['DATE'])
+monthly_df = monthly_df.sort_values('DATE').reset_index(drop=True)
+monthly_df.head()
+----
+
+[cols="2,2,2,2,3,2", options="header"]
+|===
+| Monthly_Precipitation | Monthly_AverageRelativeHumidity | Monthly_AverageWindSpeed | Monthly_Snowfall | Monthly_AverageDryBulbTemperature_Farenheit | DATE
+| 2.706452 | 77.870968 | 5.532258 | 2.225806 | 39.780645 | 2006-01-31
+| 1.717857 | 68.892857 | 4.875000 | 3.571429 | 32.347143 | 2006-02-28
+| 5.567742 | 70.838710 | 4.951613 | 4.516129 | 41.725806 | 2006-03-31
+| 3.073333 | 61.366667 | 4.806667 | 0.000000 | 57.146000 | 2006-04-30
+| 3.558065 | 68.387097 | 3.903226 | 0.000000 | 61.310968 | 2006-05-31
+|===
+
+Next, construct a new DataFrame that explicitly compares each month’s average temperature to the previous month’s temperature. This will allow you to examine how strongly consecutive observations are related.
+
+[source,python]
+----
+monthly_comparisons = []
+
+for i in range(1, len(monthly_df)):
+ date = monthly_df.loc[i, 'DATE']
+ current_temp = monthly_df.loc[i, 'Monthly_AverageDryBulbTemperature_Farenheit']
+ previous_temp = monthly_df.loc[i - 1, 'Monthly_AverageDryBulbTemperature_Farenheit']
+
+ row = {
+ 'Date': date,
+ 'Current': current_temp,
+ 'Previous': previous_temp
+ }
+ monthly_comparisons.append(row)
+
+comparison_df = pd.DataFrame(monthly_comparisons)
+comparison_df.head()
+----
+
+=== Visualizing Lag Relationships
+
+To visualize the relationship between consecutive months, create a scatterplot with the previous month’s temperature on the x-axis and the current month’s temperature on the y-axis.
+
+[source,python]
+----
+import matplotlib.pyplot as plt
+
+plt.figure(figsize=(8, 6))
+plt.scatter(comparison_df['Previous'], comparison_df['Current'], alpha=0.6)
+plt.title('Current vs Previous Month Temperature')
+plt.xlabel('Previous Month Temperature (°F)')
+plt.ylabel('Current Month Temperature (°F)')
+plt.grid(True)
+plt.tight_layout()
+plt.show()
+----
+
+image::TimeSeriesSection2-2c.png[width=600, height=450, title="Current vs previous month temperature"]
+
+Use this plot to interpret how strongly the current month’s temperature depends on the previous month. Consider whether the relationship appears linear, tightly clustered, or highly variable, and how this supports the use of autoregressive models.
+
+
+== ARIMA and Stationarity
+
+++++
+
+++++
+
+=== Why Are We Using ARIMA Now?
+
+By now, you have seen that temperature data is not random. Some months are correlated with each other. Some months are warmer than others, and these shifts often repeat each year. But how can we predict the future based on what we have seen?
+
+One of the most widely used tools for time series forecasting is the *ARIMA* model. ARIMA stands for:
+
+* *AR – AutoRegressive:* Uses past values to predict the future
+* *I – Integrated:* Removes trends by differencing the data
+* *MA – Moving Average:* Uses past errors to improve predictions
+
+So why are we using ARIMA here?
+
+* We’re working with monthly climate data, which often shows both trend and seasonal behavior
+* The data is recorded at regular time intervals, which ARIMA is well-suited for
+* ARIMA provides an interpretable framework that allows us to understand what is driving predictions
+
+Before jumping into a full ARIMA model, the earlier questions focused on autocorrelation. This was intentional. Autocorrelation lays the foundation for how time series models “remember” the past. It helped build intuition around lagged values and showed how previous observations influence current ones.
+
+ARIMA models are flexible and interpretable, and they work best when future values depend linearly on past values. However, ARIMA makes one critical assumption that must be checked before modeling: *stationarity*.
+
+=== Why Stationarity Matters
+
+In time series modeling, stationarity means the statistical properties of the data—such as its mean, variance, and autocorrelation—remain consistent over time. This consistency allows ARIMA models to reliably detect patterns.
+
+If a series has a strong trend or changing variance, ARIMA may struggle to learn anything meaningful. The model can mistake trends for patterns, leading to misleading forecasts. That is why stationarity must be assessed before fitting an ARIMA model. If the data is not stationary, transformations such as differencing can be applied to address the issue.
+
+=== How Do We Know If a Series Is Stationary?
+
+To test for stationarity, we use the *Augmented Dickey-Fuller (ADF) test*. This test evaluates the presence of a unit root in the series by comparing two hypotheses:
+
+* *Null hypothesis (H₀):* The series is non-stationary
+* *Alternative hypothesis (H₁):* The series is stationary
+
+The test produces a p-value. If the p-value is below a chosen significance level (commonly 0.05), we reject the null hypothesis and conclude that the series appears stationary.
+
+The ADF test can be thought of as a screening step. If the test indicates non-stationarity, that signals the need for transformation before modeling.
+
+=== How Do We Make a Series Stationary?
+
+One of the most common methods for achieving stationarity is *differencing*. Differencing subtracts each value from the previous value in the series. This shifts the focus from absolute levels to changes over time.
+
+Conceptually:
+
+* The original series represents the actual temperature each month
+* The differenced series represents how much the temperature changes from one month to the next
+
+By removing long-term trends, differencing stabilizes the series and allows ARIMA to focus on repeatable patterns. This step is essential for producing meaningful and interpretable forecasts.
+
+=== Train–Test Splitting in Time Series
+
+image::Train-test-split.png[width=600, height=450, title="The split for our training and test dataset"]
+
+When building forecasting models, the order of observations must always be respected. Unlike traditional machine learning problems, time series data cannot be randomly shuffled before splitting.
+
+Instead, data must be split chronologically:
+
+* *Training set:* Earlier observations used to learn historical patterns
+* *Testing set:* Later observations used to evaluate future forecasting performance
+
+For example, if monthly temperature data spans January 2012 through December 2024, a realistic split would be:
+
+* Training set: January 2012 to December 2022
+* Testing set: January 2023 to December 2024
+
+This approach simulates real-world forecasting, where models are trained on past data and evaluated on unseen future values. It also prevents data leakage, which occurs when information from the future is inadvertently used during training.
+
+Time-aware train/test splitting is fundamental to reliable time series forecasting and applies to all models, including ARIMA and LSTM.
+
+=== Preparing the Data for ARIMA
+
+Begin by splitting the dataset into training and testing sets using the specified date ranges. The stationarity tests and model fitting will be performed using the training data only.
+
+[source,python]
+----
+import pandas as pd
+
+monthly_df['DATE'] = pd.to_datetime(monthly_df['DATE'])
+
+train = monthly_df[
+ (monthly_df['DATE'] >= '2012-01-01') &
+ (monthly_df['DATE'] <= '2022-12-31')
+].copy()
+
+test = monthly_df[
+ (monthly_df['DATE'] >= '2023-01-01') &
+ (monthly_df['DATE'] <= '2024-12-31')
+].copy()
+
+train.head()
+test.head()
+----
+
+=== Testing for Stationarity with the ADF Test
+
+With the training data prepared, run the Augmented Dickey-Fuller test on the `Monthly_AverageDryBulbTemperature_Farenheit` series. The test returns both a test statistic and a p-value, which can be used to assess stationarity.
+
+[source,python]
+----
+from statsmodels.tsa.stattools import adfuller
+
+result = adfuller(train['Monthly_AverageDryBulbTemperature_Farenheit'])
+
+print("ADF Statistic:", result[0])
+print("p-value:", result[1])
+----
+
+----
+ADF Statistic: -2.006080928580352
+p-value: 0.2839132054964615
+----
+
+Use the p-value to interpret whether the series appears stationary or non-stationary and how this impacts readiness for ARIMA modeling.
+
+=== Applying First-Order Differencing
+
+To address non-stationarity, apply first-order differencing to the training series and visualize the result. This transformation highlights month-to-month changes rather than absolute temperature levels.
+
+[source,python]
+----
+train['Temp_diff'] = train['Monthly_AverageDryBulbTemperature_Farenheit'].diff()
+
+import matplotlib.pyplot as plt
+
+plt.plot(train['DATE'], train['Temp_diff'])
+plt.title("First-Order Differenced Series (Training Data)")
+plt.xlabel("Date")
+plt.ylabel("Change in Temperature (°F)")
+plt.grid()
+plt.show()
+----
+
+image::TimeSeriesSection2-3c.png[width=600, height=450, title="First-order differenced temperature series"]
+
+=== Re-testing Stationarity After Differencing
+
+After differencing, run the ADF test again on the transformed series. Compare the new test results to the original ADF output to assess whether stationarity has been achieved.
+
+[source,python]
+----
+temp_diff_clean = train['Temp_diff'].dropna()
+
+result_diff = adfuller(temp_diff_clean)
+
+print("ADF Statistic (differenced):", result_diff[0])
+print("p-value (differenced):", result_diff[1])
+----
+
+----
+ADF Statistic (differenced): -9.345081128600434
+p-value (differenced): 8.589567709641302e-16
+----
+
+Use this comparison to reflect on how differencing changes the statistical properties of the series and why this step is essential before fitting an ARIMA model.
+
+
+== Fit a Baseline ARIMA Model and Evaluate Performance
+
+++++
+
+++++
+
+You have done the groundwork: explored the data, visualized trends, and confirmed stationarity through differencing. Now it is time to fit a baseline ARIMA model using only the temperature data—without seasonality or external predictors.
+
+This baseline model will serve as a point of comparison. Later, when you introduce seasonality and additional variables using SARIMAX, you will be able to evaluate whether those added components meaningfully improve performance.
+
+=== Why Start with a Baseline ARIMA Model?
+
+Fitting a basic ARIMA model first provides a simple benchmark. It answers an important question: *How well can we forecast temperature using only its past values?*
+
+If a more complex model does not improve upon this baseline, then the added complexity may not be justified.
+
+=== What Is ARIMA?
+
+ARIMA is a classic and widely used model for time series forecasting. It stands for:
+
+* *AutoRegressive (AR):*
+ The model uses the relationship between a variable and its own past values.
+ _Example: If last month was hot, this month might also be hot (though not always)._
+
+* *Integrated (I):*
+ Differencing is used to remove trends and make the series stationary, which is a key assumption of ARIMA.
+ _Example: If temperatures gradually increase over time, differencing shifts focus to month-to-month changes._
+
+* *Moving Average (MA):*
+ The model incorporates past forecast errors to improve predictions.
+ _Example: If last month’s prediction was too low, the model may adjust future predictions upward._
+
+Although ARIMA does not directly model seasonality or external variables, it remains a powerful and interpretable approach when working with a single time series.
+
+Additional documentation for the ARIMA implementation in Python can be found here:
+https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html
+
+=== Defining the Target Variable
+
+Before fitting the model, clearly define what you are trying to predict. In this project, the target variable is the monthly average dry bulb temperature.
+
+[source,python]
+----
+# Define the target variable
+target_col = 'Monthly_AverageDryBulbTemperature_Farenheit'
+----
+
+=== Preparing the Training Data
+
+Reset the index of the training DataFrame and extract the target variable as a separate series. This prepares the data in the format expected by the ARIMA model.
+
+[source,python]
+----
+train = train.reset_index(drop=True)
+y_train = train[target_col]
+----
+
+=== Fitting a Baseline ARIMA(1,1,1) Model
+
+Fit an ARIMA model with order (1, 1, 1), then visualize how well the fitted values align with the observed temperature series.
+
+[source,python]
+----
+from statsmodels.tsa.arima.model import ARIMA
+import matplotlib.pyplot as plt
+
+# Fit the ARIMA model
+arima_model = ARIMA(y_train, order=(1, 1, 1))
+arima_fit = arima_model.fit()
+
+# Extract fitted values
+fitted_values = arima_fit.fittedvalues
+
+# Align actual values and dates
+y_aligned = y_train.loc[fitted_values.index]
+date_aligned = train['DATE'].loc[fitted_values.index]
+
+# Plot actual vs fitted values
+plt.figure(figsize=(12, 5))
+plt.plot(date_aligned, y_aligned, label='Actual', color='blue')
+plt.plot(date_aligned, fitted_values, label='Fitted', color='orange', linestyle='--')
+
+plt.title("ARIMA(1,1,1): Actual vs Fitted Monthly Temperature")
+plt.xlabel("Date")
+plt.ylabel("Temperature (°F)")
+plt.legend()
+plt.grid(True)
+plt.tight_layout()
+plt.xticks(rotation=45)
+plt.show()
+----
+
+image::TimeSeriesSection2-4c.png[width=600, height=450, title="Actual vs Fitted Temperature"]
+
+Use this plot to assess how well the ARIMA model captures the overall behavior of the training data, including trends and short-term fluctuations.
+
+=== Evaluating Model Performance with MAE
+
+To quantify model performance, compute the Mean Absolute Error (MAE). MAE measures the average magnitude of prediction errors in the same units as the target variable, making it easy to interpret.
+
+[source,python]
+----
+from sklearn.metrics import mean_absolute_error
+
+actual = y_train
+predicted = fitted_values
+
+mae = mean_absolute_error(actual, predicted)
+print(f"Mean Absolute Error: {mae:.2f}°F — on average, the model's predictions are off by about this many degrees.")
+----
+
+----
+Mean Absolute Error: 7.13°F — on average, the model's predictions are off by about this many degrees.
+----
+
+Use this value as a baseline benchmark. In later sections, you will compare this error to models that incorporate seasonality and additional predictors.
+
+=== Reflecting on Model Limitations
+
+Although ARIMA provides a useful starting point, it has limitations in the context of climate data. Reflect on one limitation of using a plain ARIMA model for this problem—for example, its inability to directly capture strong seasonal patterns or account for other weather-related variables such as precipitation or wind.
+
+
+== Build and Fit the SARIMAX Model and Evaluate Performance
+
+Before fitting a SARIMAX model, it is important to understand *why* we are using it and how it extends what we have already learned from ARIMA.
+
+SARIMAX stands for *Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors*. It builds directly on ARIMA by adding two key capabilities: seasonality and external predictors.
+
+=== What Is SARIMAX?
+
+Let’s break the model down:
+
+* *AutoRegressive (AR):*
+ Uses past values of the series to predict future values.
+ By now, you have seen how temperature depends on recent history.
+
+* *Integrated (I):*
+ Applies differencing to remove trends and make the series stationary.
+
+* *Moving Average (MA):*
+ Uses past forecast errors to refine predictions.
+
+* *Seasonal:*
+ Adds AR, I, and MA components that repeat at a fixed seasonal interval (such as yearly cycles in monthly data).
+
+* *Exogenous variables (X):*
+ Allows the model to incorporate additional predictors—such as precipitation, humidity, wind speed, and snowfall—that may help explain temperature variation.
+
+In simpler terms, SARIMAX is ARIMA with upgrades. It is designed to handle repeating seasonal patterns *and* external influences, making it a strong choice for modeling climate data.
+
+Why not rely on ARIMA alone? ARIMA models temperature using only its own past behavior. SARIMAX allows us to include other weather-related variables that may help explain temperature changes more accurately.
+
+In this section, you will define the target variable, select relevant exogenous variables, configure the SARIMAX model, and evaluate how well it performs.
+
+=== What Are We Asking SARIMAX to Do?
+
+The SARIMAX model in this project is designed to:
+
+* Learn how temperature evolves over time
+* Capture repeating seasonal patterns (for example, colder winters and warmer summers)
+* Use other climate variables that help explain temperature fluctuations
+
+=== Model Configuration
+
+We will begin with the following configuration:
+
+[source,python]
+----
+order = (1, 1, 1)
+seasonal_order = (1, 1, 1, 12)
+----
+
+==== Non-Seasonal Component: `order = (1, 1, 1)`
+
+* `1` (AR): Uses the previous observation
+* `1` (I): Applies first-order differencing
+* `1` (MA): Uses the previous forecast error
+
+==== Seasonal Component: `seasonal_order = (1, 1, 1, 12)`
+
+* `1` (Seasonal AR): Looks at the same month in the previous year
+* `1` (Seasonal I): Applies seasonal differencing
+* `1` (Seasonal MA): Uses past seasonal forecast errors
+* `12`: Indicates a yearly seasonal cycle in monthly data
+
+This configuration allows the model to capture short-term dynamics, long-term trends, and yearly seasonality—while also accounting for outside weather conditions.
+
+=== Loading Required Libraries
+
+Begin by importing the libraries needed for modeling, evaluation, and visualization. A warning filter is included to keep the output readable.
+
+[source,python]
+----
+import warnings
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from statsmodels.tsa.statespace.sarimax import SARIMAX
+from sklearn.metrics import mean_absolute_error
+
+warnings.filterwarnings("ignore")
+----
+
+=== Fitting the SARIMAX Model
+
+Define the exogenous variables that may influence temperature, then fit the SARIMAX model using the specified configuration.
+
+[source,python]
+----
+exog_cols = [
+ 'Monthly_Precipitation',
+ 'Monthly_AverageRelativeHumidity',
+ 'Monthly_AverageWindSpeed',
+ 'Monthly_Snowfall'
+]
+
+X_train = train[exog_cols]
+
+model = SARIMAX(
+ y_train,
+ exog=X_train,
+ order=(1, 1, 1),
+ seasonal_order=(1, 1, 1, 12)
+)
+
+model_fit = model.fit(disp=False)
+----
+
+Including a seasonal order of `(1, 1, 1, 12)` allows the model to explicitly learn repeating yearly temperature patterns, which are common in monthly climate data.
+
+=== Evaluating Model Fit on the Training Data
+
+Once the model is fitted, compare the fitted values to the actual training data to assess how well the model captures overall trends and seasonality.
+
+[source,python]
+----
+fitted_values = model_fit.fittedvalues
+
+plt.figure(figsize=(14, 6))
+plt.plot(train['DATE'], y_train, label='Actual (Train)', color='blue')
+plt.plot(train['DATE'], fitted_values, label='Fitted (Train)', color='orange', linestyle='--')
+
+plt.title('Training Set: Actual vs Fitted (SARIMAX)')
+plt.xlabel('Date')
+plt.ylabel('Temperature (°F)')
+plt.xticks(rotation=45)
+plt.legend()
+plt.grid(True)
+plt.tight_layout()
+plt.show()
+----
+
+image::TimeSeriesSection2-5c.png[width=600, height=450, title="Actual vs Fitted Temperature"]
+
+This visualization provides a qualitative check on whether the model is capturing both the overall structure and the repeating seasonal patterns in the data.
+
+=== Evaluating Performance on the Test Set
+
+To assess how well the SARIMAX model generalizes to unseen data, generate forecasts for the test period and compute the Mean Absolute Error (MAE).
+
+[source,python]
+----
+y_test = test[target_col]
+
+forecast = model_fit.forecast(
+ steps=len(test),
+ exog=test[exog_cols]
+)
+
+mae_test = mean_absolute_error(y_test, forecast)
+
+print(
+ f"Mean Absolute Error (Test Set): {mae_test:.2f}°F — "
+ f"on average, predictions are about {mae_test:.2f}°F away from actual values."
+)
+----
+
+----
+Mean Absolute Error (Test Set): 3.08°F — this means the model's predicted temperatures are, on average, about 3.08°F away from the actual values.
+----
+
+Evaluating performance on new data is essential for understanding whether the model has learned meaningful patterns or is simply fitting noise in the training set.
+
+
+== Forecast Into the Future
+
+You have evaluated your SARIMAX model on the test set (January 2023–December 2024). The final step is to push the model beyond the observed data and use it to forecast future temperatures.
+
+This is where time series modeling becomes especially powerful. By learning patterns from historical data—including trends, seasonality, and relationships with other variables—the model can generate informed predictions about what may happen next.
+
+In this section, you will use your fitted SARIMAX model to forecast **12 additional months into the future (January–December 2025)** and visualize how those predictions compare to historical trends.
+
+=== Forecasting Beyond the Observed Data
+
+To generate forecasts for future dates, the SARIMAX model still requires values for the exogenous variables. Since true future values are unknown, a common and reasonable placeholder strategy is to reuse recent observed values. Here, the final 12 months of exogenous data from the test set will serve as a stand-in for 2025.
+
+=== Creating the Future Forecast
+
+Begin by constructing a DataFrame that contains future dates and forecasted temperatures.
+
+[source,python]
+----
+# Use the last 12 rows of exogenous variables as a placeholder for 2025
+future_exog = test[exog_cols].tail(12).copy()
+
+# Forecast 12 months into the future
+future_forecast = model_fit.forecast(
+ steps=12,
+ exog=future_exog
+)
+
+# Create a date range for the forecast period
+future_dates = pd.date_range(start='2025-01-01', periods=12, freq='MS')
+
+# Combine dates and forecasts into a DataFrame
+future_df = pd.DataFrame({
+ 'DATE': future_dates,
+ 'Forecasted_Temp': future_forecast
+})
+
+future_df.head()
+----
+
+----
+ DATE Forecasted_Temp
+132 2025-01-01 29.074106
+133 2025-02-01 33.112588
+134 2025-03-01 44.510898
+135 2025-04-01 52.317802
+136 2025-05-01 63.753183
+----
+
+This DataFrame represents the model’s best estimate of monthly average temperatures for 2025, based on patterns learned from historical data.
+
+=== Visualizing Historical Data, Test Forecasts, and Future Predictions
+
+To better understand how the forecast behaves, create a final plot that overlays:
+
+* Actual temperatures from the full dataset
+* Model forecasts for the test period (2023–2024)
+* Future forecasts for 2025
+
+[source,python]
+----
+plt.figure(figsize=(14, 6))
+
+# Plot actual data
+plt.plot(
+ monthly_df['DATE'],
+ monthly_df[target_col],
+ label='Actual Temperature',
+ color='blue'
+)
+
+# Plot test set forecast
+plt.plot(
+ test['DATE'],
+ forecast,
+ label='Forecast (2023–2024)',
+ color='orange',
+ linestyle='--'
+)
+
+# Plot future forecast
+plt.plot(
+ future_df['DATE'],
+ future_df['Forecasted_Temp'],
+ label='Future Forecast (2025)',
+ color='green',
+ linestyle='--'
+)
+
+plt.title("Actual Temperatures, Test Forecasts, and Future Predictions")
+plt.xlabel("Date")
+plt.ylabel("Temperature (°F)")
+plt.xticks(rotation=45)
+plt.legend()
+plt.grid(True)
+plt.tight_layout()
+plt.show()
+----
+
+image::TimeSeriesSection2-6b.png[width=600, height=450, title="Actual Temperatures, Test Forecasts, and Future Predictions"]
+
+This visualization provides a holistic view of the model’s behavior across time—past, present, and projected future.
+
+=== Reflecting on the Forecast
+
+Use the final plot and forecasted values we can reflect on how the SARIMAX model behaves when projecting into the future. Consider whether the forecast:
+
+* Follows expected seasonal temperature patterns
+* Aligns with trends observed in recent years
+* Appears reasonable given the historical range of values
+
+This reflection helps assess not only the numerical output of the model, but also whether the predictions make sense in the context of real-world climate behavior.
+
+