Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
= TDM XGBoost Seminar Project

== Project Objectives

In this project, you will build an XGBoost model to accurately predict fertility rates and discover which features in the data are most important for predicting fertility rates around the globe.

.Learning Objectives
****
- Interpolate missing values in the data to preprocess the data.
- Compute the correlation of numeric features with the target variable `fertility_rate`.
- Develop a fertility rate map in order to capture fertility rates trends and patterns around the globe in 2024.
- Build an XGBoost model using `XGBRegressor` and then measure it's accuracy on the test set using RMSE and R^2.
- Extract feature importance from the model and interpret the results.
****

== Dataset
- '/anvil/projects/tdm/data/icecream/hd/images/0_hd.png'


== Questions

=== Question 1 (2 points)

.Deliverables
====
**1a. Load the dataset using the file path provided and display the first 5 rows.**

**1b. Interpolate missing numeric values within each country using linear interpolation with forward and backward fill.**

**1c. Confirm that all numeric columns no longer have missing values after interpolation.**

_Hint:_ You can use `.isna().sum())`


====

=== Question 2 (2 points)

.Deliverables
====
**2a. Identify the target variable for this prediction task and it's mean.**

**2b. Compute the correlation matrix using all numeric features. Report the 5 features most positively and most negatively correlated with fertility rate.**


_Hint:_ You can use `.corr()` for the correlation matrix and `.sort_values` to sort the correlation matrix.


**2c. Provide an interpretation for one strong positive and one strong negative correlation in 1-2 sentences.**
====

=== Question 3 (2 points)

.Deliverables
====
**3a. Using the most recent year for each country, extract a DataFrame with country and fertility rate.**

**3b. Merge this with a world GeoJSON file and create a choropleth map of fertility rate by country.**

**3c. What geographic patterns do you observe? Where are fertility rates highest and lowest?**
====

=== Question 4 (2 points)

.Deliverables
====
**4a. Create a feature matrix using only the top 5 positively correlated variables with fertility_rate. Then split the data into training and test sets (80/20).**

**4b. Use XGBRegressor to fit an XGBoost model on the training dataset.**

**4c. Evaluate the XGBoost model on the test dataset using RMSE and R2.**

**4d. In 2–3 sentences, explain what XGBRegressor does. How does it differ from DecisionTreeRegressor? Why might it perform better?**
====

=== Question 5 (2 points)


.Deliverables
====
**5a. Use all numeric columns except `country`, and `fertility_rate` as features. Define your X and y, and split into train and test sets (80/20).**

**5b. Train an XGBRegressor using all numeric features. Report the test set RMSE and R².**

**5c. Use xgboost.plot_importance to visualize the top 15 most important features in the model.**

**5d. Which features were most important in predicting fertility rate? Are they consistent with your correlation analysis?**
====


=== Question 6 (2 points)


.Deliverables
====
**6a. Which features are most negatively associated with fertility rate? Use correlation values to identify them. What could these relationships imply?**

**6b. Plot how fertility rate has changed over time for the United States of America and Turkey. What patterns do you notice?**

**6c. What are two limitations of your XGBoost model for real-world decision-making?**
====


== Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

.Items to submit
====
- firstname_lastname_project1.ipynb
====

[WARNING]
====
You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.

You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.
====