TheDataMine · arroyo38 · Aug 5, 2025 · Aug 5, 2025 · Aug 5, 2025 · Aug 7, 2025
diff --git a/projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc b/projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc
@@ -0,0 +1,118 @@
+= TDM XGBoost Seminar Project 
+
+== Project Objectives
+
+In this project, you will build an XGBoost model to accurately predict fertility rates and discover which features in the data are most important for predicting fertility rates around the globe. 
+
+.Learning Objectives
+****
+- Interpolate missing values in the data to preprocess the data.
+- Compute the correlation of numeric features with the target variable `fertility_rate`.
+- Develop a fertility rate map in order to capture fertility rates trends and patterns around the globe in 2024.
+- Build an XGBoost model using `XGBRegressor` and then measure it's accuracy on the test set using RMSE and R^2. 
+- Extract feature importance from the model and interpret the results. 
+****
+
+== Dataset
+- '/anvil/projects/tdm/data/icecream/hd/images/0_hd.png'
+
+
+== Questions
+
+=== Question 1 (2 points)
+
+.Deliverables
+====
+**1a. Load the dataset using the file path provided and display the first 5 rows.**
+
+**1b. Interpolate missing numeric values within each country using linear interpolation with forward and backward fill.**
+
+**1c. Confirm that all numeric columns no longer have missing values after interpolation.**
+
+_Hint:_ You can use `.isna().sum())`
+
+
+====
+
+=== Question 2 (2 points)
+
+.Deliverables
+====
+**2a. Identify the target variable for this prediction task and it's mean.**
+
+**2b. Compute the correlation matrix using all numeric features. Report the 5 features most positively and most negatively correlated with fertility rate.**
+
+
+_Hint:_ You can use `.corr()` for the correlation matrix and `.sort_values` to sort the correlation matrix. 
+
+
+**2c. Provide an interpretation for one strong positive and one strong negative correlation in 1-2 sentences.**
+====
+
+=== Question 3 (2 points)
+
+.Deliverables
+====
+**3a. Using the most recent year for each country, extract a DataFrame with country and fertility rate.**
+
+**3b. Merge this with a world GeoJSON file and create a choropleth map of fertility rate by country.**
+
+**3c. What geographic patterns do you observe? Where are fertility rates highest and lowest?**
+====
+
+=== Question 4 (2 points)
+
+.Deliverables
+====
+**4a. Create a feature matrix using only the top 5 positively correlated variables with fertility_rate. Then split the data into training and test sets (80/20).**
+
+**4b. Use XGBRegressor to fit an XGBoost model on the training dataset.**
+
+**4c. Evaluate the XGBoost model on the test dataset using RMSE and R2.**
+
+**4d. In 2–3 sentences, explain what XGBRegressor does. How does it differ from DecisionTreeRegressor? Why might it perform better?**
+====
+
+=== Question 5 (2 points)
+
+
+.Deliverables
+====
+**5a. Use all numeric columns except `country`, and `fertility_rate` as features. Define your X and y, and split into train and test sets (80/20).**
+
+**5b. Train an XGBRegressor using all numeric features. Report the test set RMSE and R².**
+
+**5c. Use xgboost.plot_importance to visualize the top 15 most important features in the model.**
+
+**5d. Which features were most important in predicting fertility rate? Are they consistent with your correlation analysis?**
+====
+
+
+=== Question 6 (2 points)
+
+
+.Deliverables
+====
+**6a. Which features are most negatively associated with fertility rate? Use correlation values to identify them. What could these relationships imply?**
+
+**6b. Plot how fertility rate has changed over time for the United States of America and Turkey. What patterns do you notice?**
+
+**6c. What are two limitations of your XGBoost model for real-world decision-making?**
+====
+
+
+== Submitting your Work
+
+Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
+
+.Items to submit
+====
+- firstname_lastname_project1.ipynb
+====
+
+[WARNING]
+====
+You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.
+
+You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.
+====