diff --git a/projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc b/projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc new file mode 100644 index 0000000000..6bcaa2ea57 --- /dev/null +++ b/projects-appendix/modules/spring2025/pages/30200/project_xgb.adoc @@ -0,0 +1,118 @@ += TDM XGBoost Seminar Project + +== Project Objectives + +In this project, you will build an XGBoost model to accurately predict fertility rates and discover which features in the data are most important for predicting fertility rates around the globe. + +.Learning Objectives +**** +- Interpolate missing values in the data to preprocess the data. +- Compute the correlation of numeric features with the target variable `fertility_rate`. +- Develop a fertility rate map in order to capture fertility rates trends and patterns around the globe in 2024. +- Build an XGBoost model using `XGBRegressor` and then measure it's accuracy on the test set using RMSE and R^2. +- Extract feature importance from the model and interpret the results. +**** + +== Dataset +- '/anvil/projects/tdm/data/icecream/hd/images/0_hd.png' + + +== Questions + +=== Question 1 (2 points) + +.Deliverables +==== +**1a. Load the dataset using the file path provided and display the first 5 rows.** + +**1b. Interpolate missing numeric values within each country using linear interpolation with forward and backward fill.** + +**1c. Confirm that all numeric columns no longer have missing values after interpolation.** + +_Hint:_ You can use `.isna().sum())` + + +==== + +=== Question 2 (2 points) + +.Deliverables +==== +**2a. Identify the target variable for this prediction task and it's mean.** + +**2b. Compute the correlation matrix using all numeric features. Report the 5 features most positively and most negatively correlated with fertility rate.** + + +_Hint:_ You can use `.corr()` for the correlation matrix and `.sort_values` to sort the correlation matrix. + + +**2c. Provide an interpretation for one strong positive and one strong negative correlation in 1-2 sentences.** +==== + +=== Question 3 (2 points) + +.Deliverables +==== +**3a. Using the most recent year for each country, extract a DataFrame with country and fertility rate.** + +**3b. Merge this with a world GeoJSON file and create a choropleth map of fertility rate by country.** + +**3c. What geographic patterns do you observe? Where are fertility rates highest and lowest?** +==== + +=== Question 4 (2 points) + +.Deliverables +==== +**4a. Create a feature matrix using only the top 5 positively correlated variables with fertility_rate. Then split the data into training and test sets (80/20).** + +**4b. Use XGBRegressor to fit an XGBoost model on the training dataset.** + +**4c. Evaluate the XGBoost model on the test dataset using RMSE and R2.** + +**4d. In 2–3 sentences, explain what XGBRegressor does. How does it differ from DecisionTreeRegressor? Why might it perform better?** +==== + +=== Question 5 (2 points) + + +.Deliverables +==== +**5a. Use all numeric columns except `country`, and `fertility_rate` as features. Define your X and y, and split into train and test sets (80/20).** + +**5b. Train an XGBRegressor using all numeric features. Report the test set RMSE and R².** + +**5c. Use xgboost.plot_importance to visualize the top 15 most important features in the model.** + +**5d. Which features were most important in predicting fertility rate? Are they consistent with your correlation analysis?** +==== + + +=== Question 6 (2 points) + + +.Deliverables +==== +**6a. Which features are most negatively associated with fertility rate? Use correlation values to identify them. What could these relationships imply?** + +**6b. Plot how fertility rate has changed over time for the United States of America and Turkey. What patterns do you notice?** + +**6c. What are two limitations of your XGBoost model for real-world decision-making?** +==== + + +== Submitting your Work + +Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. + +.Items to submit +==== +- firstname_lastname_project1.ipynb +==== + +[WARNING] +==== +You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. + +You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. +==== \ No newline at end of file