By: Lynn Do, Linh Ngoc Le
From the perspective of the health insurance company, the medical expenses of the beneficiaries have direct impact on whether or not the cost of the insurance plan should be increased/decreased (in order to maximize profit). Therefore, they are especially interested in predicting the medical expenses, in order to gain insights into the potential predictors that might correlate with a change in this factor.
The data set we look at is a synthetic data set that uses actual demographic statistics from the US Census Bureau. Therefore, it can give an approximate of real-world conditions.
There are 1338 observations, each representing an individual enrolled in the insurance plan. There are 7 demographic statistics of concern: age, sex, BMI (which is a metric used to measure a person's weight wrt. height), number of children, smoking habits, and region of residence in the US. For this analysis, as expected, we will treat charges as the response variable, and the other 6 variables as potential explanatory variables.
1. Examining the numerical variables
Just from visual inspection, focusing on the last row, we observe a few things:
- There seems to be positive correlation between
ageandcharges. In the plot, there are 3 clouds of points corresponding to each tier of charges. For each cloud, the positive correlation can be seen. - The same with
bmiandcharges. There seems to be 2 tiers of charges, each both have a positive correlation withbmi. - On the other hand, there is little to no relationship between
chargesand number of children.
2. Examining the categorical variables
Smoking habit seems to be the factor that is most related with the change in charges. Specifically, the yes group (corresponding to the smoker group) tends to be associated with higher amount of charges.
Using the BIC metric, we were able to confirm that the best model (model 4) included these variables: age, bmi, children, smoker. However, as observed above, we may encounter a problem with linearity condition if we include children in the model. Specifically, children when standing alone has visibly no correlation with charges.
Our simple linear regression results for these 2 variables alone, even after some trials of transformation, also proved this point. That leads us to select just 3 variables: age, bmi, smoker.
Remember that we noticed that there are clearly 2 distinct clouds in the bmi vs. charges plot. When we look closely into the interactions between the 4 variables selected, we also observe the 2 distinct slopes for each smoker group, meaning: an additional increase in the bmi for people who don't smoke is associated with less increase in charges compared with people who smoke.
This guides us to add an interaction for bmi and smoker in our final model to account for the pattern we observed:
Estimated model:
The predictors
-
$X_1$ : age -
$X_2$ : bmi -
$X_3$ : indicator for smoke group (1 if that individual smokes, 0 otherwise) -
$X_4$ :$x_3 \times bmi$
-
Independence: There is not enough information to conclude about whether the patients included in this data set, by any chance, are related (there is no information on how the individuals with these demographic statistics are selected). In this case, we have to assume independence.
-
Linearity:
- Residuals vs BMI: we differentiate
smokeryesandsmokernowith colors and notice the patterns as in the plot above, which makes us unsure about whether this condition is met. As shown, there are 2 distinct slopes for each smoker group, which we have accounted for using the interaction term. We will still revisit this when we transform the variables.
- Residuals vs age: the linearity is met (no observable pattern).
- Equal variance of residuals: The variance of the residuals for each smoker group indicates the equal variance condition satisfaction since the standard deviations are roughly similar (ratio = 1.22).
- Outliers: Across all metrics for identifying outliers, we observe the same pattern: there are many points with exceptionally high
y.
For leverage, there are many points above the cutoff line y-intercepting at 0.0075 in the Leverage plot which are detected to be outliers.
Same for studentized, there are many studentized residuals that have absolute values greater than |2|, which indicate potential outliers.
The Cook's distance plot helps us gain more insight into the points that if omitted from the model, will have the most effect.
It's definitely an issue we will need to examine.
- Normal distribution of residuals: Looking at the distribution of the residual, the plot looks right-skewed, which we will attempt to transform to meet this condition.
- Multicollinearity: As observed in part 1, and also confirmed in the results of the VIF metric, we can see that there is no concerning correlation between the variables.
So there are 2 conditions which we will need to examine: outliers, normal distribution of residuals, linearity for bmi. We will then show that, actually no transformation is better than if we did transform.
We determine a set of transformations for charges, age and bmi according to our findings in the simple linear regression analysis as follows:
First, we will go down the ladder for charges since the residuals plot is right-skewed. We choose log. Other variables are left as is.
Based on our observation:
- Linearity for
agevs.chargesgoes from "met" to "unmet" with this transformation. There is a curve in this plot once we take the log of charges, which is visible even with an additional term$age^3$ .
Going further down the ladder for the response even makes the curvy pattern more obvious.
- Outliers issue seems to persist, even after we went down the ladder further. There just seems to be a lot of data points with extremely high
chargesvalue. We will examine this issue closely in a minute.
If we use the threshold
Just to make sure we handle the worst case scenario, we performed a quick analysis of the model with and without the suspicious leverage points:
- The p-values for the parameters do not significantly change.
- The estimates: If we consider the estimates that are above 3 SDs away from the original estimate to be "weird"
-
ageandbmicoefficients do not seem to significantly change. New estimates are < 3 SDs away from the old ones - Change in
$I(smoker=yes)$ and interaction term coefficients: most significant change.
-
What does this mean? We took a closer look at the data points which are suspicious and got an interesting finding: 88% of the high leverage points (100 points out of 113) come from the smoker group! This confirms our previous assumption about the distinction between these 2 groups.
Long story short: We do not see it necessary to do transformation with charge here as it does not bring about clear improvement with the residuals, or better the linearity condition. The interaction term, as we saw, has accounted for this distinction in the slope of bmi and charges for 2 smoker groups.
Going back to our BIC results, there are 2 other models with the same or even better performance to our final model: model 4 and 5.
But these models both have the children variable as a predictor, which we already confirmed has little to no correlation with the response. Since the assumptions may not meet for these 2 models, we will leave these model equations here, but with cautious application:
Model 4: age, bmi, children, smoker
Model 5: age, bmi, children, smoker, regionsoutheast
For this model, we need to create a new variable region_new that groups all regions other than southeast into 1 group.
We compare the 3 models performance with and without the outliers, using BIC as the metric:
As we can see, model 4 has the lowest BIC of all 3, both for a fit with all data or data without the outliers. We don't know for sure if the differences in BIC is statistically significant in order to conclude where children is actually important. This might be an area for future work.
In our final model, from the summary, we can see that age, smoker, and the interaction term are statistically significant in predicting mean charge. bmi when standing alone actually is not a significant term in the model.
-
$Y$ : charges
The predictors
-
$X_1$ : age -
$X_2$ : bmi -
$X_3$ : indicator for smoke group (1 if that individual smokes, 0 otherwise) -
$X_4$ :$x_3 \times bmi$
Final model decribing the relationship between charges (response variable) with bmi, smoke, and age:
We are 95% confident that the true
Looking at the p-value for F-test in the final model's summary, we see that it is much smaller than 0.0001. This data provide very strong evidence against our null hypothesis that age, sex, BMI, number of children, smoking habits, and region of residence are not associated with a change in average medical expense.
-
Even though the characteristics of the observations are claimed to be taken from real-world data from US Census Bureau, the expenses are simulated. Meaning there are limits to the population we can extend the results to;
-
Data on medical expenses may be kept private (example, like in medical records), so there might be some limitations in approaching data;
-
There might be many other economic factors though not being covered in this study might be one of the determinant factors in medical expenses and insurance purchase
-
The model can be used to predict medical expenses for reference purposes but restricted to some particular population with similar statistics as those in this study. The study, if possible, can be expanded to larger demographics for better generalization;
-
We hope to get access to updated data set to reflect better current market conditions and consumers' tendencies in the market bracket of insurance and medicine
-
We will perform other transformations to see if we can still find a better model;
-
More variables can be updated in future data set, such as people's income, occupations, health conditions, etc to create more comprehensive model
- Datta, A. (n.d.). US Health Insurance Dataset. [online] www.kaggle.com. Available at: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset/discussion/156033 [Accessed 26 Mar. 2024].
- www.scikit-yb.org. (n.d.). Cook’s Distance — Yellowbrick v1.5 Documentation. [online] Available at: https://www.scikit-yb.org/en/latest/api/regressor/influence.html#:~:text=Because%20of%20this%2C%20Cook [Accessed 25 Apr. 2024].
- Class notes: Multiple Regression Outliers, Multiple Regression Model Selection, Multiple Regression Multicollinearity, HW8 Key
R libraries
library(dplyr) # functions like summarize
library(ggplot2) # for making plots
library(readr)
library(tidyverse)
library(GGally)
library(grid)
library(gridExtra)
library(leaps)
library(car)



















