Hi everyone,
Please see the code below. It is simulated for the sake of illustration. My actual data has an "age" variable with 10 distinct values. I also have about 7 categorical variables. I want to know if it's appropriate to plot age along the x-axis and my predicted response on the y-axis with upper and lower CI bounds. I'm wondering if my plot looks weird because my age variable is not necessarily continuous. Setting the weird plot aside, does it make sense to visualize CI's like this without accounting for the "gender" variable? In other words, can I say these are the lower and upper bounds for the 3 education groups across both genders and all ages?
Please see the code below. It is simulated for the sake of illustration. My actual data has an "age" variable with 10 distinct values. I also have about 7 categorical variables. I want to know if it's appropriate to plot age along the x-axis and my predicted response on the y-axis with upper and lower CI bounds. I'm wondering if my plot looks weird because my age variable is not necessarily continuous. Setting the weird plot aside, does it make sense to visualize CI's like this without accounting for the "gender" variable? In other words, can I say these are the lower and upper bounds for the 3 education groups across both genders and all ages?
Code:
library(purrr)
library(dplyr)
library(ggplot2)
# simulate data
set.seed(2722)
age<-rdunif(n = 1000,a = 25,b = 35)
gender<-sample(x = c("M","F"),size = 1000,replace = TRUE,prob = c(.45,.55))
education<-sample(x = c("HS","C","PG"),size = 1000,replace = TRUE,prob = c(.60,.30,.10))
dat<-data.frame(age,gender,education)
# simulate normal errors
normal_errors<-rnorm(n = 1000,mean = 0,sd = 5)
dat<-dat %>%
mutate(female_ind=case_when(gender=="F"~1,
TRUE~0),
college_ind=case_when(education=="C"~1,
TRUE~0),
postgrad_ind=case_when(education=="PG"~1,
TRUE~0),
# create response as a linear combo of X
Y=(-2+1.2*age+3*female_ind+2.1*college_ind+2.6*postgrad_ind))
# add errors to the response
dat$Y<-dat$Y + normal_errors
# relevel the factors to match the indicators from above
dat$gender<-factor(dat$gender,levels = c("M","F"))
dat$education<-factor(dat$education,levels = c("HS","C","PG"))
# create model
model<-lm(Y~age+gender+education,data = dat)
summary(model)
# create "new data" to plot CI's
newdata<-dat %>%
select(age,gender,education) %>%
distinct()
# add confidence intervals for each mean response
newdata <- cbind(newdata, predict(model, newdata, interval = "confidence",level = .95))
# plot weird graph
ggplot(newdata, aes(age, fit)) +
geom_line(aes(col=education), size = 1) +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = education), alpha = .25)+
labs(x = "Age", y = "Predicted Y")
Attachments
-
9.2 KB Views: 5