Plotting Confidence Intervals

Buckeye

Active Member
#1
Hi everyone,

Please see the code below. It is simulated for the sake of illustration. My actual data has an "age" variable with 10 distinct values. I also have about 7 categorical variables. I want to know if it's appropriate to plot age along the x-axis and my predicted response on the y-axis with upper and lower CI bounds. I'm wondering if my plot looks weird because my age variable is not necessarily continuous. Setting the weird plot aside, does it make sense to visualize CI's like this without accounting for the "gender" variable? In other words, can I say these are the lower and upper bounds for the 3 education groups across both genders and all ages?

Code:
library(purrr)
library(dplyr)
library(ggplot2)

# simulate data
set.seed(2722)
age<-rdunif(n = 1000,a = 25,b = 35)
gender<-sample(x = c("M","F"),size = 1000,replace = TRUE,prob = c(.45,.55))
education<-sample(x = c("HS","C","PG"),size = 1000,replace = TRUE,prob = c(.60,.30,.10))

dat<-data.frame(age,gender,education)
# simulate normal errors
normal_errors<-rnorm(n = 1000,mean = 0,sd = 5)

dat<-dat %>%
  mutate(female_ind=case_when(gender=="F"~1,
                              TRUE~0),
         college_ind=case_when(education=="C"~1,
                               TRUE~0),
         postgrad_ind=case_when(education=="PG"~1,
                                TRUE~0),
         # create response as a linear combo of X
         Y=(-2+1.2*age+3*female_ind+2.1*college_ind+2.6*postgrad_ind))

# add errors to the response
dat$Y<-dat$Y + normal_errors

# relevel the factors to match the indicators from above
dat$gender<-factor(dat$gender,levels = c("M","F"))
dat$education<-factor(dat$education,levels = c("HS","C","PG"))

# create model
model<-lm(Y~age+gender+education,data = dat)
summary(model)

# create "new data" to plot CI's
newdata<-dat %>%
  select(age,gender,education) %>%
  distinct()

# add confidence intervals for each mean response
newdata <- cbind(newdata, predict(model, newdata, interval = "confidence",level = .95))

# plot weird graph
ggplot(newdata, aes(age, fit)) +
  geom_line(aes(col=education), size = 1) +
  geom_ribbon(aes(ymin = lwr, ymax = upr, fill = education), alpha = .25)+
  labs(x = "Age", y = "Predicted Y")
 

Attachments

hlsmith

Less is more. Stay pure. Stay poor.
#2
I haven't thought too hard about this, but would this be better illustrated using dots with whiskers for each age? Yeah if gender was in the model, these estimates are conditional on gender and what you used as the reference group, right?

Good to see you @Buckeye
 

Buckeye

Active Member
#3
Thanks hlsmith, yes I was thinking the same thing about the dot and whisker plots. I made that visual with the coefficients of my "state" variable.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Yeah, I think the connecting the categorical age groups is very distracting, but you know this. I make me wonder if their is something else I am supposed to be looking at. But overall there is a linear trend. Is gender significant and not an interaction term with any of these. If it was really adding much it could be dropped. If you needed it and not an interaction, you could fit the model again with the other gender as a reference and then see what it looks like plotting dot with whisker like before but fit the male and then females on the same of adjacent plots. You would have to do a little scoring so intercept = one group and intercept + gender term = the other gender.