Difference between simulating the dependent variable and simulating the error terms and adding them to the fitted values values assuming normality?

KNR

New Member
#1
What's the statistical difference between simulating the dependent variable and simulating the error terms and adding them to the fitted values values assuming normality (gaussian GLM)?

Say I'm doing a simple multiple regression on the following data (R):
Code:
 n <- 40 x1 <- rnorm(n, mean=3, sd=1)
x2 <- rnorm(n, mean=4, sd=1.25)
y <- 2*x1 + 3*x2 + rnorm(n, mean=2, sd=1)
mydata <- data.frame(x1, x2, y)
mod <- lm(y ~ x1 + x2, data=mydata)
I don't get the statistical difference between:
  • tmp <- predict(mod) + rnorm(length(predict(mod)), 0, summary(mod)$sigma)(R function simulate);
  • tmp <- rnorm(length(predict(mod)), mean(y), sd(y));
What is the proper way to resample the dependent variable assuming gaussian GLM?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Almost, but not completely following. Errors represent epistemological uncertainty right. Without them it would be a deterministic model.

You could simulate a truckload of data and sample without replacement. What is the purpose of this endeavor?
 

Buckeye

Active Member
#4
Just to be clear, you can also bootstrap an entire vector of independent variables and the dependent variable. I'm assuming you have data that you want to fit to a glm (not that you are trying to simulate the data yourself). There are probably more elegant ways to do this is R
Code:
library(dplyr)

data("mtcars")

# add row number
mtcars<-mtcars %>%
  mutate(row_nbr=row_number())

# sample rows with replacement
resampled_rows<-sample(x = mtcars$row_nbr,size = nrow(mtcars),replace = T)

# get bootstrapped data
bootstrapped_data<-mtcars[resampled_rows,]
If you plan to simulate the data, maybe this link will help: https://stats.stackexchange.com/questions/59062/multiple-linear-regression-simulation
 
Last edited:

Dason

Ambassador to the humans
#5
Code:
tmp <- predict(mod) + rnorm(length(predict(mod)), 0, summary(mod)$sigma)
This takes the fitted values from your model and then simulates new error terms for them based on the estimated error variance. This is typically what is known as a parametric bootstrap.

Code:
tmp <- rnorm(length(predict(mod)), mean(y), sd(y))
This just simulates a univariate normal distribution with mean that has the same mean and standard deviation as the dependent variable. It doesn't do anything related to the fitted model at all.

For doing something like bootstrap I don't think the second thing is what you would want at all. With that said just resampling the data directly (which is what Buckeye suggests) might be more akin to what I would assume somebody is doing if they just said they bootstrapped.
 

KNR

New Member
#6
Code:
tmp <- predict(mod) + rnorm(length(predict(mod)), 0, summary(mod)$sigma)
This takes the fitted values from your model and then simulates new error terms for them based on the estimated error variance. This is typically what is known as a parametric bootstrap.

Code:
tmp <- rnorm(length(predict(mod)), mean(y), sd(y))
This just simulates a univariate normal distribution with mean that has the same mean and standard deviation as the dependent variable. It doesn't do anything related to the fitted model at all.

For doing something like bootstrap I don't think the second thing is what you would want at all. With that said just resampling the data directly (which is what Buckeye suggests) might be more akin to what I would assume somebody is doing if they just said they bootstrapped.
Ok thank you, this is very clear. It helped me a lot to understand the difference plotting X vs fitted.values, X vs option1 and X vs option2.