Simulation for logistic regression in R

#1
Dear All,

I would like to simulate data for logistic regression and I need to have below variables
x1=numeric (mean=25,std=5)
x2=numeric (mean=50,std=10)
x3=factor variables with 5 levels
x4=factor variable with 3 levels
x5=factor variable with 2 levels

How can I do that?
Thank you
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
I think the big question now is how these variable relate to the Y variable (e.g., y = Bo + B1X1,...,+ random error)?
 
#3
Hlsmith can you tell me with random error?Actually below is what I have done so far.Below seems okey but I do not get x1 significant.
Thank you.

PHP:
set.seed(666) 
x1 = rnorm(100)           # some continuous variables  
x2 = rnorm(100) 
x3=sample(x=c(1, 2, 3), size=100, prob=rep(1/3, 3),replace = TRUE)
z = 0.01 + 0.5*x1+1.2*x2+0.75*x3      # linear combination with a bias 
pr = 1/(1+exp(-z))         # pass through an inv-logit function 

y = rbinom(100,1,pr)      # bernoulli response variable 
data.frame(pr,y)  
df = data.frame(y=y,x1=x1,x2=x2,x3=x3) 
glm( y~x1+x2+x3,data=df,family="binomial")  
summary(glm( y~x1+x2+as.factor(x3),data=df,family="binomial")  )
 
Last edited:

Dason

Ambassador to the humans
#5
Hlsmith can you tell me with random error?Actually below is what I have done so far.Below seems okey but I do not get x1 significant.
Thank you.
You don't want " + random error" in your model. You already are simulating y according to the model specific in logistic regression.
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Correct! I was thinking of the random error you can add when coming up with the actual variable terms. Dason is right that there is not a individual/stand alone error term.
 
#7
Hello guys,
I got another way to do this simulation. Above code I could not get good estimators. But chek below simulation I have 1 numeric and 1 categorical variable. If you want to add,please let me know.
Thanks.

PHP:
x1_a=rnorm(100000,mean=290,sd=15)
x1_b=rnorm(100000,mean=300,sd=15)
x1=c(x1_a,x1_b)  ###numeric variable 
x2_a=sample(1:4, size=100000, prob=c(.3,.5,.1,.1),replace = TRUE)
x2_b=sample(1:4, size=100000, prob=c(.1,.1,.3,.5),replace = TRUE)
x2=c(x2_a,x2_b)###categorical variable with 4 levels
y1=sample(0:1, size=100000,  prob=c(.8,.2),replace = TRUE)  
table(y2)
y2=sample(0:1, size=100000,  prob=c(.6,.3),replace = TRUE)  
y=c(y1,y2)
table(y)###create y variable 
dat=data.frame(x1=x1,x2=x2,y=as.factor(y))
mylogit=glm(y~x1+as.factor(x2),data=dat,family=binomial())
summary(mylogit)
 

JesperHP

TS Contributor
#8
You may get what you want, but what you are doing is - in a manner of speaking - simply wrong.

The reason is:

Code:
y1=sample(0:1, size=100000,  prob=c(.8,.2),replace = TRUE)  
table(y2)
y2=sample(0:1, size=100000,  prob=c(.6,.3),replace = TRUE)  
y=c(y1,y2)
where the dependent variable is not simulated according to a logistic model where the dependency between x and y is obvious and where the parameters to be estimated are known. If you do not know the true parameters how do you know your estimator is not simply inconsistent?

And if you cant tell this from the simulation, what can you tell from the simulation? What is the purpose of the simulation? (I get that it is fun to make random draws and throw dices and stuff like that but a higher purpose than simply celebrating randomness is ussually wanted)