Hi,
I tried to work out the necessary sample size for a logistic regression by simulations and got some scary results. If anyone could check the code below, it would be a great help.
I simulate a logistic regression with two normalized variables, one having a fixed odds ratio of 1.4 the other a variable odds ratio from 1 to 2.5.
What makes this scary is that even at relatively high sample sizes (190) the probability of not identifying the real situation (both xs significant) is very high. At lower sample sizes the situation is more or less hopeless.
I tried to work out the necessary sample size for a logistic regression by simulations and got some scary results. If anyone could check the code below, it would be a great help.
I simulate a logistic regression with two normalized variables, one having a fixed odds ratio of 1.4 the other a variable odds ratio from 1 to 2.5.
What makes this scary is that even at relatively high sample sizes (190) the probability of not identifying the real situation (both xs significant) is very high. At lower sample sizes the situation is more or less hopeless.
Code:
# logistic regression sample size simulation
# for several variables (2)
#Define beta0 (start with 0)
B0=0
# Define the odds ratio
#First trial was from 1 to 5 where 1 is giving us a rate of the false alarms
#It dies not make sense to go higher then 2 because at that point
#the power will be almost always high
OR=seq(from=1, to = 2.5, by=0.1)
#Define the sample size
SSize=seq(from=50, to=200, by=20)
#Define table to store the power
T=expand.grid(OR,SSize)
T$PP1=rep(0, length(T$Var1))
T$PP2=rep(0, length(T$Var1))
T$PP3=rep(0, length(T$Var1))
T$PP4=rep(0, length(T$Var1))
names(T)=c("OddsRatio", "SampleSize", "Only.X1", "X1.and.X2", "Only.X2", "Neither")
#Define the X variable (from -3 to 3) assuming standardized X
#Normally distributed with mean 0 and stddev 1
#Define the number of trials per setting (OR and sample size)
NTrials=200
#FOR each OR DO
for(or in OR){
# FOR each Sample Size DO
for(ss in SSize){
# Define holding vector for pvalues
pvals1=numeric(length=NTrials)
pvals2=numeric(length=NTrials)
# For NTrials times DO
for(count in 1:NTrials){
# Generate SSize X values
XVals1=rnorm(ss)
XVals2=rnorm(ss)
# Calculate the probability of an event for the generated X
B1=log(or)
Fact=B0+B1*XVals1+log(1.4)*XVals2
YProb=exp(-Fact)/(1+exp(-Fact))
# Generate the Ys for the X values
Y=rbinom(length(YProb),1,YProb)
# Generate a dataframe (not really necessary here)
testdata=data.frame(Y=as.factor(Y), X1=XVals1, X2=XVals2)
# Run a logistic regression Y~X
mod=glm(Y~X1+X2, data=testdata, family="binomial")
# Save the p-value
pvals1[count]=coef(summary(mod))[11]
pvals2[count]=coef(summary(mod))[12]
# ENDFOR Ntrials
}
# Calculate the percent p-values less then 0.05
px1=sum(pvals1<=0.05 & pvals2>0.05)/NTrials
px1x2=sum(pvals1<=0.05 & pvals2<=0.05)/NTrials
px2=sum(pvals1>0.05 & pvals2<=0.05)/NTrials
pnone=sum(pvals1>0.05 & pvals2>0.05)/NTrials
# Save them in the power table
T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X1=px1
T[T$OddsRatio==or &T$SampleSize==ss,]$X1.and.X2=px1x2
T[T$OddsRatio==or &T$SampleSize==ss,]$Only.X2=px2
T[T$OddsRatio==or &T$SampleSize==ss,]$Neither=pnone
# ENDFOR Sample Size
}
#ENDFOR OR
}