Calculating prediction intervals (error estimates of specific predicted points) in a

#1
Hi all, this is my first post here so please let me know if I need to provide more details.

I want to generate a multivariate logistic regression model and use that to predict future outcomes. The logistic model gives me point estimates, but I also want prediction intervals. I have only been able to find explanations of how to calculate prediction intervals from univariate models, but I have a multivariate model and do not know how to generalize. (I know that there are functions in, say, R and other packages to automatically compute prediction intervals, but I need to be able to do it -- and understand it -- without relying on these.)

The specific model doesn't matter, but for sake of discussion we can just say that the model coefficients are:
Intercept = -30
Time since company was founded, in days = 0.008
Daily revenue, in thousands of dollars = 0.001
Size of market, in billions of dollars = 0.01

The point estimate is trivial to calculate from this:

Code:
odds = exp(-30 + 0.008*time + 0.001*revenue + 0.01*market)
The prediction intervals are tripping me up, though. For one variable, I understand the formula to be similar to (though perhaps not exactly) along the lines of the following R code:
Code:
x <- c(1,4,5,9,13,11,23,23,28)
y <- c(64,71,54,81,93,76,77,95,109)
df <- data.frame(x,y)
model <- lm(y~x,data=df)
df$pred <- predict(model,df) #The predicted odds are stored here
df$sigma <- sqrt( (1/(length(df$x)-2)) * sum((df$y-df$pred)^2) )
df$sePredX <- df$sigma * sqrt( 1 + 1/length(df$x) + (df$x - mean(df$x))^2/sum((df$x - mean(df$x))^2) )
df$pred05 <- df$pred - 1.97*df$sePredX #These are what I calculate as being the upper 2.5%ile of odds
df$pred95 <- df$pred + 1.97*df$sePredX #These are what I calculate as being the upper 2.5% of odds
I'm writing in R because I'm hoping I can convey myself more clearly this way, but let me know if another notation is preferred. Also, I'm not even sure that my example is correct for the one variable situation (for one, I left out the uncertainty in the intercept term!).

Most importantly, I have no idea how to generalize this to multiple independent variables... it seems that this approach only works with 1 independent variable. I'd greatly appreciate any help, suggestions, pointers, anything. Thanks.
 
Last edited:

Tart

New Member
#2
For multivariate regression, it should be more or less the same as in univariate case, but with matrices now. You estimate variance of errors. Errors in the regression model provide distribution assumption. Then you estimate predictive variance, usually estimated variance properly adjusted, this allows you to construct prediction intervals. Usually Yh +/- z_alpha/2 * sqrt(var(e_h)), where var(e_h) - estimate of your predictive variance

With logistic regression, I'm little bit rusty, predictive intervals do not translate well with transformations, so I'm not quite sure. I have my books on regression at work, and I'm too lazy to search Internet for solution :) so if you wait for Monday or Tuesday, I might be able to help. Unless someone else provide you with answer you are looking for.

Meanwhile, you can read very interesting paper by Chris Chatfield on Prediction Intervals. Computation of the Prediction Intervals is very challenging problem!
 
#3
Tart - thanks for your insightful reply. I don't know the odds are of this happening, but I was literally reading through Chatfield's paper when I saw your reply!

I would be very appreciative of references to textbooks that address this, especially in the multivariate case. Up to now, I have been relying upon Google searches, and particularly on one resource that I identified (PDF: http://www.stat.wisc.edu/courses/st572-larget/Spring2007/handouts03-1.pdf ). I am now a bit concerned that the equations on pages 3-4 may be incorrect, though.

I appreciate your help and encouragement, and will return with more questions after I try to digest the Chatfield paper. In the meantime, if anything comes to you, I'd be delighted to learn more from you.

Thanks!
 

Tart

New Member
#4
No, they look correct. Extra 1 in the equation corresponds to increase of uncertainty when you try to forecast observation and not the mean.

Odds are probably pretty good considering that you are solving problem dealing with prediction intervals, it is natural that you would google for them and Chatfield's paper right on the top :). By they way, you case is simple than the one he talks about. You have specific model and distribution assumptions. He talks about general approaches, imagine situation then there is no model - judgmental forecast for example, but you still need to create PI, how to do it correctly, this is very difficult problem.

Check this out. I looked at computing.pdf - they show how to do it in R (i just scanned it, didn't read carefully). Rest of the handouts might provide more details.
http://gbi.agrsci.dk/statistics/courses/phd06/material/Day5-Logistic-Regression/


http://zoonek2.free.fr/UNIX/48_R/12.html - check this out too. He shows you how to do logistic regression and many other useful information
 
Last edited:
#5
Thanks! You're right that getting R to spit out the prediction intervals is easy. In fact, it's even easier than the handouts make it look. Let's say you have a model 'M0' and some new data in a format similar to the data used to create M0 called 'd1'. Then to get the prediction interval, you just run:
Code:
prediction <- predict(M0, d1,interval="prediction")
Unfortunately, although I do trust its output to be correct, I am not able to duplicate it. To me, this means that I am likely miscalculating it, and that I don't understand the calculations sufficiently, which is giving me a strong motivation to try to identify resources that will help me understand what to do.

I had initially hoped that for the multivariate mode, I could sum up the output from each independent variable X like so (using the same data as in my above code)
Code:
Univariate (as above): 
df$sigma <- sqrt( (1/(length(df$x)-2)) * sum((df$y-df$pred)^2) )
df$sePredX <- df$sigma * sqrt( 1 + 1/length(df$x) + (df$x - mean(df$x))^2/sum((df$x - mean(df$x))^2) )

Multivariate(?):
(Imagine now that there is another predictor, W, that was in the data frame and was used to generate a model and output a prediction)

df$sigma <- sqrt( (1/(length(df$x)-2)) * sum((df$y-df$pred)^2) [b]+ (1/(length(df$w)-2)) * sum((df$y-df$pred)^2)[/b] )
df$sePredX <- df$sigma * sqrt( 1 + 1/length(df$x) + (df$x - mean(df$x))^2/sum((df$x - mean(df$x))^2) [b]+ 1/length(df$w) + (df$w - mean(df$w))^2/sum((df$w - mean(df$w))^2)[/b] )
I am pretty sure I'm not correctly updating the formula for the multivariate model, but I can't find examples that would give me a hint as to the correct approach! I know that you mentioned the matrix of independent variables, but am not sure how to put that into use programmatically.
 

Tart

New Member
#6
Hey Bernoilli, I'm not sure I want to type all this info - too much and probably be to confusing. Two ways out
a) Get the book "Applied Linear Statistical Models" by Neter, Kutner, Nachtsheim, Wasserman, from library, check section 14.4-14.9 (I'm using 4th edition).
b) send me PM with your e-mail address, I'll scan those and send it to you
 
#7
Hey Bernoilli, I'm not sure I want to type all this info - too much and probably be to confusing. Two ways out
a) Get the book "Applied Linear Statistical Models" by Neter, Kutner, Nachtsheim, Wasserman, from library, check section 14.4-14.9 (I'm using 4th edition).
b) send me PM with your e-mail address, I'll scan those and send it to you
Awesome - thank you so much! I'll try to find the book before I ask you to go through that hassle. I'm away from campus for this week, so I'll look in the library when I return on Saturday. I'll report back whether or not I find it. Again, thank you!