Using the Multiple Linear Regression Model to predict values.

#1
I've used R to create two linear regression models - one level-level and one log-level, from 125 lines of data given to me.

I've been told:
For both of the models I've been told to fit the model using the first 100 rows of the data, and then use the fitted models to predict y from the remaining 25 rows of the data.

I used:
petrol.lm4 <- lm(hydrcarb~disptemp+tankpres+tankpres2+disppres2, data=petrol[1:100, ])
and
logpetrol.lm5 <- lm(log(hydrcarb)~disptemp+disppres+tankpres2, data=petrol[1:100, ])
to fit the model using the first 100 rows (is this correct?)

However, I'm now stuck on how to use these models to predict hyrdcarb now?

After that I'm supposed to calculate the root mean square prediction error, any ideas on how to do that? I've been given the formula so I'm not sure if I'm supposed to calculate it manually.

Thanks :)
 

JesperHP

TS Contributor
#4
Code:
# Create artificial data 125 observations
x=rnorm(125) # Independent variable
e=rnorm(125) # Noise 
a=2
b=3
y=a+b*x+e
mydata=data.frame(y=y[1:100],x=x[1:100])

# estimate model on 1 to 100 observation
mymodel=lm(y ~ x,data=mydata)

# predict using function predict()
newdata=data.frame(y=y,x=x)
predicted=predict(mymodel,newdata=newdata)
predicted


# Now manuel prediction
coefficients=mymodel$coef
X=matrix(c(rep(1,length(x)),x),nrow=length(x),ncol=2)
yhat=X%*%cbind(coefficients)
colnames(yhat)="yhat"

# Compare manual with non-manual predicition
print(cbind(yhat,predicted))


data frames have been used in order to make sure the name of modelled independent variable and name of the independent variable used when predicting are the same .... otherwise you may run into this error: http://stackoverflow.com/questions/27464893/getting-warning-newdata-had-1-row-but-variables-found-have-32-rows-on-pred
 
#5
Code:
# Create artificial data 125 observations
x=rnorm(125) # Independent variable
e=rnorm(125) # Noise 
a=2
b=3
y=a+b*x+e
mydata=data.frame(y=y[1:100],x=x[1:100])

# estimate model on 1 to 100 observation
mymodel=lm(y ~ x,data=mydata)

# predict using function predict()
newdata=data.frame(y=y,x=x)
predicted=predict(mymodel,newdata=newdata)
predicted


# Now manuel prediction
coefficients=mymodel$coef
X=matrix(c(rep(1,length(x)),x),nrow=length(x),ncol=2)
yhat=X%*%cbind(coefficients)
colnames(yhat)="yhat"

# Compare manual with non-manual predicition
print(cbind(yhat,predicted))
[/URL]
Thank you - I have 4 variables so I've changed it to the below code:

Code:
# using model to predict hydrcarb
w=rnorm(125) #Independent variable
x=rnorm(125) #Ind var
v=rnorm(125) #Ind var
z=rnorm(125) #Ind var
e=rnorm(125) #Noise
a=0.14370 #coeff for disptemp
b=4.09870 #coeff for tankpres
c=-1.06894 #coeff for tankpres2
d=1.21589 #coeff for disppres2
f=2.65430 #intercept
y=a*v+b*w+c*x+d*z+f #linear model
mydata=data.frame(y=y[1:100],v=v[1:100],w=w[1:100],x=x[1:100],z=z[1:100]) #just looks at 100 observations
petrol.lm5=lm(y~v,w,x,z,data=mydata) #estimates model on 1 to 100 observations
newdata=data.frame(y=y,v=v,w=w,x=x,z=z) #sets up to predict new values
predicted=predict(petrol.lm5,newdata=newdata)#creates new predictions
predicted #views new predicted values

#now using manual predictions
coefficients=petrol.lm5$coef
X=matrix(x(rep(1,length(x)),x),nrow=length(x),ncol=2)
yhat=%*%cbind(coefficients)
colnames(yhat)="yhat"

#compare manual with non-manual prediction
print(cbind(yhat,predicted)
I think the matrix stuff is wrong though cause I'm not sure how to change that to have more than one variable. However, before I can get to there I now get an error code:
Code:
Error in xj[i] : only 0's may be mixed with negative subscripts
when I put
petrol.lm5=lm(y~v,w,x,z,data=mydata) #estimates model on 1 to 100 observations
Do you know what I'm doing wrong?

Thanks :)
 

JesperHP

TS Contributor
#6
Compare you're code
lm(y~v,w,x,z,data=mydata)
with
lm(y~v+w+x+z,data=mydata)
That is the first error ...



If you do not and are not supposed to know/learn the matrix algebra of regression maybe it is best to forget about the manuel prediction. If you are supposed to learn that you should be able to figure out how the design matrix X should be set up. Anyway there are two errors in the manuel prediction part:
X=matrix(x(rep(1,length(x)),x),nrow=length(x),ncol=2)
yhat=%*%cbind(coefficients)
The matrix X is constructed in a wrong way and you have changed a "c" to "x" such that the matrix will not be constructed at all ...
Secondly in calculating predicted values you have removed the matrix X from the equation ...
correcting these two mistakes gives you:
X=matrix(c(rep(1,length(x)),x),nrow=length(x),ncol=2) # chanhing an x to a c for concatenation ....
yhat=X%*%cbind(coefficients) # Multiplying X with coefficients
but still the matrix X is wrong .... If you are supposed to learn about matrix algebra of regression I suggest you give it a go correcting the set up of the design matrix ... you need to have a column of ones for the constant of you're regression model .... therefore rep(1,length(x)) .... and the for each IV you need a column containing values of the IV ... hence ncol=number of IV's + 1 for the constant






Also remember the initial lines of the script are creating artificial data .... you would need to use you're actual petrol data for estimation and prediction ... but offcourse I had to construct artificial data for the example because I do not have your pertrol data....