Biglm predict Error message "Error in x %*% coef(object) : non-conformable arguments"

#1
Dear all,

i am using biglm-packages for big data to do linear Regressions and log-Regressions for probability.

In both cases i am able to run the model for regressions coefficients. I am also able to use the predict function on the data frame prediting values from said data frame.

Error:
When i use any other data frame to predict values from that one i get the following error:

Error in x %*% coef(object) : non-conformable arguments

I have made sure, that the new data frame does not contain any factorial levels which are unknow to the modell (i tried it with the "regular" lm and predict-function and it worked). As far as i understand, the Error is thrown, when the dimensions of the new data matrix is not the same as the dimension of the data matrix which was used to derive the regression coefficients. As you see my understanding is very basic.

Please point out to me, what i am doing wrong here so i can try to solve this problem.
:wave:
Reproducible Example:
Code:
## bigglm and predictfunction - Example #
require(biglm)
set.seed(42)

# create random df.
df.1<-data.frame(sell=seq(0,1,1),price=rnorm(100, mean=10000,sd=4000),blabla=rnorm(100,mean=12000,sd=5000))
df.1["fact_1"]<-as.factor(c("a","b","c","e"))

# Do the probaility regression
prob.fit<-bigglm(sell~price+blabla+fact_1,family=binomial("logit"), data=df.1) # ignore warning
summary(prob.fit)

# Do a prediction on df.1
predict(prob.fit,df.1,type="response") #works

# Do a prediction on new data which contain known variables 
df.2<-data.frame(sell=c(1,0),price=c(1000,1800),blabla=c(2140,2110),fact_1=c("a", "b"))
predict(prob.fit,df.2,link="response") #Error in x %*% coef(object) : non-conformable arguments????


## Predict without bigglm
fit<-lm(sell~price+blabla+fact_1,family=binomial("logit"), data=df.1)
summary(fit)

predict(fit,df.2,type="response") # works..why not bigglm
Thank you all so very much!
 
Last edited:

JesperHP

TS Contributor
#2
Re: Biglm predict Error message "Error in x %*% coef(object) : non-conformable argume

Thank you for making such a nice reproducible example.

1) Run your example
2) Then run this

Code:
## are the factors identical ##
all(levels(df.1$fact_1)==levels(df.2$fact_1))
## No.... makes them identical
levels(df.2$fact_1)=levels(df.1$fact_1)
## Now do predict
predict(prob.fit,df.2,link="response")

probably an observation is coded as (0,1,0,0) with 1 indicating level.... so two factor variable results in for example (0,1) ... you have three coefficients for the factor - one for each level except reference level - does not make sense for a factor of two levels (ok could be explained better ... but you can ask follow up questions).
 
#3
Re: Biglm predict Error message "Error in x %*% coef(object) : non-conformable argume

First of all mange tak JesperHP!

your code works very well for the example - i will try it out on my productive data set.

Thank you for making such a nice reproducible example.

1) Run your example
2) Then run this

Code:
## are the factors identical ##
all(levels(df.1$fact_1)==levels(df.2$fact_1))
## No.... makes them identical
levels(df.2$fact_1)=levels(df.1$fact_1)
## Now do predict
predict(prob.fit,df.2,link="response")

probably an observation is coded as (0,1,0,0) with 1 indicating level.... so two factor variable results in for example (0,1) ... you have three coefficients for the factor - one for each level except reference level - does not make sense for a factor of two levels (ok could be explained better ... but you can ask follow up questions).
I try to follow up on your comment. So you mean, that in the "original" data of df.1 my factorial variable fact_1 has had 4 levels (a,b,c,e) (minus the one which is reference level and incorporated into intercept when we get the coefficients) so there are now three regression coefficients in the final modell. Now when i present newdata to the modell which was derived by biglm and use the predict function on this new data it will throw an error when not all of the regression coefficients from the "original" data are in this newdata. (mind that we have to have the same columnnames and so on...)

You worked around this by adding the missing levels to the new data frame. What i find funny is that you just "nominated" these levels. By nominated i mean, that you didn`t add levels as true values but just added them to the factor variables possible levels.

So the same could have been also achieved by just adding pseudo data rowwise which do have these missing factor levels. For example i could have added a row like 0,1000,1000,c and another row like 0,1000,1000,e to the new data frame. Now i would have all the needed "factor levels" and the biglm predict would work. My approach is not good of course as it bloats the data and productive i have a factor level which has >600 levels. So your approach is much better and lean.

Have i understood this right and also - your approach was very simple yet perfect - is this a known work around when using biglm and predict or has nobody ever had this problem? I searched on the internet an found only basic - non answered requests for help.:eek:
 

JesperHP

TS Contributor
#4
Re: Biglm predict Error message "Error in x %*% coef(object) : non-conformable argume

WARNING:: THIS POST GOT VERY LONG (LONGER THAN INTENDED) PLS. IT DOES CONTAIN SOME IMPORTANT INFO SO PLS. DO READ...


Have i understood this right
I think so.

and also - your approach was very simple yet perfect - is this a known work around when using biglm and predict or has nobody ever had this problem?
I have no idea. I have never used biglm. I DO NOT THINK THIS IS THE WAY TO FIX YOUR PROBLEM (I only used it to diagnose the problem).

My rule for working with factors in R is to think of them as indicators because in all contexts that I know of R performs the transformation from factor to indicator - where a factor of K levels implies K-1 indicators. To see the transformation R performs use
Code:
model.matrix()
but lets make the statistical context clear first:
1) Assume we take a random sample with among other variables an independent variable taking on values a,b,c,d
2a) Assume we observe some a's some b' and some c's but not d's.
Lets say we deal with the no d's observations simply by removing this level from the factor. Then our fitted linear model would have 2 coefficients
for the factor since it has three levels....The drawback would be that we are not able to make predictions for the group of people actually taking on the value of d in the population...

2b) Assume we observe a's b's c's and d's
The fitted object would have three coefficients since the factor have 4 levels.
Now you want to make predictions but only for people belonging to a-group and b-group but even so these people are still people who could have taken on values c and d according to the fitted model hence you cannot use a factor of two levels instead of a factor of four levels because R reacts differently to these contingencies. How R reacts is I assume determined by model.matrix() - but I do not know the code of biglm so cannot be sure - and heres an example of model.matrix()
Code:
y=c(1,2,3,1,2,3,1,2,3)
y=factor(y)
y
## Only three possible values that is three levels... 
model.matrix(~y)
## Now add one level and R will >>know<< to create an extra dummy
## even though there are no positive observations of this level
levels(y)=c(1,2,3,4)
model.matrix(~y)
# This is the matrix that is needed for prediction from model with three coefficients for factor of four levels
However be all that as it may I think:
1) The errormessage is somewhat uninformative
2) It is unfortunate that predict() reacts differently for biglm an lm in this instance
So in my opinion a discussion about this on some R-mailing list would not be out of place....but offcourse biglm is not part of base R - so the one who should decide whether to make any changes would probably be Thomas Lumley the author of biglm.....


As a final point all I did was to make sure the factor had the correct number of levels to insure that the modelmatrix would have the correct dimension. I did not make any kind of check whether the levels are being transformed in the same way in the fitting procedure and in the preditiction procedure: YOU need to make sure that the levels and coefficients are correctly matched. Here is a small example that you probably should study:

Code:
x=c("a","b","c","a","c")
x.fac=factor(x)
z=c("c","a")
z.fac
z.fac=factor(z)
levels(z.fac)=levels(x.fac)
z.fac
pay attention to the way z.fac is >>recoded<< when the levels are changed ... this is probably not what you want... hence my example IS NOT a fix or workaround since it only insures dimensions are conformable. Please study this more elaborate example:

Code:
## Assume x is the factor on which the model is fitted
## observations on three levels
x=c("a","b","c","a","c")
x.fac=factor(x)
x.fac

## Assume z is factor from which we want to make predicitons
z=c("a","c")
z.fac1=factor(z)
z.fac1
## We only want to predict for "a" and "c" instances
## But we need the right number of levels to get dimensionconformability
z.fac2=z.fac1
levels(z.fac2)=levels(x.fac)
z.fac2
z.fac1
## Notice that z.fac2 now are recoded to contain observation of "b"
## THIS IS NOT WHAT WE WANTED	
## One solution is to impose the levels when the factor is created
intendedfactor=factor(c("a","c"),levels=c("a","b","c"))
intendedfactor
# and here the important thing is that levels=c("a","b","c") are the levels of the variable/factor
# on which the model was fitted that is levels(x.fac)

model.matrix(~x.fac) #remember x.fac is factor model i assumed fitted on
model.matrix(~intendedfactor)

If you are wondering why z.fac gets recoded when levels are changed now might be a good time to read the R-documentation on factors :) I think its something like levels are represented internaly by number and "a" and "c" are thus represented as level 1 and level 2 so when you give z.fac levels c("a","b","c") then "b" is before "c" in the alphabet and is therefore level 2 whereas "c" is level 3 so the original z.fac with level 2 represented by "c" becomes a level 2 observation represented by "b"...Im not entirely sure about this so read the documentation - I never really work with factors partly because of these kind of problems.
 
#5
Re: Biglm predict Error message "Error in x %*% coef(object) : non-conformable argume

Dear JesperHP,

wow - that is truly a detailed and most informative info you gave here. :tup:

Your step by step decomposition of the factor related levels "problem" points me into the right direction. I understand that your first fix was not a fix but rather a demonstration of what might have been the problem factorlevelswise.

Furthermore i understand that i have to make sure, that the factorlevels present in my fitted model are "in the same order" (for a lack of better words) in my "new example" so they are pointing to the right regression coefficient. If i am not doing this the predict function might run, but not giving me the correct values.

I will come back to you on this as i have to try out the code in detail and reread what was presented to me.

Please have patience :eek:
 
#6
Re: Biglm predict Error message "Error in x %*% coef(object) : non-conformable argume

Dear JesperHP,
Dear all,

as JesperHP suggested i did try out the following code fragment on my productive data. Mind that by productive data i mean a new data frame which might does not have all the factor levels in it when i was making the prediction. This is normal i want to use the original model to estimate prices on certain car makes, which are comming in one by one and not as a full list which fits the original data.frame with "all" makes and such :p.

As JesperHP pointed out naturally in the model matrix which may be produced by biglm when doing predictions some factors will not be there as they are failing to be represented in the "new" data frame which i want to predict.

Solution therefore is as following:

For the new, predictive data make sure all of the needed factor levels are there. But make also sure you are not recoding the true value of you predictive data! The following code is an adaption of JesperHPs code so i use his implications it for my "new" data.frame where values are to be predicted for:
Code:
## First i will get a list of all the factor-variables i do have in my data.frame which i used to make the regression modell on (olddata== big thing with 500k rows and over 40 columns)
factorlist<-lapply(olddata,function(x)is.factor(x)==TRUE)

## After i have the list of factor variables of "olddata" i tell R to only leave variables in that list that are truly factors.
factorlist<-names(factorlist[which(factorlist==TRUE)])

## Now with the list of true factors i will tell R to check for missing factor levels in  the new, to be predicted data AND if they are missing copy them to the factor levels of the new data without changing the origininal data. By this we achieve Dimensionconformability.

for (i in factorlist){
  newdata[,paste(i,sep=",")]<- factor(newdata[,paste(i,sep=",")],levels(olddata[,paste(i,sep=",")]))
} 

the last upper code is a modification of JesperHPs code...
intendedfactor=factor(c("a","c"),levels=c("a","b","c")) ## original one
intendedfactor=factor(newdata$column,levels(olddata$column)) #a more generalized version.


Now a prediction via biglm should work :-).
I tried it on data which i predicted with "regular" lm and big data biglm regression. Both gave the same result which shows, that inducing Dimensionconformability can be achieved by equalizing factor levels without changing true factor values per row.

:tup:

Have a nice weekend you all and thx for the help JesperHP.