Linear regression not full ranked.

noetsi

Fortran must die
#1
I have only encountered this once before and that time was a mistake on my part. Here I have a set of variables related to what job you are being a linear combination of other variables in the model. The set of variables they are a combination of is very long, it looks like most or all of the other variables in the model. The variables in question are the entire set of jobs (9 variables) in the model (which is not every job someone could have, but they will have a zero or one for each one). I thought this might be an issue of putting k set of indicators in the model rather than k-1, but removing one of the predictors generated the same error. The set of variables used is used by the Department of Labor so I am really confused.

Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Note: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
 

Dason

Ambassador to the humans
#2
That's a lot of text without anything that could help.

So if it's saying that then it sounds like at least one of your predictors is a linear combination of the others. But that's all we can say with the info provided.
 

Dason

Ambassador to the humans
#3
I mean it even tell you that it set some parameters to 0 because they're a linear combination of the others. So look at which are 0 and figure out if it makes sense that they are linear combination of the others.
 

noetsi

Fortran must die
#4
I am sure it does mean that its a linear combination. But there is no logic that I can imagine that the variables in question could be a linear combination.

This is what the regression is generating. I have to wait for a coworker to get the formal definition of the variables in question, but I can't think of any relationship (and perfect relationship) between these variables.
 

Attachments

hlsmith

Not a robit
#5
Pretty awesome that SAS can find that. How big is your dataset? So you think it is an artifact? You could score your dataset using the above formula and see if it always equals that variable. How is "Education, Health Rela" formatted?

I have had SAS do this to me before, but it was always after a bonehead use of redundant terms that slipped into the model by accident.
 

Dason

Ambassador to the humans
#6
The math doesn't care about your logic. And showing the regression equation isn't nearly as useful as other things you could have shown to help identify where the linear combinations might arise from.

A reported DF of 0 or B means that the estimate is biased.
Maybe take a look at the reported DFs.
 

noetsi

Fortran must die
#7
I have about 5000 cases. I figure it was a bonehead move too on my part, but I can not find one.

Dason what could I report that would be more helpful at dealing with the problem?

The df for the variables in question are all zero, I think sas does this automatically. It has real cases associated with it.

One thing I notice is that while they all should have 4 distinct levels (one for each quarter) some only have 3 reported values. I will try to see if that is the issue.
 

noetsi

Fortran must die
#8
What is really strange is I have 10 job related variables (one was the reference level). All have similar types of data and are coded the same way. For three of these it was able to calculate a slope for and the other six it was not.

Ok I know now that including only the work variables in the model, if you know 3 of these they perfectly predict the other six. That makes no sense given that there is no link at all between the coding of these 9 variables.
 
Last edited:

noetsi

Fortran must die
#9
Strange. The regression says if you multiply the three variables times a given value and add the intercept times this you get the 4th variable. But I did this and did not the level of the other variable at all.
 

hlsmith

Not a robit
#10
What are the variables? List them out. Does it make any sense they should equal the other variable? Post your model code too, please.
 

noetsi

Fortran must die
#11
Here is an example
Other Service Employment = -.08578 * intercept + 3.07973 * Manufacturing Related Employment -.25581 * Construction Employment + .33223 *Natural Resource Employment

I did the calculations and these do not result in the value of Other Service Employment for a given case. Can the algorithm have a flaw in it? I have not run into this issue before.
 

noetsi

Fortran must die
#13
It is the SAS output when I run the regression model. SAS says (just above this)
Note the following parameters have been set to zero (Other is among them) since the variables are a linear combination of the other variables as shown.

The calculation I posted follows. I looked at the raw data and there seems no relationship between these variables in the data (nor is there any reason they should predict each other at all). I did leave out one of the ten dummy variables for jobs so its not a case of me reporting all k levels of a categorical variable. Moreover, not everyone in the sample would have one of these ten jobs anyway.
 
#14
In administrative registers it often happens that a group of variables are summed. If you delete one-variable-at-a-time then maybe you can find where the problem is?
 

noetsi

Fortran must die
#15
I will try that. I have already essentially done that by removing all the variables in my model except the job one. Actually given that 3 of my variables predict the other 6 I think I will add one at a time.
 

noetsi

Fortran must die
#16
I started with the three predictors that seemed ok and started to add one other variable at a time. What I noticed is that each time the model estimated three of the variables and not the other. But which three it estimated changed each time. I added this for the 4 variable model in case some can see a red flag. I have never encountered this issue before and going over the data I don't see any reason it is occurring. I was wondering if for some strange reason the model is underidentified.

1559758714321.png
 
Last edited:

noetsi

Fortran must die
#17
One thing that might influence this, but I can't see why, is that each of these predictors only have 4 possible levels. They show what percent are in that job by quarter. But they don't really qualify as dummy variables (I can't think of any logical way to make them a dummy). I don't think this would cause the issue. I have run likert scale data before with 5 levels as predictors and it did not lead to this.
 

hlsmith

Not a robit
#18
"Other Service Employment = -.08578 * intercept + 3.07973 * Manufacturing Related Employment -.25581 * Construction Employment + .33223 *Natural Resource Employment"

__________________________________________________________________________

For example, can you provide the SAS output for these terms and tell us exactly how they are formatted. So perhaps post a slice of your data just for these variables, a la proc print.

Is there anything going on in the Log about this? What happens if you fit a logistic model for:

Other Service Employment = Manufacturing Related Employment + Construction Employment + Natural Resource Employment.

Does it fail to converge, since they perfectly explain the DV?
 

hlsmith

Not a robit
#19
If the variable is defined by the other 3 variables and they are categorical, I wonder if you can record them as say -> var1: 1, 2, 3, 4; var2: 10, 20, 30, 40; and var3: 100, 200, 300, 400; and then create a new variable which is a sum of these three variable and then run a contingency table of the variable they are supposedly explaining and this new sum variable and see if there is a pattern.