# Comparing to excluded group

#### noetsi

##### Fortran must die
For strange reasons I won't go into we have a data set where it is non-trivial to identify certain customers who are an excluded group in a set of categorical variables, they are in the data set but it will take time to identify them. So we can not easily change the reference group. Essentially you can not easily identify customers in the reference group.

In some cases this excluded group is very small, maybe 12 out of 16 thousand cases. Other times it is not. The point is I want to compare the slopes of the dummy variable to each other not the excluded group (when the excluded group only has 12 cases that is a bit pointless). And do it very quickly.

All the predictors are dummy variables, the response variable is income (interval). If one dummy variable slope on a given categorical variable is say 2000 and another dummy variable on the same categorical variable is say 3000 can we say the level with 3000 has more impact?

I guess this raises also the issue of whether you can even talk of relative impact in regression, with a new wrinkle. I came to this board 10 years ago to address relative impact and after 10 years of reading I am still not sure how to do this.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
So you want to drop a group from pairwise comparisons if it is below a certain size (threshold)?

#### noetsi

##### Fortran must die
So you want to drop a group from pairwise comparisons if it is below a certain size (threshold)?
No I wanted to compare two levels of a categorical variable that has multiple dummies (essentially compare the dummies). Where I can not change which is the reference group.

Say a variable has 5 levels and we build 4 dummies from it. We do not have information that lets us change the reference group easily, we can not easily access the 5th level that is excluded, or I would just change it. We want to compare the 4 non-excluded levels to each other on which is higher on the DV - assuming that if the magnitude is greater the impact is greater. I have the whole population so t test do not matter here. I just want to say this level of the original categorical variable will generate a larger change in the DV than this level of the categorical variable. And comment on the direction, of each level .

I am not sure you can do this, usually you compare levels to the excluded level. If the data was easier to manipulate that is what I would do. But I am trying to do this fast, because I have limited time. The problem is that in some cases comparing to the reference level itself makes no substantive sense. And for the present I can not change the reference levels. So can you compare the dummies of a categorical variable to each other, not the reference group.

#### noetsi

##### Fortran must die
I think I have been thinking about this wrong. The slope are not the differences between the reference level and anything. They show the difference between say male and female on the DV. So it really does not matter what the reference level is. Is this true even when there are very few cases on the reference level (say the reference level has 12 cases out of 16,000). Does this distort the results of the regression?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yes, it would. If your reference group was the smallest group you could fail to find a difference beyond chance due to that group being small. However, you said you have the full population so this may be moot.

Last week I actually had this scenario where the reference group end-up being employees under 20 years of age, which represented 2% of the workforce I was sampling. Thus, statistically all of the comparisons could not rule out chance given the reference group was so small.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Also per your inquiry in the Chatbox, here is a piece of code that kicks out residuals. Let me know if you find a another way/option to automate graphing them.

Code:
ods graphics on;
proc genmod data=Sero_2 descending plots=all;
model     L_result2 = time_since         /
output out       = Residuals
pred      = Pred
resraw    = Resraw
reschi    = Reschi
resdev    = Resdev
stdreschi = Stdreschi
stdresdev = Stdresdev
reslik    = Reslik;
run;

#### noetsi

##### Fortran must die
I will run that code hlsmith but here is my real concern (even given that I have the population)

There are only about 12 people not in one of these dummies. I am using reference coding. I thought that meant that this was the mean difference between a level 0 and a level 1 controlling for all other variables. But in every single case the 0 group is much lower than the 1 group. I don't understand how that is possible. It is like everyone being below the average.

Age 16 to 18 0 -2404.590541
Age 19 to 24 0 -2375.881472
Age 25 to 44 0 -3065.549946
Age 45 to 54 0 -3164.930383
Age 55 to 59 0 -2779.786832
Age 60+ 0 -1970.861470

I am doing something wrong, not sure what.

This is the code I am running

Code:
proc hpreg data=DORA.INCOME;
CLASS 'Limited English-language profici'n 'Migrant and seasonal farmworker'n 'Race: Hawaiian/Pacific Islander'n 'Race: White'n 'Race: Black'n 'Psychosocial and psychological d'n 'Race: Asian'n 'Physical disability'n 'Postsecondary education no degre'n 'Low-income'n 'Long-term unemployed'n 'Age 16 to 18'n 'Intellectual and learning disabi'n 'Age 19 to 24'n 'Displaced homemaker'n 'High school diploma or equivalen'n 'Individuals has a significant di'n 'Individuals is most significant'n 'TANF recipient'n 'Special education certicate/comp'n 'Received public support at appli'n 'Single parent'n 'Received training services'n 'Received other services'n 'Received career services'n 'Homeless individual, runaway you'n 'Age 25 to 44'n 'Race: More than one'n 'Foster care youth'n Female 'Ethnicity-Hispanic Ethnicity'n 'Employed at application'n 'Age 45 to 54'n 'Age 55 to 59'n 'Age 60+'n 'Associate’s degree'n 'Auditory and communicative disab'n Veteran 'Bachelor’s degree'n 'Beyond a bachelor’s degree'n / PARAM=REFERENCE REF=Last;
MODEL Qtr2_Wage =  'Age 16 to 18'n  'Age 19 to 24'n  'Age 25 to 44'n  'Age 45 to 54'n  'Age 55 to 59'n  'Age 60+'n  'Associate’s degree'n  'Auditory and communicative disab'n  'Bachelor’s degree'n  'Beyond a bachelor’s degree'n  'Construction Employment'n  'Displaced homemaker'n  'Educational, or Health Care Rela'n  'Employed at application'n  'Ethnicity-Hispanic Ethnicity'n  Female  'Financial Services Employment'n  'Foster care youth'n  'High school diploma or equivalen'n  'Homeless individual, runaway you'n  'Individuals has a significant di'n  'Individuals is most significant'n  'Information Services Employment'n  'Intellectual and learning disabi'n  'Leisure, Hospitality, or Enterta'n  'Limited English-language profici'n  'Long-term unemployed'n  'Low-income'n  'Manufacturing Related Employment'n  'Migrant and seasonal farmworker'n  'Natural Resources Employment'n  'Other Services Employment'n  'Physical disability'n  'Postsecondary education no degre'n  'Psychosocial and psychological d'n  'Race: Asian'n  'Race: Black'n  'Race: Hawaiian/Pacific Islander'n  'Race: More than one'n  'Race: White'n  'Received career services'n  'Received other services'n  'Received public support at appli'n  'Received training services'n  'Single parent'n  'Special education certicate/comp'n  'TANF recipient'n  'Trade and Transportation Employm'n  'Unemployment Rate Not Seasonally'n  Veteran  /  TOL ;
Selection Method = NONE;
OUTPUT OUT=WORK.HPREG_OUTPUT (LABEL="Linear regression predictions and statistics for DORA.INCOME")
PREDICTED=PREDICTED_Qtr2_Wage
COOKD=COOKD_Qtr2_Wage
COVRATIO=COVRATIO_Qtr2_Wage
DFFIT=DFFIT_Qtr2_Wage
H=H_Qtr2_Wage
PRESS=PRESS_Qtr2_Wage
STDI=STDI_Qtr2_Wage
STDP=STDP_Qtr2_Wage
STDR=STDR_Qtr2_Wage
RESIDUAL=RESIDUAL_Qtr2_Wage
STUDENT=STUDENT_Qtr2_Wage
RSTUDENT=RSTUDENT_Qtr2_Wage
;
ID 'Age 16 to 18'n 'Age 19 to 24'n 'Age 25 to 44'n 'Age 45 to 54'n 'Age 55 to 59'n 'Age 60+'n 'Associate’s degree'n 'Auditory and communicative disab'n 'Bachelor’s degree'n 'Beyond a bachelor’s degree'n 'Construction Employment'n 'Displaced homemaker'n 'Educational, or Health Care Rela'n 'Employed at application'n 'Ethnicity-Hispanic Ethnicity'n 'Ex-offender'n Female 'Financial Services Employment'n 'Foster care youth'n 'High school diploma or equivalen'n 'Homeless individual, runaway you'n 'Individuals has a significant di'n 'Individuals is most significant'n 'Information Services Employment'n 'Intellectual and learning disabi'n 'Leisure, Hospitality, or Enterta'n 'Limited English-language profici'n 'Long-term unemployed'n 'Low-income'n 'Manufacturing Related Employment'n 'Migrant and seasonal farmworker'n 'Natural Resources Employment'n 'Other Services Employment'n 'Physical disability'n 'Postsecondary education no degre'n 'Professional and Business Servic'n 'Psychosocial and psychological d'n 'Race: Asian'n 'Race: Black'n 'Race: Hawaiian/Pacific Islander'n 'Race: More than one'n 'Race: White'n 'Received career services'n 'Received other services'n 'Received public support at appli'n 'Received training services'n 'Single parent'n 'Special education certicate/comp'n 'TANF recipient'n 'Trade and Transportation Employm'n 'Unemployment Rate Not Seasonally'n Veteran;
run;

#### Dason

It literally doesn't matter what the reference group is. It literally didn't change a single thing. Do an estimate statement to compare the coefficients that you care about directly.

#### noetsi

##### Fortran must die
I agree the reference level does not matter. This is showing the mean difference between those in an age range and those not on the DV. And in every case not being in the range leads to worse results. That is not possible bad as I am at math.

Or does dummy coding not show the mean difference between being in a level and not in a level controlling for other variables. At every possible age you can not be not be negative if you are in the excluded group. In some cases you have to be higher.

#### Dason

All it is saying is that compared to the reference group all of the other groups are lower. Why do you think that is mathematically impossible?

#### noetsi

##### Fortran must die
Because I thought it was comparing them not to the reference group but to each other. So that it was saying that if you are 60 or older you earned less money compared to if those that are less than sixty (whether they are in the reference group or not), And if you were 55-59 you were earning less money than those where were not 55-59. And so on for every single group. Obviously some groups have to earn more if others earn less.

I thought that, this is an amazing miss on my part, that for dummy coding it was showing the mean difference between everyone who was a 1 and everyone who was a 0 in the entire population you have. I never realized that they were being compared to the reference group.

This gets at what really occurs. The interpretation of the coefficients is much like that for the binary variables. Group 1 is the omitted group, so Intercept is the mean for group 1. The coefficient for mealcat2 is the mean for group 2 minus the mean of the omitted group (group 1). And the coefficient for mealcat3 is the mean of group 3 minus the mean of group 1. You can verify this by comparing the coefficients with the means of the groups.

Yow what a miss. Reading about this over the years I always thought they were comparing the 1 and 0 to each other. So that if men were 1 and women 0 and the slope was 500 the difference between men and women was men were higher than women by 500. Which is true, but does not work well when you have more than 2 levels of a variable. Then you are comparing to the reference group.

Which is unfortunate in our case because the federal government chose reference groups with tiny membership (say 12 out of 16 thousand). Present the results to my audience, who are not going to understand why every age group shows negative, is going to be fun.

#### Dason

If you're interested in the pairwise difference why aren't you showing them? The raw coefficients shouldn't matter much. Just calculate the comparisons you are interested in.

You know I'm not a sas fanatic but it will do this for you very easily.

#### noetsi

##### Fortran must die
If you're interested in the pairwise difference why aren't you showing them?
I already built descriptives that show this if that is what you mean. I want to show what the difference is between 60 and not sixty for example when you take into account all other variables (for every one of the six age variables for instance). In theory you could do this with cross tabs but in practice with so many variables that is not realistic (we have 51 variables). And ultimately we need to do regression anyway because the federal government requires this. If there is a simple way to get this information in SAS I do not know what it is. I don't understand why they made age into a categorical variable to start with in honesty.

The problem is that, as has been true since I first came here, what we really want to do is if I do x what will happen and does this variable have more impact than this variable. And, as you and Jake and many others have told me repeatedly over the last decade or so regression is not designed for relative impact.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Making ages into categories helps deindentify people.

#### noetsi

##### Fortran must die
That could be it, although I see the customers in the raw data so it does not work if that is the intent. My guess is they just had preexisting categories and they added them to the regression.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
60 vs all others is something like:

Lsmeans '60 comparison' agegroup 1/60 1/60 ... -1;