Interpreting dummy variables.

noetsi

Fortran must die
#1
After all these years reading regression this should be simple to do...


1626577967674.png

Impact are the regression slopes for dummy variables.

I should say that the excluded reference group here is not a good idea to me, they are less than 16 of which we have extremely few and most likely they earn very little. I can not change it, it was decided by the federal government.

That said I don't see how every dummy variable can be positive. Some have to earn less than others. Is there a way to say, I have not seen this addressed, that relative to another level one level did better? Formally you are comparing customers in the category to those not. But my audience will want us to discuss how one level did relative to the other.

What I did say [not certain this is true formally for regression]

The impact of age is most positive in the 25-44 and 45 -54 brackets and least positive in the 60 or older category . For example, those in the 45-54age group earned $3,220 more those who were not in this age group.

Relative impact is why I came to the board in large part 11 years ago and it continues to elude me after all these years. :p
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
They are all the estimated average increase in salary between reference and listed group. Yes they can all be positive if that group on average makes more than kids.

You can get other comparisons by using the estimate or contrast statement in SAS or emeans in R. But those values should've apparent from staring at the estimates.

If I was you, I would just calculate each groups estimated average salary with 95% causing the model output and plot those data for you audience!
 

Dason

Ambassador to the humans
#3
Noetsi. You understand that one group must have the lowest average right? And that the parameter estimates are the estimated differences between the averages? So if the lowest group is chosen as the reference group then all of the differences will be positive right?
 

noetsi

Fortran must die
#4
I thought that the dummy was showing the difference between being in the group and not being in the group. So for example when you look at the dummy for 60 +you were comparing those 60 +to everyone who was not 60+( most of whom earn more than 60 plus and most of whom are not in the excluded group).

But it looks like you are saying instead that they are comparing those 60+ to the reference group, not to everyone who was not sixty plus. Amazing all the years I have read descriptions of dummy variables I missed this. I think it is because most of the examples use variables that only have two levels like gender.

But it is correct to say that if group 1 is higher relative to group 2 (both in terms of the reference group) they group 1 is higher than group 2.
 

noetsi

Fortran must die
#5
They are all the estimated average increase in salary between reference and listed group. Yes they can all be positive if that group on average makes more than kids.

You can get other comparisons by using the estimate or contrast statement in SAS or emeans in R. But those values should've apparent from staring at the estimates.

If I was you, I would just calculate each groups estimated average salary with 95% causing the model output and plot those data for you audience!
The reason I don't want to generate descriptives is that they fail to control for other variables in the model. And we have 40 plus variables in the model. I built descriptives like average salary and then gave up because of this concern.
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Descriptive based on your multiple linear regression. Intercept is <16, intercept plus 16 group equals 16 group, intercept plus next group equals next group,..., intercept plus 61 group equals mean value for 61 group. Get it?

Yes, each coefficient is the mean increase in value from reference, so you can kind of compare them to each other, group 1 vs group 2.
 

noetsi

Fortran must die
#7
Ok but say you have this situation, and this is real data.

In age you are comparing each level to the same reference level. So you can compare levels. But for one other group of variables there is no reference level in common. I call these other. For example Veteran's reference level is not being a veteran. Gender is not being Female. I did not think in that case one could directly compare the impact of one variable to the other variable... (but I have been wrong all night so who knows). :p

I honestly don't get what you mean here. I have never seen an example of this in a journal.

Descriptive based on your multiple linear regression. Intercept is <16, intercept plus 16 group equals 16 group, intercept plus next group equals next group,..., intercept plus 61 group equals mean value for 61 group. Get it?

I am doing something simpler. The tables you saw above and this summary of them in the executive summary.

  • The impact of age is most positive in the 45 -54 level ($3,320 more than the reference level) and least positive in the 60 or older level ($1,920 more than the reference level).
  • For education, having a college degree is helpful Those with a BA on average earn $2,070 more than the reference level. On the other extreme those with a special education certificate earn $1,080 less than the reference level. Those few who have an advanced degree beyond a bachelor’s degree actually earn less on average than those with a bachelor’s which is surprising.
  • Which disability group individuals are in matters. Those in the auditory and communicative disorder category earn $490 more than the reference level. On the other extreme those who have a psychological and psychosocial category earn $420 less than the reference level.
  • The severity of a disability makes a great deal of difference. Those who are most significantly disabled earn $2,700 less than the reference level. Those with a significant disability earn $2,300 dollars less than the reference level.
  • Race has a lesser impact than most variables, but it does have some impact. Those who chose more than one race earned $360 dollars more than the reference level. Native Americans (which includes Alaskan natives) earned approximately $210 less than the reference level.
  • Among other factors considered, being a Veteran was the most positive. Being a veteran resulted in earning $500 more than the reference level. On the other extreme, receiving public support at IPE led to earning about $1,230 less, being classified as low income, $610 less, and being long term unemployment $410 less (all in comparison to their reference levels).
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
Yeah, what you have seems fine. And yeah you are correct that when grabbing estimates from the multiple linear reg they would be for the base case for the other references, so your presentation above seems fine.
 

noetsi

Fortran must die
#9
I am running proc genmod, essentially OLS (the distribution is Normal and the link function is Identity). My dependent variable has two levels (0 and 1). As I understand it the slope is thus increased chance of being in one of these levels (I believe the increased chance of level 1, but I am uncertain of this in the documentation). Or decreased chance of course if the slope is negative.

With dummy variables it is the mean difference as always, but it still reflects the increased (or decreased) chance of being at one of the levels (again I assume this is level 1).

I ask this because in Proc Logistics unlike normal software SAS maximized the chance of being at level 0 not 1.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
The output or log likely tells you which is the DV and IV reference groups.

Also, as mentioned before - this model would be kicking out the probability values and using the MLE.