Wage project

I am starting a new project and I want to get it right - it is the first we have run in many years and it covers a very important issue. I work for a public agency getting those with a disability jobs.

Essentially we are looking at wages two quarters after people leave our agency as the dependent variable. We have never analyzed (nor has the academic community apparently based on the research reviews I have done) what leads to higher income for the Vocational Rehabilitation community. Historically, this data was not available. So there is no real theory or articles using regression to address this. There are a set of about 34 statistical controls, age, education that the government requires that the federal government requires for VR agencies. I am adding median wage in the county you get a job which remarkably the government does not control for.

I decided to start by looking at how counts of services influences wages two quarters after a customer leaves. Any suggestions about how this should be conducted or pitfalls in conducting it would be appreciated. I have decided to abandon in the short run determining which service has the most impact relative to other services since there does not seem to be an agreed way to do this. I am thinking about doing dominance analysis if I can do this in SAS, to look at relative impact, eventually. If any one has any better way I would be interested in hearing about it.

What is the best way to determine if services really matter? And what problems should I look out for?


Less is more. Stay pure. Stay poor.
What is the time frame for the sample? Will you need to control or COVID? There are plenty of ways to find the association of covariates, you just don't like them. Given that you are gonna switch over to an approach you just heard about a week ago? What happens if you switch DV to using LPM?
Its not a sample. It is everyone we are rated on for the federal government the entire population we deal with. Our data runs only right now into 2020 so it is not entirely clear if covid effects it or not. That is something I can look for, segmented regression maybe with the intervention in March 2020?

I am open to any way to rate this. Which method do I not like, I am not certain.

I am open to linear probability models, but the federal government is not going to use those. I am not sure what the benefit of doing a linear probability model is when the dependent variable is wages. It is a linear dependent variable

To be clear, since I was not before we are looking at customers who closed with us from 20217 through 2020 (about three quarters of the way through that year) . The dependent variable is wages. How much each individual who earned some wages earned (for this measure those that are unemployed are ignored which is strange to me but the way the federal government has chosen to do it).

We want to know what increases wages. I am starting with services since we have some control of that and we have data on that.


Less is more. Stay pure. Stay poor.
Yeah, if the DV is continuous, the LPM would be irrelevant. Can you provide a data slice or summary table to help us understand the data frame.
I will when I have some data run. This project keeps getting pushed back.

Essentially I have about 40 control variables, gender, education, race and so on that are required. Then I have the variable I think matters, what services we provide a customer (counts of services). The dependent variable is how much a customer earns six months after they leave the agency.

The model is basically specific service count -> higher income controlling for all the control variables. There is, I have spent years reading the literature, essentially no theory of what drives wage increases in (public) vocational rehabilitation. Until very recently the government did not make this data available. I only have it because I work for the agency in question.

I think what disability you have matters for this. So I am specifying an interaction effect between disability and services. Which is a bit confusing to me. I usually specify (the way I was taught) a k level categorical variable as a series of k-1 dummies. But it seems like with interaction you probably want to leave the variable in the original form. That is don't break it down into dummies.
If we consider the problem as having many variables that contribute to one dependent variable, then it could be framed as a search for the principal components that influence the customer's wage, a process called Principal Component Analysis

It's a method of reducing the number of dimensions to a short list, prioritized by amount of influence the variables have on the outcome.
Then further proposed measurements can be compared against that short list to see if they are more important or less important to the outcome.
And such new measurements can be placed in the ordered list of dimensions measured thus far according to their impact on the result.
Last edited: