[ self-selection - nonresponse bias] 65% of population answered survey

#1
Dear TalkStatstters,

I want to investigate possible wage gaps between different groups of people with doctorate degree, in my region. So, I have in mind a regression where the dependent variable is the log of the ratio between two average wages.
I have administrative data from the whole population of new doctorates, in a given year, from my region.
I have also survey data from a sample of this same population (where the whole population was contacted to participate to the survey, but only around 65% of doctorates answered the survey: this is the sample I should work on).
There will be self-selection bias if the nonrespondence is not random; so that my estimates won't have external validity.
Any inputs on how to tackle this issue? Liteature on this issue?
 

noetsi

Fortran must die
#2
Well to start with you have to determine if the non-response was indeed non-random. Which means looking for differences between those who did and did not respond and making a decision if (assuming there are differences) if those differences suggest the result was non-random. I don't think there is any absolute or statistical way you can know for sure if the response is non-random, it is a judgment call on what is reasonable.

What the literature will show you on this issue is that it is common to have 50 plus percent of respondents not respond. The common "solution" to this issue is essentially to claim that your results are no worse than others who have conducted them and thus they are valid (in my area it is common to have 15 percent response rates). That really is not a solution - but there is no real solution. All you can hope, and it is only a hope, is that those who did respond don't vary signficantly in their response from those that did and that is what essentially researchers do.

To me, although I have not seen this done in analysis, one possiblity is to report known demographic differences between those that did not respond and those who did - giving the reader a chance to decide for himself what is reasonable. There are methods, such as multiple imputations, that deal with missing data assumed to be not at random, but I do not know if this really addresses this issue or not. I have never seen it used this way in the literature.

It should be noted that the literature I have seen is public administration/polling/ political science so it might be different in other areas (and I have not read the literature extensively since the late nineties so things might have changed there - but I doubt it very much given the issues and solutions).
 
#3
Thank you Noetsi,
I have not received the dataset yet, I have been just reading the dataset features, the survey, and all the rest about it.
In the dataset there should be also basic statistics about the whole population (province, university, course of studies, etc...), so I guess, as you suggest, an analysis of basic descriptive statistics is something I should go through in a first stage.

I will start to read about multiple imputation, thanks for the tip.

There is also the possibility to use the propensity score, but its utilization is not clear to me, in a context where you are not comparing treated to non-treated subjects.
 

noetsi

Fortran must die
#4
To me this is one of the most difficult areas in survey research (and practisioners of which I am). Response rates are low and getting lower as fewer respond each year. As I said in my area 15 percent responses are common, we survey extra people to get larger samples knowing this. Commonly you have the difficult choice whether to use what you have, knowing that most of the population did not respond, or throw out the research. Most I suspect go with the former which raises the question how valid such (very common) analysis is.

I need to look into multiple imputations myself :p Good luck on your results.