Does a reduced sample size after joining two datasets count as missing data?

#1
Hi there, I have two datasets both share names which is what I use to join them together to form my sample on which I am running a logistical regression. However, after performing a fuzzy join I only have about 55% of the data matching (5.5k out of 10k observations). Does this count as missing data? I wonder because it is about the sample rather than a particular variable. If so, is it missing completely at random (MCAR).

Thanks!
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Almost all biases can be considered missing data. Model misspecification = missing knowledge, confounding = missing variable, selection bias = missing individuals, information bias = missing accuracy/classification metrics, missing data = missing data.

Why can't you deterministically join these data?
Your scenario could be just subsampling concern or actually represent selection bias if there is a systematic underpinning.

Tell us more, please!
 
#3
Almost all biases can be considered missing data. Model misspecification = missing knowledge, confounding = missing variable, selection bias = missing individuals, information bias = missing accuracy/classification metrics, missing data = missing data.

Why can't you deterministically join these data?
Your scenario could be just subsampling concern or actually represent selection bias if there is a systematic underpinning.

Tell us more, please!
Thank you for replying. Here are the details:

I am working with two datasets pertaining to elections. Both datasets have the candidate's names. One dataset has transaction (campaign finance) data, one has electoral (voting) data. Both have roughly 10k rows - which makes sense because they pertain to the same candidates running in the same election. However, when trying to merge these two datasets together using the candidate names, I have only about 5.5k entries left due to names not matching perfectly (abbreviations, titles, etc.). I have already significantly increased the number of matches by using fuzzy logic to find suitable matches. The best I can get using this method is 5.5 entries.
 

noetsi

Fortran must die
#4
Then some data will be missing. 45 percent is quite a bit, although sampling has had more and more problems getting responses. So you can try things like multiple imputation which is no fun. Is your dependent variable ordinal or interval.
 
#5
Then some data will be missing. 45 percent is quite a bit, although sampling has had more and more problems getting responses. So you can try things like multiple imputation which is no fun. Is your dependent variable ordinal or interval.
The dependent variable is dichotomous - I'm running a logistical regression on the probability of a candidate winning based on independent variables from both datasets
 

noetsi

Fortran must die
#7
multiple imputations is not fun with binary data although it can be done in theory. I would try hlsmith's suggestion first.
 
#8
Based on your description, I presume there are two candidates. My first question is: why there is fuzzy matching for two names (even with abbreviations this is hard to wrap my mind around)? It may be helpful to share a snippet of the data. You say both datasets have about 10k rows. But, it's important to know what each row represents exactly. When you merge the two datasets into one, you expect the information across a row to represent one "observation" in your data. The most important question imo is, what defines an observation? Try to merge one row together and describe what it tells you. I'm not totally convinced that name is the unique identifier per row.

Example: Let's say I'm building a logistic regression model to predict the outcome of a baseball game for a particular pitcher (dependent variable: Win/Loss). If I have some independent variables from the games in one dataset and more independent variables from the games in another dataset and I join on win/loss the games will most likely be all scrambled. I could be wrong, but this may be an issue in your case (setting aside the fuzzy matching issue).
 
Last edited:

noetsi

Fortran must die
#9
you can try like commands to get the information. I do this with text although doing it in a join would be...painful if possible (I like that word painful it appears).

maybe that is what you mean by fuzzy logic.