# Random sampling from a database with missing values

#### hapap

##### New Member
Hello everyone!

I have a real life problem that I would like to have some help with. I'm somewhat familiar with statistics, but for some reason haven't come up with any good solutions to this. My problem is the following:

I have a database of patients with varying amounts of variables (medical test results) available. For example, let's say I have variables from A to K. Patient 1 may have variables A, C, D and K measured and other variables are missing. Then patient 2 may have only variables C and E measured and so on. There are also systematic "gaps" in the data; patients with variable F measured always have variable G missing (due to competing protocols in different hospitals).

Now, I want to create another database consisting of "simulated patients" so that every "simulated patient" in it has a random realistic value for every variable (i.e. no missing data in this). In other words, the "simulated patients" should come from the same unknown multivariate distribution as the real patients. How can I do this? We can assume that all the variables are integer-valued with limited range.

Typically I have several hundreds of patients so I can estimate the distribution of a single variable quite well. But I just cannot form the database of simulated patients by randomly sampling each variable separately because that would ignore all the dependencies between variables, right? There are plenty of dependencies because most medical conditions show up in multiple test results simultaneously.

I have tried first taking a random value for a single (randomly selected) variable, let's say I get a value 'e' for variable E. Then I estimate the conditional distribution of another variable, let's say B, by collecting the values of variable B from the real patients that have E=e. This way I can then randomly choose a value for variable B from the conditional distribution and thus take its dependency to variable E into account. This can be continued iteratively and is a valid method to my understanding. Or is it? Anyway, the problem is that even though I have hundreds of patients, after taking two or three conditional distributions I end up having only few patients belonging to that specific conditional distribution (e.g. only 10 of 300 patients that have E=e and B=b). So in practice this method becomes easily unreliable and impossible.

So what would be the correct way to construct this database of simulated patients? And if there's a working and practically usable "incorrect" way to do this, I would be glad to hear it as well.

The variables don't follow any parametric form and estimation of distributions e.g. with Poisson would produce poor results. I guess there are some more flexible parametric distributions that could describe the data decently but I try to keep this simple and perform the random sampling numerically without any fixed parametric form. Besides, it's very dangerous to claim that a medical test follows a certain parametric distribution - I would never get the doctors' approval for it.

I hope I made my problem understandable. Most probably I forgot to mention something essential, but I'll answer any further questions. Let's see if we can at least get the discussion going with this.

Any help would be highly appreciated. And please keep it simple, because the solution should be more or less understandable to medical doctors as well. Thanks!