Discriminant analysis with mostly incomplete cases

Hello. I have a bit of a biostats problem.

I’ve got data on 5 groups and 15 species. There are 9 variables, and 1000 cases total. The problem, however, is that less than 20% of the cases have full compliments of data. In fact, 61% of measurements are missing because they were unobtainable (I’m dealing with fragile fossils).

As it is, because I didn’t personally collect most of the data, I’m sure that many cases were actually the result of combination, where the recorder took 2 measurements from one specimen, and 2 from another of the same species and collection, and threw them together as 1 case.

My goal is to perform a discriminant function analysis demonstrating that the 5 groups are indeed separate, but that most species in each group are not distinguishable from other species in the group.

If I only use the few complete cases, then this is pretty much impossible. Not to mention that the complete cases are biased as they do not represent the distribution for each species or group. Also, univariate analysis is a dead end for distinguishing the groups.

It’s been suggested to me to combine cases if they are in the same species in order to create as many complete cases as possible. As I mentioned, I’m sure that this has already been done for many of the cases by the data recorders. It’s also been suggested to (after normalizing the data), use the distributions, means, and standard deviations for variables in each species to generate many cases, and use those for discriminant analysis.

Any thoughts on what to do and how to proceed?


TS Contributor
if the var-vov matrices are not the same and then you will have to try quadratic discrim analysis instead of linear. i used it and it provides better results.
Just The ANSWER to your question

Hi guys.

Let me give the method that keeps conditional joint distribution of X assuming that
underlying missing data mechanism is ignorable.
This means that the probability p_j of jth variable value to be missed doesn't depend on that value.
In this case the following algorithm seems to be relevant for your problem.
For example you have
case1=(y_1,...y_k, y_{k+1}=missed, ... , y_n=missed)
and k+1.

Let's look for all the other cases that have 1,...,kth variables filled
and also the k+1th filled. Denote this set of cases by M. Find nearest observation (taking into account that different vars have different ranges dividing y_j-y'_j by \sigma_j) according to euclidean distance.

Denote the set of nearest observations by M(case1,k+1).
Choose the point x=(x_1,...x_{k+1},...) from M(case1,k+1) randomly and fill k+1th variable of case1 by its value x_{k+1}.

Do it for all groups separately, and for all missing values and cases.
Last edited:

I meant find the nearest in M where the euclidean distance is defined in R^k - the space of first k vars.
M(case1,k+1) is subset of M
Last edited:
What is sigma_j?

When choosing a substitute value for a missing value does it matter if it came within the same group/species? I would think so, or maybe not. If it does make a difference then 1000 cases divided into 5 groups and then into 15 species with 9 variables would make me think there is little substitute data to use.