I have a dataset that has 10 independent descriptors and one dependent statistic, b. Only a few of the descriptors are quantitative and most are qualitative.

For an arbitary set of descriptors there is no guarentee that a record in the data set has exactly the same set, but it might match, say, 6 descriptors, or there may be only a few records that match all descriptors but hundreds that match a set of, say, 8 descriptors.

I want to identify the 'best estimate' for b given an arbitary set of descriptors. I define B as the average of all b values in a given sub set of the dataset.

My initial analysis was to filter the data for qualitative descriptor values and then run a multiple regression, however the correlation coefficients were so poor that I gave up with this approach. [I recognise that this could be the key problem, however I am still required to find a better estimate of B.]

My next idea is to assume that the initial estimate for B is the average of all b values for the whole dataset, B0 with corresponding stdev S0. Now match one descriptor (giving B1) and pose the null hyp that B0=B1 and S0=S1.

Now if the H0 is true then the descriptor is not important in the analysis and I have not improved my estimate by filtering the data.

If however H0 is not true then I have, presumably, improved my estimate since I have filtered the data by a significant descriptor.

The problem is that the order that I filter may influence the result. Denoting descriptor 1 by D1 etc). If I filter by D2 and then by D1 I may get a different result to filtering D1 and then D2. Also, although the descriptor may be significant I do not know if I have improved the estimate.

Intuitively if S1<S0 I have narrowed the data and therefore presumably improved the estimate of B. However I may have simply filtered out important data that showed the spread. I could end up reducing the dataset to one record and have an undefined Stdev that gives me a 'perfect' estimate of B but is actually not an improvement. This is cleary unacceptable.

So how can I get a 'better' estimate of B?

My hunch is that I will have to improve the multiple regression analysis....

Thanks in advance,

Alan.