Looking for feedback on application of chi-square test

I am a cost engineer working in the building construction field. I am seeking some advice/feedback on the use of a chi-square test for assessing how well a cost estimate follows an expected distribution.

Our organization has developed an empirical collection of the distribution of building costs among 14 different building elements for 23 different types of facilities. The 14 elements are various building systems such as foundations, roofing, plumbing, HVAC (heating-ventilating,air conditioning), electrical, etc. The empirical data indicates the percentage of total building construction cost associated with each of the 14 elements that are typical for that particular type of facility. The 23 facility types include warehouses, schools, administrative buildings, health clinics, etc. There is typically a great deal of dispersion in these percentages. For example, an administrative facility has expected percentages ranging from 0.18% for equipment to 19.57% for HVAC.

When we develop a cost estimate, we compare the distribution of costs in the estimate, on a percentage basis, to that of the corresponding type of facility in our collection. That comparison is currently done by plotting the estimate percentages and the expected (empirical) percentages as two series on a bar chart with the data for each being the percentages in each element for each facility. From this graphic information, we make a subjective assessment of how well the distribution of the estimate matches the expected distribution. We carry this out in a spreadsheet.

I have been working on a modified spreadsheet to calculate a chi-square goodness of fit test in the hope of providing a more objective means of assessing the match between observed and expected (e.g. a p-value). I am currently using the percentage values directly, pooling percentages smaller than five. This approach behaves nicely, and the p-value tracks consistently with the graphical information.

I have read some articles on-line that suggest that it is not appropriate to use percentages directly. When I use the building construction values (typically millions of dollars) instead of the percentages, the resulting chi-square total test statistic is so large that it drives all probability values so deep into the right tail of the distribution so as to be useless. I suspect that this is due to the dispersion of the underlying percentages.

I would appreciate any insight into how appropriate it is to use the percentages. I am wondering if I am on solid ground from a statistical validity viewpoint. I am also going to try to see how well a Spearman rank correlation test will work.

Many thanks!


Less is more. Stay pure. Stay poor.
Are you using monte carlo simulations at all. If you did, that would create a distribution of possible realizations of your data, which is normally distributed.

You can use percent per se, I believe in lieu of counts, though it may be weird looking at probability values, which are heavy influenced by sample sizes. I will see if Miner, one of the regulars here may have some input, he seems savvy in these areas.

Is it possible for you to upload some images of the distributions, etc.?


TS Contributor
The problem with using chi-square is three-fold. First, it is really designed for count data. Second, with a table as large as your, the likelihood is high that you will get at least one cell with a count less than 5. Finally, with large counts, everything ends up being significant, which doesn't tell you much.

I like hlsmith's recommendation for using a Monte Carlo approach. You can establish a cost distribution for each cell of your table using historical data then perform the analysis. This will give you a probability distribution for the total cost for the building. You could then see what percentile the actual building cost came in at.
I appreciate the suggestion about monte carlo. I will think about that. It would seem that I would have come up with some understanding of the distribution for each of the elements (each category of the multinomial). That is something that I will need to research

I tried to upload a few images, but I have not figured out the mechanics of that process yet. IN the course of doing that, I may have generated an inadvertent splinter post. If so, I apologize for the confusion.

Thanks for your prompt response!
To me it seems to be more natural to try to model it with the Dirichlet distribution. Also have look at Lukacs's proportion-sum. And then search for estimation methods.

Later you can look at Generalized Dirichlet distribution. Maybe there is a covariance and different variances for the different parameters.

Your data is not multinomial. Multinomials are like when you throw a ball and it can fall into one out of k cells (so it is 0 or 1) with a specific probability. Your single values are always proportions.

You can use the chi-squared test to test if your data fit to a Dirichlet distribution.

You can also easily simulate the distribution in R.