Our organization has developed an empirical collection of the distribution of building costs among 14 different building elements for 23 different types of facilities. The 14 elements are various building systems such as foundations, roofing, plumbing, HVAC (heating-ventilating,air conditioning), electrical, etc. The empirical data indicates the percentage of total building construction cost associated with each of the 14 elements that are typical for that particular type of facility. The 23 facility types include warehouses, schools, administrative buildings, health clinics, etc. There is typically a great deal of dispersion in these percentages. For example, an administrative facility has expected percentages ranging from 0.18% for equipment to 19.57% for HVAC.

When we develop a cost estimate, we compare the distribution of costs in the estimate, on a percentage basis, to that of the corresponding type of facility in our collection. That comparison is currently done by plotting the estimate percentages and the expected (empirical) percentages as two series on a bar chart with the data for each being the percentages in each element for each facility. From this graphic information, we make a subjective assessment of how well the distribution of the estimate matches the expected distribution. We carry this out in a spreadsheet.

I have been working on a modified spreadsheet to calculate a chi-square goodness of fit test in the hope of providing a more objective means of assessing the match between observed and expected (e.g. a p-value). I am currently using the percentage values directly, pooling percentages smaller than five. This approach behaves nicely, and the p-value tracks consistently with the graphical information.

I have read some articles on-line that suggest that it is not appropriate to use percentages directly. When I use the building construction values (typically millions of dollars) instead of the percentages, the resulting chi-square total test statistic is so large that it drives all probability values so deep into the right tail of the distribution so as to be useless. I suspect that this is due to the dispersion of the underlying percentages.

I would appreciate any insight into how appropriate it is to use the percentages. I am wondering if I am on solid ground from a statistical validity viewpoint. I am also going to try to see how well a Spearman rank correlation test will work.

Many thanks!