I am documenting the phenomena of how an aggregate statistic is inappropriate when the groups it is aggregating are very different. For example, I am using the average height of an island to design all the doors. The issue is that half the population are over 6 feet tall and the other half is under 5 feet tall. In other words, there is a bimodal distribution that the mean masks. Now I also want to examine how the interaction between the bimodal distribution and the aggregate statistic has changed over time.

So I have decided upon a simple linear regression. I pool the average height of the island, the average height of the people over 6ft, and the average height of the people under 5 ft into one large dependent variable.

Now I make a time series model where my main independent variable is the categorical variables assigned to each height average. So there are three levels-average, tall, short. Average is used as the baseline.

In this case the coefficients for tall and short in the output will tell me how much using an aggregate has masked the height. In other words, the doors on the island dont all need to be 5' 10" tall, some need to be 7 ft and others 5 ft. (not a perfect example, but it works).

The issue is that this is dependent by definition since the average is part of tall and short. But a bit of simple algebra tells us that the "average" is an arbitrary measure such that A + B = C - B. Where A is the height of the short people and C is the height of the tall people. B is just some arbitrary constant put into the regression to give context.

So, my question is, does this violate dependence? I feel no, A and C are independent of each other. B (The average) is just an arbitrary constant used to "center" A and C and give a point of reference.