Can I ignore issues of dependence among independent variables?

A quick rundown of what I am doing-

I am documenting the phenomena of how an aggregate statistic is inappropriate when the groups it is aggregating are very different. For example, I am using the average height of an island to design all the doors. The issue is that half the population are over 6 feet tall and the other half is under 5 feet tall. In other words, there is a bimodal distribution that the mean masks. Now I also want to examine how the interaction between the bimodal distribution and the aggregate statistic has changed over time.

So I have decided upon a simple linear regression. I pool the average height of the island, the average height of the people over 6ft, and the average height of the people under 5 ft into one large dependent variable.

Now I make a time series model where my main independent variable is the categorical variables assigned to each height average. So there are three levels-average, tall, short. Average is used as the baseline.

In this case the coefficients for tall and short in the output will tell me how much using an aggregate has masked the height. In other words, the doors on the island dont all need to be 5' 10" tall, some need to be 7 ft and others 5 ft. (not a perfect example, but it works).

The issue is that this is dependent by definition since the average is part of tall and short. But a bit of simple algebra tells us that the "average" is an arbitrary measure such that A + B = C - B. Where A is the height of the short people and C is the height of the tall people. B is just some arbitrary constant put into the regression to give context.

So, my question is, does this violate dependence? I feel no, A and C are independent of each other. B (The average) is just an arbitrary constant used to "center" A and C and give a point of reference.


Not a robit
Not sure what you are saying about the dependence, but I believe they are independent. The issue is related to the data generating process. Height is a mixture and you lose information when pooling, which is usually the case. It would be like pooling heights and ignoring gender, since their are two underlying distributions based on biology between genders. Your data seem to be linearly seperable, so separate them and yes you can run the model both ways (e.g., controlling for height or source of height y/n and looking at changes in either AUC or MSE or R^2. Do you know the source of these phenotypical differences?

This also makes me think of Simpson's paradox, without flipping results but bias introduced by not controlling for a variable.
That is exactly what I thought. Its a case of simpson's paradox or omitted variable bias.

Now, lets say I want to look at the average affect of this paradox across all the islands. Is there any bias?


Not a robit
Are there difference across islands as well? So you are blending affects on islands and then across islands?

What is the purpose, to report the bias for fun or control for it, etc.
the purpose is just to document the bias in order to justify controlling for it in later models. The core issue is that people are aware of the bias, just not empirical extent of the bias (i.e. I know the wind is blowing outside, but a hurricane is different than a breeze. Knowing that difference is useful).

There are differences on the islands, but they are not blended together.

Some islands the avg height of the tall people might be 2 meters and another one might be 2.1 meters.

They are all compared to people on the same island, but I dont think that matters.