# Model building question-Is the researcher right?

#### leavesof3

##### New Member
I'm thinking about a researcher who made the following statement about a model(upon criticism of his interpretation of the data).

"There is a number of these univariate correlations in the data that do not fit the model (out of the thousands, there would be)"-researcher

Basically my understanding is that the criticism that was made of his research was done by someone who analyzed the correlations of his data between a few variables and found the connection he said was there between X and Y, Z WAS NOT what he said. His response was that you needed to run the regression with 1000's of variables to get the true picture. My initial thought was that he was wrong. If something doesn't make a connection in a limited number of variables, why would you 'find' a meaningful connection later?

So I ran a little experiment on a pretty popular data set that is unrelated, the baseball hitting statistics from 2000-2008, contained in the nutshell library of R. This is essentially the Moneyball model.

The full linear regression model with all the statistics included looks like this.

formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing

(Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
singles 0.56705 0.02601 21.801 < 2e-16 ***
doubles 0.69110 0.05922 11.670 < 2e-16 ***
triples 1.15836 0.17309 6.692 1.34e-10 ***
homeruns 1.47439 0.05081 29.015 < 2e-16 ***
walks 0.30118 0.02309 13.041 < 2e-16 ***
hitbypitch 0.37750 0.11006 3.430 0.000702 ***
sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
stolenbases 0.04369 0.05951 0.734 0.463487
caughtstealing -0.01533 0.15550 -0.099 0.921530
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-1

I then tried to build it imagining what if I had only collected 2 pieces of data and compared them to runs? Just using runs, singles, and doubles and still got a high degree of significance but cut the R squared to a little over one third its value at full model.

formula = runs ~ singles + doubles, data = team.batting.00to08)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.88514 70.91583 -0.323 0.747
singles 0.39419 0.06201 6.357 8.86e-10 ***
doubles 1.38342 0.14626 9.458 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.13 on 267 degrees of freedom
Multiple R-squared: 0.3497, Adjusted R-squared: 0.3448
F-statistic: 71.8 on 2 and 267 DF, p-value: < 2.2e-16

Next, I added triples expecting the same degree of significance to show up in the model as in the final, but instead triples comes out negatively correlated with runs.

formula = runs ~ singles + doubles + triples, data = team.batting.00to08)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.10399 71.11347 -0.311 0.756
singles 0.39602 0.06257 6.329 1.04e-09 ***
doubles 1.38603 0.14691 9.434 < 2e-16 ***
triples -0.10873 0.44650 -0.244 0.808
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.25 on 266 degrees of freedom
Multiple R-squared: 0.3499, Adjusted R-squared: 0.3425
F-statistic: 47.72 on 3 and 266 DF, p-value: < 2.2e-16

Summarizing, model 1 was the full model, model 2 was singles and doubles, model 3 added triples to singles and doubles. I expected the model to slowly build up to to it's high R-Squared and for the correlations to stay closely the same, but in model 3 triples show up as not significant and negatively correlated, while in the final model they show up as positively correlated and very significant. Any explanation for this phenomenon? Is this a similar problem to the criticism of the researcher's data by taking a few univariate correlations?

#### noetsi

##### No cake for spunky
No one runs regression with thousands of variables, at one time or overall. His response makes no sense.

#### Dason

Yeah 1000s seems odd but... maybe? You can take a handful of variables and look at all possible squared terms and all possible interaction terms. Then suddenly the 20 variables you started with turns into 420.

But I don't care about that too much. It's very possible for the effect to "change direction" or even to become significant when it wasn't significant before quite easily after you include different variables. So that part at least makes sense to me. I would probably want to read more about the data the researcher is talking about before providing an opinion but at the very least the general idea of what they're saying is possible.

#### noetsi

##### No cake for spunky
Normally you would (or at least this is what I have seen) address the specific set of variables in your model. Not talk about other variables not in your model. I am not certain of that logic, unless you are trying to argue that a specific method or approach to statistics would work with different sets of variables. It seems an unsual approach to me.

#### leavesof3

##### New Member
Dason and Noetsi,

The study in question is the 'China Study' done by Dr. Colin T. Campbell. Campbell's big argument is that animal protein 'turns on cancer'. The raw data of the China Study was posted on the web. It contains a record of yearly deaths from different types of cancer and other diseases, along with daily food intake in animals versus plants, and social affluence.
A criticism of study(and book) was written by health blogger Denise Minger. I starting running the numbers in R and so far it has backed up the criticism(part contained below)

COLORECTAL.CANCER..PER.1000..per.year ~ Yearly.Animal.Protein.Volume + ,
Yearly.Plant.Protein.Volume + TOTAL.CHOLESTEROL..mg.dL. + X.China.Rank,
data = ChinaStuRank)

Residuals:
Min 1Q Median 3Q Max
-3.6641 -1.4465 -0.4500 0.7313 16.0316

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.91833 4.58210 -1.510 0.13633
Yearly.Animal.Protein.Volume -8.40003 5.44398 -1.543 0.12809
Yearly.Plant.Protein.Volume 4.30401 5.41607 0.795 0.42994
TOTAL.CHOLESTEROL..mg.dL. 0.08207 0.02578 3.184 0.00231 **
X.China.Rank -0.04153 0.02016 -2.061 0.04369 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.865 on 60 degrees of freedom
Multiple R-squared: 0.2108, Adjusted R-squared: 0.1582
F-statistic: 4.008 on 4 and 60 DF, p-value: 0.006021

So only cholesterol and your social status show up as significant.

Campbell in his reply to the criticism also wrote, "A more appropriate method is to search for aggregate groups of data, as in the ‘affluent’ vs. ‘poverty’ disease groups, then examine whether there is any consistency within groups of biomarkers, as in considering various cholesterol fractions. This is rather like using metanalysis to obtain a better overview of possible associations"

I also starting running regressions based on groups of socioeconomic status and none of them supports his claims. But I was wondering still is there anything to any of his analysis points? I mean in the jacket of his book his study gets called the 'Grand Prix' of epidemiology. But, I thought it was weird to run 1000's of correlations as I did my undergrad in stats and didn't recognize most of what he was talking about. Any additional thoughts knowing more about the data?

#### noetsi

##### No cake for spunky
Well first the fact that affluence showed up as critical suggest a basic problem with the original analysis (if I understand it). That he failed to control for key factors that might explain cancer survival rates (and also food intake) income. If you have more money it is likely your diet and factors such as health care are dramatically different than lower income individuals. So health care (driven by income) could explain diet and cancer survival rates rather than food intake explaining that. You need to control for such alternatives in making an argument.

I have to say that I don't have a clue what he means by his defense. That may reflect lack of understanding of this field or the language used in it. He seems, to me, to be suggesting that there are groups of people (or perhaps variables) that explain the results and that analysis can only take place within that understanding. But I don't see why his argument that survival rates are tied to food would be supported by that - in fact it suggest that a range of factors would explain the results to me.

Is he arguing that there are moderating agents such as income (which really is a marker I am sure for factors like access to health care since income alone obviously has no impact on survival) within which different diets impact survival rates?

#### leavesof3

##### New Member
Colin Campbell's main argument is that affluent families eat more animal protein, which contributes to them dying more often of cancer, heart attacks, and diabetes. So I think he means you have to analyze them by status. I was trying to figure out if he did as you say “control for key factors”. He claims the 1000's of associations to show impact of the lower survival rates, but I find nothing of the sort in the data.

The problem with analyzing by income(if he did do it that way) is that there is solid evidence from the work of Michael Marmot(who did his studies on the British Civil Service) and Robert Sopalsky(who did his studies on African baboons) that income(or status) in and of itself is a factor in cancer survival rates even when you control for diet and health care. The great thing about these researchers work is that in both cases Health care was equal access. For the British Civil Service they had the NHS and the baboons had the "do nothing" case. This is the abstract of Marmot's paper.

Health and longevity are intimately related to position in the social hierarchy. The lower the status, the higher risk of illness and death, and consequently the shorter the life expectancy. In his book of the same name, Michael Marmot calls this social gradient in health the “Status Syndrome”.
Basically I think Campbell did some misguided statistics which somehow identified stress effects and want to test the China Study Data for such a bias. As far as I know Marmot controlled for diet and I'm in the process of getting a more thorough breakdown of Marmot's research(and there's also a National Geographic Special about it called “Stress: Portrait of a Killer”)

#### noetsi

##### No cake for spunky
It is almost meaningless to say that you need to test 1000's of associations to capture something. How would you know what mattered this way? I have never seen that done -ever. What he really is arguing I think is that the percent of meat consumed effects cancer rates. In some areas, nitrites, that is well accepted, but different types of meats in different quanities likely effects the results.

Income in and of itself almost certainly does not effect cancer rates - how could how much you earn matter (you could for example not spend any of it at all). It is just a variable, in the context of health, that permits different types of behavior. What you need to analyze is not wealth, but the things that wealth influences (like diet) that effects cancer.