*ever*assume that a model is completely correct?

- Thread starter janhen2
- Start date
- Tags data dependent independant data stat statistics

Because you **won't** get published (and **will** anoy your boss if you have one) if you stress your model is wrong. Once you get your PHD Dason you have to start being practical

I model gene expression data using a negative binomial distribution. Why do I use a negative binomial? Because the Poisson has too rigid of a mean/variance structure and the negative binomial provides an adequate fit. I don't think the expression levels are perfectly distributed as a negative binomial but hell it works just fine so I deal with that. I do some research to see if I can improve the model to something I find more believable but that doesn't mean I don't use models that I see as flawed. I still find them useful after all.

Trinker essentially summarized what I was saying.

And I don't agree that you won't get published - it's a strong point to be able to point out any flaws in the model but still provide an argument for why it's ok.

I always get the feeling that you think I'm in some ivory tower trying to make the most esoteric models imaginable. That's not it at all. I fit models all the time. They help with some very important research. But if you ask me if I actually think the models are completely correct then I will tell you no.

“all models are wrong, but some are useful”

And another version:

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful”

(Box is among other things known for Box-Cox transformations, the Box-Jenkins models in time series, the book on experimental design Box, Hunter, Hunter and a book on Bayesian inference.)

The ideas above are not only Dasons but are held by many well-known statisticians.

Someone else said:

“The most practical thing is a good theory”.

….

Englund wrote:

“My point is that R^2 is, per difinition, a measure of covariation between the IV´s and the DV”

R^2 is a ratio that is based on the fact that you can split up the total sum of squares (SST) in explained sum of squares (SSR) and residual sum of squares (SSE) so that R^2 = SSR/SST.

In my view the R^2 is greatly overemphasized. It is not a “quality index”. It is more dependent on how stretched out the x-values are and the residual variance.

The ideas above are not only Dasons but are held by many well-known statisticians.

In my view the R^2 is greatly overemphasized. It is not a “quality index”. It is more dependent on how stretched out the x-values are and the residual variance.

Statistical analysis is inherently inferential. If you want absolutes you deal in theory or design of experiment. Most will conclude, barring obvious theoretical reasons not to, that strong covariance means something is occuring. Which is really what R squared gets at. It is imperfect in as much as the relationship is non-linear, but there are no perfect tools least of all those that can be understood by the 99.9999 percent that are not statisticians by trade

Either or. Even if the "true" R^2 is 0 (which is the case for X ~ Unif(-a, a), Y = X^2) there can be dependence.

I agree on this: If R^2 is estimated to be 0, then we cannot draw the conclusion that there are no dependence. But if the true unknown R^2 is 0, then there are no dependence. I would say that R^2 is 1 in the example you refer to, because there is a perfect relationship between the variables described. But if you try to fit a linear model, then the observed R^2 will be far from 1.

\(

\begin{tabular}{|c|c|c|}

\hline

x & y & Prob(Y = y $|$ X = x) \\

\hline

0 & 0 & 1 \\

1 & 1 & .5 \\

1 & -1 & .5 \\

\hline

\end{tabular}

\)

In this case the best "regression" we could come up with is predicting y = 0 regardless of x. But there is still dependence here. For if we know x = 0 then we KNOW y = 0. If we know x = 1 then y could be either -1 or 1. So the distribution of Y depends on the value of X. So R^2 is 0 but there is still dependence.

“You can choose to get as large R^2 as you want.“

“What?!” I said.

“If there is a linear relationship between x and y and you can design where to put the x-values, then just by stretching out the x-values far enough you will get a large enough R^2 value“, he said.

Another aspect is that if you have an observational study and there has not been very much variation in the x-values – the x-values have been roughly constant (as often happens in observational studies) – then the R^2 will be low. That does not mean that the model is bad. It can be a good description of reality. A good model is a model that fits to the data. Not if R^2 is high or low. Lack of fit measures are far more important than R^2.

The residual variance has an influence on R^2 (by increasing the residual sums of squares). So you can make a two-by-two “table” or graph with high and low variation in the x-values and with high and low residual variation. I think that is more important to think of than the R^2.

I would be primary concerned by the parameter estimates and if they are significant, the standard deviation in the residuals and lack-of-fit-measures.

but there are no perfect tools least of all those that can be understood by the 99.9999 percent that are not statisticians by trade

Besides, I think it gives increased confidence if someone talks both about a models strengths AND weaknesses. This is valid for statistical investigations and used car sellers.

But that isn't true. The R^2 is per definition not zero in that case. It can be estimated to be zero, but in that case the model used to predict Y is seriously flawed.

But does R^2 have a population value? I have never heard of that.

Think of a linear regression model with a nonzero slope (beta).

Imagine that a first experiment is having the x-values in a narrow range. That will give one R^2 value.

Imagine a second experiment with exactly the same parameters but with the x-values in a wider range. That will give a higher R^2 values for exactly the same parameter values, that is, for the same population beta and sigma values.

No, I don’t think it is meaningful to think of R^2 as population parameters.

I think of R^2 as a simple descriptive of the data at hand.