# How to Determine if 2 variables are dependent?

#### Dason

I won't argue with you there. I don't like R^2 at all. I was just pointing out that you can't use the fact that R^2 = 0 to imply that there is no dependence. And people used flawed models all the time. I'm of the opinion that all models are flawed. So why should we ever assume that a model is completely correct?

#### noetsi

##### Fortran must die
I won't argue with you there. I don't like R^2 at all. I was just pointing out that you can't use the fact that R^2 = 0 to imply that there is no dependence. And people used flawed models all the time. I'm of the opinion that all models are flawed. So why should we ever assume that a model is completely correct?
Because you won't get published (and will anoy your boss if you have one) if you stress your model is wrong. Once you get your PHD Dason you have to start being practical

#### trinker

##### ggplot2orBust
I think Dason's point isn't that a model is 100% wrong, more that it's not 100% right. Then again I may be putting words in the old chap's mouth. If this is his point it's important to always keep this in your mind when analyzing data.

Man I missed these Dason vs. Noetsi philosophical battles

#### Dason

Because you won't get published (and will anoy your boss if you have one) if you stress your model is wrong. Once you get your PHD Dason you have to start being practical
Of course we have to be practical. I'm not saying that models can't be useful. But to sit around and assume that they're 100% correct is just silly (unless you're working with simulated data). And I don't agree that you won't get published - it's a strong point to be able to point out any flaws in the model but still provide an argument for why it's ok. I always get the feeling that you think I'm in some ivory tower trying to make the most esoteric models imaginable. That's not it at all. I fit models all the time. They help with some very important research. But if you ask me if I actually think the models are completely correct then I will tell you no.

I model gene expression data using a negative binomial distribution. Why do I use a negative binomial? Because the Poisson has too rigid of a mean/variance structure and the negative binomial provides an adequate fit. I don't think the expression levels are perfectly distributed as a negative binomial but hell it works just fine so I deal with that. I do some research to see if I can improve the model to something I find more believable but that doesn't mean I don't use models that I see as flawed. I still find them useful after all.

Trinker essentially summarized what I was saying.

#### noetsi

##### Fortran must die
Its more like practice (me) versus philosophy (Dason). Statistics in theory (the way perhaps it should be done) and the way it is used in the vast majority of real world organizations is very very different.

#### noetsi

##### Fortran must die
And I don't agree that you won't get published - it's a strong point to be able to point out any flaws in the model but still provide an argument for why it's ok.
If you say, other than at the end briefly in the conclusion section, that your model has a number of signficant problems (which all models do) the response of many journals will be..."get back to us when you work them out." Everyone knows that models have signficant problems. They simplify the real world, there are always violations of the assumptions, we don't know reality so we can not correctly specify our model. There is little advantage in pointing out what is understood. Statistics, which has more technical considerations than most fields, do talk about limitations more but even so they don't emphasize them. Commonly they point out limitations in other models or methods to stress why theirs works better.

I always get the feeling that you think I'm in some ivory tower trying to make the most esoteric models imaginable. That's not it at all. I fit models all the time. They help with some very important research. But if you ask me if I actually think the models are completely correct then I will tell you no.
And so would any researcher. The issue is 1) whether they improve on no model not are they perfect and 2) will they get my research published. They may in statistical journals, not in most journals generally. And they don't impress the people who run organizations. They want to know what you can do, not what you can't. Mine get annoyed when I point out the limitations - and I suspect this is generally true, most of all to the great majority who are not statistically trained.

#### GretaGarbo

##### Human
George Box, a famous statistician, once said:
“all models are wrong, but some are useful”

And another version:
“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful”

(Box is among other things known for Box-Cox transformations, the Box-Jenkins models in time series, the book on experimental design Box, Hunter, Hunter and a book on Bayesian inference.)

The ideas above are not only Dasons but are held by many well-known statisticians.

Someone else said:
“The most practical thing is a good theory”.

….

Englund wrote:
“My point is that R^2 is, per difinition, a measure of covariation between the IV´s and the DV”
No, R^2 is not by definition a measure of covariation.

R^2 is a ratio that is based on the fact that you can split up the total sum of squares (SST) in explained sum of squares (SSR) and residual sum of squares (SSE) so that R^2 = SSR/SST.

In my view the R^2 is greatly overemphasized. It is not a “quality index”. It is more dependent on how stretched out the x-values are and the residual variance.

#### Outlier

##### TS Contributor
Today I have learned something.:tup:

#### noetsi

##### Fortran must die
R squared is used a lot because 1) its simple and 2) it makes intuitive sense. When you run a regression model you want to know how one variable is influencing another. So the percent of explained variation in the dependent variable caused by the independent ones is exactly what individuals want to know.

Statistical analysis is inherently inferential. If you want absolutes you deal in theory or design of experiment. Most will conclude, barring obvious theoretical reasons not to, that strong covariance means something is occuring. Which is really what R squared gets at. It is imperfect in as much as the relationship is non-linear, but there are no perfect tools least of all those that can be understood by the 99.9999 percent that are not statisticians by trade

#### Englund

##### TS Contributor
No, R^2 is not by definition a measure of covariation.
Sure, bad choice of words. My point still stands even if it was possible to missunderstand.

#### Englund

##### TS Contributor
I was just pointing out that you can't use the fact that R^2 = 0 to imply that there is no dependence.
I think we've discussed different things. Correct me if I'm wrong, but you've discussed the observed R^2 while I've had the true unknown R^2 in mind.

#### Dason

Either or. Even if the "true" R^2 is 0 (which is the case for X ~ Unif(-a, a), Y = X^2) there can be dependence.

#### Englund

##### TS Contributor
Either or. Even if the "true" R^2 is 0 (which is the case for X ~ Unif(-a, a), Y = X^2) there can be dependence.
But that isn't true. The R^2 is per definition not zero in that case. It can be estimated to be zero, but in that case the model used to predict Y is seriously flawed.

I agree on this: If R^2 is estimated to be 0, then we cannot draw the conclusion that there are no dependence. But if the true unknown R^2 is 0, then there are no dependence. I would say that R^2 is 1 in the example you refer to, because there is a perfect relationship between the variables described. But if you try to fit a linear model, then the observed R^2 will be far from 1.

#### Dason

Consider the following distribution:

$$\begin{tabular}{|c|c|c|} \hline x & y & Prob(Y = y | X = x) \\ \hline 0 & 0 & 1 \\ 1 & 1 & .5 \\ 1 & -1 & .5 \\ \hline \end{tabular}$$

In this case the best "regression" we could come up with is predicting y = 0 regardless of x. But there is still dependence here. For if we know x = 0 then we KNOW y = 0. If we know x = 1 then y could be either -1 or 1. So the distribution of Y depends on the value of X. So R^2 is 0 but there is still dependence.

#### GretaGarbo

##### Human
A friend of mine surprised me when he said:
“You can choose to get as large R^2 as you want.“

“What?!” I said.

“If there is a linear relationship between x and y and you can design where to put the x-values, then just by stretching out the x-values far enough you will get a large enough R^2 value“, he said.

Another aspect is that if you have an observational study and there has not been very much variation in the x-values – the x-values have been roughly constant (as often happens in observational studies) – then the R^2 will be low. That does not mean that the model is bad. It can be a good description of reality. A good model is a model that fits to the data. Not if R^2 is high or low. Lack of fit measures are far more important than R^2.

The residual variance has an influence on R^2 (by increasing the residual sums of squares). So you can make a two-by-two “table” or graph with high and low variation in the x-values and with high and low residual variation. I think that is more important to think of than the R^2.

I would be primary concerned by the parameter estimates and if they are significant, the standard deviation in the residuals and lack-of-fit-measures.

but there are no perfect tools least of all those that can be understood by the 99.9999 percent that are not statisticians by trade
I don’t think that R^2 is understood by 99 percent. I think it is overemphasized and misused.

Besides, I think it gives increased confidence if someone talks both about a models strengths AND weaknesses. This is valid for statistical investigations and used car sellers.

#### Dason

Greta! Go get one more post and then you'll have a surprise on the TalkStats homepage for you.

#### GretaGarbo

##### Human
But that isn't true. The R^2 is per definition not zero in that case. It can be estimated to be zero, but in that case the model used to predict Y is seriously flawed.
The usual is to think of regression parameters like “beta” and “sigma” to have a population value that can be estimated from a sample.

But does R^2 have a population value? I have never heard of that.

Think of a linear regression model with a nonzero slope (beta).

Imagine that a first experiment is having the x-values in a narrow range. That will give one R^2 value.

Imagine a second experiment with exactly the same parameters but with the x-values in a wider range. That will give a higher R^2 values for exactly the same parameter values, that is, for the same population beta and sigma values.

No, I don’t think it is meaningful to think of R^2 as population parameters.

I think of R^2 as a simple descriptive of the data at hand.