- Thread starter zzzc
- Start date

1) Log( Y ) is more normalish than Y. Helps speed convergence of means to normal distribution.

2) There is an interaction between two covariates when using Y. It may be removed by using Log( Y ) in some cases.

3) They just happen to prefer Log( Y )!

:yup:

If we're truly trying to figure something out and conceptualize the problem statistically then I don't understand the idea of transform the data to get rid of your problems because those 'problems' are part of the data and tell you something interesting.

Hi,

I'm wondering what kind of data is appropriate for transforming the response y into log(y)?

Thanks!

I'm wondering what kind of data is appropriate for transforming the response y into log(y)?

Thanks!

More specifically, if we consider a simple linear regression model : Log(Y) = b0 + b1*X, the slope coefficient (b1) measures the constant proportional or relative change in Y for a given absolute change in X. As such, multiplying the relative change in Y by 100 will give the percentage change in Y for an absolute change in X.

This particular model is useful in situations where the X variable is a time trend since in that case the model describes the constant relative (b1) or constant percentage (b1*100) rate of growth (b1>0) or decay (b1<0) in the variable Y, where Y may be a variable such as gross domestic product (GDP), population, money supply, unemployment, profit, sales, etc. In short, the model Log(Y) = b0 + b1*X can be called a growth model.

\(\log(y_i) = \beta_0 + \beta_1x_i + \epsilon_i\) where \(\epsilon_i \sim N(0,\sigma^2)\) which implies

\( \widehat{log(y_i)} = E[\log(y_i)] = \beta_0 + \beta_1x_i \).

Alright, we're ok up to here. So if all we ever wanted to do was talk about the log transformed data then I'd be fine with this. But people don't always want to talk about log transformed data. They collected their data on the scale they did most likely because it's the scale that makes sense for them. So what if we want to consider our data on the original scale? If we're interested in \(\hat{y_i} = E[y_i]\) most people would just backtransform their predictions from the log model and say

\(\hat{y_i} = E[y_i] = \exp(\beta_0 + \beta_1x_i) \)

but this is wrong! What is actually true is:

\(\hat{y_i} = E[y_i] = \exp(\beta_0 + \beta_1x_i + \sigma^2/2) \).

How many people that just transform their data to make it nice do you think know this? What happens if we use a different transformation and want to backtransform? Can we get a nice form for the expected value then? Who knows...

Sorry for the rant but as you can probably tell I'm not a huge fan of transformations.

I hope your still watching this old post :=)

I came across an econometric text book which disagrees with you(i think It is wrong).

It essentielly says:

Log(wage)=beta_0+beta_1*education+u

means

wage=exp(beta_0+beta_1*education+u)

Not really any point to this post, just a thought as i was reading..

You have

Log(wage)=beta_0+beta_1*education+u

means

wage=exp(beta_0+beta_1*education+u)

means

wage=exp(beta_0+beta_1*education+u)

If instead of wage=exp(beta_0+beta_1*education+u) we had wage=exp(beta_0+beta_1*education) + u

then it would be ok to make that jump. But what we have is

wage=exp(beta_0+beta_1*education+u)

So E[wage] = E[exp(beta_0+beta_1*education+u)] = exp(beta_0+beta_1*education)*E[exp(u)]

And notice that the expected value of exp(u) is not 1.

It's very easy to get confused and mixed up when dealing with expected values and transformations.