Transforming Variables (squaring/logging variables)

#1
Hello,
Let's say I am running a regression y=B1 x1+B2 x2.
In examining the y variable via histogram, I decided it needed to be transformed into ln(y) to give it a normal distribution, and to deal with a number of observations where y=0 I added 1 to the y variable before logging it --> y=y+1 and lny=ln(y).
After this I added 1 to both x1 and x2 variables.

Now I am trying to decide whether to log or square x1 and x2 for the regression, and I am examining scatterplots. Should I be looking at (scatter y x1) or (scatter lny x1 ) in order to determine if a transformation is necessary.

When i do: (scatter y x1) it looks like there is a linear relationship and x1 should be left as is, but when I do (scatter lny x1) it looks like x1 should be logged.

Any idea what I should do?
 

noetsi

Fortran must die
#2
Although I am not an expert I think you might consider box cox which in theory tells you what to use in these cases. It tells you what transformation to use that is. I don't think you are supposed to add 1 to the predictors just because you added it to Y.

Are you looking at the scatter plots of the residuals of the regression? If so I think it should be of the residuals you actually run (the log Y versus X if you in fact logged Y).

Note that in some cases you will log Y, sometimes X and sometimes both (all are different methods) and supposedly which you use is based on a useful theory not the raw distribution. Is there anything in the literature you work in in that suggests which to do?
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Transformations should be based on the model's residuals. Is there a suspected problem with the error terms?