Transforming Variables (squaring/logging variables)

Let's say I am running a regression y=B1 x1+B2 x2.
In examining the y variable via histogram, I decided it needed to be transformed into ln(y) to give it a normal distribution, and to deal with a number of observations where y=0 I added 1 to the y variable before logging it --> y=y+1 and lny=ln(y).
After this I added 1 to both x1 and x2 variables.

Now I am trying to decide whether to log or square x1 and x2 for the regression, and I am examining scatterplots. Should I be looking at (scatter y x1) or (scatter lny x1 ) in order to determine if a transformation is necessary.

When i do: (scatter y x1) it looks like there is a linear relationship and x1 should be left as is, but when I do (scatter lny x1) it looks like x1 should be logged.

Any idea what I should do?


Fortran must die
Although I am not an expert I think you might consider box cox which in theory tells you what to use in these cases. It tells you what transformation to use that is. I don't think you are supposed to add 1 to the predictors just because you added it to Y.

Are you looking at the scatter plots of the residuals of the regression? If so I think it should be of the residuals you actually run (the log Y versus X if you in fact logged Y).

Note that in some cases you will log Y, sometimes X and sometimes both (all are different methods) and supposedly which you use is based on a useful theory not the raw distribution. Is there anything in the literature you work in in that suggests which to do?


Less is more. Stay pure. Stay poor.
Transformations should be based on the model's residuals. Is there a suspected problem with the error terms?