Convert data by LN or not?

Hello dear statistic community members!
I was about to run a single-species single-season occupancy model on habitat variables for my tropical mammal data in R. One of my statistic mentors told me I might need to convert my data on distance to road and distance to rivers by using LN or Log10. It seems that he also was not too sure, as he mentioned that the need to convert data depends on my research question and stayed rather vague on his opinion about my data set. I do not really understand how the type of research question affects whether I can feed my normal data or converted data into the model so I was hoping for the community here to offer advice. If it helps: distances range between 0-1.94 km.

Edit: In general, is there a rule of thumb when to convert data using log?

I thank everyone who takes their time to help me answering my question!



Less is more. Stay pure. Stay poor.
The most common reason to convert data is that the residuals in the model are not linear or there is heterogeneity. So fields, like economics are more likely to transform to get estimates on the percent change scale.

Your description of the model is not familiar to me. What package would you use in R to do this?
Hello hlsmith and thanks for the reply. The package I am using is called 'wiqid'. It contains functions for both maximum likelihood and Bayesian estimations, if that helps in any way.
I am sorry ,as that I am not very well versed in statistics, but how can I find out about the residuals in my model? And what is meant by heterogeneity in this context (I always think of heterogeneity in a biological context) ?


Less is more. Stay pure. Stay poor.
Do you know what type of model you are using (linear regression, logistic regression, etc.)? How are your variables formatted, categorical, continuous, integers, etc.,
Sorry for the late reply, was a bit busy, and thank you for your continuous interest in helping me. So the model should be a logistic regression model since my response variable is binary in nature (detection/non-detection of species) and I am analyzing for occupancy which is the probability of site X /Y/Z/...being occupied by certain species. My variables are continuous data, I believe (Elevation and distances).


Well-Known Member
A common situation where logging helps is when the data is squashed at the low end and stretched way out at the high end.
Hello katxt, for easier understanding I try to adopt your suggestion to my case about elevation: would you mean for example many data points at very low elevation (say 0-100m) and then a few points stretched at elevations exceeding kilometers?
"It stops the extreme values having undue leverage (influence)." - Yeah I see the point, but wouldn't standardizing the data have the same effect?
"However, there are problems with logging 0." - Yes, log(0) can not be defined right? Usually what do you do in this case? Just leave 0 as a data value?


Well-Known Member
Standardizing doesn't alter the shape. It will still be stretched out. Commonly people just add 1 to everything before logging if there are zeros. A bit artificial but there you go. pH is an example of data that gas been logged as a matter of course to make data better behaved.