# Why use logged variables in analysis?

#### reveller

##### New Member
Probably, this is a very basic question but I don't seem to be able to find a solid answer for it. I hope here, I can.

I'm currently reading papers as a preparation for my own master's thesis. Currently, I'm reading a paper which researches the relationship between tweets and stock market features.

In one of their hypothesis, they propose that "increased tweet volume is associated with an increase in trading volume".

I would expect them, in the pairwise correlations, to correlate tweetVolume with tradingVolume, but instead they report using the logged versions: LN(tweetVolume) and LN(tradingVolume).

For my thesis, I have replicated this bit of their paper. I have collected tweets about 100 companies for over 6 months (tweetVolume) and stock trading volume for the same timeframe. If I correlate the absolute variables, I find r=.282, p.000 but when I use the logged verions, I find r=.488, p=.000.

I don't understand why researchers sometimes use logged versions of their variables and why correlation seems so much higher if you do so. What is the reasoning here, and why is it OK to use logged variables?

Your help is greatly appreciated #### staticeland

##### New Member

Log transformation is a variance stabilizing transformation. It is good to log transform your data to help with some model assumptions. E.g. in regression you assume that the residuals have constant variance. When your variable is not log transformed this assumption is often violated.

Why is it ok to use it?
Logarithms are also often used for maximum-likelihood estimation. A maximum of the likelihood function occurs at the same parameter-value as a maximum of the logarithm of the likelihood because the logarithm is an increasing function. The log-likelihood is easier to maximize, especially for the multiplied likelihoods for independent random variables(wikipedia)

#### Englund

##### TS Contributor
Besides what staticeland says, building models with heteroscedastic variance can give flawed estimations of betavalues as well as correlation coefficients and R-square.

#### noetsi

##### Fortran must die
It is often done when you have skewed data to make it more normal.

#### Mean Joe

##### TS Contributor
why correlation seems so much higher if you do so.
This is not necessarily always true (as I'm sure you're aware). Basically, correlation measures how closely the pairs of data points fall on a line. If the data is non-linear (eg exponentially distributed), then transformed data can move closer to some line, and thus correlation would be higher.