Should I transform my response data (count data)? Impact on subsequent analyses?

#1
Hello,
My response variables are derived from count data. I have the biennial winter population counts of a hibernating mammalian species (sites / n =226). My predictor variables are from two time steps (i.e. 1992 and 2001).

I created my response variables from the following to correspond with the predictor variables' time steps and lag:
Code:
- For the predictor variables from 1992 I used the counts for years 1988, 1990, 1992, 1994 and 1996
- For the predictor variables from 2001 I used the counts for years 1998, 2000, 2002, 2004 and 2006
Using those sets of years I found the mean and trend (SLOPE via Excel 2007) of the count data for each location. If any site have more than 2 missing values for the count data for a given response variable it was removed from analysis. (years were notated as 1, 3, 5, ... 19 for trend calculations).

Code:
                range        skew       kurtosis      se           n
slope92	7533.05	-2.71	      12.71	       124.25    64 
mean92	76525	4.67	      25.03	       1093.86  93
slope01	9046.3	1.64	      12.52	       105.09    86
mean01	58727.2	4.26	      19.4	       801.71    118
I then natural log transformed the count data and and added a 1 to all values to account for any counts of zero. I then proceeded to find the mean and trend the same way as above.

Code:
 	        range    skew     kurtosis    se     n
lnslope92	1	   -1.02	1.76	        0.02	64
lnmean92	7.91	    0.46	-0.69	        0.24	71
lnslope01	0.78	   -0.22	0.99	        0.01	86
lnmean01	10.37    0.12     -0.48	        0.22 116
I natural log transformed the data because that is what people often do with count data and thought it would improve my normality.

I plan on performing linear regression (lm() in R) between the response and predictor variables to look for any relationships. Does that sound good? Is it the right regression for the task?

I've read this article: Do not log-transform count data by O'Hara and Kotze (2010) (summary below). I was already curious if my approach was sound and this article has brought up some valid points.

How might you proceed with my dataset?

Thank you kindly,
Mike

Summary

1. Ecological count data (e.g. number of individuals or species) are often log-transformed to satisfy parametric test assumptions.

2. Apart from the fact that generalized linear models are better suited in dealing with count data, a log-transformation of counts has the additional quandary in how to deal with zero observations. With just one zero observation (if this observation represents a sampling unit), the whole data set needs to be fudged by adding a value (usually 1) before transformation.

3. Simulating data from a negative binomial distribution, we compared the outcome of fitting models that were transformed in various ways (log, square root) with results from fitting models using quasi-Poisson and negative binomial models to untransformed count data.

4. We found that the transformations performed poorly, except when the dispersion was small and the mean counts were large. The quasi-Poisson and negative binomial models consistently performed well, with little bias.

5. We recommend that count data should not be analysed by log-transforming it, but instead models based on Poisson and negative binomial distributions should be used.

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00021.x/full
 
Last edited:
#2
Here I am saying it again, differently
:
I collected data on the same variables twice. Once in 1992 and once in 2001.

For example, in 1992 I have some independent variables (e.g. forest amount, urban amount, agriculture amount). My dependent variables are mean 1992 and trend 1992 (as described above). My end goal is to model mean 1992 with the independent variables from 1992 and then to model trend 1992 with the independent variables from 1992. I was planning on using a linear model (i.e. lm in R) to see if there was a relationship between the independent variables and dependent variables. Examining each dependent variable on its own.

I will do the same using the 2001 independent variables and 2001 dependent variables.

I was wondering if I should log transform my dependent variable since it comes from count data and is often down this way.

Thanks,
Cheers,
Mike
 
#3
Ultimately I want to see if there is relationship between the count (mean or trend) and the forest for a given year. I'd like to see what independent variables can describe what's going on with the dependent variable and this is why I am thinking linear regression.

As of now I am not trying to link the 1992 and 2001 data points. Just seeing if there is a relationship between the counts and the forest.

I suppose it may have been easier if I didn't discuss the time steps.

How about this (using only forest for simplicity)?
- Is there a relationship between the amount of forest and the mean count of the hibernating mammals?
- Is there a relationship between the amount of forest and the trend (i.e. increase or decrease) of the count of the hibernating mammals?

Should I transform my count data?
Is a linear regression a good way to model this relationship?

Thanks again,
Mike
 
#4
I derived the arithmetic mean as from the count values for 1988, 1990, 1992, 1994 and 1996:
((1988.x+1990.x+1992.x+1994.x+1996.x)/5) = mean

I derived the 'trend' of the count data values as follows, I found the "slope" from the count data values for 1988, 1990, 1992, 1994 and 1996. In Excel 2007 I used the SLOPE function I used the count data for my "known y's" and 1, 3, 5, 7, 9 for my "known x's". I was treating 1988 as 1, 1990 as 2, etc. The value changes depending on what you use.... no particular logic was used except that 1 seemed like a good starting point, I wanted a separation of 2 since there are 2 years between counts.

Based on expert opinion and empirical evidence I do not expect the population count to be immediately affected by the value of the forest. This range of years is to account for that.

I hope this is beginning to become clearer.

Cheers,
Mike
 

bryangoodrich

Probably A Mammal
#6
I'm not familiar with the area of research, so I don't know why log-transformations would typically be preferred. I would think that should be determined by the data, and the Box-Cox analysis should help determine what transformation to be considered. As the article indicates, transformations are chosen to correct problems in the original OLS model that fails certain parts of the assumptions. In particular, when the error terms are nonconstant, often a Box-Cox transformation will help determine the optimal way of correcting that. Unfortunately, such a correction may alter the fundamental relationship the regression is trying to capture between the predictor and response variables. As the article points out, generalized linear models (GLMs) offer a lot more in this area, if you have some basis to describe the relationship (say, that you already know it should follow a Poisson distribution). Needless to say, GLMs are far more complicated than OLS regression, so I cannot help you there. An even more fundamental question you might ask is, what is this model to be used for? Are you trying to predict what sort of counts you should expect in another decade given the two "cohorts" you currently have? Or are you trying to explore a fundamental relationship between the independent variables and the observed response? The two tasks are quite different (i.e., prediction vs explanation, broadly speaking), and you are unlikely to make the best model you can for both purposes.

I would begin by looking at a scatterplot of all the variables to see if there is (1) an observable linear tread between Y and all the Xs, and (2) if there are any observable trends amongst the Xs. I would also consider a correlation matrix to see how they all are correlated, too. You should mostly consider variables that are correlated with the response variable, but also be concerned about high levels of correlation amongst X variables. If any are significantly high (say, > 90%), I might exclude them on principle for being so much related to another predictor that can get the job done (I would keep whichever is more highly correlated with Y). However, that may not be the 'statistical' thing to do. Don't quote me! haha

An initial fit of all the variables is a good place to start because you can always reduce your model. You can also build up your model, but I think a backward search will not exclude important stuff; a forward search might leave out important stuff. While a backward search may not exclude enough, I think we would be better equipped to figure out if we should exclude something than we would be to include it. Those are my thoughts, but I'm not that experienced, yet.

From the initial fit, you will want to look at the OLS regression assumptions. Check for multicollinearity, heteroskedasticity, and normality of error terms. There is quite a bit involved in that, and I would recommend Kutner, et al. "Applied Linear Statistical Models." It is not a small book, but it teaches some basics to dealing with models from an applied perspective (i.e., you don't need to be a mathematical statistician to make use of this thing). How you deal with those deviations from the assumptions or figure out what sub-models will be more appropriate than the overloaded one (i.e., the kitchen sink omelet) or what transformations may appear most appropriate will determine your course of action. If ecological studies have a typical approach, it is only because there is a broad similarity in the data and appropriate models that warrant such an approach. However, I would never recommend taking such a cookie-cutter method; every data is going to be different, and you should at least take the typical approaches to analyzing your initial models to find that warrant for a given alternative approach. It is an exploratory process.
 
#7
bryangoodrich:
Thanks for this thoughtful response!

An even more fundamental question you might ask is, what is this model to be used for? Are you trying to predict what sort of counts you should expect in another decade given the two "cohorts" you currently have? Or are you trying to explore a fundamental relationship between the independent variables and the observed response? The two tasks are quite different (i.e., prediction vs explanation, broadly speaking), and you are unlikely to make the best model you can for both purposes.
I am more interested in explanation given the 'issues' I have with my dataset. I hadn't really thought about the ways the models would be different based on the objective and I am not sure how they would differ ... yet.

For example, if I were to run a linear regression in R, I'd use something like this:
Code:
> mydata.lm <- lm(slope1992~forest, data=mydata)
> summary(mydata.lm)
....

Call:
lm(formula = slope1992 ~ forest, data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.485277 -0.077622  0.004429  0.075563  0.336313 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09791    0.05243   1.868   0.0653 .
forest      -0.09839    0.07499  -1.312   0.1931  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1365 on 84 degrees of freedom
Multiple R-squared: 0.02008,    Adjusted R-squared: 0.008416 
F-statistic: 1.721 on 1 and 84 DF,  p-value: 0.1931
Is there some "ready to go" R code I could include to get the values for the:
multicollinearity, heteroskedasticity, and normality of error terms

However, I would never recommend taking such a cookie-cutter method; every data is going to be different, and you should at least take the typical approaches to analyzing your initial models to find that warrant for a given alternative approach. It is an exploratory process.
I agree.

Thanks again,
Cheers,
Mike
 

bryangoodrich

Probably A Mammal
#8
For the R code, check out my website for ALSM. The chapters 9-11 deal with building a multiple regression model. I don't recall off hand which chapters or sections deal with each of the topics, but they are all included. Usually for normality of error terms you just do a QQ plot (see qqnorm(resid(fit)) along with qqline(resid(fit))). If there are serious deviations from the center line, then the residuals are appearing to be non-normal or skewed to one side or the other (I think if it bows downward it is right-skewed, but I can't recall at the moment). As for multicollinearity, see 'vif'. Heteroskedasticity can be checked with from a number of means, such as the Brown-Forsythe test, among others. There are also analytic methods for checking normality. Search R for normality tests (wilcox.test might be one. There's some Smirnov one, also).