# Help needed with choice of test

#### BatmanFlight

##### New Member
Hi,
I am an ecologist and not a stats expert by any stretch of the imagination. I have some citizen science research where I need to run some stats tests, but I am getting myself confused over which test would be the most appropriate, and wondered if someone here could give me the benefit of their expert opinion.

I have data of bat activity from various locations and I would like to establish if there is any correlation between bat activity and proximity to various habitat features.
I therefore end up with tables like this;

I therefore just need to establish if there is a correlation between distance from the stream and mean nightly passes, or if the distribution is random.

What test do you think I should I be using?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Do you have actual distance from stream as a continuous variable?

#### BatmanFlight

##### New Member
Do you have actual distance from stream as a continuous variable?
No, I took the data in categories, as seen in the table. I could probably go back and work it out, but it would take some time, so I am not super keen to do that really.

#### katxt

##### Active Member
Your data are typical of bat surveys - zeros and erratic spikes, as you are no doubt aware. I know of no easy answer. No matter what you do, someone is likely to disagree.
You presumably would like a correlation figure and a p value. I would start by making the x values single numbers, like mid bin range.
The spikes will invalidate the p values from Pearson's correlation but you could try a transformation. Quinn and Keough (2002). "Experimental design and data analysis for biologists", suggest a fourth root transformation for this sort of data to bring the spikes under control and leave the zeros at zero.
Alternatively you could try Spearman's correlation.
Another simple solution, if you have the software, would be a Monte Carlo test.
One final thing to keep in mind is that if you do multiple tests it is advisable to use a more stringent limit than p < 0.05 as a cutoff point for significance, to protect yourself against false positive claims.

#### BatmanFlight

##### New Member
Your data are typical of bat surveys - zeros and erratic spikes, as you are no doubt aware. I know of no easy answer. No matter what you do, someone is likely to disagree.
You presumably would like a correlation figure and a p value. I would start by making the x values single numbers, like mid bin range.
The spikes will invalidate the p values from Pearson's correlation but you could try a transformation. Quinn and Keough (2002). "Experimental design and data analysis for biologists", suggest a fourth root transformation for this sort of data to bring the spikes under control and leave the zeros at zero.
Alternatively you could try Spearman's correlation.
Another simple solution, if you have the software, would be a Monte Carlo test.
One final thing to keep in mind is that if you do multiple tests it is advisable to use a more stringent limit than p < 0.05 as a cutoff point for significance, to protect yourself against false positive claims.
Thanks Katxt. As I often find, I thought I had loads of data but then when I compile it, I realise that there is less than I thoughm which means that outliers end up skewing the data somewhat.
Some of what you mention, I have not heard of (Monte Carlo), so will look into it. Thanks

#### noetsi

##### Fortran must die
the biggest problem I think is your dependent variable. Mean nightly passes (if that is your dependent variable). How many distinct levels do you have for that (I only see 7 but since they are fractions in some cases that makes little sense). I think the number is an average of passes. If you have the original values used to calculated that you probably could do linear regression with a number of dummy variables.

Or you can do spearman's rho probably if you don't want to do regression.

#### katxt

##### Active Member
The biggest problem I think is your dependent variable.
Yes, I agree that it's the big problem. However, in my experience, spreading out bat data to individual nights/detectors generally seems to result in even more spikes and more zeros. Spearman is OK but it ignores much of the information in the data which a randomization test would include.

#### noetsi

##### Fortran must die
I am not familiar with a randomization test. Spearman's is just for doing it simple. Someone who does not know statistics might have problem with more advanced methods. I realized recently I often do and I have been doing this for more than a decade.

Why would regression not work?

#### katxt

##### Active Member
You could certainly do a regression from the original raw data (or from the binned data using the bin middles if it comes to that). The p value you got for the slope would be exactly the same as the p value for Pearson's correlation. The problem is that the p value can't be trusted because it would be based on normality assumptions that just aren't true for this data. Moving out away from the stream you are likely to record pass rates over a few days something like 25 0 0 0 15 0 0 0 127 5 0 0 158 0 0 .... The errors are not normal (nor Poisson even though they are counts). Binning helps a bit but not all that much.
The regression p value is derived from the sampling distribution of the slope or correlation under the assumption the distribution is random and normal which of course it isn't this time. What the randomization test does is estimate the sampling distribution of the correlation from the data you have collected and uses that instead of the t distribution to get a p value.
You can also use the same setup to get confidence intervals for the slope and intercept if you want using the raw data or the binned data.
However, as noetsi says Spearman's is "doing it simple" and more people will recognize it.
A further idea is to use presence/absence at each detector and use logistic regression to get the probability of presence at any distance.

#### noetsi

##### Fortran must die
I think its increasingly agreed that normality is not that important for regression if you have enough cases. 30 or more is considered enough (although more is better). The data not being random might be an issue (I am not sure what that means -since I work with populations normally. Do you mean you can not generalize from it to larger populations?).

Paul Allison's take on that

"Non-normality is a trivial problem with moderate to large size samples"
Whatever a moderate to large sample is

MY concern for someone who does not have a lot of background in statistics is that any complicated method has assumptions they might not know of or be able to test (normality is one of course, linearity another).

Last edited:

#### katxt

##### Active Member
The data is probably not just non normal. It is probably wildly non normal.
Anyway, I think we've probably covered the options. Maybe BatmanFlight's best bet now is to show the data to a statistician and take their advice.

#### katxt

##### Active Member
Q What do bats do in the winter?
A They crack if you don't oil them.
(Very old English schoolboy cricket joke.)

#### fed2

##### Active Member
mine was calvin and hobbes. just always stuck with me.