GW shark study- You give advice on the stats and I give advice on swim spots ;)

Clem

New Member
#1
Disclaimer: I am fairly new to the stats side of Ecology, I have carried out a bit of field work in my time but have only come over to the analysis side of things in the last little bit. Please forgive me if my questions seem a little basic- I have to say though that before I posted this thread, I have spent a few weeks looking at this site and others and trying to get my head around a few issues I have. I have taught myself the basics of JMP and tried out the tests myself P and think I am getting the hang of the program. However I think I need advice on whether I am on the right track or have headed off on a tangent (which I often do). Thanks for your time, feel free to point out any gaps in my explanation of the study.

Shark study: This was a quick pilot study to see if the methodology used worked. The aim of the study was to see if there is a difference in shark abundances or frequency between different depths (eg shallow versus deep). I am interested in doing some statistical analyses on the results as they are not going to be used for anything otherwise.

Experimental design
-Two depths 5m and 30m
-At each depth three different locations
-At each location 5 replicates

What was measured:
Relative Abundance

we had a camera and a bait at each replicate (replicates were separated by 1km), each camera filmed for 1hr and a measure of relative abundance taken called MaxN. MaxN is the maximum number of one species seen in any one frame during the one hour. We do this because otherwise you can't be sure you aren't counting the same shark twice when they swim in and out of the field of view).

These are the results
Depth /Location/MaxN
5 Boo 4
5 Boo 2
5 Boo 6
5 Boo 0
5 Boo 8
5 Cal 3
5 Cal 0
5 Cal 0
5 Cal 1
5 Cal 3
5 Har 4
5 Har 0
5 Har 1
5 Har 0
5 Har 36
30 Bow 0
30 Bow 0
30 Bow 0
30 Bow 0
30 Bow 0
30 Grop 14
30 Grop 15
30 Grop 0
30 Grop 0
30 Grop 3
30 Hy 0
30 Hy 0
30 Hy 0
30 Hy 12
30 Hy 5
 

Clem

New Member
#2
So after getting the results they sat around for quite a while until I decided to have another look at them. The original team leader isn't around any more to give advice but I was told that ANOVA was the way to go and after doing quite a bit more reading, at this stage I think that a nested ANOVA seems to be right. With Location nested in depth. Depth* Location[Depth]

I have access to JMP and I did a course at Uni a long time ago using JMP so I thought I would use this program (turns out I didn't remember a thing about it). JMP seems to be not that intuitive but after watching many tutorials and reading lots of posts I have got the basics (I think).

Below is my output (and yes it took me many tries to get it to look like this haha).
[/URL][/IMG]

From this I would conclude: no significant difference across both depth and location.

What I would like to know is, are my conclusions correct ,have I made any basic errors that I haven't noticed, is a nested ANOVA the best test or would you recommend something else?
 
Last edited:

Clem

New Member
#3
So no suggestions yet, I have carried on anyway :). After a little more reading it appears that I got a little ahead of myself

"Analysis of Variance (ANOVA) there are three assumptions:

Observations are independent.
The sample data have a normal distribution.
Scores in different groups have homogeneous variances."

So I now know that my data is not normally distributed, However I am not exactly sure how close to normal it has to be, do you really ever get a perfect bell shaped curve? I read that if sample size is large enough then ANOVA is robust to deviations from normal. But again how large is large enough?

I also think I need to check for homogeneity of variance in the residuals so I used GMAV to carry out a Cochrans C test. The C Test is significant which I think means I haven't met the assumption of variance homogeneity of the residuals in the analysis of variance.
So tmrw I will try transforming the data in an attempt to remedy this and then re-run the test (does this really deal with the issue?). I don't think it will make a difference to the result though, again I could be wrong, I guess i'll see tomorrow. :confused:
 
#4
Clem has prepared well and investigated a lot so I guess Clem deserves an answer. (And not insulted anybody, so maybe I dare to suggest something.)

I think it is reasonable to think of the explanatory variables depth and location in a nested model.

But what about the response variable? Since it is a count variable I guess that, or rather I am sure that, many statisticians would suggest a Poisson model – that the response variable is Poisson distributed with “means” given by the explanatory variables depth and location. Or negative binomial distribution as a second alternative that gives a little bit more flexibility than the Poisson.

Such models can be estimated within the framework of generalized linear models (glm) and can be estimated with most standard software. Anova is a special case of glm, so it is not that strange.

Clem has taken a number of photos at within one hour and pick out the photo with the maximum number of sharks.

What does that mean? I don't know, so I throw out this question to TalkStats:

Given that the Poisson intensity is fixed within an hour, what distribution will the random variable Y = max(X1, X2,...,Xm) have? Where Xi are measurements of number of sharks (Xi~Po(mu)) taken sufficiently far away from each other in time to be considered independent random variables.

If that distribution is known then maybe it is possible to continue in the analysis.

The above was trying to answer the problem as it was given to us by Clem. But why should we accept the problem as it has been formulated? We can reformulate it.

In analysis of variance (anova) one is trying to estimate the mean, given the explanatory variables. Clem formulates the maximum above. Isn't it more natural to estimate the mean? (which is also the Poisson parameter)?


Instead of having:
5 Boo 4
I suggest to have many rows per hour like:
5 Boo 1
5 Boo 0
5 Boo 3
For, say one photo taken every minute (or every 30 second) within the hour.


This would give a three level model. Depth and in that location and in that one specific hour and the re would be the different measurements every minute.

If the model had been normally distributed if would had been easy just to average the minutes measurements to the hour level and run an anova on that. Now it is more complicated because the discrete distribution (the Poisson to start with).

Others can continue and explain this better. (Good luck!)

- - -

And now to Clems part of this: Is is safe to swim in the North Sea? What about the Mediterranean?

:)
 
#5
So no suggestions yet, I have carried on anyway :). After a little more reading it appears that I got a little ahead of myself

"Analysis of Variance (ANOVA) there are three assumptions:

Observations are independent.
The sample data have a normal distribution.
Scores in different groups have homogeneous variances."
Let me add that the second assumption, although accepted and used broadly, is not correct. The correct alternative is normality of residuals.

So I now know that my data is not normally distributed, However I am not exactly sure how close to normal it has to be, do you really ever get a perfect bell shaped curve? I read that if sample size is large enough then ANOVA is robust to deviations from normal. But again how large is large enough?
You should care then about the normality of your residuals, not the data itself. For that purpose, you can draw a Q-Q plot and analyze it subjectively (or attach it here so we could help you). About the extent of the deviations from normality (of residuals of course), actually it is something rather subjective and no exact objective cut-offs are defined for that. But if you have a large enough sample (again the "enough is something subjective) you can care less about the normality (of residuals).

For checking the Q-Q plot: It draws a straight oblique line and places the residuals all around it. If your residuals are almost placed on that oblique line, they are normally distributed. Here 'almost" is again subjective, but if there are slight deviations, you can use ANOVA, especially if your sample is large. But if the assumptions were not met, you can still use ANOVA's non-parametric alternatives.

I see Greta is suggesting something other than ANOVA though. I recommend listening to her high-quality advice (although she knows almost nothing about subjectivity!). :)
 

rogojel

TS Contributor
#6
Hi,
thanks for the great example, it is nice to play with data that comes from such an interesting application. However, I believe that there is a bit of a problem with your experiment or maybe we just need additional information.

The problem I see is with the location - there is obviously a lot of variability in the data due to the location changes and this variability might be confounded with the variation in numbers due to depth changes. I mean for example in te location Boo at 5 meters you saw 8 sharks, in the location Bow at 30 m 0 sharks. Is this difference due to the different locations or to the different depths?

The way your data is structured you can not run a two-way ANOVA because the data is unbalanced - that is the method would expect to ave the data at both depths for the same location. In this way, I would guess that one of the reasons you do not see a significant difference could be because the variation due to locations will mask any signal from the depths. This would be problem IMHO for any other analysis method (GLM for instance).

Is there some aditional info on the locations, something like Boo is close to Bow or something? That would help analyse the data I think:

Regards
Sandor
 

rogojel

TS Contributor
#7
oops,
one additional comment, the 5th value for Har does look extraordinary (36 sharks). Is there any explanation for this large number, some different bait maybe, or different time of the day? In GLM this line does have some effect - the method kindof shows that the variation due to locations is far larger then the variation due to depth BTW.

Does the time of the day play any role? was the data collected always at the same time or different times?
 

Clem

New Member
#8
Thanks guys I appreciate the feed back, it will take me a little time to compute and get a decent response back to you. I will add some more details on the methods first and then go away and look further into the other methods suggested and also for checking the Q-Q plot. I have a very limited stats background so I know there are often many methods of analysis to consider however I generally have never heard of them so have to go and look into them when they are suggested.

ps: I would say its safe to swim in most places :). There is one beach here that often has 10+ sharks just offshore and swimmers in the water and this has been the case for 50+ years. There has never been an attack at that beach. I would not suggest swimming around seal colony's though!
 

Clem

New Member
#9
What was measured:
Relative Abundance

we had a camera and a bait at each replicate (replicates were separated by 1km), each camera filmed for 1hr and a measure of relative abundance taken called MaxN. MaxN is the maximum number of one species seen in any one frame during the one hour. We do this because otherwise you can't be sure you aren't counting the same shark twice when they swim in and out of the field of view).
I will try and explain this a little better.

The cameras are video cameras and they film continuously for one hour. I could just count every single shark that swims into the field of view of the camera over the course of one hour and get a total count:

The camera only has a small field of view (say 120 degrees), so a shark swims into the view and then out again, 10 seconds later another shark swims into the field of view... is it the same shark or is it a new shark??? That is the issue with just counting every shark

For example: In a simple case if I consider two different 1 hour samples and count every shark that appears in the field of view and each ends up with a total count of 100. By that count number they appear to have the same number of sharks, that is 100 at each. however at one replicate four sharks just swam around in circles for an hour in and out of the field of view 25 times or so , in that case there was only four sharks (but they were each counted 25times for a total count 100). In the other replicate there were actually 100 different sharks that swam through the field of view at various times. So in reality one sample had four sharks present and the other 100 present but using a total count both get a result of 100. This is a problem with the methodology that cannot be changed at present.

So to get around this (and this is the most common way in the literature) you film for an hour but you stop the video on the single frame in that hour with the most sharks in the field of view so that you aren't counting the same shark more than once (this is called MaxN). So going back to the simple example above with two replicates, in the first replicate you could only possibly get four sharks in frame at a time as there was only four present, you probably wont get them all in the field of view at once so perhaps the most you get in any one frame is 2 together, so a MaxN of 2. In the second replicate you may get a frame with 50 sharks in the frame at once (you rarely get all at once in a frame and you cannot tell if you did or didn't anyway) so in that case you get a MaxN of 50.

So using the MaxN you get replicate one MaxN = 2 and replicate two MaxN = 50. Using the total count you get replicate one = 100 and replicate two = 100.
You know that in reality there were four sharks at replicate one and 100 at replicate two. So you can see that using MaxN gives a more realistic comparison of relative abundance (but neither give an exact number of actual sharks).
 
#10
oops,
one additional comment, the 5th value for Har does look extraordinary (36 sharks). Is there any explanation for this large number, some different bait maybe, or different time of the day? In GLM this line does have some effect - the method kindof shows that the variation due to locations is far larger then the variation due to depth BTW.

Does the time of the day play any role? was the data collected always at the same time or different times?
The 5th one is correct. Their must have been a whale carcass in the water in the area which would draw in a lot of sharks. Most of them were small juveniles about 2m long.

The problem is that at the locations that have deep water, they generally dont also have shallow water sites as well. so you can't use a 10m site and a 30m site at the same location because they don't both exist. All of the shallow locations on have shallow water and all of the deeper only have 30m plus waters. BUT all of the deeper locations are straight offshore from the shallow locations if that helps.

All data was collected in the morning during daylight.
 

rogojel

TS Contributor
#11
Hi Clem,
I would still take out the 5th Har data point - not because it is an incorrect measurement but because it is not, I would say, representative of the question you ask.
As I understand the question is, whether the number of sharks differ in different depths, under normal circumstances. Can the presence of a whale carcass be considered "normal"? If yes then obviously you should keep the data point but I am assuming no as the answer.

Could you build pairs of locations , a shallow one paired with a deep one that is situated offshore from it, and analyse the data as if the pairs belonged to the same "location-pair"? Just curious if that would show some effect of the depth?
Regards
rogojel