How to analyze different, complex data sets?

#1
I'm looking for a good way to compare data sets.

For 20 months, we collected larval fish using standard sampling gear (produces larval fish/cubic meter) at 31 sites along 95 km river miles. We identified, categorizes to developmental stage, and measured several hundred thousand larval fish and have input the data to Accesss. We also collected a suite of environmental data while sampling (dissolved oxygen, temperature, conductivity, etc) and I can also add rainfall, flow rate, and several other parameters to the complete data set. Now, I'm planning on using NMDS and some ArcInfo tools to take a look at the crucial environmental variables that influence spawning in various species, but what is puzzling me is how to compare our data sets to various other data sets.

I can obtain trawl data for black crappie, electrofishing data for centrarchids, pushnetting data for American shad, gizzard shad, and threadfin herring, and even potentially electrofishing and/or creel survey data for adult American and hickory shad. One of the big questions in larval fish research is whether density of larvae translates into recruitment of young-of-the-year (YOY) fish, particularly since mortality is incredibly high in larval fish and there are many instances of mass mortality.

So, I need a statistical test that can determine if there is a relationship between the larval fish data and the resulting YOY data. Others have suggested regression or Bray-Curtis similarity, but as I've investigated these techniques, none seem ideal for for answering if a relationship exists between the larval fish and juvenile data.

I am prepared to analyze species by species or use the ArcInfo extensions, but I can't figure out exactly how to compare these disparate data sets.
 

bugman

Super Moderator
#2
I'm looking for a good way to compare data sets.
Now, I'm planning on using NMDS and some ArcInfo tools to take a look at the crucial environmental variables that influence spawning in various species, but what is puzzling me is how to compare our data sets to various other data sets.
NMDS is probably ok for exploratory purposes, but Canonical Correlation Analysis might be a better option given that you have two matrices (fish & environmental). Also, you say you want to compare data sets with others. What are you looking for spatial trends across sites? Comparisons of abundance data? Biodiversity differences between the different sampling techniques? This will determine which method to use.




So, I need a statistical test that can determine if there is a relationship between the larval fish data and the resulting YOY data.
I am prepared to analyze species by species or use the ArcInfo extensions, but I can't figure out exactly how to compare these disparate data sets.
Again, in terms of abundance? Spatial information? Are you looking at community assemblages or single species info?

Do you have access to PRIMER v6?
 
#3
Let me just clarify what data I'm working with:
My larval data which contains environmental data collected while sampling, larval density, larval diversity, species measurements (ID, size, developmental stage), and this data was from Feb 11, 2008-Sept 30, 2009. We collected twice a week at every site (31 sites total) until the end of May, then down-shifted to once a week for the rest of the project. We also collected night samples in 3 regions so we could compare night to day.
Other data sets:
1) push-netting for threadfin, gizzard, and anadromous shad, which produces shad/m^3 and this data is generally collected May-August.
2) Trawl data for Young-of-the-Year black crappie (I believe this is also fish/m^3, but it may be time deployed) which is done once a year in October. Multiple trawls per lake.
3) Electrofishing littoral zones in the fall. This data is primarily fish/hour, so effectively CPU or encounter rate. It's probably the most different kind of data I would like to compare with our data set.

Here are my core research questions:
Which variables are most important in driving larval abundance?
What spatial trends exist in larval abundance?
Is there is a relationship between larval density and juvenile density?

I don't currently have PRIMER, but I have PCORD 5 and SPSS.

CCA is potentially a valuable method, but McCune and Grace's "Analysis of Ecological Communities" warns that "as the number of environmental variables increases relative to the number of observations, the results become increasingly dubious" and "as the number of environmental variables approaches the number of sites, CCA becomes very similar to CA." In our data set, we measured multiple environmental parameters per sample. And if I add even more variables (e.g., flow rate, rainfall, moon phase, barometric pressure, etc.), I'm concerned a method like CCA will really start to fall apart compared to NMDS.

I am looking for spatial trends across sites (processing the larvae has already told me there are differences), and was considering Spatial Analyst in ArcInfo to explore and display many of those differences. I may also use cluster analysis for at least a basic statistical differentiation within my data set.

Because the other data sets are very limited in terms of collection duration and species diversity, I am going to have to compare on a single species basis. The black crappie larval data will be compared to trawl data. The threadfin shad larval data will be compared to the push-netting data.

Thanks for your input! I appreciate the different perspective and approaches suggested.
 

bugman

Super Moderator
#4
L

Here are my core research questions:
1) Which variables are most important in driving larval abundance?
2)What spatial trends exist in larval abundance?
3)Is there is a relationship between larval density and juvenile density?
.
Hi Hand Hunter

Here is my two cents worth,

In answer to research questions:

1).If you are considering only one species at a time, this can be modelled with a simple multiple regression however the problem is that linearity is assumed and this is unlikely with species abundances. But you could look at the diagnostics and decide. Hard transformations might help.

Since you are only considering one species, this simplifies matters.

2). Your ARC methods sound good and cluster analysis sounds reasonable (I would follow this with a SIMPROF test to test significance amongst clusters).

3) As long as you have a standardised method and your residuals look ok, a simple linear regression should work here.

Be aware that if you are looking at environmental factors on an NMDS with vector overlays, you are effectively looking at descriptives only.

RDA and Multiple regression with allow variance partitioning and hypothesis testing.

If I have misunderstood any of this let me know, but I hope this has given you a bit of help.

P.S. Re: the CCA. you are right, but it is worth comparing the two methods to see if either or both are giving you sensible results. I would be inclined to start with CCA and compare this to your NMDS plot. See how you go :D