Distribution, data aggregation/simplification




I'm working with fisheries data and I have the quantity in kg of each species caught by boat, by day. There is one species that is dominant and my objective is to build models for its catches - one of these models would be a multi regression and the other an artificial neural network model.

So far I'm stuck with pretty basic stuff concerning the distribution of the data.

I want all the data for all the species in one spreadsheet to be able to use them together, right? But for instance in Jan 15 1988 the captures may be 15 kg of target species, plus 5 kg of x species and 6 kg of y species. This means that all the other species listed in the datasheet will have zero kg in this day. This situation happens a lot and I have a spreadsheet with more of these spurious zeros than captures.

My first thought was to do a PCA or factor analysis to try to condense all these species in smaller groups (and get rid of a lot of 'fake' zeros in the process). I've been reading webpages and a couple of statistics books and I simply can't figure out which method is the most appropriate and what to do after using it (can I just add the captures of the columns in the same groups and use the aggregated data for the models?)

Anyway, to use either of these methods I need to know the distribution of each variable, don't I? Because, as far as I understood I have to choose if I use Person or Spearman correlation depending on the data being normal or not. Is it correct to apply a Kolmogorof-Smirnoff to these data? Because I wouldn't be looking at the real distribution, but rather at what I get after adding all the zeros.

One of my supervisors thinks that I should forget about the PCA and just do cluster analysis and use the results to add the species in groups, while the other insists it's a semi-quantitative method and I shouldn't use it... advice, please?

I'm really confused about this all and I would appreciate any help in clarifying these issues.

Thanks. :)