# Poisson GLM: data set use; influence of counts

#### Columbine

##### New Member
Hello Regression form,

I am after a explanation in layman's terms of what's going on behind the scenes with a couple of GLM analyses I'm doing.

Data:
Counts of animals within certain habitat descriptors.

Aim:
Select the best model from the habitat descriptors I have.
(Depth, Slope, Substrate, Distance to Land)

Using poisson log link, with categorical variables of each descriptor.
i.e. depth 0-5m, depth 5-10m etc.

I would like to know how grouping the data affects the outcome.

For example: if I create two data sets from the original data set, 1 with that with count animals on the day for just slope categories such that
the data will be:
Date, Count Slope 0-5%, Count Slope +5-10%, Count Slope +10-15%, Count Slope +15-20%
and run the analysis:
count ~ s5-10 + s10-15 + s15-20 where (s0-5) is the base

and the second data second data set is created by counting those animals each day in all categories; depth, slope, habitat, distance to shore but still run the same analysis

count ~ s5-10 + s10-15 + s15-20 where (s0-5) is the base

these have yielded different (similar) results why? when they are essentially counting the same thing?

here I have the log-likelihood and exponential of the coefficients (I work in R)

Data set (1) that counts only slope:
'log Lik.' -1286.276 (df=4)
(Intercept) s10 s15 s20
14.1025641 0.2757576 0.2521212 0.1595455

Data set (2) that counts all fields but analysis is on only slope
'log Lik.' -6350.718 (df=4)
(Intercept) s10 s15 s20
19.0641248 0.7139646 0.7067560 0.4720909

So the idea is to choose the best habitat descriptors that describe where the animals are.

I can do many different analyses and then compare them to see which is the best/ most appropriate.

If I use the data set (2) which has all the counts I can then run ANOVAs to test between models. However if I use AIC to compare between models - which is based on the log likelihood and is another valid method of testing fit of models; I get different results order of best model when compared to the set of models made like (1) of their own counts comparing between each-other.
For example if I create 14 models:
4 models with single descriptor (multiple categories) i.e. one that tests just slope as above, one for depth etc.
6 models that combine 2 descriptors
4 models that combine 3 descriptors and
1 model that has all 4 descriptors

If I make these models analysing data that was made for each model i.e. counts of just slope, then analysing slope.
The models with fewer descriptors produced lower AICs then those with more descriptors.

The opposite was found for the data set (2) made with all descriptors. Lower AICS were calculated from analyses 4 and 3 descriptors.

There was a smaller trend within this: that from the data set (2) habitat was better than depth, than, slope than distance, while with data sets (1) it was slope better than habitat, depth then distance.

Why?
The choice of descriptor is often arbitrary. As this is ecology, it is our best guess, and often precedence and availability of data, logistics and finance that influence our choice of descriptors.

See I am thinking what if, for example, Distance to Shore isn't viewed as "important" in model, so I redo the count without it and the data give me a different answer?

I hope I have explained this well enough for someone to devise an answer.