I've been collecting hosts in 3 different habitats, about 20 hosts every sampling trip. I've been recording prevalence of a parasite (1 or 0) and abundance (number of parasites per host) for each individual. I'd like to know:

- Abundance/prevalence (DV) are the same for the 3 habitats (IV).
- Some other variables (temperature, soil type, etc.)(IV) could affect the overall prevalence/abundance (DV) of the parasite.

I'd say I need to use a linear regression, but I'm afraid that the hosts collected in each trip could be non independent.

Since General Estimating Equations deals at a population level, I was thinking about applying it, but I'd like to confirm it with you.

**Should I use GEE ('geeglm' {geepack}) using each sampling trip as a clustering vector?**

More info:

3 habitat; about 10 sampling trips to each; 20 hosts per trip. About 600 hosts.

After applying the linear regression ('glm'), residuals are not normally distributed ('shapiro.test'). Durbin Watson test ('durbinWatsonTest' {car}) p-value > 0.05 and Breusch-Pagan Test ('bptest' {lmtest}) p-value > 0.05.

I'm using R, so any answered tailored to it would be more than welcome.