Dumbed down multiple linear regression

Kiwi_Tim

New Member
I have measurements of an ecological variable; let’s call it insect Species Diversity.
This has been measured across 22 catchments. We want to be able to predict what insect diversity might be in a specific catchment.
It’s been quite a while since I did grad school and my stats are rusty.
What I have done is produce my own dumbed down version of multiple liner regression within a simple linear regression model. What matters to us are ecological realities, rather than a model that is rigorous from a statistical theory perspective.
What I would like is for some members of this forum to identify problems with the methods I have used.
We have 22 samples of insect diversity from 22 different catcments, but we would like to predict insect diversity across 500 or so catchments.
We have a bunch of potential predictor variables for which we have run paired correlations. These predictor variables are percent land cover type within a catchment (land cover area / total catchment area). We have dropped out variables that appear to be highly correlated with variables that we are keeping for further analysis.
Of the remaining p potential predictor variables we have applied weightings, so that
Score = Sum (variable1 * weighting1 + variable2 * weighting2 + … variablep * weightingp)
Score is then re-scaled between 0-100, using Rescale = (Score-Score(min)/Score(max-min))*100.
We then make a simple linear regression between Species Diversity and Rescale. This is all done in MS Excel, so we have the regression chart with fitted line and have R^2, s, SSE present in cells next to the chart.
We have tried to adjust the weightings value of each predictor to minimize SSE & s, and maximize R^2. Because we used Excel, you can instantly see the change in the model performance as weightings are adjusted.
We have a total of 19 predictor variables, but I get the smallest SSE if I drop 6 of them from the model (give them a weighting of zero). However, doing this does not make any sense from an ecological perspective, so, because we know these variables have some ecological value, and have significant individual correlations with Rescale , we keep them in the model. The model predictions across catchments seem to make a lot more ecological sense when these 6 variables are included.
Some time ago I attempted to produce a multiple linear regression model, but found that it had a much lower R^2 than our Excel model above; also it predictions seem to make less ecological sense.
Can I please have some opinions of the process I used above. I want to publish our model and results in a science journal but would prefer to iron out criticisms now.

Last edited: