Physiology grad student needing some help - linear regression through the origin

#1
Hi all. I'm a PhD student in physiology, and I seem to be suffering from mild amnesia when it comes to statistics. My old course notes are out of my reach at the moment, so I was hoping to get some help from this forum. It's an extremely simple question, I'm sure, but I just want to make sure that what I'm doing is right.

What I want to do is perform a linear regression on my data. The fit is great, but its intercept is not at the origin - it's slightly above it. For my analysis, however, it's important that the fit passes exactly through zero. (I'm well aware that in general, forcing a fit through zero is a bad idea, but it's essential for what I'm doing). I don't want to simply remove "b" from my linear fit (f = ax + b --> f= ax ) because that would reduce my correlation. Rather, I want to "re-fit" the data with (0,0) as a datapoint, *and* make sure that the fit passes through the origin (in other words, "b" would be zero, but the slope of my fit ("a") would be different from the slope of my original "non-forced" fit). Could anyone give me some hints as to what the correct way of doing this would be? I remember from undergrad courses that there was a "well-established", statistically acceptable way of doing this, but as I said above, my course notes are unavailable to me right now.

If it helps, I'm using Matlab for my statistical analysis. Thanks a lot in advance.
 
#2
What you do not want to do is actually what you should do.

There is always an option to not include a constant in your model, and thereby effectively fit your model with y = ax which is the same as y = 0 + ax, meaning that your line will go through the origin. However, your beta estimates will change slightly in most cases, but this should not be a problem I guess.

Don't add the (0,0) data point however, because that would be manipulation of the data which you want to avoid in scientific research as much as possible. Just unclick the 'Include constant in regression' box (or whatever it may be in your software) and run the regression again.

Another option would be to check the significance of your constant. If it's not significant, than you can't reject H0 that a = 0 and therefore you can put forward a model in which a = 0. You say that a is slightly positive, so maybe it's insignificant. Rememeber that your sample is just a sample, and so a constant of 0,05 or something would not be a problem if it's insignificant. The same holds for a constant of 20 or something if it still is insignificant. Samples have errors, but if you have a significant non-zero intercept, you should think about your theory and how it fits reality.

But you can't just add datapoints. And why is it a problem that your correlation (R) changes? If you want to do regression, you have to respect that fact that the line that the model returns is the best fit because it minimizes the sum of squared errors between datapoints and estimated points... If you exclude a constant, you'll still have the best fit possible for a y = ax model. But if you start manipulating, you will not get the best fit anymore... And you are in essence not doing OLS anymore.

So I'd go with the exclude constant, if it really has to be zero to fit your theory. Otherwise, there might be flaws in your theory. Because if your empirical results don't return a zero intercept, but your theory predicts a zero intercept, you should look into that.
 

TheEcologist

Global Moderator
#3
Hi all. I'm a PhD student in physiology, and I seem to be suffering from mild amnesia when it comes to statistics. My old course notes are out of my reach at the moment, so I was hoping to get some help from this forum. It's an extremely simple question, I'm sure, but I just want to make sure that what I'm doing is right.

If it helps, I'm using Matlab for my statistical analysis. Thanks a lot in advance.
In addition to Riverdale27's comments: It might be a good idea to look into a non-linear fit to your data. If your not sure what non-linear model is best then post your scatterplot.

It could be very logical that a linear model gives a non-zero intercept where in life its impossible (when x is measured at zero, y = zero ), you might just need to adjust your model.
 
#4
Good comment Ecologist, but I assumed a linear is fine, because he stated that the fit was great... so nonlineair models will probably not have a decent fit... But he could try it anyways, maybe some other interesting things will come forward from this analysis...
 

TheEcologist

Global Moderator
#6
so nonlineair models will probably not have a decent fit
Cant agree with that, I see no basis to assume a nonlinear model would have a less 'decent' fit. Based on experience, I'd say the opposite (especially with physiological data which inspired many a non-linear model. e.g. Michaelis-Menton ect). Though before I see the scatterplot first its all speculation.

I still do believe that Pascals trouble with the non-zero intercept can possibly be solved this way, as forcing a zero intercept does not suffice (due to loss of fit). Ergo it seems that a non-linear model is a more natural model for this data.

Your solution is however, by far, the simplest :D
 
#7
Thank you all for your replies - they've been very helpful indeed. As for the linear vs non-linear debate: I'm working on a system (the cochlea) which is known to be non-linear, but which is almost always modeled as being linear (in a certain frequency range). So my choice of using a linear fit has as much to do with the fact that the linear fit "suffices" as with the fact that most of the existing models ignore the non-linearities.

Riverdale: as I read your message, I realized that what I said about adding (0,0) as a data-point was a terrible idea of mine indeed. Thanks for setting me straight on that one. I went ahead with fitting a simple y = ax line to the data, and am using that in the analysis at the moment. What I was saying about me not "wanting to delete the constant" was a bit unclear I guess - what I meant was that I didn't want to perform a normal y = ax + b fit, and then simply delete the "b" term (shifting the entire line with the constant). For some reason I thought that simply fitting an y = ax line wouldn't do.

Again, thanks for the help everyone - this forum is grand. I wish it was around back when I was struggling with undergrad statistics ;).