Multivariate outliers

Hi, I hope some of you can help me a bit - or share what you would do. I've been working on my master thesis and I can't get my head around how I should treat outliers. I have a moderation analysis, and I checked for all detection possibilities in SPSS (cook's, leverage and Mahalanobis..). I found some critical values, however, while checking if there are any abnormalities in their answers, nothing seems to be strange. Upon checking visually, one point really stands out - so I ran first with all of the data and the moderation analysis without this outlier. This happened:

Model with all of the data
coeff se(HC4) t p LLCI ULCI
constant 6.302 .272 23 .140 .000 5.759 6.845
SDES -.002 .034 -.074 . 941 -.070 .065
DAQ_DP .003 .029 .107 . 915 -.054 .061
Int_1 .006 .003 1.840 .070 -.001 .013

Model without outlier
coeff se(HC4) t p LLCI ULCI
constant 6.375 .267 23.857 .000 5.842 6.908
SDES -.016 .033 -.477 .634 -.081 .050
DAQ_DP .011 .030 .386 .700 -.048 .070
Int_1 .005 .003 1.447 .152 -.002 .012
For some reason I think I am concerned for no reason - a point slightly changed the coefficient of determination, but is change important? (I know nothing is significant, we were expecting that).
I would love to get some help, how to decide ;)
Thank you very much!


Active Member
"Outlier" does not necessarily mean "bad data point". Measures like Cook's distance, DFFITS and DFBETAS are highly informal. Researchers disagree on their thresholds. Even if some of those thresholds are taken as the Bible, they work only for data where residuals are normally distributed. This means that the thresholds are meaningless for most of real-world data.

A more robust approach is thinking where your data came from and whether you can trust the values. "Extreme" does not mean "wrong". Otherwise our society would not have presidents, Nobel laureates and Victoria Secret models.


Fortran must die
With interval DV you can also run robust regression although that is more complex. I never heard measures like Cook's distance described as informal before :p It is a good point that these measures assume normality and a lot of real world data is not normal (data is often skewed).


Ambassador to the humans
The measures don't assume normality. The 'accepted' thresholds for whether the measure is too big/small are usually based on normality assumptions.