Winsorization using SPSS

#1
Hello everyone! =)

I have to winsorize my data (replace outliers with the next highest/lowest score that is not an outlier).
I have been searching the web for a week already, but could not find any explicit information about how to do it in spss.
Does anyone have an idea how to apply winsorization in spss ? If yes, could you please explain me the steps I have to follow in order to winsorize the data?:confused:
 
#2
The only way I can think of is to sort the data by the value of the variable and manually replace the maximum and minimum based upon the content of the adjacenet record. If you need to be able to restore the orginal sequence of data, then use the $casenum function in a COMPUTE statement to create a casenum variable before the sort.
 
#3
Thank you for a quick reply. I have tried to replace the values manually, however I believe I am not doing it right.

First I ranked the variables, after I have copied the value which is not an outlier, and passed it into the value of outlier( basically using copy/paste function).

I believe that is not the right approach , as after I have replaced all the outliers with the different values, and run the regression again, my model became highly insignificant (even thoough before I changed the values of outliers it was significant).

What do you think about it? Do you know any other way to replace the values (and which steps I have to follow to do it)?
 
#4
Whether or not this result is reasonable depends on your sample (including its size) and how extreme the outliers were. I think both results should be reported together with a scatter plot if that is appropriate for your data. There are several notorious examples of outliers completely distorting the interpretation of some results. That said, outliers should not be ignored and are sometimes the most interesting aspects of a study, as long as they are physically real and not just data entry errors, calibration errors, etc. My own feeling is that the only outliers that should be removed are those that are proven errors or physically impossible. Other outliers might be an indication of multiple populations within a sample.
 
#5
I should add that there is some argument that trimming is better than winsorizing. With trimming you might just remove the two, four, six, etc extremes. Another more common approach is by percentage, e.g. remove the most extreme 5% of the values. In any case I still think you should also report the results of analysising the raw data, after removal of known errors.
 
#6
I do quite a lot of data analysis using Winsorization due to outliers in research on psychopathology, which often involves very skewed distributions. These are very tricky decisions, and if you are not experienced, you may need advice from someone familiar with research on the type of data you are analyzing. Winsorization of data can definitely make a significant regression become non-significant. I agree with Robert that simply removing (or replacing) outliers can be very problematic because you are eliminating or changing meaningful data. In some case, I do Winsorize because the extreme values have a distorting effect on statistics. Ultimately, you want to understand what is happening in the data and represent it without distortion. There are different ways to Winsorize, and if the effect you first observed is real, then you may need to Winsorize differently - say replacing values that are at the 99th %ile rather than 95%ile.
 
#7
Irrespective of the Winsorizing questions above, the actual procedure should be fairly straightforward. Assuming you're letting SPSS determine the outliers, then you know what the cutoff points are, yes? (Analyze > Descriptive Statistics > Explore > Statistics > Outliers).

Then select Transform > Recode into Different Variables > (assign a new variable name) > Old and New Values > Range, value through HIGHEST > (enter your extreme value from above) > (enter your replacement value in the new value box) > Add > All other values > Copy old value(s) > Add > Continue

If you've got low values, add an entry for Range, LOWEST through value as well.

Isn't that what you're trying to do?
 

Jinn

New Member
#8
Whether or not this result is reasonable depends on your sample (including its size) and how extreme the outliers were. I think both results should be reported together with a scatter plot if that is appropriate for your data. There are several notorious examples of outliers completely distorting the interpretation of some results. That said, outliers should not be ignored and are sometimes the most interesting aspects of a study, as long as they are physically real and not just data entry errors, calibration errors, etc. My own feeling is that the only outliers that should be removed are those that are proven errors or physically impossible. Other outliers might be an indication of multiple populations within a sample.
Hello Mr.Jones, this is in line with you mentioning the presence of multiple populations within one sample. I believe I am working on such a data. My concern is how do I verify whether it actually is the case or they are just simple outliers. Do you have any comments on this? Any kind of insight on this issue is highly welcome.