Do I even need a statistical method??

#1
After rethinking my problem with finding outliers in some GPS/SONAR data i started to wonder if i even need to do a statistical test for outliers at all. What i am trying to do is identify potentially erroneous readings and remove them from the data.

if i were to remove a valid data point the impact on the resulting chart would be virtually nothing (i would be removing one of thousands of data points) therefore i wonder if all i need to do is to simply indentify suspicious data and toss it out? if it was bad, then i will markedly improve my map. if it was good i will hardly degrade the map at all.

i will try a test of this over the weekend to see what happens if i remove some valid data.

any thoughts??

jerry
 
#3
Jerry,

I don't think it's a problem to toss out the extreme outliers and form a new dataset, as long as you keep most (>99%) of the data points. If there's any problem with the map you can always go back and analyze the original data.

Tossing out outliers would be a no-no if you are doing biomedical studies, especially clinical trials, where sample size is usually small, and drug interactions, however rare, are critical. Your study should be much more robust. :D
 
#4
thanks guys,

i think i'll try this over the weekend:

take one of the data sets that i have been using for testing (which contains an extreme outlier) and remove the outlier then remake the map. then i will remove three other data points at random and remake the map again. then i'll post the three maps so that we can compare the effect of removing a known outlier and the effect of removing a valid data point from the set.

i know that removing the outlier from that set will have a dramatic effect, we'll see about remove some good data. this file is a relatively small data set ~2000 points if i recall correctly taken over a one mile area, give or take, so any negative effects should show up nicely.

cheers
jerry
 
#5
results of test

Hi,

so i tested the results of removing the suspicious data points and then replotted.

the orginal problems created by the bad data were cleaned up nicely!

so i went ahead and did some trimming of the data, i removed all measurements with depth values below 2, since the depths from 2 feet to the shoreline do not really add any utility to the maps purpose.

the result: no sigificant chage except for a "smoothing" effect on the shallowest contour line, which is good.

so then i trimmed some more: i remove 20% of the data, using a convience selection process i removed every fifth data point from the list while it was sorted according to depth. thus i should have removed roughly the same number of points at all depth ranges. the result: some smoothing and almost no loss of definition of the contours.

this data set contained just short of 5000 data points at the start and ended up with about 3000 points over an area of about 4 square miles.

I think i am satisfied that simply removing any and all suspicious data will not harm the overall utility of the map. so i am going to work on a method for detecting suspicious data.

any input is most welcome.

cheers
jerry
 
#6
Jerry,

I think removing data systematically is ok. You can also remove data at random. As long as your map is useful, the less computation the better, just my .02. :yup:
 
#7
Quark,

i found a textbook with a method for detecting outliers in multidimensional data. The method given is basically the method that i had in mind to try anyway (although fully developed ;) ), so i am going to go ahead and follow my idea through and see what point it locates in the data set that i just worked with. there is one section of data that i am worried about taking out useful and needed information so we'll see if that data makes it through or not.

thanks again for the interest.

jerry
 
Last edited: