need a method for detecting outliers

#1
Hi,

i just joined this group and hope to find some help. i'll begin with a breif introduction of myself then my question:

i am a professor of mathematics and physics at a two year college and an avid fisherman. I have been working on a personal research project involving making hydrographic maps form GPS/SONAR data using consumer equipment. My statistics background is limited to that needed for a BS in math and an MS in science education, so i know the basics and can execute the mathematics once i have the right method at hand and i can usually spot when i don't have the right method.

so, now the question:

the data is lattitude, longitude and depth of water. this can be thought of as (x,y,z) triplets where the water depth is always recorded as a positive real number, the lat and long are in a geographical unit known as mercator meters (integers) but i don't think that will matter. lat and long are independant and depth is dependant.

the difficulty that i need to solve is how to detect outliers in the depth value as collected by the sonar unit. on occasion the sonar will record a "bad" data point, this typically happens when there is a sunken tree for example. the sonar will record the depth of the top of the tree rather than the depth of the bottom of the lake where the tree is resting. so the depth reading might be off by 20 feet or more. in a small data set i can simply plot all of the data and see where the outliers are influencing the contours and then weed them out of the data and get a good map. the problem with that method is that once i have this all working the way i want the data sets will likely grow into millions of data points and manual clean up will no longer be practical.

also, the data is collected by driving a boat around on the lake and collecting the GPS and SONAR data along the path of the boat.

so, does anyone have any insight into this type of outlier detection? i have so far been unable to find anyone who has published on thsi type of problem, though my search has not been exhaustive. i do have a book coming on interlibrary loan which might contain something useful.

i do have an idea that i would post for discussion if no one has any ideas.

thanks for any help,
jerry
 
#3
quark,

i'm not sure how to answer your question, because i'm not sure if your suggesting to only consider depth in relation to other depths (in which case i know to answer ni, won't work due to the likelyhood of encountering depths from1 foot to 100 feet in the same lake so a depth reading of 20 that should have been 35 would not stand out of the overall data) or to only look for outliers in one dimension of the three dimensional data (i.e. i am not worried about outliers in x and y)?

if you meant the second then yes that is exactly what i want to do. only look for outliers in z. this is unlike looking for outliers in a set of data that are distributed about a mean. let me try another way to describe what is going on.

think of an experiment where x and y are measured and we find the least squares fit line. the data will be distributed about the line, some very near some further away. suppose that one data point was substantially far from the line, that is the point that i want to find. but i am working in x,y,z so think of a soup bowl as the model for a best fit curve to lake bottom data. as i collect data my depth points will vary around the shape of the bowl, but every now and then i have one erroneous reading that is much above the bowl. i want to find that reading and remove it.

i hope this helps to clarify what i am working on.

thanks for the response, i will read through that article.

cheers
jerry
 
#5
Jerry,

I was thinking of looking at depth in relation to other depths and ignore the x and y, but it won't work on your soup bowl model. I wonder if it is possible to divide your data into groups based on the shape (x and y). In the center of the bowl we look at depth only. In peripheral we look at all three variables.

I think your data is similar to times series data. The adjacent data points are correlated in time series. It's true in your case as well, only the relationship is spatial not temporal. It may be helpful to see how outliers are identified in time series as well.

Thanks for sharing this interesting problem.
 
#6
quark,

i think you may be onto something.

a quick google on time series uncovered an article about detecting outliers that looks like it may have promise. in particular it discusses a method for finding "additive outliers" which are outliers that do not effect the measurement before or after the themselves. i think this strategy could work since the data IS collected as a series of measurements over time that should vary in a resonable way. i would think it could be handled in the method of this article as a time series, the data is recorded in order of collection, if we think of the x,y,z as being three parameters being measured at a specific time in a series of many such measurements. looking for outliers in the z values would seem to fit what the author is talking about.

i'll read the article another time or two and see if it still seems to make sense. here is a link:

http://aws.tt.utu.fi/tolvi2.pdf

thanks for the input, i'll report back.

jerry

PS: glad you found this to be interesting!
 
#7
I have been doing some reading on this and i got to thinking that i may need to take another approach to the problem with the data, rather that looking for outliers.

i am going to make a new post on my alternate thinking.

jerry