# Outliers

#### Almostasmartstudent

##### New Member
Hey guys,

My teacher told me to make a scatterplot with Cook's distance and leverage to detect outliers. But I don't know which concentration of dots are outliers. Are all the dots at the right outliers or just the ones above 0.01? Can the "middle" group be considered as outlier? Can every dot above 0.01 be considered as outlier?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Just curious how many dots are in that SW corner?

#### Almostasmartstudent

##### New Member
Just curious how many dots are in that SW corner?
N = 1045, so I don't know actually. Because where does the "corner" end? haha

#### hlsmith

##### Less is more. Stay pure. Stay poor.
If you made the figure, can you add some transparency to the dots. I would say that if only1-2% of dots are out on the fringe, yeah they are outliers or have a slightly different data generating process. However, given the sample size they likely don't have leverage/ influence.

#### noetsi

##### Fortran must die
There are no agreed on way to detect outliers, or rather there are many which are not agreed on. You can look at specific values of standardized or studentized residuals, I think the rule of thumb is 3 or greater but you can look this up. There are also rules for when Cox distance matters, again there are many and no consensus.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
The big question is whether they are valid measurements or errors. If they are valid and that extreme, they may have come from a different generating function. Say you are looking at salaries and you have a couple people with advanced degrees in the sample space. Well, big picture who cares, it all comes down to whom you want to infer your results to. If they are not valid, what caused them to be off. If they are valid but outside your target sample space, remove them as well and be transparent about it. if they are within your sample sample and your know why they are extreme, add an indicator term to the model to address this source of variability in the DV. That is it.

likely @noetsi said there are general rules, but who cares. It comes down to the purpose of your analyses!!! All systems should expect some spread in the data, but know why and if they are beyond normal disperse is important to know and may require contextual knowledge.