Am I dealing with outliers, or something else (skewness of 106)?

#1
So I have not exactly a homework problem, but I just discovered how fun statistic modelling is, and usually I use already clean datasets.
However, I am dealing with a credit default dataset that a lecturer showed me as a challenge. I want to do a logistic regression, a random forest, and XGBoost model where the dependent variable is whether a company has defaulted or not, and then there are 18 other variables in the dataset.

3 of the variables I want to use stand out (equity, total assets, and revenue), see below for picture of their boxplots.


I first thought I might be dealing with outliers, however if I applied both removing outliers with a command in R I just had 30 000 observations left. I also tried to put a cap and a floor on the data for values that were 1.5 times the interquartile range, but it too gave questionable results. I checked for skewness, and got 106 for equity.

Aside from logtransformations, what other possible ways are there to deal with this? I am relatively new to R, and there are so many packages that it's easy to get lost sometimes, but answers don't have to involve R, but if you have packages that can make these datacleaning parts easier it's a plus.
 

Attachments

Last edited:

obh

Active Member
#2
Hi Aite,

You should "deal" with outliers only if you can identify mistakes. (experiment mistakes, human error, measurement mistakes)

Example: the teacher measured the kids in class (cm): 120, 133, 111, 109, 141, 741, 122
clearly you should remove or measure again the 741.

If you remove valid outliers you damage your data!
 
#3
Don't know as I've ever seen a skew of 106 in real data -- cool! (Until you have to analyze it).

I gather a log transform didn't work.

People sometimes convert data like this into quarters or fifths.

In some cases a dichotomy makes sense, *if* there is a logical cutpoint. You might argue that for purposes of predicting who buys a fancy yacht, all values of $500,000 and above are equal, and cut it there. Done right this can be helpful (because you eliminate a lot of useless variation). Arbitrary dichotomies cost a lot of power however.

I think OBH may have overstated the case against outlier control. I think that it is OK to do an analysis without non-error outliers, as long as you clearly state that you are doing that. Like OBH, I normally much prefer to keep them in and either do a nonparametric test or some other approach. But once in a while, especially with small sample sizes, being able to see what would happen if that one odd value was removed is helpful. But your report MUST say that you did that.

You don't have true outliers -- you just have an exponential or other very non-normal distribution. So I wouldn't remove any values. And with those kinds of extremes, trimmed means and such won't help much.

So consider dividing the data into quarters or fifths.