So I have not exactly a homework problem, but I just discovered how fun statistic modelling is, and usually I use already clean datasets.
However, I am dealing with a credit default dataset that a lecturer showed me as a challenge. I want to do a logistic regression, a random forest, and XGBoost model where the dependent variable is whether a company has defaulted or not, and then there are 18 other variables in the dataset.
3 of the variables I want to use stand out (equity, total assets, and revenue), see below for picture of their boxplots.
I first thought I might be dealing with outliers, however if I applied both removing outliers with a command in R I just had 30 000 observations left. I also tried to put a cap and a floor on the data for values that were 1.5 times the interquartile range, but it too gave questionable results. I checked for skewness, and got 106 for equity.
Aside from logtransformations, what other possible ways are there to deal with this? I am relatively new to R, and there are so many packages that it's easy to get lost sometimes, but answers don't have to involve R, but if you have packages that can make these datacleaning parts easier it's a plus.
However, I am dealing with a credit default dataset that a lecturer showed me as a challenge. I want to do a logistic regression, a random forest, and XGBoost model where the dependent variable is whether a company has defaulted or not, and then there are 18 other variables in the dataset.
3 of the variables I want to use stand out (equity, total assets, and revenue), see below for picture of their boxplots.
I first thought I might be dealing with outliers, however if I applied both removing outliers with a command in R I just had 30 000 observations left. I also tried to put a cap and a floor on the data for values that were 1.5 times the interquartile range, but it too gave questionable results. I checked for skewness, and got 106 for equity.
Aside from logtransformations, what other possible ways are there to deal with this? I am relatively new to R, and there are so many packages that it's easy to get lost sometimes, but answers don't have to involve R, but if you have packages that can make these datacleaning parts easier it's a plus.
Attachments

18.3 KB Views: 4
Last edited: