I know that this method is suitable for large data-sets, can handle `NA` and also decides which variables will be used and which not into the regression.

**What I'm doing**

After the regression is estimated, I detect the `outliers` using `boxplot` and then I eliminate from the data the observations which are `extreme values` and compute the model again.

I do this until maximum of `grsq` and `rsq` are found.

**CODE**

Code:

```
model <- earth(log(price) ~ ., data = data, weights = weights)
max_grsq <- round(model$grsq, digits = 4)
max_rsq <- round(model$rsq, digits = 4)
min_diff <- abs(max_grsq - max_rsq)
while(!done) {
residuals_abs <- abs(model$residuals)
boxplot <- boxplot(residuals_abs, plot=F)
indexes_to_remove <- c(which((residuals_abs > boxplot$stats[4]) == T), which((residuals_abs < boxplot$stats[2]) == T))
if (length(indexes_to_remove) > 0) {
data <- data[-indexes_to_remove, ]
distances <- distances[-indexes_to_remove]
weights <- (1/distances)/(sum(1/distances))
}
tempModel <- earth(log(price) ~ ., data = data, weights = weights)
temp_grsq <- round(tempModel$grsq, digits = 4)
temp_rsq <- round(tempModel$rsq, digits = 4)
temp_diff <- abs(temp_grsq - temp_rsq)
if ((temp_grsq > max_grsq && temp_rsq >= max_rsq) || (temp_grsq >= max_grsq && temp_rsq > max_rsq)) {
model <- tempModel
max_grsq <- temp_grsq
max_rsq <- temp_rsq
min_diff <- temp_diff
} else {
done = T
}
}
```

**QUESTION**

I'm not a statistician so I don't know any better way for removing the outliers.

- is my approach correct?

- should I use another approach?

- I know that there are bad outliers and good outliers (leverage points), how can I remove only the bad outliers?

- I'm using the `semi-log form` of the regression. because of the use of `dummy variables` I can't use the `log-log form`. Is there any other approach for data transformation? or should I standardize the data? `x <- (x - x_min)/(x_max - x_min)`

Does anyone has some hints?