finding the best fit and removing outliers with MARS regression

#1
I using the regression method called `MARS`, in `R` is it called `earth` and is located in the package `earth`, in order to find the best regression model for my datat.

I know that this method is suitable for large data-sets, can handle `NA` and also decides which variables will be used and which not into the regression.

What I'm doing

After the regression is estimated, I detect the `outliers` using `boxplot` and then I eliminate from the data the observations which are `extreme values` and compute the model again.

I do this until maximum of `grsq` and `rsq` are found.

CODE

Code:
    model <- earth(log(price) ~ ., data = data, weights = weights)
    max_grsq <- round(model$grsq, digits = 4)
    max_rsq <- round(model$rsq, digits = 4)
    min_diff <- abs(max_grsq - max_rsq)
  
    while(!done) {
      residuals_abs <- abs(model$residuals)
      boxplot <- boxplot(residuals_abs, plot=F)
      indexes_to_remove <- c(which((residuals_abs > boxplot$stats[4]) == T), which((residuals_abs < boxplot$stats[2]) == T))
    
      if (length(indexes_to_remove) > 0) {
        data <- data[-indexes_to_remove, ]
        distances <- distances[-indexes_to_remove]
        weights <- (1/distances)/(sum(1/distances))
      }
    
      tempModel <- earth(log(price) ~ ., data = data, weights = weights)
      temp_grsq <- round(tempModel$grsq, digits = 4)
      temp_rsq <- round(tempModel$rsq, digits = 4)
      temp_diff <- abs(temp_grsq - temp_rsq)
      
      if ((temp_grsq > max_grsq && temp_rsq >= max_rsq) || (temp_grsq >= max_grsq && temp_rsq > max_rsq)) {
        model <- tempModel
        max_grsq <- temp_grsq
        max_rsq <- temp_rsq
        min_diff <- temp_diff
      } else {
        done = T
      }
     }
QUESTION

I'm not a statistician so I don't know any better way for removing the outliers.

- is my approach correct?
- should I use another approach?
- I know that there are bad outliers and good outliers (leverage points), how can I remove only the bad outliers?
- I'm using the `semi-log form` of the regression. because of the use of `dummy variables` I can't use the `log-log form`. Is there any other approach for data transformation? or should I standardize the data? `x <- (x - x_min)/(x_max - x_min)`

Does anyone has some hints?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
What is the purpose of these analytics? It seems like you may be taking a blind approach (e.g., have a program decide on the model, then eliminating the tails of the distribution to maximize the R^2).


Side note, 5% of your data is always going to be > 2 SD away from the mean! So chopping at the ends to ad nauseum, is a recursive process. What are the limitations to you building this model yourself based on content? Yes, most models have a leverage value for observations.
 
#3
the purpose of the model is prediction.
I want to use the regression in order to estimate the price of a car upon its characteristics.

I stop trimming the data at the moment when the maximum grsq is achieved.