Outliers Parametric model

#1
I'm struggling to decide how to normalise my data for modeling.

I am dealing with crowdfunded projects and a huge chunk of them (15%) raised $0 - $10 dollars, therefore failing. Those produce a very strong positive skew that is impossible to normalise. (tried log, z-scores, cubic). And will not adding value to my model.

Therefore I decided to remove them. (although these outliers are valid).

I tried Winsorizing them, but suspected that was wrong. Thus concluding that my only option is to trim them by dropping the top and bottom 10% of values.

Is this approach correct, or is there a better method?
Perhaps a non-parametric model..
 
#2
Please don't remove a correct outlier, this is a mistake.

Maybe there is a way to transform to normal if you want you can send example data.
Anyway, non-parametric tests are a good choice, with no normality assumption
 
#3
Please don't remove a correct outlier, this is a mistake.

Maybe there is a way to transform to normal if you want you can send example data.
Anyway, non-parametric tests are a good choice, with no normality assumption
Hi, thanks for replying. That was my impression as well.

Here is the link to the data; https://ufile.io/vgxm1
4 features of crowdfunding campaigns on Kickstarter: ID, Backers Funding Goal (numeric)

The data is raw, so I can post a jupyter notebook tomorrow if that helps. However it should be clean enough for a quick look through.

P.S. Some outliers are indeed wrong. Eg. Its unreasonable that a projects seeks to crowdfund 1$. But having values of 0-5 for funding or backers is reasonable, due to high failure rates (account for at least 1/4 of the data).

Thanks!
 

Dason

Ambassador to the humans
#4
Is it really unreasonable? My understanding is that they don't get the money unless they reach their goal so setting it to be $1 and bring happy with whatever they raise might be something that makes sense to them
 
#5
Is it really unreasonable? My understanding is that they don't get the money unless they reach their goal so setting it to be $1 and bring happy with whatever they raise might be something that makes sense to them
Thats a great observation Dason. Didn't think of that.

Unfortunately for me this puts trimming out of the question. Any ideas on how to proceed?

Thanks
 
#7
Can you post a sample of the data and say what your goal/question is
I can only upload this in txt format and its quite a small sample. the full thing can be found here as a csv: https://ufile.io/vgxm1

The goal is to construct a regression model to predict the funding of campaigns based on the other input variables.

Even some pointers for non-parametric models or outlier detection would be usefull, since I spent a lot of time doing the wrong thing, it seems.
So far i trimmed, removed outliers 2std from mean, added a contant so as to change 1s and be able to use log...

I'm new to this stuff so thanks for the patience and interest!
 

Attachments

#9