I'm doing an extreme outlier analysis on the price distribution of the New York Airbnb listings in 2019.
I divided the overall distribution in 15 distributions, grouping them by "room_type" (3 types) and "borough" (5 boroughes): for example, one distribution can be ("Private room", "Queens"), another ("Shared room", "Brooklyn") and so on.
I did this division because obviously the price distribution of ("Entire apartment", "Manhattan") is very different compared to ("Shared room", "Bronkx").
Moreover, because the price distributions are right-skewed and without negative values, i used the median as location metric, and the inter quartile range as dispersion metric.
If i had only one overall distribution i would use the threshold Q3 + 3*IQR as outlier detection.
In the link below you can see that i found the threshold for each of the 15 distributions, but, in order to simplify the analysis i did it considering the distributions as independent. This assumption is not true, because for example if i know that the prices of ("Shared room", "Bronx") increase, also the prices in ("Private room", "Bronx"), increase.
Another problem is that each empirical distribution has different sample sizes.
My question is: does make sense the Q3 + 3*IQR threshold separately calculated for each distribution given that the distributions are not independent and the are different sample sizes?
How can a define a method for the outlier detection in this exploratory descriptive analysis?
If you want to see the problem in more details here the link: https://antonio-catalano.github.io/NY_Airbnb_outliers.html
In the same link you can find even the first part of article (but it's not necessary in order to understand the problem that i highlighted).
In other words: the threshold of Q3 + 3*IQR makes sense if calculated for different but not independent distributions?
Thanks.
I divided the overall distribution in 15 distributions, grouping them by "room_type" (3 types) and "borough" (5 boroughes): for example, one distribution can be ("Private room", "Queens"), another ("Shared room", "Brooklyn") and so on.
I did this division because obviously the price distribution of ("Entire apartment", "Manhattan") is very different compared to ("Shared room", "Bronkx").
Moreover, because the price distributions are right-skewed and without negative values, i used the median as location metric, and the inter quartile range as dispersion metric.
If i had only one overall distribution i would use the threshold Q3 + 3*IQR as outlier detection.
In the link below you can see that i found the threshold for each of the 15 distributions, but, in order to simplify the analysis i did it considering the distributions as independent. This assumption is not true, because for example if i know that the prices of ("Shared room", "Bronx") increase, also the prices in ("Private room", "Bronx"), increase.
Another problem is that each empirical distribution has different sample sizes.
My question is: does make sense the Q3 + 3*IQR threshold separately calculated for each distribution given that the distributions are not independent and the are different sample sizes?
How can a define a method for the outlier detection in this exploratory descriptive analysis?
If you want to see the problem in more details here the link: https://antonio-catalano.github.io/NY_Airbnb_outliers.html
In the same link you can find even the first part of article (but it's not necessary in order to understand the problem that i highlighted).
In other words: the threshold of Q3 + 3*IQR makes sense if calculated for different but not independent distributions?
Thanks.