Removing outliers - endless story?

#1
I've been collecting some data for the first time, investigating differences between three groups of people and their numbers of hospital visits. Before running any tests (probably ANCOVA with a few covariates), I have made a box-plot in SPSS to visualize the data. There are many outliers, which I guess are not good for my analysis. However, if I remove the outliers, or set a cut-off threshold removing everyone over a certain number of visits, what will happen is that NEW outliers will show up next time I do the box-plot. It is intuitive that when I remove cases, some who were "normal" before suddenly becomes outliers, but what is the solution here?

Thank you in advance!
 
#3
I was thinking that ANCOVA, or whatever test I choose, will fit poorly to the data if there are outliers. However, this is a pure speculation from my side so please correct me if I am wrong!
 

CB

Super Moderator
#4
Statistical tests do not assume that there are no outliers. The presence of substantial outliers can in rare cases reflect data collection errors (e.g., misrecorded values). If that's the case, maybe deleting such values would be ok.

ANCOVA and other linear models do assume that the distribution of the dependent variable is normal conditional on any given combination of the predictor variables, and big outliers may cause normality breaches... but normality is the least important distributional assumption of linear models. Whatever problems are caused by a breach of normality are likely to be far smaller than the problems introduced by subjectively removing observations from your dataset.

This all said... sometimes the apparent presence of outliers is a tip-off that something else is awry. In your case, "number of hospital visits" is not an unbounded continuous variable, as we'd want as the DV for a linear model. Number of hospital visits is a count variable. A model that assumes a normal distribution with constant variance for any given combination of predictor values may not be the best choice for count data. Perhaps a model designed for count data like the Poisson, quasi-Poisson or negative binomial might be a better choice.
 
#5
Statistical tests do not assume that there are no outliers. The presence of substantial outliers can in rare cases reflect data collection errors (e.g., misrecorded values). If that's the case, maybe deleting such values would be ok.

ANCOVA and other linear models do assume that the distribution of the dependent variable is normal conditional on any given combination of the predictor variables, and big outliers may cause normality breaches... but normality is the least important distributional assumption of linear models. Whatever problems are caused by a breach of normality are likely to be far smaller than the problems introduced by subjectively removing observations from your dataset.

This all said... sometimes the apparent presence of outliers is a tip-off that something else is awry. In your case, "number of hospital visits" is not an unbounded continuous variable, as we'd want as the DV for a linear model. Number of hospital visits is a count variable. A model that assumes a normal distribution with constant variance for any given combination of predictor values may not be the best choice for count data. Perhaps a model designed for count data like the Poisson, quasi-Poisson or negative binomial might be a better choice.
Thank you for your reply. So you suggest another test? I have never heard of those tests you mention. Are they available in SPSS? My professor suggests ANCOVA in order to include important co-variates, however I realize that my outcome variable is not normally distributed at all (not even close), and that other assumptions may also be violated. The professor still insists that ANCOVA is the best option (he is not a statistician), even though assumptions may be violated. Hmm, I am really getting confused by this.
 

CB

Super Moderator
#6
Plenty of researchers stick to using the tests they're most familiar with, instead of the tests that are actually appropriate for the questions at hand. If this is an independent project (e.g., postgraduate thesis), at the end of the day it's your project, and you don't necessarily have to do what your professor thinks is best. A breach of the normality assumption is not a big deal, but with count data the homoscedasticity assumption of ANCOVA will probably be breached too, which is more of a real problem, as it'll compromise the efficiency of the least squares estimator. So I would be thinking about a count data model here.

Yep, you can apply the count data models I mentioned in SPSS, using the generalized linear model dialog. And they do allow covariates to be specified. Though probably the first thing to do is to have a look around for some background reading on models for count data.
 
#7
Plenty of researchers stick to using the tests they're most familiar with, instead of the tests that are actually appropriate for the questions at hand. If this is an independent project (e.g., postgraduate thesis), at the end of the day it's your project, and you don't necessarily have to do what your professor thinks is best. A breach of the normality assumption is not a big deal, but with count data the homoscedasticity assumption of ANCOVA will probably be breached too, which is more of a real problem, as it'll compromise the efficiency of the least squares estimator. So I would be thinking about a count data model here.

Yep, you can apply the count data models I mentioned in SPSS, using the generalized linear model dialog. And they do allow covariates to be specified. Though probably the first thing to do is to have a look around for some background reading on models for count data.
Thank you for your help so far! I will definitely try to learn more about count data models. However, I am not sure if these methods are really applicable on my data. It seems from a quick review that count data models are dealing with categories (where each count is a category), and frequencies in each count, creating contingency tables. In my case, every subject in the set has an individual score of number of illness episodes (it is not about hospital visits, that was only a simplification). My plan was to run ANCOVA or similar, in order to find out whether an independent categorical variable (three categories in total), can predict number of illness episodes.

Thankful for any clarification!
 

maartenbuis

TS Contributor
#8
However, I am not sure if these methods are really applicable on my data. It seems from a quick review that count data models are dealing with categories (where each count is a category), and frequencies in each count, creating contingency tables.
I suspect you have come across what is sometimes called a log-linear model. This is a very specific sub-class of count models, and certainly not representative of what is possible with count models.

In my case, every subject in the set has an individual score of number of illness episodes (it is not about hospital visits, that was only a simplification). My plan was to run ANCOVA or similar, in order to find out whether an independent categorical variable (three categories in total), can predict number of illness episodes.
That is not a problem for Poisson, quasi-Poisson, zero-inflated Poisson, negative binomial, or zero inflated negative binomial regression model. In fact, that is exactly the kind of problem they were designed to tackle. I would start with looking at a Poisson regression, and move up as required. One place (of many) where you could start looking is Chapter 8 of J. Scott Long (1997) Regression models for categorical and limited dependent variables. Thousand Oaks: Sage.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
Yes-yes, tell us more about this episode variable. Also, can patients seek care outside of your clinic/health system (are you losing any data from potential re-admissions elsewhere?
 
#11
Could you tell us a little more about this dependent variable?
The dependent variable is number of illness episodes in patients with a chronic disease. The disease course is fluctuating, and patients have recurring episodes of illness, with healthy periods in between. We are investigating whether some genetic variants (three categories) are associated with a higher number of recurrences (i.e. number of illness episodes). The genetic categories are ordinally ranked, i.e. category 1 should have higher numbers than 2 that should have higher numbers than 3. Because the number of episodes are also affected by disease duration, that is going to be included as a co-variate (even though I realize that it might not fix the problem).
 
#13
That does sound like a count variable to me, but to be sure; the dependent variable values are restricted to non-negative integers, right?
Correct! Is there any obvious reason why models such as ANCOVA won't work on count data? I've seen several publications using ANCOVA or equivalent on these types of problems. I don't doubt that count models are superior in this case, but just need some arguments if any one asks me why I disregarded methods that would be expected in mose cases :) Also, in what regard would ANCOVA be worse, statistical power or interpretation of results?
 

CB

Super Moderator
#14
Correct! Is there any obvious reason why models such as ANCOVA won't work on count data? I've seen several publications using ANCOVA or equivalent on these types of problems. I don't doubt that count models are superior in this case, but just need some arguments if any one asks me why I disregarded methods that would be expected in mose cases :)
When you use ANCOVA or other linear models that use ordinary least squares estimation, there is an assumption made that the conditional distribution of the dependent variable is normal, independent, and identically distributed (homoscedastic) for any combination of levels of the predictor variables. For example, in a simple ANOVA model, this means that we assume (among other things) that the distribution of the dependent variable has the same variance within each group (and is normally distributed within each group).

When working with count data, the normality assumption is obviously breached: The normal distribution is continuous and unbounded, whereas count data is restricted to a discrete set of values that is bounded at zero (i.e., restricted to the non-negative integers). This often manifests in a positively-skewed distribution. The ordinary least squares estimator can remain unbiased, consistent, and efficient in the case of a non-normal conditional distribution of the response variable - i.e., the point estimates of means or regression coefficints are still good estimates. But with relatively small sample sizes confidence intervals and significance tests will be untrustworthy. With larger sample sizes this becomes less of an issue.

The more important issue is the likely breach of homoscedasticity. Specifically, with count data, groups that have higher means (higher average counts) tend to have higher variances as well. This breaches the assumption that the variance of the response variable is identical across all combinations of levels of the independent variables. I wouldn't be surprised if you can actually see this in your own data (if you are comparing groups of patients with quite different frequencies of illness) - you may see that the group with the higher mean has a higher variance as well. This will result in the OLS estimator that ANCOVA uses not being an efficient estimator. I.e., it is not as accurate as other estimation methods in this scenario.

Also, in what regard would ANCOVA be worse, statistical power or interpretation of results?
I'm not sure about statistical power, but the interpretation of the results is compromised in the ANCOVA case - i.e., you might report confidence intervals and significance tests and so on, but they may be untrustworthy due to assumption breaches.
 
#15
Thank you for all the replies, and I am now bumping my own thread. I had the opportunity to meet with a statistician and we decided to use negative binomial regression. The statistician somehow plotted Q-Q plots for different distributions and draw the conclusion that the negative binomial distribution was the best. Is this the correct way to choose which distribution is the best, and what tests to use?

I also have some other questions:

1. What is actually the difference between negative binomial regression and linear models such as ANCOVA. Is negative binomial regression not linear? Not using method of least squares?

2. Is it possible to include co-variates in negative binomial regression? We included some other factors, potential confounders. Does this has the same effect as co-variates in ANCOVA, i.e. "controlling" for factors?

3. What are the assumptions of negative binomial regression and how do I make sure that the test results (p-values) are generalizable?

Thankful for any more input!
 

CB

Super Moderator
#16
Is negative binomial regression not linear?
Negative binomial regression is a generalized linear model.

Not using method of least squares?
Not ordinary least squares. Maximum likelihood, e.g., via iteratively reweighted least squares (in R, I don't know if other programs use the same).

2. Is it possible to include co-variates in negative binomial regression? We included some other factors, potential confounders. Does this has the same effect as co-variates in ANCOVA, i.e. "controlling" for factors?
Yes/yes.

3. What are the assumptions of negative binomial regression and how do I make sure that the test results (p-values) are generalizable?
Crumbs, from memory I think it would be:
  • Errors have conditional mean zero for any given combination of values of the predictors
  • Errors independent
  • For any given combination of values of the predictors, the dependent variable takes the form of a negative binomial model

That might be a bit vague though.
 
#17
Thank you once again! So, regarding the co-variates: In linear multiple regression, the predictors are generally not supposed to be "co-variates" in the same sense as in ANCOVA right? Doesn't that depend on the order of entry for the predictors? In order for co-variates to "control for" unwanted effects, they need to be introduced to the model before the predictor variable, correct?
 

CB

Super Moderator
#18
Thank you once again! So, regarding the co-variates: In linear multiple regression, the predictors are generally not supposed to be "co-variates" in the same sense as in ANCOVA right? Doesn't that depend on the order of entry for the predictors? In order for co-variates to "control for" unwanted effects, they need to be introduced to the model before the predictor variable, correct?
Not really - the interpretation of a coefficient for an independent variable in a model that includes covariates is the same regardless of which was entered "first". E.g., if I have a model:

Y = B0 + B1*X1 + B2*X2 + e

And I regard X1 as my IV and X2 as the "control", then:

B1 is the expected increase in Y for a one-unit increase in X, while holding X2 constant. I don't need to enter one or other variable first to achieve this interpretation. A covariate in ANCOVA is no different from a predictor in regression, as far as I know. ANCOVA is just regression after all.
 

noetsi

Fortran must die
#19
Well in what some call hierachical regression you do enter variables at different times and the interpretation changes at least in terms of the R squared. In this case it now reflects the increased ability to predict that the new variables bring (if any).

Or so I was taught :p