What test to report?

#1
Hi everyone,

So first let me give some background information on my study.
Research question: Are there differences on helpfulness between online vinyl record reviews with the indication of a verified purchase and online vinyl record reviews without this indication?

This indication of a verified purchase is a thing on Amazon that can be added to a review when the reviewer bought the exact same product as is presented on Amazon.

Anyways, I gathered 80 reviews (I know, this are too few reviews, but it was part of the assignment); 40 reviews with the indication and 40 reviews without.

I had some big outliers, but my teacher told me that I only should remove these if they influence my outcome.

I checked normality and homogeneity of variance for both (with and without the outliers) and they both gave negative results. I learned in class that I may solve this by doing a bootstrapped non-parametric test, or a normal parametric test. To be sure, I did both, again for both data files (with and without outliers). This gave me the following results

With the outliers
Boostrapped independent t-test

Mdif = -2.33, t(78) =-2.02, p = .054 ----> not significant, but a tendency to significance
95% CI [-4.759, -0.159], d = 0.04 ----> and an extremely small effect size (I know, calculating an effect size makes no sense if your difference is non-significant, but mine is close to significance, so I figured I calculate it as a bonus ;))

Chi-Square test
X2 = 16.608, p = .343 ----> not significant

Without the outliers
Bootstrapped independent t-test

Mdif = -1.97, t(71) =-3.69, p = .002 ----> significant
95% CI [-3.056, -0.938], d = 0.09 ----> but a very small effect size (I have never seen an effect size this small, so can I even use this?)

Chi-Square test
X2 = 14.264, p = .161 -----> not significant

OK, now here is my question. What test should I report in my paper? Because some give significant differences, some don't?? I don't want to do any p-hacking, so what is the real result/good result I have to report here?

Thank you so much in advance!! Please let me know if something is unclear.
 
#2
Hi,

What do you compare? the average rank?
doesn't sound like a classic Chi-Square test? How did you run this test?

Outlier should be removed only if there is a good reason to remove it. How many outliers did you get? what method?
"remove these if they influence my outcome." is not a good reason.
to make short a big list you probably want to remove outliers if one of the following "very high-level cases:
1. any type of mistake
2. valid rare case that expediently fell under your not too big sample.

I don't really see any difference between p = .054 or p = .05.

If data is not normal, is it symmetrical? t-test can manage a reasonably symmetrical data even if not normally distributed.

Did you consider the Mann-Whitney U test which doesn't require the normality assumption and more robust to outliers than t-test?
(although t should be a bit more powerful, but probably worth a try)
 
Last edited:
#3
Hi,

What do you compare? the average rank?
doesn't sound like a classic Chi-Square test? How did you run this test?

Outlier should be removed only if there is a good reason to remove it. How many outliers did you get? what method?
"remove these if they influence my outcome." is not a good reason.
to make short a big list you probably want to remove outliers if one of the following "very high-level cases:
1. any type of mistake
2. valid rare case that expediently fell under your not too big sample.

I don't really see any difference between p = .054 or p = .05.

If data is not normal, is it symmetrical? t-test can manage a reasonably symmetrical data even if not normally distributed.

Did you consider the Mann-Whitney U test which doesn't require the normality assumption and more robust to outliers than t-test?
(although t should be a bit more powerful, but probably worth a try)
Thank you for your reply!

I am comparing the helpfulness ratings of both types of reviews (with and without the indication of a verified purchase).
I ran a chi-square test because that's a test I need when my data is all nominal and not normally distributed. With a Mann-Whitney test, I learned that I can only use this if my data is not nominal. What are your thoughts on this? I am using this picture to determine what test to use (from my textbook):
What test to use.png

I determined my outliers with a boxplot, this one:
Outliers.png
The tests without the outliers are done without all the cases that are situated outside of the boxplots (dots and stars).

Let me know what your thoughts are on this! :)
 
#4
So you compare the average of the "helpfulness Rating" between 2 groups? (scale of "helpfulness Rating" is 1-100?
If yes this is a continuous variable (the "helpfulness Rating" not the group) I don't think you read the tree correctly ...

There shouldn't be measurement mistakes.
Are only starts outliers? or also the circles?
If there are 7 outliers from 80 reviewers? 8.75% this is not a random rare case. => Whitney ...
 
Last edited:
#5
So you compare the average of the "helpfulness Rating" between 2 groups? (scale of "helpfulness Rating" is 1-100?
If yes this is a continuous variable (the "helpfulness Rating" not the group) I don't think you read the tree correctly ...

There shouldn't be measurement mistakes.
Are only starts outliers? or also the circles?
If there are 7 outliers from 80 reviewers? 8.75% this is not a random rare case. => Whitney ...
O yeah you're right! The helpfulness rating can be endless, which is a continuous variable of course! How stupid of me.

I will try the Mann-Whitney test now!
 
#6
The Mann-Whitney test gave me the following results:

With all data
Mannwhitney1.png
mann-whitney_zonder.PNG

Without the seven outliers
Manwhitney2.png
mann-whitney.PNG

Does this make any sense? And would you pick the one with or without the outliers?
 
Last edited:
#7
U test is a test on ranks. There can be no outliers with ranks. The highest rank will be highest rank, regardless of whether its original value was 27,2 or 27200000000000000 .

Outliers are removed if they are obvious errors, so I am surprised your teacher asked you to perform crude data manipulation.

By the way
I checked normality and homogeneity of variance
Normality (within each group, not of the total sample) is not important if n > 30, and unequal variances can be dealt with by using the Welch version of the t-test.

With kind regards

Karabiner
 
#8
U test is a test on ranks. There can be no outliers with ranks. The highest rank will be highest rank, regardless of whether its original value was 27,2 or 27200000000000000 .

Outliers are removed if they are obvious errors, so I am surprised your teacher asked you to perform crude data manipulation.

By the way

Normality (within each group, not of the total sample) is not important if n > 30, and unequal variances can be dealt with by using the Welch version of the t-test.

With kind regards

Karabiner
I agree with your part about the outliers indeed... Weird she told me that.

With the N>30 I cannot agree, because she will give me a very bad grade if I do not mention normality tests hahah
 
#9
Thank you for your reply!

I am comparing the helpfulness ratings of both types of reviews (with and without the indication of a verified purchase).
I ran a chi-square test because that's a test I need when my data is all nominal and not normally distributed. With a Mann-Whitney test, I learned that I can only use this if my data is not nominal. What are your thoughts on this? I am using this picture to determine what test to use (from my textbook):
View attachment 643

I determined my outliers with a boxplot, this one:
View attachment 644
The tests without the outliers are done without all the cases that are situated outside of the boxplots (dots and stars).

Let me know what your thoughts are on this! :)
The teacher’s advice and that textbook are the equivalent of toilet paper. No offense to you.
 
#10
U test is a test on ranks. There can be no outliers with ranks. The highest rank will be highest rank, regardless of whether its original value was 27,2 or 27200000000000000 .

Outliers are removed if they are obvious errors, so I am surprised your teacher asked you to perform crude data manipulation.

By the way

Normality (within each group, not of the total sample) is not important if n > 30, and unequal variances can be dealt with by using the Welch version of the t-test.

With kind regards

Karabiner
Hi Karabiner :)

You can check and find outliers in any regular method independent of the test.
As you said outlier checked mainly for experiment errors, so we want them out even for U test.
For example, you still want to fix or exclude observation of 5 meters height person ...)

But U test is very robust for outliers since it is a rank. and doesn't give more weight to outliers as variance (square)

For example, 5 meters height or 2.09 meters may get the same rank. (unlike in a Variance)
But we still want to correct it if it was a meant to be 1.55
You can say that since usually, it isn't easy to find a real error outlier, like in my example, practically usually you can ignore outlier for U test

I assume normality is import even if N>30 if the data is not reasonable symmetrical (Skew data)
So the decision to use t-test depends on the combination of Skewness level and N.
For extremely skew data I probably wouldn't use a t-test
(and if I read correctly the U test results, it gave a more significant result despite the fact that U test is less powerful than T-test0
 
#12
Hi Karabiner :)

You can check and find outliers in any regular method independent of the test.
As you said outlier checked mainly for experiment errors, so we want them out even for U test.
For example, you still want to fix or exclude observation of 5 meters height person ...)

But U test is very robust for outliers since it is a rank. and doesn't give more weight to outliers as variance (square)

For example, 5 meters height or 2.09 meters may get the same rank. (unlike in a Variance)
But we still want to correct it if it was a meant to be 1.55
You can say that since usually, it isn't easy to find a real error outlier, like in my example, practically usually you can ignore outlier for U test

I assume normality is import even if N>30 if the data is not reasonable symmetrical (Skew data)
So the decision to use t-test depends on the combination of Skewness level and N.
For extremely skew data I probably wouldn't use a t-test
(and if I read correctly the U test results, it gave a more significant result despite the fact that U test is less powerful than T-test0
Thanks for this reply! I really appreciate your help. The z scores for skewness were very bad (see below) and my N was 80. What is your conclusion on that? No t-test right? And the U test seems to be more powerful here indeed. The u-test gave me a significant p value with (I checked) a small to a medium effect size. The t test gave me a marginal significant effect with a veeeeerrry small effect size. Screenshot_20181206-080912~2.png
 
#13
Hi Marit :)

The results you describe support the assumption that it is better to use the Mann-Whitney test in this case.
Despite the fact that N=80 the distribution is very highly skewed.

So I believe you should choose the Mann-Whitney U test.

What do you think Karabiner ?
 
#14
Well, your answer is an equivalent of toilet paper as well. Thanks but no thanks
You already received your answers: just discarding outliers to "not influence the outcome" is incredibly wrong; there's a problem in statistical education these days, mainly because many people are allowed to teach without a good background.

In reference to "and an extremely small effect size (I know, calculating an effect size makes no sense if your difference is non-significant, but mine is close to significance, so I figured I calculate it as a bonus ;))" This is also incorrect; the p-value has no bearing on "whether it makes sense to calculate an effect size". In fact, it's often suggested to provide point and interval estimates regardless of presenting a p-value.

The book is poor because "Wilcoxon" and "Mann-Whitney U" might be the same thing so their graphic is confusing. The Mann-Whitney-Wilcoxon (MWU) is equivalent to the Wilcoxon rank sum test (both for independent samples), which are both different from "Wilcoxon" mentioned in the chart which is the Wilcoxon signed rank test for dependent samples. A book meant to introduce students to topics should use the character count to demonstrate the difference to students who will come across that issue of having to choose between 1 of 2 "Wilcoxon" tests that the book failed to delineate.

I hope more detail provides insight into why I made the quick points, but also, some points were already made.
 
#15
I assume normality is import even if N>30 if the data is not reasonable symmetrical (Skew data)
Do you maybe have a reference for this assumption? I did not know that the random sampling distribution for the mean is affected by asymmetry if n> 30.

Another question is how a mean is interpreted in case of a heavily skewed distribution, but this does not affect the statistical inference, as far as I can see.

With kind regards

Karabiner
 
#16
Do you maybe have a reference for this assumption? I did not know that the random sampling distribution for the mean is affected by asymmetry if n> 30.

Another question is how a mean is interpreted in case of a heavily skewed distribution, but this does not affect the statistical inference, as far as I can see.

With kind regards

Karabiner
I know I'm not directly helping with the references, but I've read some good discussions and seen simulations that show that the tendency for the sampling distribution to be approximately normal (at various sample sizes) does depend on the degree of nonnormality in the underlying distribution of Y values.
 
#18
You already received your answers: just discarding outliers to "not influence the outcome" is incredibly wrong; there's a problem in statistical education these days, mainly because many people are allowed to teach without a good background.

In reference to "and an extremely small effect size (I know, calculating an effect size makes no sense if your difference is non-significant, but mine is close to significance, so I figured I calculate it as a bonus ;))" This is also incorrect; the p-value has no bearing on "whether it makes sense to calculate an effect size". In fact, it's often suggested to provide point and interval estimates regardless of presenting a p-value.

The book is poor because "Wilcoxon" and "Mann-Whitney U" might be the same thing so their graphic is confusing. The Mann-Whitney-Wilcoxon (MWU) is equivalent to the Wilcoxon rank sum test (both for independent samples), which are both different from "Wilcoxon" mentioned in the chart which is the Wilcoxon signed rank test for dependent samples. A book meant to introduce students to topics should use the character count to demonstrate the difference to students who will come across that issue of having to choose between 1 of 2 "Wilcoxon" tests that the book failed to delineate.

I hope more detail provides insight into why I made the quick points, but also, some points were already made.
I totally get what you are saying. Though, when that one Wilcoxon test is the same as the Mann-Whitney test, does it matter which one you write down?