Min sample size for Wilcoxon Rank Sum test

#1
Hi,

I have data measurements (eg, running cost) that keep coming in to the system continuously. There are two groups (say X and Y) that I want to compare. I want to see which one is systematically larger (H1: X > Y?). However, I have a couple of issue with my data: i) I can't assume normality of data; and ii) my data keeps getting deleted periodically. And when data is deleted it's no longer available to me. So at any given moment, X and Y may have lots of data, or only a few measurements.

Based on this, I was suggested to use Wilcoxon Rank Sum Test since it doesn't assume normality. However, I couldn't find any information about:
When can I apply Wilcoxon? or what the minimum sample size be?

This is important since at any moment, X and Y may have only a few data points (say 2 datapoints each); hence, running Wilcoxon on such sample would be misleading.

Any thoughts on this?
 
Last edited:
#2
Hi :welcome:

Please give us more detail on your study. What are your variables that terminate automatically? (if not confidential) What are your groups' sizes?

You have been recommended a test based on two incorrect assumptions. A t-test could be still used even if the normality was not met.

You can apply Wilcoxon whenever you want to do a paired t-test but its assumptions (its true assumptions) do not hold. It works with samples as small as 3 pairs of data, regardless of the reliability of the result. However minimum sample size is a totally different thing. You should have large-enough samples and for calculating that size, there are formulas.

It looks like some radioactive isotopes to me that deteriorate together!! However isn't it possible to record their state before annihilation?

Overall MUCH DETAILS are needed.
 
#3
Hi,

Thanks for your response :D
No it's not confidential at all lol.

Overall, I want to compare different computer configurations. Some configurations may give good computer performance, while others may give poor performance. Performance for a single measurement for a given configuration represents the execution time for running a certain task using that configuration. Each time you run that same task under a given configuration, you get a relatively different reading.
Example (I just made up the data):
config X = {2, 44, 4.5, 5, 66.77, 7, 3.4, 2.11, 100.3, 22, ...}
config Y = {4, 36.7, 24.5, 22, 12, 9.9931, 22, 22, 23, 22.33, 25, 100.2, ...}
config Z = {15, 100, 102, 102.11, 200.22, 204.4, 11.2, 1.22, 1.4, 1.2, 1.45}

I want to compare these computer configurations in terms of their performance. Generally, if I'm statistically confident that some configs work better than others, I can eliminate all the bad configs and recommend to my users using the the configs with good performance. That's where I wanted to use Wilcoxon.

Further, since my environment is highly dynamic, I decided to rely only on most recent measurements and ignore the ones done in the past. For that I'm implementing a sliding window where the only measurements I will consider are the ones which were taken during the past time-window of W. So during this time-window of W, some configs may have only a few measurements, while others may have lots.
So my question is: at a given time window, what configs can I compare against each other? What's the minimum # of measurements a config should have in order to be able to compare it against others??

The reason why I can't collect large # of measurements for each config is because there is hundreds of different configurations, and testing each configuration by running lots of measurements would cost lots of money.
 
#4
It is becoming clearer but not quite. So what is that "given time" window? Only a limitation of resources and time. And you want to find the minimum number of configuration to keep your experiment affordable but make sure the result is valid. Or you want to keep the number of your measurements low?

For that purpose you need to do a sample size calculation based on the information you already have, and determine the sample sizes of the minimum size. That is not only a characteristic of the test used, but other factors are involved.

config X = {2, 44, 4.5, 5, 66.77, 7, 3.4, 2.11, 100.3, 22, ...}
config Y = {4, 36.7, 24.5, 22, 12, 9.9931, 22, 22, 23, 22.33, 25, 100.2, ...}
config Z = {15, 100, 102, 102.11, 200.22, 204.4, 11.2, 1.22, 1.4, 1.2, 1.45}
I want to help more but the design is still extremely vague to me. What is exactly a config? A combination of different computer hardware? A combination of firmware tweaks? A combination of both? How many programs you test on each setup?

config X = {2, 44, 4.5, 5, 66.77, 7, 3.4, 2.11, 100.3, 22, ...}
config Y = {4, 36.7, 24.5, 22, 12, 9.9931, 22, 22, 23, 22.33, 25, 100.2, ...}
config Z = {15, 100, 102, 102.11, 200.22, 204.4, 11.2, 1.22, 1.4, 1.2, 1.45}
Why you think the configurations are matched? Do you know about matching? If so, according to which factor your configs are matched? I see you want to compare more than two configs against each other (X, Y, Z)... that is not a case of Wilcoxon. there are so many other questions regarding your experiment.

Based on the received information, I would suggest you to let a statistician handle this for you. Otherwise you would need to study the basics of stats (of course TS members would help you in that matter) before deciding on the test and the number of the measurements. :)
 
#5
Hi,
And you want to find the minimum number of configuration to keep your experiment affordable but make sure the result is valid. Or you want to keep the number of your measurements low?
I actually want to lower both. Since exploring more configurations means more money, and doing more tests per configuration would also lead to increased experimentation cost.
My first step would be to answer: what's the min number of measurements I need to have for each config/group?
And the next step would be: since I have enough number of measurements for each group, which configs (or groups) have statistically larger values than others and which ones have lower values?

I want to help more but the design is still extremely vague to me. What is exactly a config?
I'm really sorry my explanation was a bit vague and very high level. A configuration denotes a collection of properties for each computer. Example: if we assume there are three types computer sizes = {small, medium, large} and three types of CPU for each, CPU = {A, B, C}. Then a config defines what machine size and CPU I'm using. In my example, there is a total of 3 x 3 = 9 configurations (or groups, or classes) {A-small, A-medium, A-large, B-small, ..., C-large}.
Now, for each config, I want to run a program a couple of time to test the execution time for each config.

However, in real world, there can be hundreds of such configs and testing all of them blindly would cost lots of money. The issue is that I wasn't sure of the min number of measurements I need to have for each config so that I can confidently compare configs against each other (config X is higher than config Y with 0.05 alpha, etc.)


Why you think the configurations are matched? Do you know about matching?
Not really, I'll read about this :p
 
#6
Uh I see. I would say forget about the statistical analyses (seriously) and only record the performance of each setup and sort the performances from shortest time to the longest time, and recommend the one with shortest elapsed time if financial issues are not a problem. If financial issues are as well important, then you can sort your list according to both system performance and price.

The reason I would say to forget about "statistical" analyses is that from what I see, it appears to me you have a long way ahead and your experiment has serious flaws that are very difficult to me to elaborate on since they need at least some basic knowledge of stats. Besides, I "guess" (not sure) that you don't really want a statistical analysis. In my humble opinion, you just want a confident and reliable answer; and in your case, if your A-Large tower gives you the best performance among the other 8 setups, that would count as a reliable result. So a sorting would be fine for your case.

But in any case, if you still wanted to do statistical analyses, your test is not Wilcoxon.