Using bootstrap to calculate the sample size for a median ?

rogojel

TS Contributor
#1
hi,
I have a small(ish) sample of 31 data points from a highly right skewed distribution and need to determine the sample size to estimate the median.

My idea is to randomly select (with replacement) 10 data points and then estimate with the bootstrap the 0,025 and 0.975 percentiles and calculate the width of the interval. Repeat the procedure 100 times and store the widths in a vector.

Then repeat the whole procedure again with a sample of 15 data points, then with 20 and so on up to 30, build a data set that has the sample size as the IV and the interval width as the DV and run a regression on this set.

This would give me a model to extrapolate to find the necessary sample size for a required interval width.

Is there anything I am missing? I am aware that extrpolating too far is risky, but other than that?

Thanks and regards
rogojel
 

Dason

Ambassador to the humans
#2
Well you certainly have the tools to assess how well this will perform. Why not give Monte Carlo simulation a shot to see how well your method would work out where you *know* the answer. Since your data is skewed you could consider simulating from an exponential or gamma with a similar median and variance then apply your procedure to the generated data to get your sample size estimate. Once you do this a lot of times you'll have a distribution for the sample sizes that you would get from your procedure. You could also then use simulation to see what the true coverage rate and/or interval length would be at each of those sample sizes to get a feel for how well this procedure does on average.
 

rogojel

TS Contributor
#3
Thanks Dason,
I take it, there is no obvious flaw in this idea. I will try the MC as you suggested, I am really curious to see how it turns out.


regards
rogojel
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Have you seen the method you originally reported in post #1 performed before within the literature? And if so, can you post the source. I would be curious to read a description that was put into practice.
 
#5
It seems (to me) to make sense to simulate from a known distribution, something like the gamma distribution. But I was thinking about how sensitive the results can be for the chosen distribution? I guess that if simulation is made from a bimodal distribution then the variance in the median will be larger. We can't know what the real distribution is. Suppose it is a mixture (40%,60%) of two gammas with different expected value.

Is it reasonable to use the current data set (n=31) and estimate a couple of alternative distributions to get an empirical base for the simulation? Or would that lead to over adjustment?

I can not see anything wrong in using the bootstrap in estimation from the current data set (the n=31) in estimating the variance in the median from n=10, n=20 and n=31. But of course it is not possible to get estimates from larger samples.

Also searching the web could give many better links than this first find:

http://artent.net/2013/08/12/the-exact-standard-deviation-of-the-sample-median/
 

rogojel

TS Contributor
#6
Have you seen the method you originally reported in post #1 performed before within the literature? And if so, can you post the source. I would be curious to read a description that was put into practice.
Nope, it is just something that seemed like a good idea to me :) . I am right now trying to figure out if it works with the gamma distribution - my original data is also a bit like a gamma (p=0.07).

regards
rogojel
 

rogojel

TS Contributor
#7
I played a bit with this idea and it seems to work but a crucial element is the right model for the regression - CI width against N is pretty bad, width against 1/N seems to be a bit better and so on. probably some more advanced nonlinear regression will work even better, I'll have to try it.

regards
rogoje
 
#8
So far rogojel has only mentioned the median. Now regression is mentioned.

the right model for the regression
Maybe rogojel could find it interesting with "quantile regression". Search for it! :)

Also, it would be interesting if rogojel shared with us a little bit about the results of the simulation of the median.
 

rogojel

TS Contributor
#9
Hi,
just to give a quick summary: the problem was that we had a small sample taken from a skewed distribution and we were asked the question whether the sample size is large enough to estimate the median and if not what would be the right sample size?

My idea was to use bootstrap -
1. generate samples of a given size n
2. estimate the confidence interval of the median estimates (P95-P5)
3. repeat with a slightly greater n
4. build a regression with the confidence interval width as the DV and sample size as the IV
5. use the regression to estimate the necessary sample size.

The method seems to work in that I can see a nice shrinking of the CI width with n. The regression die not work with n as the IV because the connection is non-linear and going into saturation quite quickly (no big surprise when I think about it) R-squared of 5% . using 1/n gave much better results (R-squared almost 70%).

It tuened out that the decrease in CI width for samples about the double of wjat I actually had was not practically significant, the data being very skewed. So I simulated a data set that was veary similar to the one I had (the suggestion of Greta) , and tried the method on that set. The effect was a lot more visible, but the decrease was again quite slow (as a dependence on 1/n would suggest) . Probably a better model would give more insight in the number of samples needed, but as measurements are quite costly in our case it did not make much sense for me to pursue this further.

In summary, the idea seems to work but care must be taken with building of the model CI width vs. sample size. It might also end up with the result, that the needed sample size is completely unrealistic.

regards
rogojel