Comparing group means with drastically different sample sizes

#1
I have two groups: a control group and a test group. For context, I'm working with insurance data. Each month, the sizes of the groups grow because the groups are assigned some unknown amount of claims to work with (we can think of this process as a random allocation). We can't control how many claims we get. An artifact of gathering this data is that the control group is much larger than the test group. Now, I've been asked if the sample sizes are large enough to yield credible results. Being the data person I know this question depends on the goal of the analysis as well as the approach of the analysis.

For simplicity, let's say I know ahead of time that I want to compare group means. The issue I have is that the control group has 8000 units and the test group has 100 units. Not only this, the variances in the measures are unequal. The only thing I can think of to do is use some bootstrap method or nonparametric approach. Anyone have thoughts?
 
#2
Calculate the two means and standard deviations. Calculate Z test = (x bar 1 - x bar 2) / (s / sqrt 100) twice, with s = both standard deviations. Then look up P (Z </= Z test) for both values. Look at all the info. If you can't make a conclusion, bring it here and I'll comment.
Looking at the original data and some calculations will often lead to a conclusion-then that conclusion can be tested.
 
#4
Even if n1 = 100 and n2 = 8000? Doesn't s 2 sorta disappear?
I'd look at the numbers first; I always look at the numbers first.
Doesn't Welch turn into t at one n = infinity?
 
Last edited:
#5
Thanks Greta, I will look into this further. I'm hesitant to use Student's t-test because the distribution of values looks more like a Gamma distribution (aka: highly skewed). And the standard t-test assumes equal variances. But, the sample variance for the control group is nearly twice that of the test group.
 
#6
And the standard t-test assumes equal variances.
Yes, that it the standard usual t-test with a pooled standard deviation. But the Welch t-test assumes different variances.

But with such large samples both means would tend to approximate normality.

the distribution of values looks more like a Gamma distribution (aka: highly skewed).
Then a possibility is to model it with the gamlss package in R where a gamma distributed variable and its mean can be modeled how it is influenced by group and also how the standard deviation is influenced by group.
 
#7
Thanks! That package looks useful for plenty of distributions. In this case, I believe it's similar to fitting a glm:

model <- glm(y ~ group,
family=Gamma(link="identity"),
data = my_data)
 
#8
Can anyone comment of the drastic difference in group size? I don't know why, but I'm always nervous that I'm missing something in cases like this. All of the toy data in school was fairly balanced. lol
 
#9
I believe it's similar to fitting a glm:

model <- glm(y ~ group,
family=Gamma(link="identity"),
data = my_data)

Yes, of course it is possible to estimate a glm model with gamlss. In a gamma model the coefficient of variation (sigma/mu) i constant. (Just like the variance is constant in a normal distribution model.) So in a gamma model the standard deviation is proportional to the mean.

But in a gamlss model you can also estimate if the standard deviation has an other relationship to the mean or to the grouping variable.

But since you have so many observations the mean in the large group will essentially be a constant. So you would almost compare mean_1 - constant and do a simple t-test of that.

There is nothing wrong with having many observations in one group.

But if you do an experimental design it will be optimal to have the same number of observation in both groups.
But if it is an observational study, then you get what you get.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
Just for clarity the independence assumption isn't broken in the sample?

Any more I just use quantile regression for this with one predictor (group). Intercept represents the control group and group represents the difference between groups. You can fit it twice, with either group as the intercept to get all three visualization (control group, noncontrol group, and difference). The process will let you compare any percentiles and provides an easily interpret able picture for coworkers. I just started using Bayesian quantile regression and I am going to submit an abstract to a conference using it next month.

Also you can control for differences between groups with the approach (covariates).
 
#11
Thanks for the replies. I don't have any reason to believe the observations wouldn't be independent (from a non statistical perspective). I didn't see any weird patterns in the residuals for the model I played around with. I will take a look into quantile regression. Ideally, I would like to use something easily interpretable for the audience.
 
#12
Just for clarity the independence assumption isn't broken in the sample?

Any more I just use quantile regression for this with one predictor (group). Intercept represents the control group and group represents the difference between groups. You can fit it twice, with either group as the intercept to get all three visualization (control group, noncontrol group, and difference). The process will let you compare any percentiles and provides an easily interpret able picture for coworkers. I just started using Bayesian quantile regression and I am going to submit an abstract to a conference using it next month.

Also you can control for differences between groups with the approach (covariates).
Hi hlsmith,

Do you have an example (with your conference or any other papers) of using quantile regression (classic or baysesian) to visualize difference between two groups? I am very interested on how I can apply it to my data. Thank you so much!
 

hlsmith

Less is more. Stay pure. Stay poor.
#13
What program do you think you will be using? I just grabbed this image off the internet, but this is what you would expect as the output:

1618505297597.png
You sent them up similar to a regression and get output like this for each variable. So the y-axis is for the outcome and the x-axis is the comparison of the outcome values across its percentile between say two groups. I will change the content on this but if it was for say weight for females vs males, you would see for the upper 95% of values women weigh less than males, etc. It is pretty intuitive once you use them for awhile. I use them when the outcome may be skewed or there are outliers. So say three mean weight 400 pounds, Quantreg would show for most people the weight difference were blank but those outliers would only be seen at say the top 95% of weight value differences.

Once you get going create a thread and I would be happy to help as time allows.
 
#14
I'm using JASP and R. I have a lot of outliers and that why I was interested too. Thanks I will look into it and post a thread if needed. Thanks a lot!