Logistic regression, sample sizes and bootstraping


New Member
Hello, this is my first post. I currently I'm working in my phD in a multivariate logistic model but I have a problem regarding the sample size of my observations:

-The "success" (1) event group has a sample size of 249 distinct observations
- The "non success" (0) event group has a sample size of 48,957, and it's a significant part of the population, and many times larger than the "success" group.

So when I fit a multivariate logistic regression model, two things happen:
- The p-values of the model coefficients become always significant at alpha 1%
- The fitted predicted probability variation between the groups becomes very tiny, even if the independent variables are good predictors.

So I was suggested making a bootstrap of the model. So here is my doubt :confused: :

-Should I do it the more usual way, taking smaller arbitrary size random samples of the complete sample (both groups) with replacement ( or without replacement in this case?)


-Should I keep the the small "success" group constant and them add an equal number of different "non success" randomly picked cases in each sample.

In both cases this is not exactly the usual bootstrap as I wish to make sub-samples of a larger group, the original big sample, and perhaps what I want is not bootstrap at all. My idea is to minimize the discrepancy between the two groups sizes. Is this teoretically correct?

The final objective is to obtain an "avarage" model with the mean coefficients and use it to calculate the propability of the "non success" cases actualy being "successes"

Or does anybody has other idea for this question?
Last edited:


TS Contributor
I would just stick with your original model. I don't see why you would consider significant results and small differences in predicted probability a problem. Your results just sound like an accurate reprentation of your data to me.


Less is more. Stay pure. Stay poor.
I agree with maartenbuis. In addition, many times the samplerate = 1, which is the whole sample, but with replacement. Going that general route does not seem like it would provide any added benefit to your analyses, unless for some reason you think your original sampling design was systematically flawed.