Choosing a datasample for logistic regression analysis


New Member

I have several thousands of experiments from 6056 different individuals and I'm using logistic regression (with R) to study a binary outcome (correct/incorrect).
I've chosen one random experiment from each individual in the dataframe datos where the best fit I've found is the following:
fit<-glm(formula = correcto ~ usa_reken + tipo_cambio + curso +
usa_reken:tipo_cambio , family = binomial(link = "logit"),
data = datos)

This model gives me these values for my sampled data:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.11245 0.18416 6.041 1.53e-09 ***
usa_reken1 1.33353 0.15534 8.585 < 2e-16 ***
tipo_cambioRESTA 0.65133 0.19147 3.402 0.00067 ***
curso -0.39029 0.09804 -3.981 6.86e-05 ***
usa_reken1:tipo_cambioRESTA -0.50046 0.25859 -1.935 0.05295 .
I'm no expert and hence tend to doubt all my results. Thus, I've made several (six) different random subsets from the original data to make sure that all my conclusions are valid and that my subsampling is not introducing any noise.

Today I've found that, if I average all estimates from the six sampled datasets, everything is coherent with this model for all subsets and the estimates fit smoothly.

Here are the questions that I'd very much appreciate your advice and references on:
- Is what I'm doing a known technique? How is it called?
- Am I being too strict by allowing only one experiment per individual? (the oucome depends on the individual as well as on the number of experiment per individual)
- Would a referee complain if I chose two or three or maybe more experiments per individual?

I hope I gave you enough insight on my experiments, but I'm wishing to answer your questions if you have any :)