One population, two sampling methods, how many logistic regression models?

Can I include an explanatory variable that stratified the population in model?

I have 400 randomly selected urban households. I have 400 rural households selected from 15 villages census style. Food insecurity (dichotomous) is my outcome variable. I want to find determinants of household food insecurity. I know urban/rural is likely a factor.

Can I create ONE regression model with urban/rural as one of several explanatory variables or do I need TWO separate models? I believe the argument for one model is that sampling methods are not as important for identifying potential determinants and putting the samples together increases my sample size. I have chosen to use TWO models but am not 100% confident on the rationale. I believe that it is not appropriate to enter explanatory variables that were used in defining sample selection (the sample was stratified on urban/rural). But I do not know why one cannot enter a variable on which the sample was stratified. Can anyone help explain?

Thank you!
Last edited:


No cake for spunky
I have never seen two populations used for the same model, although obviously I have read a very small percentage of the total literature on regression. I would think the potential variation of the populations could be different and that might influence the result. Whether that is true or not, I think you could only do this if you could reasonably assume that the two samples you generated actually represented the population you were interested in (stating the obvious I know). Different urban and rural populations could vary significantly due to economic, ethnic, religious etc factors (i. these differences exist in different urban and different rural populations).

It would seem extremely difficult to find a sample that got at urbancity generally - you would seem to be limited to ubran/rural differences say in a given country or maybe even region.
Thank you for your response. This is the same population but the two groups were sampled differently and were divided along urban and rural lines. I'm wondering if I can pool the data for one regression model and use urban/rural as an explanatory variable. I suspect I cannot because the portion of urban and rural households was predetermined. Additionally, the samples are not directly comparable due to the different sampling methods. Would you agree?


No cake for spunky
I do not know the regression issues as I have never seen this done in regression. In terms of the design issues I would think the real issue is how well did the sampling get at the underlying groups you want to compare. If you used different sampling methods, but both ended up representing the subpopulation well than i think it would be ok - although this too is not an issue I have ever seen formally addressed in the literature.

How specifically did the sampling techniques differ?
I have 400 randomly selected urban households. I have 400 rural households selected from 15 villages census style. One general population.


No cake for spunky
If the village sampling involves a stratified random sampling method than the variation will be I believe different than what you get in random sampling just as bloc sampling is (i.e.,sampling at different levels in a geographic area when the true population sampling frame is not available to randomly sample). I do not know how that effects regression, it will lead to different results in how much variation there is from the true population I believe.

To ask another question. Do the 15 villages you sampled from reasonably represent the rural population that you want to compare the urban sample to. Or do they only represent those 15 villages - which vary systematically from villages generally?


TS Contributor
this looks like a simple application of an indicator variable to me or do I miss something? I mean, using a variable in the model that is 1 for village and 0 for urban.



No cake for spunky
I think the concern is not how to run the regression, but if you can use data gathered in different places and by different sampling schema in the regression.