# Sufficient sample size

#### PinballWizard

##### New Member
I'm very new to the world of statistics.
Could someone explain to me in layman's terms the following:
I have complete data on a population size of ~37,000. I can show, across the entire population, that there are trends between certain variables and the amount of debt (e.g. age, length of tenure etc).
What I'm trying to do is to establish whether there is a statistically significant variation in the effect of these independent variables on debt between geographical location.
Partitioning the data into geographical location results in sample sizes of between 300 and 3000 records. Again dividing these data into age ranges e.g. 25 to 35, 36 to 45 years old etc further reduces the size of each sample.
I've produced a number of graphs using seaborn/matplotlib e.g. regression/scatter plots, grouped column graphs and the results fluctuate especially with smaller sample sizes (e.g. smaller geographical locations).
What I'm trying to establish is is there a statistical test I can carry out that indicates whether a change between geographical location (e.g. average debt for 25 to 35 year olds between location A and B) is statistically significant or due to random error introduced by the size of an individual sample.
Many thanks

#### hlsmith

##### Less is more. Stay pure. Stay poor.
@PinballWizard - I saw your other introductory post - so I thought I would provide a little general information. You use the word population. In order to use this word you need to have every single observations (or at least almost every value). If you have the population, then statistics are not really needed since you know the truth and you are not trying to generalize back to a super population. So if I had all of the fire fighter ages in two states, I could calculate their mean ages and if they were different, no statistical test is needed - I can make states just using those values.

So do you have the population or a sample? This gets a little trick at times if you have a population but want to 'predict' the future, which is a new population.

#### PinballWizard

##### New Member
@PinballWizard - I saw your other introductory post - so I thought I would provide a little general information. You use the word population. In order to use this word you need to have every single observations (or at least almost every value). If you have the population, then statistics are not really needed since you know the truth and you are not trying to generalize back to a super population. So if I had all of the fire fighter ages in two states, I could calculate their mean ages and if they were different, no statistical test is needed - I can make states just using those values.

So do you have the population or a sample? This gets a little trick at times if you have a population but want to 'predict' the future, which is a new population.

I have data for the entire population. Taking average debt as an example, there may be 2 different kinds of reason obervations in location A have a different average debt to those in location B. Bare in mind the size of locations are set in stone and I can't change the location boundaries, some locations have relatively small number of observations. So I am using the terminology 'sample' for different locations and probably shouldn't be.

I'm interested in this type of variation in e.g. average debt if there is a covariation with say access to services, crime or health percentile index. What I need to establish is if there is a way of quantifying the extent to which the average debt for a location could be a result of the smaller number of observations in location A as opposed to say an underlying socio-economic pressure or availability of staff etc in that area.

Please feel free to tell me if my question is nonsensical in statistical circles.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
And you have individual level data per location? If so, you may be looking at multilevel regression - where observations are clustered inside groups. These models allow you to control for within group and between group variability.

However, something else to think about is geographic similarities. So the closer I am to another group the more I resemble them. For example, people in the US state of Florida are not all the same. Residents in the pan-handle may have different tendencies than those further south in the large urban centers. However the closer you are to one the more likely you are similar to them (e.g. geospatial covariance).

#### PinballWizard

##### New Member
And you have individual level data per location? If so, you may be looking at multilevel regression - where observations are clustered inside groups. These models allow you to control for within group and between group variability.

However, something else to think about is geographic similarities. So the closer I am to another group the more I resemble them. For example, people in the US state of Florida are not all the same. Residents in the pan-handle may have different tendencies than those further south in the large urban centers. However the closer you are to one the more likely you are similar to them (e.g. geospatial covariance).