Stats for whole population data

#1
Hi all - I'm an employee of a government department who has been asked to answer some question to do with a whole population data set. In short, here in NZ we only have one accident insurance sheme, so data from this scheme represents all the claims lodged for accidents in this country. I've got this data longitudinally and am being asked about whether recent changed in trends represent a statistically significant change from previous periods.

so, a couple of questions:

1) I think I'm dealing with a 'whole population' but of course don't have data that will exisit in the future, so is it really a 'whole' population?

2) My previous use of inferential stats has always been around applying results from a sample to a wider population - I'm not doing this here, so is inferential stats ever valid in a setting like this? I've got very little experience of longitudinal, whole population data sets, so this is new territory for me.

3) What tests should I consider when looking at whether these changes are 'significant'?

Any help would be most appreciated.

Kind regards

Jim
 

terzi

TS Contributor
#2
Hi JimRobertsonnz,

From your explanation, I assume that you may need some modeling techniques. I'd say this is a "changepoint" regression problem. Let's say you are modeling a relationship between X and Y, but you suspect that this relationship may have changed at some point based on information you have available. There are tests designed for that. This is an analysis common in econometrics, so you should search for that.

Hope this helps.
 

CB

Super Moderator
#4
Hi all - I'm an employee of a government department who has been asked to answer some question to do with a whole population data set. In short, here in NZ we only have one accident insurance sheme, so data from this scheme represents all the claims lodged for accidents in this country. I've got this data longitudinally and am being asked about whether recent changed in trends represent a statistically significant change from previous periods.
Kia ora Jim,

It does sound like you're dealing with a population data set (good old ACC) - making the "statistical significance" question rather strange. A lot of time people think "statistical significance" means "statistically important", which might be the subtext of the question you're being asked? Your guess that inferential statistics (including statistical significance questions) aren't appropriate here makes sense. When you have the actual population data, using null hypothesis testing to decide whether an effect is likely to be present in the population (i.e. a "statistically significant" effect) doesn't make much sense - you can see directly whether it's present in the population or not!

I'm not sure what dept you're at, but for instance over at Statistics NZ the fact that they're working with population data sets mean that complicated inferential processes are generally ignored, and a department with giant loads of data actually mostly uses pretty simple descriptive procedures.

To me, the way to proceed would be to look at how much claim frequency has changed. I'm not sure on the best way to illustrate this is; maybe changepoint regression as Terzi suggests, maybe tabulated figures of claims-by-month data, maybe just graphs.
 
#5
I think it depends on what sort of question you're asking, but for this post consider that you're interested in how the frequency of claims has changed since last year.

One approach would be to look at all the date, which includes every single claim, and say "Yes, there are more claims per person now than last year," and there's no error in that estimate at all, because you have measured the true population parameter.

But you might imagine that the event that any one person files a claim is a random variable, say with a bernoulli distribution, so P(claim)=p and p(no claim)= 1-p. If you also assume that this distribution holds for everyone (a big assumption, but it's just an example), then the total number of claims, X, from all n people has a Binomial distribution w/ parameters n and p. You have X from this year and X from last year, and you can ask, has p changed from last year to this year? Here's where statistical inference will come in--even if X is higher this year, it may not be larger enough to suggest a significantly different change in p. (flip a coin 100 times and then 100 more. If you get more heads the second time, do you assume the coin has become biased towards heads?) You're still doing inferential stats, even though you have all available data, because your parameter of interest is never directly observed.


Does this make sense?
 
#6
Kia ora Jim,

It does sound like you're dealing with a population data set (good old ACC) - making the "statistical significance" question rather strange. A lot of time people think "statistical significance" means "statistically important", which might be the subtext of the question you're being asked? Your guess that inferential statistics (including statistical significance questions) aren't appropriate here makes sense. When you have the actual population data, using null hypothesis testing to decide whether an effect is likely to be present in the population (i.e. a "statistically significant" effect) doesn't make much sense - you can see directly whether it's present in the population or not!

I'm not sure what dept you're at, but for instance over at Statistics NZ the fact that they're working with population data sets mean that complicated inferential processes are generally ignored, and a department with giant loads of data actually mostly uses pretty simple descriptive procedures.

To me, the way to proceed would be to look at how much claim frequency has changed. I'm not sure on the best way to illustrate this is; maybe changepoint regression as Terzi suggests, maybe tabulated figures of claims-by-month data, maybe just graphs.
Thanks cowboybear. Nice to hear from someone just up the road!:) I'm at DoL as part of a qual team, so hard number crunching not a strength for me. Look like my hunch of the numbers being simply what they are is right. We do often get asked if numbers are 'significant' by policy analysts, without thinking about what this means I think. In many ways, they'd be more happy with a sample that was 'significant' rather than the whole population data!

Thanks a heap for the help

J
 
#7
I think it depends on what sort of question you're asking, but for this post consider that you're interested in how the frequency of claims has changed since last year.

One approach would be to look at all the date, which includes every single claim, and say "Yes, there are more claims per person now than last year," and there's no error in that estimate at all, because you have measured the true population parameter.

But you might imagine that the event that any one person files a claim is a random variable, say with a bernoulli distribution, so P(claim)=p and p(no claim)= 1-p. If you also assume that this distribution holds for everyone (a big assumption, but it's just an example), then the total number of claims, X, from all n people has a Binomial distribution w/ parameters n and p. You have X from this year and X from last year, and you can ask, has p changed from last year to this year? Here's where statistical inference will come in--even if X is higher this year, it may not be larger enough to suggest a significantly different change in p. (flip a coin 100 times and then 100 more. If you get more heads the second time, do you assume the coin has become biased towards heads?) You're still doing inferential stats, even though you have all available data, because your parameter of interest is never directly observed.


Does this make sense?
Hi Atlas - certainly does make sense. An interesting point - The probability of any one person lodging a claim could indeed be random variable - not sure about the distribution though - will have a think. Thanks a heap for the help.

J
 
#8
Sample of whole population

Sorry to tag on to your question...but I have a similar question. I have a sample (with 46 survey questions, some research-specific stuff, demographics etc), the respondents for which were chosen from the whole population (the population dataset has some variables in common with the sample, specifically mean water use per annum, and lot size). What I want to do is run a test on whether my sample is representative of the population (for my independent variable, which is the water data). I did this as follows, but would like to know if this is the correct methodology, or not...

How I did this was to calculate the mean per annum of the water use for each dataset, and did a one-sample t-test in SPSS (PASW 17.0) using the mean water use for the sample, and inputting the mean water use for the population (into the test value input area). This was significant for some years, and not significant for others. Is this the correct methodology?

But my population (mean water use per annum) data is by no means normally distributed, does this matter?
 
#9
For the topicstarter:

You can regard a population also as a procces. Like for example stock returns are also a process, since up to today you have all changes, so you could think of it as a population, but actually it is a sample out of a stock return generating process.

In that case, you have a sample, and you should deal with statistical inference.

I don't know what questions you want answered from this data. It's funny that people with question never really posts what they are looking for, since this would help us the most formulating an answer.

So... what are you looking for in that data? Be concrete and clear please, for us to maximize our helping potential ;).

For the guy above this post:

If you sample is taken randomly from the WHOLE population, it should be representative. Also it should be large enought and 46 surveys seems pretty ok. That's basically all you need to know.
To test it, you could derive some statistics from your sample, like mean, variance, skewness, distribution, trends, etc... and see if hypothesis tests correctly reject or not reject certain hypotheses about true population parameters. So for example, you calculate a mean and a confidence interval from the sample. If the population mean in that interval? It should be in 95 of 99% of the cases, depending on what significance level you used for the interval. And so on...

But I'm wondering: why are you working with samples if you got population data?