# Representative sample - how to decide

#### arlesterc

##### New Member
I am interested in understanding some of the Covid 19 statistics being thrown around.

What I want to know is how to figure out whether a sample that was used to generate probabilities was actually representative of the population that was supposed to be sampled. I want to do this 'backwards' so to speak. So for instance let's say that real world stats show that 2 % of the population that has taken Covid vaccines have had side effects. However in the original sample group .2 % showed symptoms - in other words 1/10 the number in the real world. What is the probability in that scenario that the original sample was representative? Extreme case would be that no side effects were noted in the original sample but in the real world 2% showed side effects. I understand size of the original sample size and the 'real world' numbers are necessary to get to the answer I am looking for. I want a general formula that I can plug those any other numbers I need that will spit out the answers.

To put this another way, let's say a poll predicts that a party will win 30 percent of the vote. However the party only wins 25 percent of the vote in the election. How would I calculate whether the original poll was actually representative of the actual voting public? Or how close (far off) it was?

Thanks in advance for any attention to this.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
First step is to assess whether the sample really is a sample from the target super population. If not, differences need to be mapped out using directed acyclic graphs and results have to be 'transported'. This can be done if the sample meets something called selective admissability. But if there is a variable or factor in the sample and target pop that is different, the estimates may not be identifiable to the target values and can't be recovered. The typical example is comparing NYC to LA and LA having smog that may interact with another variable, which smog doesn't exist in NYC, so you can't recover the values for that population using an LA sample.

So it depends on how the sample was derived. If it was a random sample you have better assurance of its representativeness, but it can still have some variability given chance.

However, if it was not a random sample and the results come from the target pop, you can have selection bias or systematic differences. In these scenarios if you have enough data you can model the probability of being sampled and weigh results to recover the population values. Below is an abstract I had at a conference earlier this year, which may help.

#### arlesterc

##### New Member
Thanks for the two responses.

Maybe I will be clearer.

I have read that with the MRNA vaccine there is something like 1 - 2 in a 100000 chance of pericarditis, myocarditis.

(https://www.macleans.ca/news/myocarditis-covid-vaccine-pfizer-moderna-mrna/ - “What we’re calling the signal—the increase in myocarditis—is only seen in younger people, by and large. We’re starting to see some cases in the 12- to 15-year-olds, but the risk seems to be highest in the 16- to 24-year-olds, and more so in males than females and more after the second dose,” says Dr. Karina Top of the Canadian Centre for Vaccinology at Dalhousie University in Halifax. The rate of myocarditis appears to be one or two cases per 100,000, Top says. )

Now let's say I go back to the MRNA trials for the vaccine and find there were 30000 involved in the trials. As far as I know there was no risk of myocarditis/pericarditis reported in the samples before the vaccines were approved What I wanted to figure out was how representative of the population the original sample was given this discrepancy. There is a big difference in numbers here so it might not be strange that the sample group would not have any cases of myocarditis/pericarditis. I would like to figure out how to put a number to how likely is it that the original sample was a good portrait of the actual population?

Thanks for any further feedback on this.

#### fed2

##### Active Member
chi^2, use exact if p-value if events rare

#### hlsmith

##### Less is more. Stay pure. Stay poor.
chi^2, use exact if p-value if events rare
I would likely disagree with just doing a test. This question seems bigger than that. So you need to think about who participates in trials. Well it is known wealthier and healthier people join trials. So even if you tried to figure out if in the trial the younger people were similar to the population, just their proportions were off, you could do reweighting. But you would need more information about the study participants to understand this. So target representation of results for generalizing results takes additional information. And this is likely a futile pursuit, unless you are trying to generalized to people that could have been in the trial (exchangeable), so live in the same area during the same period, but did not participate for a random reason.

Take home message, the vaccines work and encourage everyone to take them for the greater good of public health!

#### hlsmith

##### Less is more. Stay pure. Stay poor.
chi^2, use exact if p-value if events rare
PS, I would hate to give a non-stats savvy person what they think is a correct tool to answer their question, which they then use to make inaccurate conclusions and propagate misinformation!

This question can't be simply turned into a basic statistic approach without a truckload of assumptions. And even then it still may be missing the mark.

#### arlesterc

##### New Member
A lot of reading into my query. Why would be the assumption be that I am going to spread bad information? Why is there the assumption I am going to argue against the vaccine? I am not in fact. I am pro-vaccine. I have pushed people with unresearched/wrong research fears to re-think their sources. I posted this query as a technical curiosity about side effects showing up in the real world when such side effects did not show up in the trials. All the points about the sample possibly being non-representative are conjectures - maybe logical - but no numbers actually produced to back up the 'reasonable' possibilities. And in fact the point of my query was to actually see if the sample was or wasn't representative based on statistical probabilies not on 'theories' of why it was not representative. I have not reached any conclusion - I don't have a clue whether it was or was not representative. I believe the facts about the vaccine's effectiveness based on the real world use of them - the numbers are in, they are very safe. What I do know is that there are some side effects that show up in 1/100,000, 1 in 1,000,000 in the real world that did not show up in the trials and the guidance that goes with the vaccines has been updated to reflect that 'new' knowledge. If only 30,000 people were in the trial then there's a good chance that a 1/100,000 or 1 in a million side effect might not have shown up even with a representative sample. That's what I am trying to find a formula to put a number on. If anyone here is concerned that I am going to somehow miscalculate, misuse the miscalculation feel free to do the math for me. So if this swine is not to be given the pearls of how to calculate on their own let me know what are the chances that a trial was or was not representative - for the reasons of the poster here or other reasons - that involves 30000 people that does not produce a side effect that is subsequently seen by 1 in a 100,000 when x hundred million use the vaccine. Before thinking of the reasons why the sample may not be representative I would like to know there is some grounds for believing it to be non-representative based on the numbers - only if it is out of bounds for representation would I think about investigating why it was not representative. And in fact I am not going down that rabbit hole regardless. Again I am not saying the sample was or wasn't representative. I am just trying to see what formula in stats might put a number to 'representability' of samples.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
PS, I would hate to give a non-stats savvy person what they think is a correct tool to answer their question, which they then use to make inaccurate conclusions and propagate misinformation!

This question can't be simply turned into a basic statistic approach without a truckload of assumptions. And even then it still may be missing the mark.
Well I think I was fairly reasonable

If I see someone give someone else a quick solution that may not be the correct method to answer their question, I feel inclined to point that out. There is a replicability crisis in science due to similar issues. Thus I said hold on perhaps spoon feeding someone a quick solution may result in generating an inaccurate result. I actually did not give the calculation enough thought to posit the direction of the bias. But I guarantee the result would be biased and if you made conclusions off it they could be wrong and you could spread wrong information ignorantly - thus misinformation. I guess that term now has a more directed definition in the social science field. But regardless, my point was you could walk away thinking you had an answer. And to be clear, once again I did not write that you would be pro or anti, just wrong in one direct.

@fed2 quote was an interesting one. I actually had to look it up at the time and to me it didnt come off as calling you swine but weighs the merits and utility of giving a person knowledge versus a tangible and immediate solution. But I could have also misinterpreted its meaning.

Last edited:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
its pearls before swine if you ask me.
Given the above comment I don't see the need to delete this post. To keep things transparent this is my attempt to make it searchable.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
I will post a link to a paper but the concept is transportability or data fusion to reweight a sample to generalize the result to a new population. Given that the original estimate came from a trial, I can speak from experience that those individuals will not look like the general population. So the next question is whether the differences are associated to the outcome of interest. If not, study estimates may generalize, if yes, you get into the transport concept, which either path requires assumptions.

Next you can ignore the above logic game and just try to test stuff, which could result in bias. But that route may require having demographic info on trial participants and sample or if you just want to see if the sample was just too small, you have rare event probability stuff which is addressed with Bayesian method and what judgement day calculations use - not my forte.

#### arlesterc

##### New Member
Thanks for persisting. I am reading about it but there is catchup reading material on stats that I need to read to get to where I can understand the test as it is not immediately clear to me how I would apply it. So it's back to the books. I was under the impression that my query was a fairly standard statistical question and there was a formula I could use to answer it. It may be that chi^2 test is that formula - I have to get my head around it on a theoretical basis as the examples I have seen are at first glance not similar enough.

#### fed2

##### Active Member
You may be able to use SAS proc freq to run exact chisq. As an adult, you are on your own recognizance to verify that it is appropriate for you, refute, or whatever.

Code:
proc freq data = whatever;
tables isAE / binomial( P=theoreticalRate );
exact binomial;
run;

#### hlsmith

##### Less is more. Stay pure. Stay poor.
@fed2 - for clarification, are you suggesting that they run an exact test on the proportion in the randomized control trial with an adverse event versus the proportion reported in the community?

@arlesterc - back to the approach I referenced, people in the social sciences typically use a different basic approach to try to restore estimates to a population called post-stratification. It is what is used for survey data. This method would assume the sample is relatively representative without any unknown interactions not present in the target population. The method requires knowledge about the underlying distribution of relevant covariates in both groups (sample and target population) and results stratified by these group (covariate) counts. Thus requires more information or big assumptions.

Of note, the approaches I have mentioned are for trying to identify the unbiased estimate in a target population. What I believe @fed2 references is a test if the distribution of events are similar between a sample and target population. A comment on that approach would be, when just running the test and you find a high p-value or low standardized residuals, a researcher would not have evidence for proving the distributions were similar, since it is a null hypothesis test. You end up failing to reject that they are different. And unless you run an equivalency test with a stated margin of similarity, you never test they are similar. I get that this is nuanced, but stats can be odd and peculiar concepts.

Your slightly other inquiry on whether the rates are the same, but the sample for the RCT was too small, is still valid but not an area that I have explored. It could be a black swan event, you don't know what you don't know if your sample is too localized. Or if the community data was correct, You could simulate samples from those estimates and report back the frequency of times a sample the size of the RCT had an event in it and use that as a probability of seeing an event rate as high as in the population data in a RCT. Which isn't that bad of an idea either. But requires assumptions that the sample is representative of the target population.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
OK, if we assume the community rate is true, 3/200,000 will have an event. I did a quick simulation also assuming the RCT is a random sample of that population.

After simulating 1 million samples of size 30K, 36% of them had an event rate equal to or greater than the population event rate. It is a big assumption that the quoted population rate is true and doesn't systematically differ from the RCT sample in characteristics. But given the above parameters, the fact that a study did not show an event given the population rate is true, those null results would be expected about 2/3 of times if the study was repeated.

Last edited:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Til the day I die!

#### fed2

##### Active Member
well not to put the horse before the cart (and not specifically identifying anyone as a horse or cart, lest they be offended), but i think it would be rare or never to see hypothesis tests on AEs in gneral in an RCT. I think you would be discouraged from submitting it anyway, i guess people can do whatever they want. If there were a 'primary safety endpoint' id bank that it would be tested by a binomial type test as described above.