Calculating probabilities with skewed distribution

#1
How do you calculate the probability of a value being part of an array of values when the array is not a standard distribution? In this case we have what we think is a “special” number in terms of its ability to come up with special results in a long series of calculations. In order to test it, we put in random numbers (50 of) to see what success they had, and then we put in the “special” number to see how it compares. The random number results are skewed, so using SD Z-Score probabilities could be challenged. We have 15 sets of results. Here’s one, ordered:-
0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 10, 10, 10
The special number gave 12. What’s the probability?
(2 of the 3 "10"s are where the random number generator hit the "special" number, and a simple multiple of it, and the 3rd is where the random number hit another known candidate. For the sake of honesty we've left them in)
 
#4
what frequency of occurrence would you consider to be not 'special'?
fed2, Bearing in mind that we have multiple results to analyse, any probabilities below 0.1 (of the special number being part of the random results) would be good, as the multiple results would be mutually supportive. The example I've given is not the best and not the worst!
 
#7
Background explanation
This is to give you some idea of where the numbers come from, and why it matters. Forgive me if I’m imprecise, but that is deliberate because we have a book going to print soon, and we don’t want to reveal the detail ahead of that.
Over the last 30 years, studies of ancient artifacts and structures in the UK revealed the probable use a previously unknown unit of length (UL). (We do have a formal name for this UL). More recently, research has determined that this “special” UL is not new at all. The same UL was used in Mesopotamia in ancient times, and its use migrated east to China and west to Europe.
Recent study of ancient structures in UK and western Europe appear to show that not only was this special UL used in some of the structures, but the precise measurements appear to show that the dimensions of these structures stored calendrical information. That information reveals deep knowledge of some aspects of astronomy.
Many of these structures will have been constructed using other ULs and for purposes other than storing calendrical data. So there is a lot of noise in the fundamental data that we are analysing. To try to show beyond reasonable doubt that our findings are not chance, we generated 50 random ULs around the “special” UL, and analysed the structures using each of these random ULs. We were looking for 5 key calendrical numbers to 3 levels of accuracy; 0.2%, 0.5% and 0.8%.
There are 62 of these structures in UK that have been accurately surveyed.
The numbers given in my original post were for one of the 5 key calendrical numbers, counting the number of hits within 0.5% for the 50 random numbers. And the “special” number for comparison.
Our problem now is that while analysis assuming Standard Distribution Z-scores is strongly supportive of two of the 5 calendrical numbers, we know the distribution is not SD, and is therefore vulnerable to challenge.
While we do have science backgrounds, that’s a long time ago and not in statistics. Our attempt to educate ourselves with “Statistics for those who hate statistics” and “Statistics for Dummies” has not helped with analysing this non SD array.
That’s why we are asking for your help!
 
#8
fed2, Bearing in mind that we have multiple results to analyse, any probabilities below 0.1 (of the special number being part of the random results) would be good, as the multiple results would be mutually supportive. The example I've given is not the best and not the worst![/QUOTE
I don't know whether talkstats flags to you entries in the thread that are not direct replies to you. In case it doesn't, please see my entry of Wed 2.12 pm
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
So these are measurements rounded to integers?

How do you know the distribution is skewed? Even with a normally distributed random variable, a small sample can appear skewed?

Probably not related, but look at Benford's Law.

If I was trying to see if a value was coming up more than expected, you just need to initially define the shape of the underlying distribution. If it is skewed use the log normal.

The integer thing is still confusing me, why do we end up with whole numbers?
 
#11
So these are measurements rounded to integers?

How do you know the distribution is skewed? Even with a normally distributed random variable, a small sample can appear skewed?

> In most of our arrays, the mean is greater than the median, because there is a tail to the right. I've seen a reference that says that if the median and mean are close together, then SD Z-scores are a good guide. I've been trying to find that reference again, but can't find it anywhere.

Probably not related, but look at Benford's Law.

> Now where have I seen reference to Benford's law recently? Oh yes! The so called smoking gun of votes in ballot packages! A very interesting Law that I hadn't seen before. Makes perfect sense for many circumstances such as identifying potential financial fraud cases. I agree that it's probably not related in our case.

If I was trying to see if a value was coming up more than expected, you just need to initially define the shape of the underlying distribution. If it is skewed use the log normal.

> Log-normal is not covered in by elementary books, but I see that there are several calculators on line. I'll give that a shot. Thanks

The integer thing is still confusing me, why do we end up with whole numbers?
> In the structures we are looking at, there are 2 obvious dimensions. We take the UL we are testing, and divide that into the surveyed length and see if we get one of the key calendrical numbers we are testing for. Besides integers, we accept simple fractions. There is plenty of evidence that the builders were very familiar with fractions. (And triangles which were used as the basis of their more complicated designs:- 45 and 30 60 deg triangles and integer right angle triangles, 3,4,5; 5,12,13; 12,35,37: were all used). We set error bars: +- 0.2%. 0.5% and 0.8%. We test each random UL in turn against both dimensions If we get a hit within the error we are testing to, that counts "1" towards that random number's total. That's why they're all integers.

Joey
 
#12
> In the structures we are looking at, there are 2 obvious dimensions. We take the UL we are testing, and divide that into the surveyed length and see if we get one of the key calendrical numbers we are testing for. Besides integers, we accept simple fractions. There is plenty of evidence that the builders were very familiar with fractions. (And triangles which were used as the basis of their more complicated designs:- 45 and 30 60 deg triangles and integer right angle triangles, 3,4,5; 5,12,13; 12,35,37: were all used). We set error bars: +- 0.2%. 0.5% and 0.8%. We test each random UL in turn against both dimensions If we get a hit within the error we are testing to, that counts "1" towards that random number's total. That's why they're all integers.

Joey
I'd hoped that using Lognorm would give a useful comparison with Normal Distribution, but there's a problem with comparison. With SD, the Z Score effectively gives me the probability of a result being at it's Z Score or anywhere above. But the Lognorm online calculator gives PDF, which I think is just the probability of the result occurring only where it is. Do you know of a way to make the result for Lognorm be the equivalent of the SD Z-score?

Joey
 
#14
Tks katxt. Forgive me for not answering that just now. I will put up a full post after we've published. In the mean time do you know the answer the question on how to calculate the cumulative probability above a given value in a Lognorm distribution?
 

katxt

Active Member
#15
There is a LOGNORM.DIST function in Excel with a cumulative option which gives you the area up to a given value. You can then just that away from 1. kat
 
#16
Another question!

We've had a good look at different distributions and put data into online calculators to see how it fares. Conclusion. The data distribution is not lognormal and not chi^2.

Then I thought that if I summed all the hits for each of the 50 random numbers, that might smooth things out and make the distribution clearer. The result was, I think, good. Range is 18 to 49. Mean is 32.8 and Median is 31.5. I think that is a reasonable fit for Normal Distribution. (Special number scores 50). Now the question. The summed data used in this calculation is the sum of 15 sub-sets. (5 calendrical numbers and 3 error ranges...0-0.2%, 0.2-0.5% and 0.5-0.8%). If the totals give Normal Distribution, then is it valid to treat of the sub-sets and sums of the subsets (eg adding the first two error ranges to get 0-0.5%) as Normal Distribution even if the limited sampling (only 50 random numbers) sometimes makes it look a bit skewed?

My intention at present is to continue using Normal Distribution to find z-scores for the "special number" but adding a comment concerning uncertainty of distribution.
 

katxt

Active Member
#17
Have you considered a Monte Carlo approach where you use your accumulated data to generate a population?
I feel that you current approach is asking for trouble in the future. I understand that you intend to publish your findings, at which point your work will be scrutinized by hostile critics, especially as it sounds as if it is in a controversial field.
My advice is that you find a professional statistician to help develop a robust, unassailable basis for your analysis. At the moment it seems like a hobby approach with well meaning advice from random folks off the internet. It is not a standard problem and you need to be able to discuss it fully and openly with an expert. If you can't afford a statistician, then perhaps offer them co-authorship. kat
 
#18
Have you considered a Monte Carlo approach where you use your accumulated data to generate a population?
I feel that you current approach is asking for trouble in the future. I understand that you intend to publish your findings, at which point your work will be scrutinized by hostile critics, especially as it sounds as if it is in a controversial field.
My advice is that you find a professional statistician to help develop a robust, unassailable basis for your analysis. At the moment it seems like a hobby approach with well meaning advice from random folks off the internet. It is not a standard problem and you need to be able to discuss it fully and openly with an expert. If you can't afford a statistician, then perhaps offer them co-authorship. kat
Thank you Kat. Good advice. We hadn't considered either, and will now consider both. Joey
 
#19
I am brand new and want to ask a question. I am having a problem finding where AND HOW to ask a question regarding probability. So, if someone can tell me how to ask it and where, I would be much obliged.

There was a note at the top of the list mentioning the members of the forum expect some work from those asking questions. I get it.

What I wanted to know was the probability of drawing one particular card from a standard, well shuffled 52-card deck, drawing 10 random cards one after another. If all were numbered 1-52, what would be the probability of drawing card #1 in the 10 sequential attempts. For the first, it seemed likely to me that the probability would be 1/52 and the second card would be 1/51, the third 1/50 ... and the last 1/43. My first instinct was to add them all up, thinking, you know, that the answer would be the sum total of the probabilit(ies) of each of the 10 events. That came up to 21.13%. Took me by surprise. I began to question my logic. Searching the internet for the problem yielded many other sample problems. But mine was nowhere to be found. So that's my homework. Then I found you guys, registered 'n now, here I am, asking my questing, posing it to a body of experts. Am happy to get an answer here or to post elsewhere as may be the case. Would-a-done so. Just dunno how 2 go about it.