Chance?

#1
Hi everybody.

I'm trying to understand the methodology for determining the likelihood a sequence that repeats twice in a time series is not by chance. Assuming the two sequences are identical, I'm guessing the likelihood is some function of the length of the sequence and the size of the series.
 
#3
Thanks for the reply. It may help me to better understand my question.

Assume there is a time series: {1,5,3,5,1,5,3,5,}. Can one determine the probability that the sequence1,5,3,9,5 occurred two times in this series by chance? I would expect that it would be unlikely. Now assume that the series consisted of 10 billion single digit integers. It occurs to me that the probability of finding 1,5,3,5 two times in that series would be very likely.

Is there a generalized method for determining these probabilities?
 
#5
That's a good question, but I'm not sure it is what I was trying to get at.

Perhaps the problem could be simplified and restated thus: suppose we have a random letter generator that outputs lower case letters of the first 10 letters of the alphabet (a,b,c, . . .,h,i,j), and we use this random letter generator to create a series, say, 1000 letters long. Now we create a 3 letter sequence, say: d, g, c. What is the likelihood that that 3 letter sequence will appear in the 1000 letter series?

It seems to me that this should be a straightforward probability problem.
 
#6
So in a 1000 letter series generated by the above random letter generator, one should expect to find approximately 100 "d"s, and approximately 10 "d, g" pairs, and approximately 1 "d, g, c" sequence? Am I on the right track here?
 

fed2

Active Member
#7
yeah i think you are on the right track. at least as long as the the letter series is much longer than the sequence you are counting.
 
#8
Thanks for the reply,

OK, now let’s assume there is a 1000 element series, as above, comprised of the first ten letters of the alphabet: (a, b, c, d, e, f, g, h, i, j). We don't know whether it is randomly generated or not, but we work from the assumption that it is. We note that a three-letter sequence occurs twice in the series. That shouldn't be so unlikely that it would call into question the assumption the series is randomly generated.

But suppose we find a five-letter sequence that occurs twice in the 1000 element series, or a ten-letter sequence, or a twenty-letter sequence that occurs twice in the series? At some length the occurrence of two identical sequences in the 1000 element series is going to be so unlikely that it refutes the assumption that the series is randomly generated.

I'm guessing that this is some sort of elementary probability/statistics problem which can be generalized as a function the length of the series and the length of the identical sequences. I'm hoping someone can help me understand how to do this.

Any thoughts?
 

katxt

Active Member
#9
You can do a simulation for any particular case. I doubt if there is an exact closed form answer because of end effects and the possibility of overlapping sequences.
 
#10
Thank you for the help Katxt,

The possibility of overlapping sequences seems to me to be an excellent point that I hadn't thought of. Also, I guess there isn't any pressing need to to have a closed-form solution since one can calculate the probability for an individual instance.

I feel like I am making some significant progress in understanding this type of problem.

So if one had a 1K letter series as above and it was not known if it was randomly generated or not. And if in it there were two identical fifty-letter sequences (let's say one in the first third, not near the beginning of the series, and one in the last third of the series, not near the end). Then how might one calculate the probability of this happening in a single 1K letter randomly generated series?
 

fed2

Active Member
#12
simulation will probably explode your computer for even moderate n, but is good place to start. I think when it comes to complicated sequence counting like problems the go to stratagem is to establish some sort of recursive relationship, ie of the form x_n+1 = x_n*2 + ...., or something. Then generating function. Thats from distant memory. For simple problems you might be able to count in a simpler way.



Heres some r code to get you goin'


C-like:
library(stringr);
library(gtools);



n = 7;
nchars = 2
alphabet = letters[1:nchars]

words = data.frame(  permutations(nchars, n, v=alphabet, repeats.allowed = T )   );

makeWord = function(j){
  paste(words[j,], collapse = '')
}

words_ = as.vector( unlist(  lapply(1:nrow(words), makeWord )  )  );


subseq = 'ab'

countsubseq = function(j){
  str_count(words_[j], subseq)
}


counts_ = as.vector( unlist(  lapply(1:nrow(words), countsubseq )    )  );
 
#13
Yeah, I don't want to blow up my computer , , , ,

I am not really interested in calculating the probability of a fifty-letter sequence occurring twice in a 1K series (not overlapping, etc.). I am interested in understanding how it is done. And in this case why simulation might be a better way than using traditional statistical analysis.

So perhaps I should start by going back to the simplier form of the question and ask: how would one calculate the actual probability of a four-letter sequence occurring twice (not overlapping, etc.) in a 1K letter series (as described above), using traditional statistical analysis? Is this such a difficult problem that simulation would be a preferable approach to using traditional statistical analysis?
 

fed2

Active Member
#14
for sim just do:

counter=0
generate a sequence
if sequence has 2 occurences then counter + 1

do it until blue in face.
 

katxt

Active Member
#15
Here's a rough start.
10 symbols. A full 1000 string has approx 1000 substrings of 4. There are 10^4 or 10000 possible substrings. So the probability that some given substring appears somewhere in the 1000 string is about 10%.
Not what you asked for but it might be useful.
 
#16
So if one picks a 4 symbol substring from the rqandomly generated 1k symbol series, the probability that it will occur a second time in the series is . . . . ~10%?
 
Last edited:

katxt

Active Member
#17
Yes. But it's virtually certain that there will be at least one matching pair in the 1000 string. Here's what a simulation gave. 2500 runs each of matches in a 1000 string. Probably accurate to nearest % or so.

1605056982644.png
If you are just interested in testing for randomness, there are many suites of tests. This isn't my field but there seems to be plenty of stuff on the net.
 
#18
That simulation was very helpful, thank you.

I'm not interested in testing for randomness per se, although the question of randomness is implied in the problem. What I am investigating is at what length two identical substrings in a series become so large that the probability they occurred randomly in a series approaches zero. Using simulation, you showed that in a 1k series this substring length is around 8 or greater.

The specific problem I am looking at involves substring pairs of ~50 symbols in a ~40k series.

Using your "rough start" methodology from above there are 10^50 possible substrings. I’m not a professional mathematician, but by my estimation, that is a really big number—a really, really big number. Even relative to a 40k series.

On the one hand, one could just say that it is impossible that two 50 symbol substrings could randomly occur in a 40k series. On the other hand, it would be nice to be able to quantify the extent of that impossibility.
 
Last edited:

katxt

Active Member
#19
You could probably put an upper limit on it. The chance that the first 50 symbol substring is repeated is about 40 000/10^50. There are 40 000 strings to make a possible match. 40 000*40 000/10^50 <10^-40. (I think, but it's not easy to get your head round this stuff. Pretty unlikely, anyhow.)