So I've been getting into concepts like the Library of Babel and infinite monkey theorem, and have been trying to work out how to calculate the probability that a given string of text contains a particular substring at least once. To simplify things, I'm going to talk about numerals. Suppose my phone number is 123-4567, and I generate a random decimal number of 100 digits, with each digit having an equal independent probability of being anything from 0 to 9. How many of the googol possible 100-digit numbers contain my phone number, 1234567?
My initial way of looking at the problem was this. I can treat it as if "1234567" was a single digit separate from the others, since all the strings I'm interested in will have 1234567 "clumped together" in them. If the total string is just 9 digits, then there are 3 places the desired substring can appear: the "beginning, middle, and end". For each of those locations, there are a total of 10^3 permutations for the other 3 digits in the string, and thus the probability of a 9-digit number containing 1234567 is 3000/1000000000, or .003%. This seems to work well, but if I use only this as the algorithm, then for much larger numbers my probabilities will exceed 1.
One of the problems with the simple analysis is this. Consider 14-digit numbers containing 1234567. I might suppose that the possibilities can be listed thus:
1234567-------
-1234567-------
--1234567-----
---1234567----
----1234567---
-----1234567--
------1234567-
-------1234567
Where a hyphen can be any digit. The problem is that both the first and last set of numbers on the list have exactly one string in common:
12345671234567
and my original analysis counts it twice. Now, for 14 digits, counting only one possibility twice makes a very small error, but the error increases as the length of the total string does.
So here's my next shot at it. I call the function for finding the total number of possibilities that contain 1234567 the f(n), where n is the size of the string. The f(0) through f(6) are all 0. Next, f(7) = 1. After that, the formula is:
f(n) = (n-6) * (10^(n-7)) - f(n-6)
So from 7 to 13 digits, it's relatively easy to calculate. At 14, I end the calculation by subtracting 1, resulting in . For 15 digits onwards, I'm basically working out the problem in terms of counting up all the places I can "insert" the magic number 1234567, then subtracting the places of redundancy where the early hyphens "already contain" some previously-tallied permuations.
So, a couple questions. Firstly, am I close to the right answer? Secondly, what's a good way to do recursive functions like this one? Thirdly, how should I factor in the apparent difference in the probabilities of different substrings? I'll explain that last one now.
If I flip a fair coin twice, I have an equal chance of getting TT, TH, HT, or HH. But if I flip it 4 times, I have different odds of getting those combinations as substrings.
HHHH HHHT HHTH HHTT
HTHH HTHT HTTH HTTT
THHH THHT THTH THTT
TTHH TTHT TTTH TTTT
Of those 16 combinations, 11 contain TH but only 8 contain HH.
From this, I'm assuming that the number 123123 has a better chance of appearing in a sequence of 100 random numerals than 123456 does, because of the "cycling" possibilities. Is there an easy way to account for this in my calculations of monkey-typing probabilities, or is it way to complex?
Thanks in advance for any thoughts you have…
My initial way of looking at the problem was this. I can treat it as if "1234567" was a single digit separate from the others, since all the strings I'm interested in will have 1234567 "clumped together" in them. If the total string is just 9 digits, then there are 3 places the desired substring can appear: the "beginning, middle, and end". For each of those locations, there are a total of 10^3 permutations for the other 3 digits in the string, and thus the probability of a 9-digit number containing 1234567 is 3000/1000000000, or .003%. This seems to work well, but if I use only this as the algorithm, then for much larger numbers my probabilities will exceed 1.
One of the problems with the simple analysis is this. Consider 14-digit numbers containing 1234567. I might suppose that the possibilities can be listed thus:
1234567-------
-1234567-------
--1234567-----
---1234567----
----1234567---
-----1234567--
------1234567-
-------1234567
Where a hyphen can be any digit. The problem is that both the first and last set of numbers on the list have exactly one string in common:
12345671234567
and my original analysis counts it twice. Now, for 14 digits, counting only one possibility twice makes a very small error, but the error increases as the length of the total string does.
So here's my next shot at it. I call the function for finding the total number of possibilities that contain 1234567 the f(n), where n is the size of the string. The f(0) through f(6) are all 0. Next, f(7) = 1. After that, the formula is:
f(n) = (n-6) * (10^(n-7)) - f(n-6)
So from 7 to 13 digits, it's relatively easy to calculate. At 14, I end the calculation by subtracting 1, resulting in . For 15 digits onwards, I'm basically working out the problem in terms of counting up all the places I can "insert" the magic number 1234567, then subtracting the places of redundancy where the early hyphens "already contain" some previously-tallied permuations.
So, a couple questions. Firstly, am I close to the right answer? Secondly, what's a good way to do recursive functions like this one? Thirdly, how should I factor in the apparent difference in the probabilities of different substrings? I'll explain that last one now.
If I flip a fair coin twice, I have an equal chance of getting TT, TH, HT, or HH. But if I flip it 4 times, I have different odds of getting those combinations as substrings.
HHHH HHHT HHTH HHTT
HTHH HTHT HTTH HTTT
THHH THHT THTH THTT
TTHH TTHT TTTH TTTT
Of those 16 combinations, 11 contain TH but only 8 contain HH.
From this, I'm assuming that the number 123123 has a better chance of appearing in a sequence of 100 random numerals than 123456 does, because of the "cycling" possibilities. Is there an easy way to account for this in my calculations of monkey-typing probabilities, or is it way to complex?
Thanks in advance for any thoughts you have…