# Probability of a string containing a specific substring

#### Chris Morrow

##### New Member
So I've been getting into concepts like the Library of Babel and infinite monkey theorem, and have been trying to work out how to calculate the probability that a given string of text contains a particular substring at least once. To simplify things, I'm going to talk about numerals. Suppose my phone number is 123-4567, and I generate a random decimal number of 100 digits, with each digit having an equal independent probability of being anything from 0 to 9. How many of the googol possible 100-digit numbers contain my phone number, 1234567?

My initial way of looking at the problem was this. I can treat it as if "1234567" was a single digit separate from the others, since all the strings I'm interested in will have 1234567 "clumped together" in them. If the total string is just 9 digits, then there are 3 places the desired substring can appear: the "beginning, middle, and end". For each of those locations, there are a total of 10^3 permutations for the other 3 digits in the string, and thus the probability of a 9-digit number containing 1234567 is 3000/1000000000, or .003%. This seems to work well, but if I use only this as the algorithm, then for much larger numbers my probabilities will exceed 1.

One of the problems with the simple analysis is this. Consider 14-digit numbers containing 1234567. I might suppose that the possibilities can be listed thus:

1234567-------
-1234567-------
--1234567-----
---1234567----
----1234567---
-----1234567--
------1234567-
-------1234567

Where a hyphen can be any digit. The problem is that both the first and last set of numbers on the list have exactly one string in common:

12345671234567

and my original analysis counts it twice. Now, for 14 digits, counting only one possibility twice makes a very small error, but the error increases as the length of the total string does.

So here's my next shot at it. I call the function for finding the total number of possibilities that contain 1234567 the f(n), where n is the size of the string. The f(0) through f(6) are all 0. Next, f(7) = 1. After that, the formula is:

f(n) = (n-6) * (10^(n-7)) - f(n-6)

So from 7 to 13 digits, it's relatively easy to calculate. At 14, I end the calculation by subtracting 1, resulting in . For 15 digits onwards, I'm basically working out the problem in terms of counting up all the places I can "insert" the magic number 1234567, then subtracting the places of redundancy where the early hyphens "already contain" some previously-tallied permuations.

So, a couple questions. Firstly, am I close to the right answer? Secondly, what's a good way to do recursive functions like this one? Thirdly, how should I factor in the apparent difference in the probabilities of different substrings? I'll explain that last one now.

If I flip a fair coin twice, I have an equal chance of getting TT, TH, HT, or HH. But if I flip it 4 times, I have different odds of getting those combinations as substrings.

HHHH HHHT HHTH HHTT
HTHH HTHT HTTH HTTT
THHH THHT THTH THTT
TTHH TTHT TTTH TTTT

Of those 16 combinations, 11 contain TH but only 8 contain HH.

From this, I'm assuming that the number 123123 has a better chance of appearing in a sequence of 100 random numerals than 123456 does, because of the "cycling" possibilities. Is there an easy way to account for this in my calculations of monkey-typing probabilities, or is it way to complex?

Thanks in advance for any thoughts you have…

#### Miles

##### New Member
I think you are close to the solution. Let us fix notation. We are given alphabet of m characters, random string S of length n and we want to calculate probability that substring K of length k appears in S. As you mentioned we can try to match string K at n-k positions. Unfortunately whether a string matches on the next position depends how much it matched on previous position. Thus Bernoulli model is not enough. Markov chain is one of possible models. (to be specific Time-homogeneous Markov chain with a finite state space). Basically you need to construct transition matrix for specific substring K and take n-th power of that matrix. The value in the upper right corner will be a searched probability. To give you example how to construct transition matrix let us construct it for HH substring. We have three states:
A. no match
B. one match
C. whole strings match
Thus matrix will be 3x3. Entries will be probabilities of going from one state to the next state:

A. B. C. <--next state
A. 1/2 1/2 0
M1=B. 1/2 0 1/2
C. 0 0 1
^ current state

For TH string:
A. B. C.
A. 1/2 1/2 0
M2=B. 0 1/2 1/2
C. 0 0 1

0.313 0.188 0.500
M1^4 = 0.188 0.125 0.688 thus searched probability is 1/2
0.000 0.000 1.000

0.063 0.250 0.6875
M1^4 = 0.000 0.063 0.938 thus probability is 0.6875=11/16
0.000 0.000 1.000

Note that results of matrix multiplication are rounded at a places. If you want more
efficient calculations you may try to read e. g. "A Unified Approach to Word Occurrence Probabilities" by Mireille Regnier.