On the main Pi page, I mentioned that if we view Pi as a big, random string of numbers (which is close enough for our purposes), then we can figure out the odds of finding any string in the first 100 million digits of Pi:
Number Length | Chance of Finding |
---|---|
1-5 | 100% |
6 | Nearly 100% |
7 | 99.995% |
8 | 63% |
9 | 9.5% |
10 | 0.995%% |
11 | 0.09995% |
Where do these numbers come from, and how can you compute them?
Let's say you're searching for a single digit in Pi, and pretend again that Pi is random. If you pick a number between 0 and 9 at random, the chance that it's equal to your search digit is 1 in 10, (10%, or 0.1).
That's pretty simple, but what happens if you want to search for a two digit string? Well, you can approximate this by picking two numbers. If the first doesn't match, then it's over. But if the first does match, you have to try to match the second, too. Each of these has a probability of 0.1, and we'll assume that the numbers are completely independent. So 10% of the time the first number matches, and 10% of 10% of the time, both numbers match, which is just 1%, or a probability of 0.01. We'd have a 1 in 1000 chance (0.001) of finding a three digit search string, and so on.
If we assume that Pi is random, the above formula gives us the chance that any particular position matches. So for a two digit search string, there's a 1% chance that it matches at position 1, a 1% chance that it matches at position 2, and so on. So the chance of finding the search string at all is equal to the chance of finding it at any of those positions. How do we figure that out?
Let's turn the problem on its head. The chance of finding it is simply the opposite of the chance of not finding the search string. "Well, duh, Dave, that's obvious!" you may say -- but wait. We already figured out this kind of chance earlier. How do we not find something? Well, we first have to not find it at position 1, and then not find it at position 2, ... and keep on going all the way to the end of our digits. This is just like what we did earlier to figure out the chance of one position matching!
If we have a 10% chance of matching at any position, then we have a 90% chance of not matching. So the odds of not matching the entire string of pi is equal to 90% of 90% of 90% of ... and so on, for each digit of Pi that we have. Mathematically, this would be 0.9 to the power of "N" (0.9N) if we have N digits. And then the odds of finding the string would just be 1 - (0.9)N.
Putting that all together, we know that the chances of finding a search string at any position are 0.1d, where "d" is the length of our search string. So the entire probability is 1 - (1 - 0.1d)N.
Continuing along the mathematical path, it turns out that we've accidentally stumbled into something called the binomial probabilities. Binomials come about when you ask "what are the odds of getting some number k of heads out of n flips of a coin." Just to make things tricky, let's let the coin be biased in some way - it gets "heads" with probability p (that is, if p = 0.6, then 60% of the time, the coin lands heads).
Luckily for us, asking about zero occurrences of heads is easy, as the formula above showed. But we could ask other questions, like "what are the odds of finding my birthday twice in the first 100,000,000 digits of Pi?" These questions are harder (computationally) to answer than the zero case, because we have lots of different ways to find your birthday twice. We could find it once at position 1 and once at position 2, or once at position 1 and once at position three, and so on. Even very fast computers start to choke when the numbers get big. And then we could make it even worse -- what if we want to know how likely it is to find your birthday at least 100 times in Pi?
The solution to this problem is to use what's known as the Poisson approximation to the binomial, when the numbers are large. We can actually approximate the above formula as:
Odds(finding string of length k in N digits of pi) = 1 - 1/e(N*0.1d).
That looks a little complicated, until we realize that 0.1d is just really just one divided by the number of search strings that have d digits. So if d is three, there are 1000 strings (0, 1, 2, ..., 999). So 0.13 = 1/1000 = 0.001. And N is just the number of digits of pi. Ah-ha! So what this really means is that we can calculate the odds simply as 1 - 1/e(digits of pi / possible searches). So if we have 100,000,000 digits of pi, and we can search for 100,000,000 possible strings (8 digit search strings), then our probability is simply 1 - 1/e. With twice as many digits as search strings, the probability becomes 1 - 1/e2. And so on.
At this point, other people have explained the math far better than I, so I leave you to the good graces of the Internet.
Thanks to Evan Romer for the suggestion to add this explanation.