ukmuiik

Question

This problem is probably simple enough to have an analogous problem, I just don't know the name so I'm going to describe it and hopefully somebody can point me in the right direction.

The problem is this: estimating the number of sides of a Magic 8 Ball.

Let's say you perform the following process:

Shake the "Magic 8 Ball" and mark down the result

If the result has not previously been seen, add it to the set of possible results and label the trial "N" (for "new")

If the result has been previously seen, label the trial "O" (for "old")

Repeat all steps

I imagine the following kind of sequence occurring:
NNNNONNOONOONOOONOOOONONNOONOOONNOONOOOOOONOOONOOOOOOONOOOOOOOONOOOOOOOO...

Now imagine we don't know there are twenty sides to a Magic 8 Ball. Or imagine that you have a Magic 8 Ball with 1000 sides. As the number of "new" trials approaches the actual number of sides, we'll get increasingly more "old" trials showing up in the mix. Once we've seen all of the possible results, we'll always get "old" results. But we're never 100% sure we've seen every single possible result.

So here are the questions I'm interested in:

As we proceed with trials, can we estimate the total number of "sides" of the Magic 8 ball based on the number of "new" and "old" trials up to this point?

Can we calculate a probability that our current estimate is correct, or a probability that some bounded estimate is correct?

Would you say that without even seeing a result the "magic 8 ball" has an equal chance of having any positive integer number of sides? There is no maximum possible number sides, and no preference amongst them? (I am looking for the a priori probabilities). — Jul 19 '15 at 12:52
These are not the same problems; just to give some idea what sort of approaches are used for such problems: en.wikipedia.org/wiki/German_tank_problem, en.wikipedia.org/wiki/Mark_and_recapture. — Jul 19 '15 at 13:12
@muaddib I would say that it could have any positive integer number of sides, there is no maximum, and no preference among them, yes. Oh, and that all results are equally likely to occur (uniform distribution). — Jul 19 '15 at 13:33

Communityâ™¦ 1 · Accepted Answer · 2017-04-13 12:21:33Z

This problem is similar to the German tank problem. You can use the same approaches used for that problem, with the number of distinct sides seen and its conditional distribution taking the place of the highest serial number seen and its conditional distribution. (Note that the only information we have that tells us anything about the number of sides of the magic ball is the number of distinct sides we've seen; all the details of which side we saw when, both the ones you recorded in your sequence and the ones you didn't record, are irrelevant, since their conditional distribution given the number of sides seen is the same irrespective of the total number of sides, i.e. they carry no information on the total number of sides.)

I'll work out the Bayesian approach in analogy to the treatment in the Wikipedia article linked to above; I won't go into much detail for the parts that are common to the two calculations.

So let $n$ be the number of sides of the magic ball, $k$ the number of shakes performed so far, and $m$ the number of distinct sides seen. (All the notation is in analogy to the article.) Then the desired credibility $(nmid k,m)$ is given by

$$
(nmid m,k) = (mmid n,k)frac (nmid k)(mmid k);.
$$

We can take the required conditional probability $(mmid n,k)$ of seeing $m$ distinct sides on $k$ shakes of a ball with $n$ sides from this answer:

$$
(mmid n,k)=fracm!n^kbinom nmleftkatop mright;,
$$

where the $leftkatop mright$ are Stirling numbers of the second kind. As in the article, we can eliminate $(mmid k)$ by marginalizing over all possible $n$:

$$
(mmid k)=sum_n=1^infty(mmid n,k)(nmid k);.
$$

Now comes the shady part, $(nmid k)$, your prior credibility that there are $n$ sides after having performed $k$ shakes. In contrast to the tank problem, where observing a tank changes not only the highest serial number seen but also the minimum number of tanks, in our case $(nmid k)$ doesn't depend on $k$. We can use the same trick of introducing a maximum number $Omega$ of sides up to which the prior credibility is uniform, and hoping that $Omega$ will magically drop out of the result. Then $(nmid k)=1/Omega$ for $1le nleOmega$. Putting it all together, we have

$$
(nmid m,k)=frac(mmid n,k)sum_n=1^Omega(mmid n,k)=fracfracm!n^kbinom nmleftkatop mrightsum_n=1^Omegafracm!n^kbinom nmleftkatop mright=fracfrac1n^kbinom nmsum_n=1^Omegafrac1n^kbinom nm;.
$$

Now, interestingly, if we want to let the sum in the denominator run to infinity to get rid of $Omega$, we find that it converges if and only if $mle k-2$. That is, if you had no prior idea how many sides the magic ball might have and you've never seen the same side twice, then you still have no idea how many sides it might have. That's perhaps to be expected; what's perhaps not quite as expected is that this is still true (though "only" with a logarithmic divergence in the denominator) if you've seen a single reoccurrence of a side. Only with at least two reoccurrences can you overcome your prior ignorance as to the possible scale of the number of sides.

Let's work out some examples. The first convergent case is $m=1$, $k=3$, i.e. you shook three times and got the same side every time. Then

$$
(nmid 1,3)=fracfrac1n^2sum_n=1^inftyfrac1n^2=frac6(npi)^2;.
$$

So at this point you're already $6/pi^2approx61%$ sure that the thing is a lemon and only ever shows this one side. More generally, we have

$$
(nmid 1,k)=frac1n^k-1zeta(k-1);,
$$

so after seeing the same side four times in a row your inclination to go to the store to get your money back has risen to $1/zeta(3)approx83%$. On the other hand, if you do see $m=2$ different sides in $k=4$ shakes, you have

$$
(nmid 2,4)=fracfrac1n^4binom n2sum_n=1^inftyfrac1n^4binom n2=fracfracn-1n^3sum_n=1^inftyfracn-1n^3=fracfracn-1n^3zeta(2)-zeta(3)approx 2.258fracn-1n^3;,
$$

so now only about $28%$ are concentrated on the possibility that these are the only two sides.

Now, you'd asked about an estimate of the number of sides. The expected value of $(nmid m,k)$ is

$$
mathbb E(nmid m,k)=fracsum_n=1^Omegafrac1n^k-1binom nmsum_n=1^Omegafrac1n^kbinom nm;.
$$

So it turns out that even after two reoccurences you can't get a finite estimate of the number of sides if you didn't have any prior clue how many there might be â€“ you need three reoccurrences for the sum in the numerator to converge. In the case of always observing a single side, the expected value is

$$
mathbb E(nmid1,k)=fraczeta(k-2)zeta(k-1);.
$$

so after seeing the same four sides and nothing else, your estimate of the number of sides is $zeta(2)/zeta(3)approx1.368$.

To give a more realistic example, let's say you shake the ball with $n=20$ sides $k=20$ times. Then the mode of the distribution of the number of distinct sides you see is at $m=13$ (at a probability of about $28%$). So let's say you observe that most probable value and try to estimate the number of sides. Then

$$
(nmid 13,20)=fracfrac1n^20binom n13sum_n=1^inftyfrac1n^20binom n13approxfrac8.25cdot10^19n^20binom n13;.
$$

Here's a plot, with a nice mode at $n=20$. The estimate from the expectation value is

$$
mathbb E(nmid 13,20)=fracsum_n=1^Omegafrac1n^19binom n13sum_n=1^Omegafrac1n^20binom n13approx26.5;,
$$

which perhaps shows that you might be better off using the mode as the estimator rather than the mean, at least if you don't expect huge numbers of sides.

Another question we might ask is: If we've seen all $m=n$ sides, what number $k$ of trials do we need before we're, say, $99%$ sure that that's all there is? That is, for which $k$ do we first have

$$
(n=mmid m,k)=fracfrac1m^kbinom mmsum_n=1^inftyfrac1n^kbinom nm=fracfrac1m^ksum_n=1^inftyfrac1n^kbinom nmge0.99;?
$$

Some trial and error on the borders of Wolfram|Alpha's time limits shows that for $m=n=20$ this requires a full $k=157$ trials. (If the calculation times out, just try again.)

Regarding the approximation considered in the comments: If the term for $n=m$ is dominant, we can compare it to the next term, for $n=m+1$:

$$
frac(mmid n=m,k)(mmid n=m+1,k)
=
fracfrac1m^kbinom mmfrac1(m+1)^kbinomm+1m
=
fracleft(1+frac1mright)^km+1
=
fracmathrm e^klogleft(1+frac1mright)m+1;.
$$

We can already see that this is going to get very large for large $k$ (no approximation so far); to see more clearly how large, we can use $log(1+x)simeq x$ to get

$$
frac(mmid n=m,k)(mmid n=m+1,k)
simeq
fracmathrm e^k/mm+1;.
$$

Thus, for $kgg m$ this ratio grows exponentially. The third term is again smaller by a similar factor, so we can ignore it and all remaining terms and approximate the credibility for $n=m$ as

begineqnarray
(n=mmid m,k)
&simeq&
frac(mmid n=m,k)(mmid n=m,k)+(mmid n=m+1,k)
\
&simeq&
frac11+(m+1)mathrm e^-k/m
\
&simeq&
1-(m+1)mathrm e^-k/m;.
endeqnarray

So our doubts whether indeed $n=m$ decay exponentially with $k/m$.

Wow. This is an incredible answer. You are much smarter than me. I just want to verify: can we do a plot of (n|m,k) for m = 20 and k = 100,000 in Wolfram? I would expect a very strong peak at 20 if my interpretation of your answer is correct? — Jul 19 '15 at 16:05
@aardvarkk: Well, $k=100,000$ in the exponent is a bit much even for Wolfram|Alpha :-) But here's a plot with $k=200$ (and $m=20$), where the probability is already extremely concentrated at $n=20$. (I didn't include the factor of proportionality.) — Jul 19 '15 at 16:16
Very cool. I will mark this as the answer. Thanks so much for your attention on this. I'm very impressed! I'm also curious, though -- do you think there's some approximation that could be done to deal with very large n, k, and m? Say, on the order of 100,000 as I mentioned before? If we're willing to throw away some accuracy? — Jul 19 '15 at 17:04
@aardvarkk: You're very welcome! I added an approximation to the answer; I hope this is roughly what you were looking for? (I also added a bit in a previous edit about the number of trials required for a certain certainty that $n=m$.) — Jul 19 '15 at 17:39

Communityâ™¦ 1 · Accepted Answer · 2017-04-13 12:21:33Z