From German Tanks to stochastic fortune cookies

I've ignored plenty of fortune cookies in my time, but after lunch the other day I noticed the "lucky numbers" on the back of each fortune cookie and wondered how they were chosen. Here are my group's lucky numbers:

While the lowest number is presumably 1 (which we observed), it was unclear what the maximum lucky number could be. Given the limited number of data points on hand compared to the sample space, it would be hard to address the distribution question directly. But quantifying uncertainty about the maximum number $N$ is extremely similar to the Bayesian German Tank Problem tackled in the last post.

Starting with small data

Let's set up the problem and see what range we can narrow $N$ down to using only these four fortunes.

In [1]:
from __future__ import print_function

%matplotlib inline
import numpy as np
import pymc3 as pm
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

np.random.seed(12345)

# our data
y = np.array([[35, 26, 56, 10, 32, 52],
              [26, 38, 2, 23, 27, 50],
              [11, 34, 24, 49, 2, 45],
              [14, 1, 55, 19, 32, 45]])

def sample_upper_bound(data, upper=10000, burn_in=10000, n=100000):
    """ set up a model for the upper bound N based on the data and sample """
    model = pm.Model()
    with model:
        # the upper bound
        N = pm.DiscreteUniform("N", lower=0, upper=upper)

        # likelihood - P(D|N): y ~ uniform(1, N)
        y_obs = pm.DiscreteUniform("y_obs", lower=1, upper=N, observed=data)

        # Metropolis-Hastings for discrete variables
        step = pm.Metropolis()

        # we'll use four chains, and parallelize to four cores
        trace = pm.sample(n, step, chain=4, njobs=4, progressbar=False)
        
        # discard a chunk of samples from the beginning for burn-in
        trace = trace[burn_in:]

        return trace
    
def show_output(trace):
    """ display a summary of the sampling results """
    print(pm.summary(trace))
    
    # plot the trace
    pm.traceplot(trace)
    plt.show()

trace = sample_upper_bound(y.ravel())
show_output(trace)

N:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  58.079           2.679            0.016            [56.000, 63.000]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  56.000         56.000         57.000         59.000         65.000

None

According to the sampler, 95% of the posterior density for $N$ lies in $[56,63]$. We have narrowed down the range considerably. We can also "ask" our sample about $P(N > 56 \mid \mathcal D)$, or what the probability is that we haven't yet observed the true maximum in the data:

In [2]:
def probability_n_greater_than_max_observed(trace):
    greater = trace.get_values('N') > y.max()
    return greater.sum() / float(greater.size)

print("{:0.4%}".format(probability_n_greater_than_max_observed(trace)))
66.6681%

A digression

Apparently, lots of people use these fortune cookie numbers to pick their lottery numbers. We know this because in 2005 there were 110 second place Powerball winners for exactly this reason:

The Powerball drawing on March 30, 2005 produced 110 second-prize winners. The total payout to these winners was \$19.4 million, with 89 winners receiving \$100,000 each, while the other 21 winners received \$500,000 each as they were Power Play selections.

[Multi-State Lottery Association] officials initially suspected fraud or a reporting error. However, all 110 winners had played numbers from fortune cookies made by Wonton Food Inc. of Long Island City, New York. The factory had printed the numbers "22, 28, 32, 33, 39, 40" on thousands of fortunes.

https://en.wikipedia.org/wiki/Powerball#Fortune_cookie_payout

The New Yorker expanded on this strange incident, interviewing a VP at Wonton Food, Inc which produced the fortune cookies in question:

Each day, Wonton’s factory churns out four million Golden Bowl-brand cookies, which are sold to several hundred venders, who, in turn, sell them to most of the forty thousand Chinese restaurants across the country. Wonton’s primacy in the industry and, for that matter, in the gambler’s imagination is such that when, in March, five of six lucky numbers printed on a fortune happened to coincide with the winning picks for the Powerball lottery, a hundred and ten people, instead of the usual handful, came forward to claim prizes of around a hundred thousand dollars. Lottery officials suspected a scam until they traced the sequence to a fortune printed with the digits "22-28-32-33-39-40" and Donald Lau’s prediction: "All the preparation you’ve done will finally be paying off."

"We’ve had winners before, but never this many," Lau said the other day, in his East Williamsburg office, which is furnished with stacks of financial reports and “A Dictionary of American Proverbs.” "A computer picks the numbers, not me. If only a computer could also write the fortunes."

http://www.newyorker.com/magazine/2005/06/06/cookie-master

Well the good news for Mr. Lau is that a Markov Chain apparently can write semi-plausible fortunes. But as the largest fortune cookie manufacturer in the United States, the way that Wonton Food picks Lucky Numbers has a small but amusing impact on lottery odds.

Gathering more data: the plot thickens

How can we gather more lucky numbers in order to infer more about how they are chosen?

The problem is, we can't assume that every fortune cookie company uses the same method for choosing numbers, so in order to keep digging it will be necessary to establish some fortune cookie provenance: that is, which company produced these fortunes, and how we can observe more numbers.

We do have one clue here that prevents the fortunes we saw from being completely generic. As you may have noticed, there is a message printed above the lucky numbers we saw:

How about another Fortune?
    SecondFortune.com

This site is run by Wonton Food, Inc, so apparently our fortunes came from them; not surprising since they are the largest supplier of fortune cookies to Chinese restaurants in the United States. With this knowledge in hand, I took a closer look at the company's web presence and found their Twitter feed, where they often post pictures of their fortune cookies just like this one:

If you look closely, you can see a few lucky numbers on the other side of the paper. Same thing here:

Sometimes the numbers are even right out front:

So with some very painstaking social media stalking, we could maybe double the amount of the data. But look at these fortunes for a moment — while they are very inconsistently styled, you may have detected that they all have similar blue corner artifacts. According to a recent "99% Invisible" podcast about the history of fortune cookies, these blue corners are a telltale hallmark of Wonton Food fortune cookies.

Armed with this knowledge, the use of Google image search unlocks quite a bit more data. For example, search terms like "fortune cookie lucky numbers" result in many relevant data points:

Here are 85 more fortunes that look to be made by Wonton Food, Inc:

In [3]:
more = """
 1 15 18 20 30 48  # http://cdn.smosh.com/sites/default/files/legacy.images/smosh-pit/092010/cookie-28.jpg
 2 17 28 34  8 22  # http://4.bp.blogspot.com/_ZgdRrxdmW8c/S4VoEESgGCI/AAAAAAAAAnc/2bJJnwlJ6Xo/s400/bday+fortune+cookie.jpg
 2 50 29 13  9 19  # http://www.blog.themacgroups.com/wp-content/uploads/2014/05/0507141400d.jpg
 2 51 55 25 31 50  # http://farm6.static.flickr.com/5294/5579768060_0954248504.jpg
 4 24 17 30 39 43  # http://3.bp.blogspot.com/-AR5mie8G25M/Ucer_nCzj_I/AAAAAAAAKr4/QOaOBS_LHX4/s1600/photo+5.JPG
 4 25 50 11 15 40  # http://2.bp.blogspot.com/_cWYMJRWCfs8/S7tAmnfCDdI/AAAAAAAABFA/IyQ-qugqZ4s/s1600/030.JPG
 4 43  3 32 37  6  # https://secure.cabot.net/images/emails/CWA/Screenshot/Cookie-back.png
 5  8 16 24 32 46  # http://3.bp.blogspot.com/-rvrZeRdobgw/UwpByfWQeuI/AAAAAAAAAJk/EdAys8exDIQ/s1600/fortune+cookie.JPG
 5 22 34 11  2 17  # http://j-walkblog.com/images/fortune1.jpg
 5 31  2 47 39 18  # http://cdn.acidcow.com/pics/20101209/hilarious_fortune_19.jpg
 7 10 17 23 31 40  # http://2.bp.blogspot.com/-LjaA2JqnMcc/TfBCTgLONVI/AAAAAAAAAAY/wya8xFVHnWw/s1600/021.JPG
 7 22 31 43  5 30  # http://3.bp.blogspot.com/-0ay1RDi8_U8/T85T_nJSN7I/AAAAAAAAAKY/YzdFtLOPqa0/s1600/cookie-12.jpg
 8 11 17 23 25 45  # http://www.dumpaday.com/wp-content/uploads/2011/05/fortune-cookies-fortunes32-640x426.jpg
 9 13 18 24 33 46  # http://anunmarriedwoman.com/wp-content/uploads/2010/10/photo3-e1287272247993-225x300.jpg
 9 26  1 50 23 39  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
10 13 15 20 25 28  # http://laynewong.com/wp-content/uploads/2013/09/fortune-cookie.jpg
10 13 18 31 35 36  # https://melissajanda.files.wordpress.com/2013/04/fortune1-e1367003950889.jpg
10 15 19 23 33 34  # http://kinetic.org/fortune.jpg
10 22 38 49  5 47  # http://static.fjcdn.com/pictures/Troll_07b744_1477660.jpg
11  7 25 38  5 17  # http://1.bp.blogspot.com/-VWsMoh42x2w/T4blcYLYYrI/AAAAAAAAaxU/0l7-Ojb98tQ/s1600/565%2312yoda.jpg
12  8 56 37 45 26  # https://scontent.cdninstagram.com/hphotos-xaf1/l/t51.2885-15/s640x640/sh0.08/e35/12530997_1123086017703550_1469065506_n.jpg
12 15 19 26 35 40  # http://www.lifeofanarchitect.com/wp-content/uploads/2014/05/Fortune-Cookie-600x450.jpg
14 15 16 17 26 38  # https://247wallst.files.wordpress.com/2015/09/fortune-cookie.png
14 15 20 23 35 50  # http://www.sportsbet.com.au/blog/wp-content/uploads/melbourne-cup-lucky-numbers.jpg
14 21 16 42 32 11  # http://blog.waygoapp.com/wp-content/uploads/2014/02/dscf3302-1.jpg
15  3 42 22 14 25  # http://data.whicdn.com/images/11510916/large.jpg
16 18 22 56 17 38  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
16 47 30 51 44 14  # http://i.imgur.com/FhfYlTd.jpg
17 19  4 50 18 27  # https://s-media-cache-ak0.pinimg.com/236x/f6/95/14/f695140d8c53d1e731598120ca97e483.jpg
17 28 38 37 44  2  # http://ide-a.net/images/fortune006.jpg
18 12 11 33 52 55  # http://i.imgur.com/bun1qOh.jpg?1
18 23 32 34 39 41  # http://anunmarriedwoman.com/wp-content/uploads/2010/10/photo3-e1287272247993-225x300.jpg
19  8 24 22 52 53  # https://pbs.twimg.com/media/CNHlEc_WEAAIAB0.jpg
19 21 23 25 27 29  # https://www.chaossoftware.com/blog/wp-content/uploads/2013/06/fortunecookie-300x101.jpg
20 11 31 47 26  8  # https://s-media-cache-ak0.pinimg.com/736x/39/38/45/39384572487bf20c088439b9bed0e153.jpg
20 13  1  9 10 16  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
20 37  5 17  3 18  # https://keenanevans.files.wordpress.com/2010/01/fortune-cookies1.jpg
21 17 16 12 27 30  # http://i.imgur.com/B9kD7Ff.png
21 44 28 33 14  8  # http://www.nsmbl.nl/wp-content/uploads/2011/11/345157670_8f57c10e6a_z1-610x250.jpg
22 38 42 55 33 18  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
23 22 18 36 38 47  # http://www.dandelionchocolate.com/wp-content/uploads/2011/08/fortune-500x377.png
24 35 36 17  8 39  # http://www.tiptoptens.com/wp-content/uploads/2011/02/funny-fortune-cookies-8.jpg
26 36 46 17 48  6  # http://www.tootimidandsqueamish.com/wp-content/uploads/2012/03/Learn-Chinese-1024x629.jpg
27  5 16 34 14  8  # https://www.playhugelottos.com/uploads/assets//Pictures/fortune_cookie_lottery_numbers.JPG
28 13 19 56 10 23  # http://1.bp.blogspot.com/_xoUjXFG3_sI/SwnCqR206SI/AAAAAAAAACw/aGF-6fri3TU/s1600/cheese+003.jpg
28 24 19  1 36  4  # http://farm4.static.flickr.com/3295/2996524954_9bc76a8cc8.jpg
28 29 16 52 38 14  # https://s-media-cache-ak0.pinimg.com/236x/64/83/00/64830093931ea68f7557b1540854c96d.jpg
28 33 46  5 15  2  # http://www.asian-central.com/stuffasianpeoplelike/wp-content/uploads/HLIC/48816c7764a842c9bebf9ce302a6812a.gif
28 54 22 52 47  9  # http://cdn3.geckoandfly.com/wp-content/uploads/2014/08/Blog-quote-fortune-cookie.jpg
29 27 46 54 28  2  # http://www.blog.themacgroups.com/wp-content/uploads/2014/05/0507141400d.jpg
30  1 46 14 44 26  # http://1.bp.blogspot.com/_1ChRCAfQNM4/SFZzGqNmhSI/AAAAAAAAAQI/R6sf9fe3qac/s400/Fortune+001.jpg
30 25 17 38 46  2  # http://i.imgur.com/0NQDGRJ.jpg
32  1 19 53 13 22  # http://static.fjcdn.com/pictures/When_8a02ca_1502019.jpg
32 31 46 27 38  2  # https://enlustered.files.wordpress.com/2014/09/img_1093.jpg
32 41 12 11 54 45  # https://s-media-cache-ak0.pinimg.com/236x/da/4d/72/da4d7233fc19791f245d96f0e90ccd91.jpg
32 45 15 17 52 11  # https://s-media-cache-ak0.pinimg.com/236x/75/cf/08/75cf082199cb1d61632173bff59d05a7.jpg
33 40  7 25  4 28  # http://www.barnorama.com/wp-content/galleries/02/cookie/10.jpg
34 11 27 42  5 37  # http://www.andrewlove.org/blog/blogpics/fortune.jpg
34 17 22 28 36  5  # http://images.baklol.com/2_jpeg2d02ed04e752e850e7bab0895197b888.jpeg
34 41 52 35 42 51  # https://s-media-cache-ak0.pinimg.com/236x/5d/b6/bb/5db6bb1a37b590e1b385ab8a09519864.jpg
34 56  8 17  7  1  # http://img.ifcdn.com/images/89d23a5ceb47420e268615daf45cdb817ad7a5d3d342f874ee54fff2524e2c34_1.jpg
35 32 39 21 38 48  # https://s-media-cache-ak0.pinimg.com/236x/33/0b/90/330b900102b6a82879f1bdc8631f5e73.jpg
37  7 42 35 22 49  # http://i2.cdn.turner.com/cnnnext/dam/assets/140306221521-dnt-womans-lucky-fortune-cookie-wins-millions-00012928-story-top.jpg
37 18 10 17 26 47  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
37 50 30 33 53 52  # http://i.imgur.com/FhfYlTd.jpg
39 24 15 44 21 43  # http://www.barnorama.com/wp-content/galleries/02/cookie/15.jpg
39 53 42 48 31 55  # https://s-media-cache-ak0.pinimg.com/236x/6c/4e/24/6c4e24eb8b53f3a6f4e7def63c2d34e1.jpg
41 20 18  9 22 17  # https://s-media-cache-ak0.pinimg.com/600x315/16/1a/62/161a627822d288cbfd38403c7b56f83c.jpg
41 25 27 52 11 24  # http://3.bp.blogspot.com/-SFfBdkDLk60/Tg1MB3mt9BI/AAAAAAAAAD4/bbvtCwFDPTo/s1600/photo%255B1%255D.PNG
41 35 15 16 29  9  # http://25.media.tumblr.com/tumblr_le6571kvjv1qb87jao1_500.jpg
41 49 46 16 51 12  # http://i.imgur.com/i353DZL.jpg
42  6 47 26 24 46  # http://i.imgur.com/bun1qOh.jpg?1
43 54 38 42 20 18  # https://s-media-cache-ak0.pinimg.com/236x/d4/e9/a5/d4e9a54a4c3c0da773fa72c56d7ef92c.jpg
43 55 26 45  7 10  # https://iamemilyanne.files.wordpress.com/2012/09/fortune-cookie_new-house.jpg
44  1 24 36 47 32  # http://3.bp.blogspot.com/-AR5mie8G25M/Ucer_nCzj_I/AAAAAAAAKr4/QOaOBS_LHX4/s1600/photo+5.JPG
45 21 33 17  3 20  # https://creativejamie.files.wordpress.com/2010/07/you-will-receive-a-fortune-cookie.jpg
46  2 50 10 36 53  # https://c1.staticflickr.com/9/8174/8035925042_13d272f86d_z.jpg
46 19 28  4 27  3  # https://twitter.com/wontonfood/status/737657278740987904
47 37  4 25 44 15  # http://blog.waygoapp.com/wp-content/uploads/2014/05/3183015279_9459e96875_z.jpg
48  3 28 43 24 16  # https://pbs.twimg.com/media/BEn61IgCEAE7tkp.jpg
49 32 28 38  7 43  # http://static.fjcdn.com/pictures/When_8a02ca_1502019.jpg
50 26 14 27 36 30  # http://img.ifcdn.com/images/89d23a5ceb47420e268615daf45cdb817ad7a5d3d342f874ee54fff2524e2c34_1.jpg
50 47 22  1 35 20  # https://www.tomslatin.com/wp-content/uploads/Learn-Chinese-Dry-Cleaning.jpg
53 37  1 17 32 42  # https://webb63.files.wordpress.com/2013/02/fortune-learn-chinese.jpg
55 26 25 51 11 27  # https://ashleyburnette.files.wordpress.com/2011/09/011.jpg"""

# split this text block into numbers
more_numbers = np.array([[int(n) for n in line.split('#')[0].strip().split()]
                         for line in more.split('\n') if line])

# combine the numbers together
y_all = np.vstack([y, more_numbers])
y_all.shape
Out[3]:
(89, 6)

Sampling again with more data

Now that we have assembled all of this extra data, we expect that the variance will be reduced in our posterior distribution over the lucky number upper bound $N$. Let's run the sampler again.

In [4]:
trace = sample_upper_bound(y_all.ravel())
show_output(trace)

N:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  56.000           0.010            0.000            [56.000, 56.000]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  56.000         56.000         56.000         56.000         56.000

None

As we can see, the posterior distribution over $N$ has tightened up considerably. We are now quite sure that $N=56$, implying that all Wonton Food lucky numbers fall in the interval $[1,56]$. Looking at the sampling trace, we can see that virtually any other proposed value for $N$ was rejected except for a handful of times the value 57 was accepted.

What I find really interesting is how close we got using only the first four fortune cookies. In fact, here is how the uncertainty diminishes as we open more and more fortune cookies:

In [5]:
%%time
from tqdm import tqdm

# shuffle the fortunes to random order
shuffled = y_all.copy()
np.random.shuffle(shuffled)

# create an empty matrix to hold all the results
m, n = len(y_all), (50000 - 10000)  * 4  #  n_samples minus burn-in times number of chains
results = np.zeros(shape=(m, n), dtype=np.int)

# now run the sampler with one additional fortune in each iteration and keep track of results
for i in tqdm(range(m)):
    data_so_far = shuffled[:i+1]
    trace = sample_upper_bound(data_so_far.ravel(), n=50000, burn_in=10000)
    results[i, :] = trace.get_values('N')
100%|██████████| 89/89 [12:22<00:00,  8.81s/it]
CPU times: user 2min 40s, sys: 11.2 s, total: 2min 51s
Wall time: 12min 22s



In [9]:
from scipy import stats

xs = np.arange(1, m + 1)
highs = np.percentile(results, 95.0, axis=1)
mode = stats.mode(results, axis=1).mode.ravel()

fig, ax = plt.subplots(figsize=(12, 4))

plt.fill_between(xs, mode, highs.ravel(), alpha=0.5, color="gray", label='95% highest posterior density')
plt.plot(xs, mode, label="Maximum a posteriori estimate")

plt.ylim(45, 80)
plt.xlabel('fortune seen')
plt.ylabel('estimated $N$')
plt.legend()
plt.show()

Unlike the probabilistic higher bound, the lower bound on $N$ is just based on the random order of the fortunes, in that it takes on the value of the highest number we happen to have seen so far. But as we can see, it didn't take very many fortunes before we had extremely high certainty that $N=56$.

A second digression

At the time of writing, if you visit SecondFortune.com you will be greeted with a set of lucky numbers like the following:

As it turns out, they have a snippet of Javascript here choosing numbers from 1 to 99:

for (var i = 0; i < 6; i++) {
    var luckyNumber = Math.floor(Math.random() * 99) + 1;
    $('.lucky-numbers > span').eq(i).html(luckyNumber);
};

Come on guys, get it together.

The distribution question

Now that we feel confident about the sample space, we might wonder about the number selection methodology. One natural question is whether all numbers have an equal probability of being chosen. Here are overall frequency counts for all of the numbers:

In [10]:
fig, ax = plt.subplots(figsize=(12, 4))
counts, edges = np.histogram(y_all.ravel(), bins=np.arange(1, 58))
plt.bar(edges[:-1], counts)
plt.xlim(1, 56)
plt.xlabel('value')
plt.ylabel('count')
plt.show()

We can also see how the counts break down by placement within the ordered lucky number vector:

In [11]:
fig, ax = plt.subplots(figsize=(12, 4))
number_by_position = pd.melt(pd.DataFrame(y_all))
number_by_position = number_by_position.rename(columns={'variable': 'position'})
sns.swarmplot(x='position', y='value', data=number_by_position)
plt.show()

It's hard to tell just by inspection whether certain numbers are favored or not. For example, 17 looks overrepresented, but that could have happened randomly. One thing that is more clear now is that these numbers are being sampled without replacement:

In [12]:
def count_occurrences(fortune):
    values, counts = np.unique(fortune, return_counts=True)
    return counts

def all_unique(fortune):
    counts = count_occurrences(fortune)
    return (counts == 1).all()

unique_within_fortunes = all(all_unique(fortune) for fortune in y_all)
print('Numbers are unique within every fortune:', unique_within_fortunes)
Numbers are unique within every fortune: True

"Future work" (not really though)

Although everyone will rightfully be running out patience with this problem by now, it is interesting to think about how we might model the selection process to learn more about the underlying probabilities (i.e. whether there is bias in how numbers are chosen, and exactly what that bias is).

For example, one approach would be to ignore some complexity by glossing over the sampling without replacement in each individual fortune, and to treat the numbers as 534 rolls of a possibly unbalanced 56-sided die. The distribution involved is the multinomial, so we might try to fit a Dirichlet-multinomial model and draw inferences about the different probabilities by the resulting Dirichlet parameter.

For sampling without replacement, Wikipedia gives the following "urn interpretation" of the multivariate hypergeometric distribution:

The model of an urn with black and white marbles can be extended to the case where there are more than two colors of marbles. If there are $K_i$ marbles of color $i$ in the urn and you take $n$ marbles at random without replacement, then the number of marbles of each color in the sample ($k_1$,$k_2$,...,$k_c$) has the multivariate hypergeometric distribution. This has the same relationship to the multinomial distribution that the hypergeometric distribution has to the binomial distribution—the multinomial distribution is the "with-replacement" distribution and the multivariate hypergeometric is the "without-replacement" distribution.

That seems like a promising place to start, though we will likely end up looking at a version that models bias if we want to allow for the possibility that probabilities are unequal while still requiring only one of each number. It even looks like there is an R library called BiasedUrn which does exactly that.

If anyone tries it out, I'd be interested to hear how it goes.

Any comments or suggestions? Let me know.