I thought it would be interesting to do a post on what we can learn about a business with data analysis using solely open source intel — in other words, bringing together data sources which are publicly available but not necessarily meant to be analyzed by outsiders.

A few days ago, I was trolling around the Dunkin' Donuts store locator. By examining how store location data is requested and transmitted back to the user, I was able to scrape all of it together into one comprehensive table. The result was a data set with geographic and operational information on every single Dunkin' Donuts in the continental United States. In this post, we'll do some exploratory data analysis and visualization with this data set. We'll also examine some aspects of Dunkin's business by mashing the location data up with census and demographic data, and also data on franchisees munged from public filings.

The structure of the post will be to pose a series of hypothetical questions we might want to answer, and then trying to either answer it with the data we already have or go fishing for more open source data in order to figure it out.

Getting the data🔗

I noticed that each search result has some icons representing things like whether the store accepts loyalty cards, whether it sells K-cups, what co-branded company the store is located with (e.g. Hess gas station), and so on. Briefly wondering whether the data used to render these templates was easily available, I opened the Chrome developer tools to take a look at what was going on behind the scenes.

It turns out that Dunkin' Donuts is using MapQuest's enterprise platform to serve up data for their store locator in JSON format. Because all sorts of other data is stored and served in the same requests, the MapQuest API is presumably being used for other internal applications. While small chunks of data are transmitted to the browser of anybody who searches for store locations, we can actually assemble all of it just by making a few similar GET requests and tweaking the parameters for location, radius and max results.

Chrome developer tools screenshot

We'll save the data in a csv file for analysis. With that done, we can set up our analysis environment and dig in.

In [1]:

# import the tools we're going to work with
from __future__ import division
from mpl_toolkits.basemap import Basemap
import numpy as np
import pandas as pd

# set some options to make things look nicer
import seaborn as sns
sns.set_color_palette("Set1")
set1 = sns.color_palette("Set1")
set2 = sns.color_palette("Set2")
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)
pd.set_option('show_dimensions', False)

# read in tables
fulldf = pd.read_csv('data/dd.csv')
hours = pd.read_csv('data/hours.csv')

# narrow the main df down to the columns we care about
wanted_cols = ['RecordId', 'Lat', 'Lng', 'address', 'city', 'state',
               'postal', 'PhoneNumber', 'co_brander_cd']
df = fulldf[wanted_cols]

So here is what we have available:

In [2]:

df.head()

Out[2]:

	RecordId	Lat	Lng	address	city	state	postal	PhoneNumber	co_brander_cd
0	336416	46.860233	-68.009471	89 High St	Caribou	ME	4736	207-492-1701	NaN
1	342706	46.699621	-68.010426	781 Main St	Presque Isle	ME	4769	207-762-3825	WLM
2	336417	46.677409	-68.015942	283 Main St	Presque Isle	ME	4769	207-762-3835	NaN
3	336418	46.141994	-67.840564	246 North St	Houlton	ME	4730	207-532-5954	NaN
4	341998	45.657604	-68.691898	719 Central St	Millinocket	ME	4462	207-723-8749	NaN

In [3]:

# store hours, can be joined with our main df on key `I`
hours.head()

Out[3]:

	I	Fri_hours	Mon_Hours	Sat_hours	Sun_Hours	Thu_hours	Tue_Hours	Wed_Hours
0	3183	24HOURS	24HOURS	24HOURS	24HOURS	24HOURS	24HOURS	24HOURS
1	4001	07:00-19:00	07:00-19:00	07:00-19:00	07:00-19:00	07:00-19:00	07:00-19:00	07:00-19:00
2	2573	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00
3	276	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00	05:00-23:00
4	5151	04:00-23:00	04:00-23:00	04:00-23:00	05:00-23:00	04:00-23:00	04:00-23:00	04:00-23:00

In [4]:

# other fields that are in fulldf but we're not working with at the moment
print '\n'.join([c for c in fulldf.columns if c not in df.columns])

BeverageOnly
ComboStore
DriveIn
DunkinCardEnabled
I
Ic
Loyalty
Mobile
S
SiteType
TurboOven
Wireless
address2
country
dma_cd
k_cup
matchcode

Where exactly are Dunkin' Donuts located?🔗

Living in Massachusetts, I am likely to be within sight of a Dunkin' Donuts at all times. But, the world is an unfair place, and not everyone is so fortunate.

Luckily, our data set came with specific lat/long coordinates, which makes it very easy to do some exploratory data analysis. We'll be using the Basemap toolkit for geographic visualizations. Because we'll be doing multiple visualizations with consistent styling, let's avoid repetition by creating a utility function to set up our maps for us:

In [5]:

def createmap(limits=[22, 50, -125, -67], figsize=(12, 12)):
    """ create a Basemap object for plotting """
    
    # get user defined latitude and longitude limits
    llcrnrlat, urcrnrlat, llcrnrlon, urcrnrlon = limits
    
    # create the Basemap object
    fig, ax = plt.subplots(figsize=figsize)
    m = Basemap(projection='merc',
                llcrnrlat=llcrnrlat,
                urcrnrlat=urcrnrlat,
                llcrnrlon=llcrnrlon,
                urcrnrlon=urcrnrlon,
                resolution='l')
    
    # set some aesthetic options
    m.fillcontinents(color='white', lake_color='#eeeeee')
    m.drawstates(color='lightgray')
    m.drawcoastlines(color='lightgray')
    m.drawcountries(color='lightgray')
    m.drawmapboundary(fill_color='#eeeeee')
    m.ax = ax
    
    return m, fig, ax

With that done, let's just plot the whole data set and see what we're working with:

In [6]:

m, fig, ax = createmap()

x, y = m(np.array(df.Lng), np.array(df.Lat))
m.scatter(x, y, marker='o', c='#ff6200', s=8,
          zorder=2, alpha=0.5, linewidth=0)

plt.show()

What are Dunkin' Donuts' administrative divisions and how many stores are in each?🔗

In the United States, franchising companies are required by the FTC to issue franchise disclosure documents (FDDs) to their potential franchisees. I was able to find Dunkin' Brands' 2011 filing here, from which we can gain some interesting insights into Dunkin's operations and business model.

For instance, here are their regional divisions:

Dunkin Donuts regions table

We can add these to the dataframe:

In [7]:

regions = pd.read_csv('data/regions.csv', usecols=range(1, 4))
regions.head()

Out[7]:

	state_name	region	state
0	Connecticut	Northeast States	CT
1	Maine	Northeast States	ME
2	Massachusetts	Northeast States	MA
3	New Hampshire	Northeast States	NH
4	Rhode Island	Northeast States	RI

Now that we have this data in a workable format, let's ask a question we can now answer that we couldn't before. Personally, I assumed that the Northeast states — a traditional Dunkin' stronghold — would easily have the most Dunkin' stores. That turns out not to be the case:

In [8]:

df.merge(regions[['state', 'region']]).groupby('region').RecordId\
    .nunique().order(ascending=False)

Out[8]:

region
Mid-Atlantic States    2645
Northeast States       1985
Southeast States       1611
Mid-West States         710
Central States          214
Southwest States        173
dtype: int64

Instead, the Mid-Atlantic states (DE, NJ, NY, PA) top the list.

What state has the highest number of Dunkin' locations per capita?🔗

Using US Census data (acquired here), we can easily answer that question.

In [9]:

# read the US gov't census data, keeping only the columns we care about
population = pd.read_csv('data/population.csv', usecols=(4, 5))

# get rid of the ugly, capital letter names that came in this csv
population.columns = ['state_name', 'pop2013']

# narrow down to states that actually contain store locations
mask = population.state_name.isin(regions.state_name)
population = population[mask]

# add the state abbreviations and reindex
population = population.merge(regions[['state', 'state_name']], on='state_name')
population = population.set_index('state')

# add locations by state and remove states with none
population['locations'] = df.groupby('state').RecordId.nunique()
population = population.dropna()

# figure out the quantity of interest and display the top 10
population['people_per_dunkin'] = population.pop2013 // population.locations
population.sort('people_per_dunkin').head(10)

Out[9]:

	state_name	pop2013	locations	people_per_dunkin
state
MA	Massachusetts	6692824	1009	6633
RI	Rhode Island	1051511	158	6655
CT	Connecticut	3596080	484	7429
NH	New Hampshire	1323459	178	7435
NJ	New Jersey	8899339	786	11322
ME	Maine	1328302	117	11353
NY	New York	19651127	1291	15221
DE	Delaware	925749	58	15961
VT	Vermont	626630	39	16067
IL	Illinois	12882135	529	24351

That means that in my home state of MA, we take the prize with a staggering one Dunkin' Donuts for every 6,633 people! Let's take a much closer look at the population and demographics affecting how Dunkins are located within the most densely Dunkin'-ed state. To do this, I'll use a data set of population and demographic information on every MA zip code that I scraped from biggestuscities.com which itself repackages US census data.

What factors influence the placement of store locations?🔗

It seems intuitive that population will be the dominant factor in how many stores are situated in each zip code. But we could easily imagine that other demographic factors are also significant. For instance, we might wonder whether stores are more likely to be in wealthier or working class neighborhoods. Let's take a look:

In [10]:

# read in the MA demographics data
ma = pd.read_csv('data/ma.csv', usecols=range(5)).set_index('postal')

# count up dunkins per zip code
mask = df.state == 'MA'
ma_dunkins_by_zip = df[mask].groupby('postal').RecordId.count()
ma_dunkins_by_zip.name = 'dunkins'
ma = ma.join(ma_dunkins_by_zip)

# sort and display the top 5
ma.sort('dunkins', ascending=False).head()

Out[10]:

	pop2012	households	income_per_cap	home_median_value	dunkins
postal
2301	60850	21941	21629	262900	15
2128	41128	14832	22403	311700	13
2184	35409	13267	37317	372900	11
2151	50845	19425	25085	327800	10
2155	58211	22438	33346	392600	10

One great tool for looking at multivariate relationships is a scatterplot matrix. Not only does it show what every variable's association looks like with respect to any other, looking along the diagonal shows each variable's marginal distribution.

In [11]:

from pandas.tools.plotting import scatter_matrix

axeslist = scatter_matrix(ma, alpha=0.8, figsize=(12, 12), c=set1[1], diagonal="kde")
for ax in axeslist.flatten():
    ax.grid(False)

As expected, the dunkins variable's tightest and most obvious correlation involves population. The more people (or households), the more restaurant locations.

One obvious thing that this plot demonstrates very clearly is that population and the number of households are highly correlated. No surprise there, but it reminds us that in any model we might build, we would probably either want to (a) choose one of the two and avoid using the other, or (b) reduce the dimensionality with a method like PCA. The same goes for income per capita and median home value.

The term for this phenomenon is multicollinearity, and it's good practice to be careful about this before jumping into modeling. Let's look at the calculated correlations:

In [12]:

ma.corr()

Out[12]:

	pop2012	households	income_per_cap	home_median_value	dunkins
pop2012	1.000000	0.986214	-0.260980	-0.143104	0.689992
households	0.986214	1.000000	-0.245324	-0.140970	0.691346
income_per_cap	-0.260980	-0.245324	1.000000	0.839948	-0.133693
home_median_value	-0.143104	-0.140970	0.839948	1.000000	-0.082714
dunkins	0.689992	0.691346	-0.133693	-0.082714	1.000000

Can we model the relationship between population and store locations?🔗

There doesn't necessarily look to be a strong linear correlation between number of Dunkins and any variable except population (or number of households). Simpler models are not always better, but although we might be able to squeeze some predictive power out of the other variables given a bit more legwork, it's often helpful to start with the most basic, "parsimonious" model (relevant Andrew Gelman post here).

Let's build a very simple regression model using pop2012 as the predictor variable for number of dunkins in a given zip code. We'll use a package called statsmodels, which will look very familiar to users of the R statistical package.

In [13]:

import statsmodels.formula.api as smf

# copy the dataframe so we can manipulate the `pop2012` column -- in
# order not to use such huge numbers (10k-60k) for population compared
# to much smaller numbers for dunkins (1-15), we'll divide population
# by 10,000 so it's roughly the same order of magnitude as the outcome variable
regression_df = ma[['pop2012', 'dunkins']].copy()
regression_df.pop2012 = regression_df.pop2012/10000.

# fit the model
results = smf.ols('dunkins ~ pop2012', data=regression_df).fit()

# print out results
results.summary()

Out[13]:

OLS Regression Results
Dep. Variable:	dunkins	R-squared:	0.476
Model:	OLS	Adj. R-squared:	0.475
Method:	Least Squares	F-statistic:	315.3
Date:	Tue, 08 Apr 2014	Prob (F-statistic):	1.21e-50
Time:	15:15:47	Log-Likelihood:	-676.97
No. Observations:	349	AIC:	1358.
Df Residuals:	347	BIC:	1366.
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.5527	0.159	3.475	0.001	0.240 0.866
pop2012	1.3432	0.076	17.757	0.000	1.194 1.492

Omnibus:	61.177	Durbin-Watson:	2.124
Prob(Omnibus):	0.000	Jarque-Bera (JB):	162.595
Skew:	0.825	Prob(JB):	4.93e-36
Kurtosis:	5.908	Cond. No.	4.31

So the equation is something like dunkins = 1.3 * (population/10,000) + 0.6.

Another way to do linear regression is using scikit-learn, the extremely popular machine learning library for Python. Of course, this will be somewhat overkill considering that we aren't even breaking the data up into train and test sets, but it may be useful to see this same modeling task carried out both ways.

In [14]:

from sklearn.linear_model import LinearRegression

# split the data into features and labels
X = ma.pop2012.values.reshape(-1, 1)/10000.
y = ma.dunkins.values

# fit the model
model = LinearRegression()
model.fit(X, y)

# print out the regression coefficients
coeffs = (model.coef_[0], model.intercept_)
print 'dunkins = %0.04f * (population/10,000) + %0.04f' % coeffs

dunkins = 1.3432 * (population/10,000) + 0.5527

In [15]:

# plot the results
fig, ax = plt.subplots(figsize=(10, 6))

# plot observed data with a small random y-axis jitter to avoid dots stacking up
jitter = np.random.rand(y.size)/2
plt.scatter(X, y + jitter, c=set1[1], label='observed')

# plot fitted prediction line
plt.plot(X, model.predict(X), c=set1[0], alpha=0.5, label='fit')

# set some aesthetic options
ax.set_xlim(0, X.max())
ax.set_ylim(0, y.max())
ax.set_xlabel('population (tens of thousands)')
ax.set_ylabel('dunkin locations')

plt.legend()
plt.show()

So what have we learned? Well, assuming that franchisees have opened up store locations based on some research and homework, and that stores which don't have enough demand eventually end up closing, it's probably reasonable to assume that this population model roughly represents the number of viable Dunkin' locations the market will bear at a given population level.

Is this usable? Well, let's say I am thinking about opening a DD franchise in Massachusetts. Naturally, I want to figure out the most profitable place to locate my brand new franchise. I certainly wouldn't want to plant myself somewhere that is already highly saturated unless I had some specific reason to believe I could catch some previously unmet demand. What I really want to know is this: which place might have enough population to support more Dunkin' locations than they currently have?

Let's take a look at the data:

In [16]:

# create a quick function to get the place name from the zip code
get_city_name = lambda x: df[df.postal == x].reset_index().ix[0, 'city']

# add the columns we're interested in
ma['predicted'] = model.predict(ma.pop2012.values.reshape(-1, 1)/10000.)
ma['gap'] = ma.predicted - ma.dunkins
ma['city'] = map(get_city_name, ma.index.values)

# show the top 10 potentially underserved zip codes
wanted_cols = ['pop2012', 'income_per_cap', 'dunkins', 'predicted', 'gap', 'city']
ma[wanted_cols].sort('gap', ascending=False).head(10)

Out[16]:

	pop2012	income_per_cap	dunkins	predicted	gap	city
postal
1841	47741	15197	1	6.965108	5.965108	Lawrence
1902	44991	20012	1	6.595737	5.595737	Lynn
1040	39897	20370	1	5.911528	4.911528	Holyoke
2139	35994	39875	1	5.387290	4.387290	Cambridge
2186	26828	44718	1	4.156144	3.156144	Milton
2124	48384	22095	4	7.051473	3.051473	Dorchester
2121	24972	18187	1	3.906852	2.906852	Dorchester
1602	22353	30601	1	3.555077	2.555077	Worcester
1851	29726	22782	2	4.545394	2.545394	Lowell
2467	21952	55947	1	3.501216	2.501216	Chestnut Hill

Undoubtedly, many of these "Dunkin' gaps" are due to zoning and other stuff we could uncover once we start looking into specifics. However, now we have some candidate zip codes to start doing market research on instead of just guessing.

What co-brands does Dunkin' Donuts work with?🔗

One interesting feature in this data set is looking at other companies that Dunkin' Brands works with. You may have seen that Dunkin' Donuts are sometimes co-located with another business like a Stop & Shop grocery store (in the Northeast) or a Hess gass station. By simple eyeballing of the dataframe grouped by co_brander_cd (and some judicious googling), I was able to put together a partial list of those co-brands:

In [17]:

cobrands = {
    'ARA': 'Aramark',
    'BCG': 'Boston Culinary Group',
    'BRD': 'Baskin Robbins',
    'BDT': 'Baskin Robbins',
    'BP':  'BP',
    'CIT': 'Citgo',
    'CMB': 'Cumberland Farms',
    'DEL': 'Delta',
    'DNC': 'Delaware North Companies',
    'EXX': 'Exxon',
    'GLF': 'Gulf',
    'HES': 'Hess',
    'HMD': 'Home Depot',
    'KUK': 'Lukoil',
    'MOB': 'Mobil',
    'OTG': 'OTG Management',
    'PET': 'Petro',
    'PM':  'Pathmark',
    'SHL': 'Shell',
    'SHW': 'Shaws',
    'SOD': 'Sodexho',
    'SR':  'Shop Rite',
    'SS':  'Stop & Shop',
    'SSP': 'SSP America',
    'SUN': 'Sunoco',
    'VAL': 'Valero',
    'WH':  'WilcoHess',
    'WLM': 'Walmart'
}

# add co-brands to our working dataframe
cobrands = pd.Series(cobrands, name='cobrand_name')
df = df.join(cobrands, on='co_brander_cd')

Let's see what other businesses have the most co-locations with Dunkin' Donuts:

In [18]:

cb = df.groupby('cobrand_name').cobrand_name\
        .count().order(ascending=False).head(10)
cb

Out[18]:

cobrand_name
Hess              350
WilcoHess         264
Baskin Robbins    235
Mobil              80
Walmart            75
Shell              73
Exxon              49
BP                 40
Stop & Shop        38
Citgo              29
dtype: int64

Any seasoned Dunkin' Donuts observer won't be surprised by the top 3 here — Hess and Dunkin' Brands have a longstanding strategic parternership, and the Baskin Robbins brand is actually also owned by Dunkin’ Brands Group, Inc.

Where are these co-brandings prevalent?🔗

In [19]:

# create the Basemap object
m, fig, ax = createmap(limits=[23, 47, -95, -65])

# get a list of the top 5 co-brands
top_5 = cb.head(5).index.values

# iterate through the top 5 and plot their locations
for i, name in enumerate(top_5):
    mask = (df.cobrand_name == name)
    lon, lat = np.array(df[mask].Lng), np.array(df[mask].Lat)
    x, y = m(lon, lat)
    m.scatter(x, y, marker='o', c=set1[i], s=15, zorder=i+3, linewidth=0,
              label=name)
    
# plot all the rest
mask = ~(df.cobrand_name.isin(top_5))
lon, lat = np.array(df[mask].Lng), np.array(df[mask].Lat)
x, y = m(lon, lat)
m.scatter(x, y, marker='x', c='lightgray', label='other',
               s=15, zorder=2, alpha=0.5, linewidth=1)

plt.legend(loc='lower right', fontsize=18)
plt.show()

What's the distribution of small or large franchisees?🔗

Also included in the Franchise Disclosure Document is a list of every single current franchisee. There are pages and pages of the pdf that look just like this:

Dunkin Donuts franchisees document sample

Now we may be tempted to just throw this out as unusable information — it's in the dreaded pdf format after all — but at least all of these lines are in fairly standard, comma separated format. Using the command line utility pdftotext, I was able to dump text information from the desired pages to a text file.

wget -O DD_FDD_11.pdf http://www.bluemaumau.org/sites/default/files/DD_FDD%20UFOC%2003-26-10%20AMND%2011-23-10.pdf
pdftotext -f 126 -l 224 DD_FDD_11.pdf franchisees.txt
pdftotext -f 594 -l 626 DD_FDD_11.pdf franchisees2.txt
cat franchisees.txt franchisees2.txt > franchisees_all.txt

Using some (admittedly annoying and convoluted) regular expressions and little bit of manual cleanup, I was able to get this text into csv format and ready to import.

Note: looking at the address and phone number fields, you might worry (as I did) that this data could be considered personally identifiable information. Even though it is publicly available on the internet, we always want to avoid "doxing" private citizens for no reason. Don't worry though — this information pertains to the Dunkin' Donuts location, not the person.

In [20]:

franchisees = pd.read_csv('data/franchisees.csv')
franchisees

Out[20]:

	RecordId	franchisee	address	PhoneNumber
0	338639	Scott Fanning	497 Route 6, Andover, CT, 06232 1320	860-742-5772
1	336142	Maria Micciche	39 Pershing Dr, Ansonia, CT, 06401 2214	203-732-5787
2	335770	Virginio Sardinha	11 Nott Highway, Ashford, CT, 06278 1316	860-429-3100
3	340368	Scott Fanning	75 E. Main St, Avon, CT, 06001	860-674-9257
4	342508	Cary Gagnon	1100 Berlin Tpke, Berlin, CT, 06037	860-828-6002
5	336743	Cary Gagnon	1113 Farmington Avenue, Berlin, CT, 06037 1218	860-828-0083
6	340508	Cary Gagnon	2005 Berlin Turnpike Wilbur Cross Highway, Ber...	860-828-5885
7	330641	Cary Gagnon	223 New Britain Ave, Berlin, CT, 06037 1353	860-229-0606
8	344540	Eduardo Batista	203 Greenwood Ave, Bethel, CT, 06801 2113	203-781-2576
9	303311	Eduardo Batista	25 Grassy Plain Rd, Bethel, CT, 06801 1703	203-797-9221
10	310247	James Zafiris	413 Cottage Grove Rd, Bloomfield, CT, 06002 3119	860-286-1175
11	301753	James Zafiris	760 Park Ave, Bloomfield, CT, 06002 2457	860-243-5633
12	341358	Christina Zafiris	903-905 Blue Hills Ave, Bloomfield, CT, 06002	860-286-9061
13	343384	James Zafiris	6 Old Windsor Rd Cumberland Farms, Bloomfield,...	860-243-0751
14	301333	Carlos Andrade	858 W Main St, Branford, CT, 06405 3422	203-488-7402
15	344893	Carlos Andrade	364 E. Main St, Branford, CT, 06405 2938	203-488-5938
16	338170	Carlos Andrade	33 Leetes Island Rd, Branford, CT, 06405 6513	203-315-1155
17	330839	Carlos Andrade	207 East Main St, Branford, CT, 06405 3102	203-488-7334
18	300343	Carlos Andrade	112 N Main St, Branford, CT, 06405 3011	203-488-8389
19	335070	Wayne Bowman	815 Lafayette Blvd, Bridgeport, CT, 06604 5724	203-366-1605
	...	...	...	...

Keep in mind that this table is a couple years old (I found the 2011 version online and didn't feel like paying $99 for the 2013 FDD, and part of the fun of this hunt is using free, open source intel). Even so, it's pretty interesting.

In [22]:

franchisees.groupby('franchisee').franchisee\
    .count().order(ascending=False).head(12)

Out[22]:

franchisee
Richard Lawlor           228
Stephen Williams         103
Constantine Scrivanos     90
Carlos Andrade            68
Mark Cafua                58
Edward Wolak              51
Konstantino Skrivanos     48
Anton Nader               48
David Cafua               42
Jose Couto                37
Dinart Serpa              35
Kaushik Patel             33
dtype: int64

Doing some quick googling, we find that the top two franchisees (Richard Lawlor and Stephen Williams) are executives for Hess and WilcoHess, respectively, two of Dunkin's biggest co-brands.

Now that we have this information, the next logical step is to connect it with our store location information. Luckily, Dunkin' Donuts has made this very easy for us by using the same foreign key RecordId in their FDD filing as in their MapQuest API!

In [23]:

# make a smaller df for joining with only the fields we care about
wanted_cols = ['RecordId', 'Lat', 'Lng', 'state']
franch_df = franchisees.merge(df[wanted_cols], on='RecordId')
franch_df

Out[23]:

	RecordId	franchisee	address	PhoneNumber	Lat	Lng	state
0	338639	Scott Fanning	497 Route 6, Andover, CT, 06232 1320	860-742-5772	41.732901	-72.356867	CT
1	336142	Maria Micciche	39 Pershing Dr, Ansonia, CT, 06401 2214	203-732-5787	41.334069	-73.083807	CT
2	335770	Virginio Sardinha	11 Nott Highway, Ashford, CT, 06278 1316	860-429-3100	41.857074	-72.178935	CT
3	340368	Scott Fanning	75 E. Main St, Avon, CT, 06001	860-674-9257	41.807312	-72.824295	CT
4	342508	Cary Gagnon	1100 Berlin Tpke, Berlin, CT, 06037	860-828-6002	41.620599	-72.742552	CT
5	336743	Cary Gagnon	1113 Farmington Avenue, Berlin, CT, 06037 1218	860-828-0083	41.629956	-72.758391	CT
6	340508	Cary Gagnon	2005 Berlin Turnpike Wilbur Cross Highway, Ber...	860-828-5885	41.597576	-72.754250	CT
7	330641	Cary Gagnon	223 New Britain Ave, Berlin, CT, 06037 1353	860-229-0606	41.642589	-72.769804	CT
8	344540	Eduardo Batista	203 Greenwood Ave, Bethel, CT, 06801 2113	203-781-2576	41.371809	-73.415042	CT
9	303311	Eduardo Batista	25 Grassy Plain Rd, Bethel, CT, 06801 1703	203-797-9221	41.376866	-73.425716	CT
10	310247	James Zafiris	413 Cottage Grove Rd, Bloomfield, CT, 06002 3119	860-286-1175	41.817729	-72.716034	CT
11	301753	James Zafiris	760 Park Ave, Bloomfield, CT, 06002 2457	860-243-5633	41.828488	-72.731447	CT
12	341358	Christina Zafiris	903-905 Blue Hills Ave, Bloomfield, CT, 06002	860-286-9061	41.814550	-72.695089	CT
13	301333	Carlos Andrade	858 W Main St, Branford, CT, 06405 3422	203-488-7402	41.276475	-72.836483	CT
14	344893	Carlos Andrade	364 E. Main St, Branford, CT, 06405 2938	203-488-5938	41.295386	-72.782856	CT
15	338170	Carlos Andrade	33 Leetes Island Rd, Branford, CT, 06405 6513	203-315-1155	41.296891	-72.765736	CT
16	330839	Carlos Andrade	207 East Main St, Branford, CT, 06405 3102	203-488-7334	41.289642	-72.799228	CT
17	300343	Carlos Andrade	112 N Main St, Branford, CT, 06405 3011	203-488-8389	41.286369	-72.817094	CT
18	335070	Wayne Bowman	815 Lafayette Blvd, Bridgeport, CT, 06604 5724	203-366-1605	41.173479	-73.192486	CT
19	302984	Claude Stewart, Inc	979 Main St, Bridgeport, CT, 06604 4268	203-335-8550	41.177877	-73.189243	CT
	...	...	...	...	...	...	...

By default, merge operates a lot like SQL JOIN. In fact, it has a default keyword argument how='left' that reflects the SQL-like default LEFT INNER JOIN-style behavior, meaning that if records from the right-hand table don't match (in this case RecordIds in df that are different or missing), they will be dropped.

From the cell above, it looks like we only lost around 700 data points from missing or unmatched RecordId fields. We know that much of this is probably due to the store data in df being much more up to date than the franchisee data from 2011. Let's see if we can do a little better by trying to match some of the rest on phone numbers:

In [24]:

# pick the columns we want to use for joining
wanted_cols = ['PhoneNumber', 'Lat', 'Lng', 'state']

# figure out which rows we failed to match last time
not_matched = df[~df.RecordId.isin(franch_df.RecordId.values)]

# try to merge on phone number this time around
new_matches = franchisees.merge(not_matched[wanted_cols], on='PhoneNumber')

# add our newly matched stores to the working dataframe
franch_df = pd.concat((franch_df, new_matches))

Not bad — we were able to add over 100 records back in just by trying a different foreign key.

Where are these large franchisees' holdings geographically clustered in MA?🔗

Maybe we want see who the largest franchisees are in MA, one of DD's most heavily concentrated states:

In [25]:

m, fig, ax = createmap(limits=[41.4, 43.1, -73.6, -69.4], figsize=(13, 8))

# figure out the top 10 franchisees in MA
mask = (franch_df.state == 'MA')
ma_top = franch_df[mask].groupby('franchisee').franchisee\
    .count().order(ascending=False)
    
# grab just the top 10 for plotting purposes
top_10 = ma_top.head(10).index.values

# get together a long-ish list of different colors to plot with
colors = set2[:6] + set1

# iterate through the top 10 and plot their locations
for i, name in enumerate(top_10):
    mask = (franch_df.state == 'MA') & (franch_df.franchisee == name)
    lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
    x, y = m(lon, lat)
    m.scatter(x, y, marker='o', c=colors[i],
              s=15, zorder=i+3, linewidth=0,
              label='%s (%d)' % (name, ma_top[name]))
    
# plot all the rest
mask = (franch_df.state == 'MA') & ~(franch_df.franchisee.isin(top_10))
lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
x, y = m(lon, lat)
m.scatter(x, y, marker='x', c='lightgray', label='other',
               s=15, zorder=2, alpha=0.5, linewidth=1)

plt.legend(loc='upper right')
plt.show()

So here's an interesting insight: most of the largest franchisees in Massachusetts built their Dunkin' empires outside of the city, even though Boston proper has the highest concentration of store locations. While one of the top 10 (Mr. Goddess, as a director of Watermark Donut Co.) has most of his locations clustered in the Boston metro area, the others seem to focus their efforts mainly on more suburban or rural areas.

Why is this? I have basically no domain knowledge about Dunkin' Donuts or the hospitality industry, but I would surmise that it is much more capital intensive to invest in downtown real estate, and that there is quite a bit more competition in such a densely populated area.

What about the top franchisees for the whole country?🔗

Here are the top franchisees in the country. (We will leave out the two vice presidents of Hess and WilcoHess. Since they operate so many stores, the scatter plot would obscure other franchisees — see above in the plot of co-brands for their holdings.)

In [26]:

m, fig, ax = createmap(limits=[24, 47, -92, -65])

top_10 = franch_df.groupby('franchisee').franchisee\
    .count().order(ascending=False).head(18).tail(16) # drop the top 2
    
# put together two color sets so we have enough for all the franchisees
colors = sns.color_palette('Set1', 8) + sns.color_palette('jet', 8)

# iterate through the top 10 and plot their locations
for i, name in enumerate(top_10.index.values):
    mask = (franch_df.franchisee == name)
    lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
    x, y = m(lon, lat)
    m.scatter(x, y, marker='o', c=colors[i],
              s=15, zorder=i+3, linewidth=0,
              label='%s (%d)' % (name, top_10[name]))
    
# plot all the rest
mask = ~franch_df.franchisee.isin(top_10)
lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
x, y = m(lon, lat)
m.scatter(x, y, marker='x', c='lightgray', label='other',
          s=15, zorder=2, alpha=0.5, linewidth=1)

plt.legend(loc='lower right', fontsize=12)
plt.show()

One thing that is interesting here, and easily researched via Google, is that some franchisees, like Mr. Scrivanos, Mr. Andrade, Mr. Wolak, and Mr. Cafua are large independent holders whose business appears to center primarly around Dunkin' Donuts franchises, while some, like Mr. Colaianni and Mr. Weissberg, hold these franchises on behalf of large holding companies or banking/investment operations

Conclusion🔗

It's pretty amazing what kind of data is just freely available online for those willing to go digging, and how it can be mashed up in interesting combinations. Actually, this post is just scratching the surface — many states now keep all corporate filings in publicly searchable (and scrape-able!) databases online (e.g. MA Corporations Division). If we really wanted to start connecting all sorts of dots across different businesses, I think analyzing such data could be eye-opening. Some journalists are already starting to do this kind of thing in an attempt to dig up irregularities in political campaign donations.

Along these lines, there are some fascinating results emerging based on data sets just like this; for instance, check out this paper, titled "Geo-Spotting: Mining Online Location-based Services for Optimal Retail Store Placement." In it, the authors describe how Foursquare check-ins might be able to predict profitable places to put a coffee shop.

Actually, it even looks like Dunkin' Brands is hiring for people who can "Develop probabilistic models, use machine learning algorithms, Bayesian techniques to model business scenarios." Cool stuff.

FULL DISCLOSURE: I own one share of DNKN stock. Other than that, I have no affiliation with this company.

Posts · Projects

Analyzing Dunkin' Donuts' business with scraped data

8 Jan 2014