Analyzing Dunkin' Donuts' business with scraped data

I thought it would be interesting to do a post on what we can learn about a business with data analysis using solely open source intel — in other words, bringing together data sources which are publicly available but not necessarily meant to be analyzed by outsiders.

A few days ago, I was trolling around the Dunkin' Donuts store locator. By examining how store location data is requested and transmitted back to the user, I was able to scrape all of it together into one comprehensive table. The result was a data set with geographic and operational information on every single Dunkin' Donuts in the continental United States. In this post, we'll do some exploratory data analysis and visualization with this data set. We'll also examine some aspects of Dunkin's business by mashing the location data up with census and demographic data, and also data on franchisees munged from public filings.

The structure of the post will be to pose a series of hypothetical questions we might want to answer, and then trying to either answer it with the data we already have or go fishing for more open source data in order to figure it out.

Getting the data

I noticed that each search result has some icons representing things like whether the store accepts loyalty cards, whether it sells K-cups, what co-branded company the store is located with (e.g. Hess gas station), and so on. Briefly wondering whether the data used to render these templates was easily available, I opened the Chrome developer tools to take a look at what was going on behind the scenes.

It turns out that Dunkin' Donuts is using MapQuest's enterprise platform to serve up data for their store locator in JSON format. Because all sorts of other data is stored and served in the same requests, the MapQuest API is presumably being used for other internal applications. While small chunks of data are transmitted to the browser of anybody who searches for store locations, we can actually assemble all of it just by making a few similar GET requests and tweaking the parameters for location, radius and max results.

We'll save the data in a csv file for analysis. With that done, we can set up our analysis environment and dig in.

In [1]:
# import the tools we're going to work with
from __future__ import division
from mpl_toolkits.basemap import Basemap
import numpy as np
import pandas as pd

# set some options to make things look nicer
import seaborn as sns
sns.set_color_palette("Set1")
set1 = sns.color_palette("Set1")
set2 = sns.color_palette("Set2")
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)
pd.set_option('show_dimensions', False)

# read in tables
fulldf = pd.read_csv('data/dd.csv')
hours = pd.read_csv('data/hours.csv')

# narrow the main df down to the columns we care about
wanted_cols = ['RecordId', 'Lat', 'Lng', 'address', 'city', 'state',
               'postal', 'PhoneNumber', 'co_brander_cd']
df = fulldf[wanted_cols]

So here is what we have available:

In [2]:
df.head()
Out[2]:
RecordId Lat Lng address city state postal PhoneNumber co_brander_cd
0 336416 46.860233 -68.009471 89 High St Caribou ME 4736 207-492-1701 NaN
1 342706 46.699621 -68.010426 781 Main St Presque Isle ME 4769 207-762-3825 WLM
2 336417 46.677409 -68.015942 283 Main St Presque Isle ME 4769 207-762-3835 NaN
3 336418 46.141994 -67.840564 246 North St Houlton ME 4730 207-532-5954 NaN
4 341998 45.657604 -68.691898 719 Central St Millinocket ME 4462 207-723-8749 NaN
In [3]:
# store hours, can be joined with our main df on key `I`
hours.head()
Out[3]:
I Fri_hours Mon_Hours Sat_hours Sun_Hours Thu_hours Tue_Hours Wed_Hours
0 3183 24HOURS 24HOURS 24HOURS 24HOURS 24HOURS 24HOURS 24HOURS
1 4001 07:00-19:00 07:00-19:00 07:00-19:00 07:00-19:00 07:00-19:00 07:00-19:00 07:00-19:00
2 2573 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00
3 276 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00 05:00-23:00
4 5151 04:00-23:00 04:00-23:00 04:00-23:00 05:00-23:00 04:00-23:00 04:00-23:00 04:00-23:00
In [4]:
# other fields that are in fulldf but we're not working with at the moment
print '\n'.join([c for c in fulldf.columns if c not in df.columns])
BeverageOnly
ComboStore
DriveIn
DunkinCardEnabled
I
Ic
Loyalty
Mobile
S
SiteType
TurboOven
Wireless
address2
country
dma_cd
k_cup
matchcode

Where exactly are Dunkin' Donuts located?

Living in Massachusetts, I am likely to be within sight of a Dunkin' Donuts at all times. But, the world is an unfair place, and not everyone is so fortunate.

Luckily, our data set came with specific lat/long coordinates, which makes it very easy to do some exploratory data analysis. We'll be using the Basemap toolkit for geographic visualizations. Because we'll be doing multiple visualizations with consistent styling, let's avoid repetition by creating a utility function to set up our maps for us:

In [5]:
def createmap(limits=[22, 50, -125, -67], figsize=(12, 12)):
    """ create a Basemap object for plotting """
    
    # get user defined latitude and longitude limits
    llcrnrlat, urcrnrlat, llcrnrlon, urcrnrlon = limits
    
    # create the Basemap object
    fig, ax = plt.subplots(figsize=figsize)
    m = Basemap(projection='merc',
                llcrnrlat=llcrnrlat,
                urcrnrlat=urcrnrlat,
                llcrnrlon=llcrnrlon,
                urcrnrlon=urcrnrlon,
                resolution='l')
    
    # set some aesthetic options
    m.fillcontinents(color='white', lake_color='#eeeeee')
    m.drawstates(color='lightgray')
    m.drawcoastlines(color='lightgray')
    m.drawcountries(color='lightgray')
    m.drawmapboundary(fill_color='#eeeeee')
    m.ax = ax
    
    return m, fig, ax

With that done, let's just plot the whole data set and see what we're working with:

In [6]:
m, fig, ax = createmap()

x, y = m(np.array(df.Lng), np.array(df.Lat))
m.scatter(x, y, marker='o', c='#ff6200', s=8,
          zorder=2, alpha=0.5, linewidth=0)

plt.show()

What are Dunkin' Donuts' administrative divisions and how many stores are in each?

In the United States, franchising companies are required by the FTC to issue franchise disclosure documents (FDDs) to their potential franchisees. I was able to find Dunkin' Brands' 2011 filing here, from which we can gain some interesting insights into Dunkin's operations and business model.

For instance, here are their regional divisions:

We can add these to the dataframe:

In [7]:
regions = pd.read_csv('data/regions.csv', usecols=range(1, 4))
regions.head()
Out[7]:
state_name region state
0 Connecticut Northeast States CT
1 Maine Northeast States ME
2 Massachusetts Northeast States MA
3 New Hampshire Northeast States NH
4 Rhode Island Northeast States RI

Now that we have this data in a workable format, let's ask a question we can now answer that we couldn't before. Personally, I assumed that the Northeast states — a traditional Dunkin' stronghold — would easily have the most Dunkin' stores. That turns out not to be the case:

In [8]:
df.merge(regions[['state', 'region']]).groupby('region').RecordId\
    .nunique().order(ascending=False)
Out[8]:
region
Mid-Atlantic States    2645
Northeast States       1985
Southeast States       1611
Mid-West States         710
Central States          214
Southwest States        173
dtype: int64

Instead, the Mid-Atlantic states (DE, NJ, NY, PA) top the list.

What state has the highest number of Dunkin' locations per capita?

Using US Census data (acquired here), we can easily answer that question.

In [9]:
# read the US gov't census data, keeping only the columns we care about
population = pd.read_csv('data/population.csv', usecols=(4, 5))

# get rid of the ugly, capital letter names that came in this csv
population.columns = ['state_name', 'pop2013']

# narrow down to states that actually contain store locations
mask = population.state_name.isin(regions.state_name)
population = population[mask]

# add the state abbreviations and reindex
population = population.merge(regions[['state', 'state_name']], on='state_name')
population = population.set_index('state')

# add locations by state and remove states with none
population['locations'] = df.groupby('state').RecordId.nunique()
population = population.dropna()

# figure out the quantity of interest and display the top 10
population['people_per_dunkin'] = population.pop2013 // population.locations
population.sort('people_per_dunkin').head(10)
Out[9]:
state_name pop2013 locations people_per_dunkin
state
MA Massachusetts 6692824 1009 6633
RI Rhode Island 1051511 158 6655
CT Connecticut 3596080 484 7429
NH New Hampshire 1323459 178 7435
NJ New Jersey 8899339 786 11322
ME Maine 1328302 117 11353
NY New York 19651127 1291 15221
DE Delaware 925749 58 15961
VT Vermont 626630 39 16067
IL Illinois 12882135 529 24351

That means that in my home state of MA, we take the prize with a staggering one Dunkin' Donuts for every 6,633 people! Let's take a much closer look at the population and demographics affecting how Dunkins are located within the most densely Dunkin'-ed state. To do this, I'll use a data set of population and demographic information on every MA zip code that I scraped from biggestuscities.com which itself repackages US census data.

What factors influence the placement of store locations?

It seems intuitive that population will be the dominant factor in how many stores are situated in each zip code. But we could easily imagine that other demographic factors are also significant. For instance, we might wonder whether stores are more likely to be in wealthier or working class neighborhoods. Let's take a look:

In [10]:
# read in the MA demographics data
ma = pd.read_csv('data/ma.csv', usecols=range(5)).set_index('postal')

# count up dunkins per zip code
mask = df.state == 'MA'
ma_dunkins_by_zip = df[mask].groupby('postal').RecordId.count()
ma_dunkins_by_zip.name = 'dunkins'
ma = ma.join(ma_dunkins_by_zip)

# sort and display the top 5
ma.sort('dunkins', ascending=False).head()
Out[10]:
pop2012 households income_per_cap home_median_value dunkins
postal
2301 60850 21941 21629 262900 15
2128 41128 14832 22403 311700 13
2184 35409 13267 37317 372900 11
2151 50845 19425 25085 327800 10
2155 58211 22438 33346 392600 10

One great tool for looking at multivariate relationships is a scatterplot matrix. Not only does it show what every variable's association looks like with respect to any other, looking along the diagonal shows each variable's marginal distribution.

In [11]:
from pandas.tools.plotting import scatter_matrix

axeslist = scatter_matrix(ma, alpha=0.8, figsize=(12, 12), c=set1[1], diagonal="kde")
for ax in axeslist.flatten():
    ax.grid(False)

As expected, the dunkins variable's tightest and most obvious correlation involves population. The more people (or households), the more restaurant locations.

One obvious thing that this plot demonstrates very clearly is that population and the number of households are highly correlated. No surprise there, but it reminds us that in any model we might build, we would probably either want to (a) choose one of the two and avoid using the other, or (b) reduce the dimensionality with a method like PCA. The same goes for income per capita and median home value.

The term for this phenomenon is multicollinearity, and it's good practice to be careful about this before jumping into modeling. Let's look at the calculated correlations:

In [12]:
ma.corr()
Out[12]:
pop2012 households income_per_cap home_median_value dunkins
pop2012 1.000000 0.986214 -0.260980 -0.143104 0.689992
households 0.986214 1.000000 -0.245324 -0.140970 0.691346
income_per_cap -0.260980 -0.245324 1.000000 0.839948 -0.133693
home_median_value -0.143104 -0.140970 0.839948 1.000000 -0.082714
dunkins 0.689992 0.691346 -0.133693 -0.082714 1.000000

Can we model the relationship between population and store locations?

There doesn't necessarily look to be a strong linear correlation between number of Dunkins and any variable except population (or number of households). Simpler models are not always better, but although we might be able to squeeze some predictive power out of the other variables given a bit more legwork, it's often helpful to start with the most basic, "parsimonious" model (relevant Andrew Gelman post here).

Let's build a very simple regression model using pop2012 as the predictor variable for number of dunkins in a given zip code. We'll use a package called statsmodels, which will look very familiar to users of the R statistical package.

In [13]:
import statsmodels.formula.api as smf

# copy the dataframe so we can manipulate the `pop2012` column -- in
# order not to use such huge numbers (10k-60k) for population compared
# to much smaller numbers for dunkins (1-15), we'll divide population
# by 10,000 so it's roughly the same order of magnitude as the outcome variable
regression_df = ma[['pop2012', 'dunkins']].copy()
regression_df.pop2012 = regression_df.pop2012/10000.

# fit the model
results = smf.ols('dunkins ~ pop2012', data=regression_df).fit()

# print out results
results.summary()
Out[13]:
OLS Regression Results
Dep. Variable: dunkins R-squared: 0.476
Model: OLS Adj. R-squared: 0.475
Method: Least Squares F-statistic: 315.3
Date: Tue, 08 Apr 2014 Prob (F-statistic): 1.21e-50
Time: 15:15:47 Log-Likelihood: -676.97
No. Observations: 349 AIC: 1358.
Df Residuals: 347 BIC: 1366.
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 0.5527 0.159 3.475 0.001 0.240 0.866
pop2012 1.3432 0.076 17.757 0.000 1.194 1.492
Omnibus: 61.177 Durbin-Watson: 2.124
Prob(Omnibus): 0.000 Jarque-Bera (JB): 162.595
Skew: 0.825 Prob(JB): 4.93e-36
Kurtosis: 5.908 Cond. No. 4.31

So the equation is something like dunkins = 1.3 * (population/10,000) + 0.6.

Another way to do linear regression is using scikit-learn, the extremely popular machine learning library for Python. Of course, this will be somewhat overkill considering that we aren't even breaking the data up into train and test sets, but it may be useful to see this same modeling task carried out both ways.

In [14]:
from sklearn.linear_model import LinearRegression

# split the data into features and labels
X = ma.pop2012.values.reshape(-1, 1)/10000.
y = ma.dunkins.values

# fit the model
model = LinearRegression()
model.fit(X, y)

# print out the regression coefficients
coeffs = (model.coef_[0], model.intercept_)
print 'dunkins = %0.04f * (population/10,000) + %0.04f' % coeffs
dunkins = 1.3432 * (population/10,000) + 0.5527

In [15]:
# plot the results
fig, ax = plt.subplots(figsize=(10, 6))

# plot observed data with a small random y-axis jitter to avoid dots stacking up
jitter = np.random.rand(y.size)/2
plt.scatter(X, y + jitter, c=set1[1], label='observed')

# plot fitted prediction line
plt.plot(X, model.predict(X), c=set1[0], alpha=0.5, label='fit')

# set some aesthetic options
ax.set_xlim(0, X.max())
ax.set_ylim(0, y.max())
ax.set_xlabel('population (tens of thousands)')
ax.set_ylabel('dunkin locations')

plt.legend()
plt.show()

So what have we learned? Well, assuming that franchisees have opened up store locations based on some research and homework, and that stores which don't have enough demand eventually end up closing, it's probably reasonable to assume that this population model roughly represents the number of viable Dunkin' locations the market will bear at a given population level.

Is this usable? Well, let's say I am thinking about opening a DD franchise in Massachusetts. Naturally, I want to figure out the most profitable place to locate my brand new franchise. I certainly wouldn't want to plant myself somewhere that is already highly saturated unless I had some specific reason to believe I could catch some previously unmet demand. What I really want to know is this: which place might have enough population to support more Dunkin' locations than they currently have?

Let's take a look at the data:

In [16]:
# create a quick function to get the place name from the zip code
get_city_name = lambda x: df[df.postal == x].reset_index().ix[0, 'city']

# add the columns we're interested in
ma['predicted'] = model.predict(ma.pop2012.values.reshape(-1, 1)/10000.)
ma['gap'] = ma.predicted - ma.dunkins
ma['city'] = map(get_city_name, ma.index.values)

# show the top 10 potentially underserved zip codes
wanted_cols = ['pop2012', 'income_per_cap', 'dunkins', 'predicted', 'gap', 'city']
ma[wanted_cols].sort('gap', ascending=False).head(10)
Out[16]:
pop2012 income_per_cap dunkins predicted gap city
postal
1841 47741 15197 1 6.965108 5.965108 Lawrence
1902 44991 20012 1 6.595737 5.595737 Lynn
1040 39897 20370 1 5.911528 4.911528 Holyoke
2139 35994 39875 1 5.387290 4.387290 Cambridge
2186 26828 44718 1 4.156144 3.156144 Milton
2124 48384 22095 4 7.051473 3.051473 Dorchester
2121 24972 18187 1 3.906852 2.906852 Dorchester
1602 22353 30601 1 3.555077 2.555077 Worcester
1851 29726 22782 2 4.545394 2.545394 Lowell
2467 21952 55947 1 3.501216 2.501216 Chestnut Hill

Undoubtedly, many of these "Dunkin' gaps" are due to zoning and other stuff we could uncover once we start looking into specifics. However, now we have some candidate zip codes to start doing market research on instead of just guessing.

What co-brands does Dunkin' Donuts work with?

One interesting feature in this data set is looking at other companies that Dunkin' Brands works with. You may have seen that Dunkin' Donuts are sometimes co-located with another business like a Stop & Shop grocery store (in the Northeast) or a Hess gass station. By simple eyeballing of the dataframe grouped by co_brander_cd (and some judicious googling), I was able to put together a partial list of those co-brands:

In [17]:
cobrands = {
    'ARA': 'Aramark',
    'BCG': 'Boston Culinary Group',
    'BRD': 'Baskin Robbins',
    'BDT': 'Baskin Robbins',
    'BP':  'BP',
    'CIT': 'Citgo',
    'CMB': 'Cumberland Farms',
    'DEL': 'Delta',
    'DNC': 'Delaware North Companies',
    'EXX': 'Exxon',
    'GLF': 'Gulf',
    'HES': 'Hess',
    'HMD': 'Home Depot',
    'KUK': 'Lukoil',
    'MOB': 'Mobil',
    'OTG': 'OTG Management',
    'PET': 'Petro',
    'PM':  'Pathmark',
    'SHL': 'Shell',
    'SHW': 'Shaws',
    'SOD': 'Sodexho',
    'SR':  'Shop Rite',
    'SS':  'Stop & Shop',
    'SSP': 'SSP America',
    'SUN': 'Sunoco',
    'VAL': 'Valero',
    'WH':  'WilcoHess',
    'WLM': 'Walmart'
}

# add co-brands to our working dataframe
cobrands = pd.Series(cobrands, name='cobrand_name')
df = df.join(cobrands, on='co_brander_cd')

Let's see what other businesses have the most co-locations with Dunkin' Donuts:

In [18]:
cb = df.groupby('cobrand_name').cobrand_name\
        .count().order(ascending=False).head(10)
cb
Out[18]:
cobrand_name
Hess              350
WilcoHess         264
Baskin Robbins    235
Mobil              80
Walmart            75
Shell              73
Exxon              49
BP                 40
Stop & Shop        38
Citgo              29
dtype: int64

Any seasoned Dunkin' Donuts observer won't be surprised by the top 3 here — Hess and Dunkin' Brands have a longstanding strategic parternership, and the Baskin Robbins brand is actually also owned by Dunkin’ Brands Group, Inc.

Where are these co-brandings prevalent?

In [19]:
# create the Basemap object
m, fig, ax = createmap(limits=[23, 47, -95, -65])

# get a list of the top 5 co-brands
top_5 = cb.head(5).index.values

# iterate through the top 5 and plot their locations
for i, name in enumerate(top_5):
    mask = (df.cobrand_name == name)
    lon, lat = np.array(df[mask].Lng), np.array(df[mask].Lat)
    x, y = m(lon, lat)
    m.scatter(x, y, marker='o', c=set1[i], s=15, zorder=i+3, linewidth=0,
              label=name)
    
# plot all the rest
mask = ~(df.cobrand_name.isin(top_5))
lon, lat = np.array(df[mask].Lng), np.array(df[mask].Lat)
x, y = m(lon, lat)
m.scatter(x, y, marker='x', c='lightgray', label='other',
               s=15, zorder=2, alpha=0.5, linewidth=1)

plt.legend(loc='lower right', fontsize=18)
plt.show()

What's the distribution of small or large franchisees?

Also included in the Franchise Disclosure Document is a list of every single current franchisee. There are pages and pages of the pdf that look just like this:

Now we may be tempted to just throw this out as unusable information — it's in the dreaded pdf format after all — but at least all of these lines are in fairly standard, comma separated format. Using the command line utility pdftotext, I was able to dump text information from the desired pages to a text file.

wget -O DD_FDD_11.pdf http://www.bluemaumau.org/sites/default/files/DD_FDD%20UFOC%2003-26-10%20AMND%2011-23-10.pdf
pdftotext -f 126 -l 224 DD_FDD_11.pdf franchisees.txt
pdftotext -f 594 -l 626 DD_FDD_11.pdf franchisees2.txt
cat franchisees.txt franchisees2.txt > franchisees_all.txt

Using some (admittedly annoying and convoluted) regular expressions and little bit of manual cleanup, I was able to get this text into csv format and ready to import.

Note: looking at the address and phone number fields, you might worry (as I did) that this data could be considered personally identifiable information. Even though it is publicly available on the internet, we always want to avoid "doxing" private citizens for no reason. Don't worry though — this information pertains to the Dunkin' Donuts location, not the person.

In [20]:
franchisees = pd.read_csv('data/franchisees.csv')
franchisees
Out[20]:
RecordId franchisee address PhoneNumber
0 338639 Scott Fanning 497 Route 6, Andover, CT, 06232 1320 860-742-5772
1 336142 Maria Micciche 39 Pershing Dr, Ansonia, CT, 06401 2214 203-732-5787
2 335770 Virginio Sardinha 11 Nott Highway, Ashford, CT, 06278 1316 860-429-3100
3 340368 Scott Fanning 75 E. Main St, Avon, CT, 06001 860-674-9257
4 342508 Cary Gagnon 1100 Berlin Tpke, Berlin, CT, 06037 860-828-6002
5 336743 Cary Gagnon 1113 Farmington Avenue, Berlin, CT, 06037 1218 860-828-0083
6 340508 Cary Gagnon 2005 Berlin Turnpike Wilbur Cross Highway, Ber... 860-828-5885
7 330641 Cary Gagnon 223 New Britain Ave, Berlin, CT, 06037 1353 860-229-0606
8 344540 Eduardo Batista 203 Greenwood Ave, Bethel, CT, 06801 2113 203-781-2576
9 303311 Eduardo Batista 25 Grassy Plain Rd, Bethel, CT, 06801 1703 203-797-9221
10 310247 James Zafiris 413 Cottage Grove Rd, Bloomfield, CT, 06002 3119 860-286-1175
11 301753 James Zafiris 760 Park Ave, Bloomfield, CT, 06002 2457 860-243-5633
12 341358 Christina Zafiris 903-905 Blue Hills Ave, Bloomfield, CT, 06002 860-286-9061
13 343384 James Zafiris 6 Old Windsor Rd Cumberland Farms, Bloomfield,... 860-243-0751
14 301333 Carlos Andrade 858 W Main St, Branford, CT, 06405 3422 203-488-7402
15 344893 Carlos Andrade 364 E. Main St, Branford, CT, 06405 2938 203-488-5938
16 338170 Carlos Andrade 33 Leetes Island Rd, Branford, CT, 06405 6513 203-315-1155
17 330839 Carlos Andrade 207 East Main St, Branford, CT, 06405 3102 203-488-7334
18 300343 Carlos Andrade 112 N Main St, Branford, CT, 06405 3011 203-488-8389
19 335070 Wayne Bowman 815 Lafayette Blvd, Bridgeport, CT, 06604 5724 203-366-1605
... ... ... ...

Keep in mind that this table is a couple years old (I found the 2011 version online and didn't feel like paying \$99 for the 2013 FDD, and part of the fun of this hunt is using free, open source intel). Even so, it's pretty interesting.

In [22]:
franchisees.groupby('franchisee').franchisee\
    .count().order(ascending=False).head(12)
Out[22]:
franchisee
Richard Lawlor           228
Stephen Williams         103
Constantine Scrivanos     90
Carlos Andrade            68
Mark Cafua                58
Edward Wolak              51
Konstantino Skrivanos     48
Anton Nader               48
David Cafua               42
Jose Couto                37
Dinart Serpa              35
Kaushik Patel             33
dtype: int64

Doing some quick googling, we find that the top two franchisees (Richard Lawlor and Stephen Williams) are executives for Hess and WilcoHess, respectively, two of Dunkin's biggest co-brands.

Now that we have this information, the next logical step is to connect it with our store location information. Luckily, Dunkin' Donuts has made this very easy for us by using the same foreign key RecordId in their FDD filing as in their MapQuest API!

In [23]:
# make a smaller df for joining with only the fields we care about
wanted_cols = ['RecordId', 'Lat', 'Lng', 'state']
franch_df = franchisees.merge(df[wanted_cols], on='RecordId')
franch_df
Out[23]:
RecordId franchisee address PhoneNumber Lat Lng state
0 338639 Scott Fanning 497 Route 6, Andover, CT, 06232 1320 860-742-5772 41.732901 -72.356867 CT
1 336142 Maria Micciche 39 Pershing Dr, Ansonia, CT, 06401 2214 203-732-5787 41.334069 -73.083807 CT
2 335770 Virginio Sardinha 11 Nott Highway, Ashford, CT, 06278 1316 860-429-3100 41.857074 -72.178935 CT
3 340368 Scott Fanning 75 E. Main St, Avon, CT, 06001 860-674-9257 41.807312 -72.824295 CT
4 342508 Cary Gagnon 1100 Berlin Tpke, Berlin, CT, 06037 860-828-6002 41.620599 -72.742552 CT
5 336743 Cary Gagnon 1113 Farmington Avenue, Berlin, CT, 06037 1218 860-828-0083 41.629956 -72.758391 CT
6 340508 Cary Gagnon 2005 Berlin Turnpike Wilbur Cross Highway, Ber... 860-828-5885 41.597576 -72.754250 CT
7 330641 Cary Gagnon 223 New Britain Ave, Berlin, CT, 06037 1353 860-229-0606 41.642589 -72.769804 CT
8 344540 Eduardo Batista 203 Greenwood Ave, Bethel, CT, 06801 2113 203-781-2576 41.371809 -73.415042 CT
9 303311 Eduardo Batista 25 Grassy Plain Rd, Bethel, CT, 06801 1703 203-797-9221 41.376866 -73.425716 CT
10 310247 James Zafiris 413 Cottage Grove Rd, Bloomfield, CT, 06002 3119 860-286-1175 41.817729 -72.716034 CT
11 301753 James Zafiris 760 Park Ave, Bloomfield, CT, 06002 2457 860-243-5633 41.828488 -72.731447 CT
12 341358 Christina Zafiris 903-905 Blue Hills Ave, Bloomfield, CT, 06002 860-286-9061 41.814550 -72.695089 CT
13 301333 Carlos Andrade 858 W Main St, Branford, CT, 06405 3422 203-488-7402 41.276475 -72.836483 CT
14 344893 Carlos Andrade 364 E. Main St, Branford, CT, 06405 2938 203-488-5938 41.295386 -72.782856 CT
15 338170 Carlos Andrade 33 Leetes Island Rd, Branford, CT, 06405 6513 203-315-1155 41.296891 -72.765736 CT
16 330839 Carlos Andrade 207 East Main St, Branford, CT, 06405 3102 203-488-7334 41.289642 -72.799228 CT
17 300343 Carlos Andrade 112 N Main St, Branford, CT, 06405 3011 203-488-8389 41.286369 -72.817094 CT
18 335070 Wayne Bowman 815 Lafayette Blvd, Bridgeport, CT, 06604 5724 203-366-1605 41.173479 -73.192486 CT
19 302984 Claude Stewart, Inc 979 Main St, Bridgeport, CT, 06604 4268 203-335-8550 41.177877 -73.189243 CT
... ... ... ... ... ... ...

By default, merge operates a lot like SQL JOIN. In fact, it has a default keyword argument how='left' that reflects the SQL-like default LEFT INNER JOIN-style behavior, meaning that if records from the right-hand table don't match (in this case RecordIds in df that are different or missing), they will be dropped.

From the cell above, it looks like we only lost around 700 data points from missing or unmatched RecordId fields. We know that much of this is probably due to the store data in df being much more up to date than the franchisee data from 2011. Let's see if we can do a little better by trying to match some of the rest on phone numbers:

In [24]:
# pick the columns we want to use for joining
wanted_cols = ['PhoneNumber', 'Lat', 'Lng', 'state']

# figure out which rows we failed to match last time
not_matched = df[~df.RecordId.isin(franch_df.RecordId.values)]

# try to merge on phone number this time around
new_matches = franchisees.merge(not_matched[wanted_cols], on='PhoneNumber')

# add our newly matched stores to the working dataframe
franch_df = pd.concat((franch_df, new_matches))

Not bad — we were able to add over 100 records back in just by trying a different foreign key.

Where are these large franchisees' holdings geographically clustered in MA?

Maybe we want see who the largest franchisees are in MA, one of DD's most heavily concentrated states:

In [25]:
m, fig, ax = createmap(limits=[41.4, 43.1, -73.6, -69.4], figsize=(13, 8))

# figure out the top 10 franchisees in MA
mask = (franch_df.state == 'MA')
ma_top = franch_df[mask].groupby('franchisee').franchisee\
    .count().order(ascending=False)
    
# grab just the top 10 for plotting purposes
top_10 = ma_top.head(10).index.values

# get together a long-ish list of different colors to plot with
colors = set2[:6] + set1

# iterate through the top 10 and plot their locations
for i, name in enumerate(top_10):
    mask = (franch_df.state == 'MA') & (franch_df.franchisee == name)
    lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
    x, y = m(lon, lat)
    m.scatter(x, y, marker='o', c=colors[i],
              s=15, zorder=i+3, linewidth=0,
              label='%s (%d)' % (name, ma_top[name]))
    
# plot all the rest
mask = (franch_df.state == 'MA') & ~(franch_df.franchisee.isin(top_10))
lon, lat = np.array(franch_df[mask].Lng), np.array(franch_df[mask].Lat)
x, y = m(lon, lat)
m.scatter(x, y, marker='x', c='lightgray', label='other',
               s=15, zorder=2, alpha=0.5, linewidth=1)

plt.legend(loc='upper right')
plt.show()