D.S project — The hidden taste of US zip codes — part 2

Downloading the data:

  1. For US zip code data you can simply download the file from OpenDataSoft, and since we are working with Pandas the best format will be CSV. (https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/)
  2. I downloaded the house values information from Zillow research section, simply download the preferred data (Under Geography select zipcode), After downloading the file, you will have a 60 MB’s file with prices dated back to 1996, simply open the file and remove all unnecessary data points. It’ll be faster to work with a smaller file
  3. ForSquare-This is our third dataset but since we can access it only using an address or geocode we need to limit our numbers of request (one request per each geolocation). I’ll explain more in the ForSquare API part later.

Importing the data:

If you know how to import the files and merge them you can skip this part.

Cleaning the data

Perfect we have a DF with our downloaded data and all zip codes have house values (if you skipped the import part, I have removed rows without house values).

print(‘Number of states id DF: {}’.format(len(grouped_df[‘State’].unique())))
list = [‘AS’, ‘GU’, ‘MP’, ‘PR’, ‘VI’]for state in list:
rows = len(np.where(grouped_df[‘State’] == state)[0])
print(“Delete {} row of state {}”.format(rows,state))
grouped_df = grouped_df[grouped_df[‘State’] != state]

Exploring The Data

Before we are going to retrieve the restaurant data, we should limit the number of locations we are going to work with; of course, you can work with all of them, but ForSquare API has a limit on the number of daily requests and it’ll take you a lot of time.

import seaborn as sns
sns.distplot(state_df[‘House Value’])

from matplotlib import pyplot as plt
import matplotlib as mpl
import math
fig = plt.figure(figsize=(30, 40))states_array = grouped_df[‘State’].unique()rows = math.ceil(len(states_array) / 2)
col = 2
# we can explore this by state to see if we find any diffrences
for i, state in enumerate(states_array):
state_df = grouped_df[grouped_df[‘State’] == state]
fig.add_subplot(rows,col,i+1)
ax = sns.distplot(state_df[‘House Value’])
ax.set_title(state)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(float(x), ‘,’)))
plt.tight_layout()
  1. grouped_df[‘State’].unique() — returns a unique list of all the state.
  2. This command just select all the rows for a specific state:
    state_df = grouped_df[grouped_df[‘State’] == state]
  3. Plotting this data as a distribution plot, and saving it into a variable ax.
    ax = sns.distplot(state_df[‘House Value’])
    I’m doing so for two reasons, first to add the state title, ax.set_title(state) and the second reason is to format how prices are displayed, I wanted to add a comma for thousands:
    ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(float(x), ‘,’)))
  1. from matplotlib import pyplot as plt — this is a matlab plotting library for python, very useful plotting library and we use it here to customize the plot size.
  2. fig = plt.figure(figsize=(30, 40)) — calling plt.figure we can pass the size of our plotting canvas. I passed a height bigger them width since we have a long list of plots.
  3. plotting two in a row requires use to calculate the numbers of rows, we can simply enter 26 -> 51/2 and rounded up. Or as I did here calculate it in the code: `rows = math.ceil(len(states_array) / 2)`
  4. for columns simply decided two use 2, more will be harder to read.
  5. In the loop, we are calling the function “add_subplot” before each new distribution plot, so we can tell the plot where it should be plotted, in which index, fig.add_subplot(rows,col,i+1). The “add_subplot” function gets the following params (rows, columns, index). this way we can create two columns plotting.
  6. plt.tight_layout() — simply adjusts the padding around the plot so it looks a little bit better.
folium.Map(location=[latitude, longitude], zoom_start=10, prefer_canvas=True)
folium.CircleMarker(
[lat, lng],
radius=5.0,
popup=label,
color=color,
fill=True,
fill_color=color,
fill_opacity=0.7,
parse_html=False).add_to(marker_cluster)
from folium.plugins import MarkerCluster

Calling the ForSquare api

After we have found out which state we are going to work with we can use the API and get a list of restaurants around. You can use the API to retrieve other kinds of venues as well, be aware they have a daily limit for an account without a credit card, and you can extend it without a charge by attaching a cc to the account.

CLIENT_ID = '' # your FS ID here
CLIENT_SECRET = '' # your FS Secret here
VERSION = '20180605' # FS API version
LIMIT = 50 # A max FS API limit value
CATEGORYID = '4d4b7105d754a06374d81259'
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}&radius={}&categoryId={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
LIMIT,
1000, # radius of 1000 m
CATEGORYID)
# make the GET request
response = requests.get(url).json()["response"]
# in case no venues just continue with a empty array
if('venues' not in response):
response = []
else:
response = response['venues']
# formating the result
# since it's a nested object we have to format it manully
arr = [(zipcode,
v['name'],
v['location']['distance'],
v['location']['lat'],
v['location']['lng'],
v['categories'][0]['name'] if len(v['categories']) > 0 else np.nan) for v in response]

Cleaning the restaurants’ data

The first thing that I did was to check for duplicates at the same zip codes, what are the chances of the same place to be twice in the same zip code, and if it exists I prefer to assume it to be a duplicate data point.

duplicateRowsDF = aggregated_venues_df[aggregated_venues_df.duplicated(['ZipCode', 'Restaurant Name'])]
print(duplicateRowsDF.shape)
  • The number of unique restaurants categories
  • The number of restaurants in our data set
  • The most frequent category / restaurant
# Get the list of those categories with less then 10 items, and remove them
categories_to_remove = sized[sized<=10].index.values
# remove all rows with those categories and save an aggregated df
aggregated_venues_df = aggregated_venues_df[~aggregated_venues_df['Restaurant Category'].isin(categories_to_remove)]
aggregated_venues_df.describe(include='all')
def replace_category(df, cat):
print('Number of places with category {} : {}'.format(cat, len(df[df['Restaurant Category'] == cat])))
names = df[df['Restaurant Category'] == cat]['Restaurant Name'].unique()
# df of food category items
food_cat_df = df[df['Restaurant Category'] == cat]
# df on items with category different them food
non_food_df = df[df['Restaurant Category'] != cat]
# join them togther so we can have the right category
# manually replace food category with other category of same business name:
for index, row in food_cat_df.iterrows():
# replace based on those with food cat and data from non food
search_df = non_food_df[non_food_df['Restaurant Name'] == row['Restaurant Name']].head(1)
if(len(search_df) > 0):
#print(row['ZipCode'], row['Restaurant Name'])
cond = (df.ZipCode == row['ZipCode']) & (df['Restaurant Name'] == row['Restaurant Name'])
df.loc[cond, 'Restaurant Category'] = search_df['Restaurant Category'].values[0]
print('Number of places with category Food After Correction {} : {}'.format(cat, len(df[df['Restaurant Category'] == cat])))
return df
# save the file to csv 
aggregated_venues_df.to_csv('aggregated_venues.csv', index=False)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store