In part 1 of this post I have describes the research I did, trying to segment US zip codes by different restaurant categories.
The second and the third parts will be more technical going throw the how to of the project.
Downloading the data:
- For US zip code data you can simply download the file from OpenDataSoft, and since we are working with Pandas the best format will be CSV. (https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/)
- I downloaded the house values information from Zillow research section, simply download the preferred data (Under Geography select zipcode), After downloading the file, you will have a 60 MB’s file with prices dated back to 1996, simply open the file and remove all unnecessary data points. It’ll be faster to work with a smaller file
- ForSquare-This is our third dataset but since we can access it only using an address or geocode we need to limit our numbers of request (one request per each geolocation). I’ll explain more in the ForSquare API part later.
Importing the data:
If you know how to import the files and merge them you can skip this part.
first, we need to import the panda’s package “import pandas as pd”
Pandas is a great open-source package for data analysis built for python, If you are not familiar with it, I recommend learning about it more. But you can manage to follow this post without prior knowledge.
Pandas can read a CSV file and import it directly into a data frame, think about a data frame like an excel sheet, by default it uses the column names as there are in the CSV’s.
importing the file: `pd.read_csv({file path}, usecols=[{an array of columns name if we want to limit which columns to import])`
From the housing data I decided to import the following columns: Region Name-The zip code, 2020–10–31 (the column name with the value of the house is given by a data)
Those columns don’t make any sense, let’s rename them using the following code: `houses_df.columns = [‘Zip’, ‘House Value’]`. “house_df” is the name of my house values data frame.
If you want to check the size of the data-frame after importing the data all you need to do is to call df_name.shape. calling this on the houses_df.shape will result in the following size (30230, 2) — meaning we have 20320 rows of data witch 2 columns.
Importing the zip codes data, if you just try and run read_csv with the zip code file you will get an empty data-frame this is because usually CSV files are separated with commas but this one uses “;” which requires us to set the separator using the following param: sep=”;” this will make it work.
Time to merge them, pandas have few different methods to merge data-frames, merge, join, concat for more information refer to this great post: https://realpython.com/pandas-merge-join-and-concat/
In this case, a merge is the simplest to use, we just set an index on both DF’s (data-frame) using the function set_index and passing the on param to the merge function so it’ll know to use those index for merge
Some of the zip codes don’t have house values data so we can remove them by creating a new DF without null values in the house values
To select the null values simply use isnull(), doing so on the whole DF will mark each row as if null or not, it’s possible to check for specific column by first selecting it and then calling the function.
for example, the following code will tell us witch rows have a null value in the house value column
`grouped_df[‘House Value’].isnull()
If you want to see the rows you will have to select those from the DF,
`grouped_df[grouped_df[‘House Value’].isnull()]
This will show all the rows which returns true when calling is null, But we are interested in those who don’t have a null’s so we can use the Tilda operator (~) to select all rows but this from our data frame:
`grouped_df[~grouped_df[‘House Value’].isnull()]
if you save this to the same or another data-frame, you got yourself a DF without null house values :)
My full import code:
Cleaning the data
Perfect we have a DF with our downloaded data and all zip codes have house values (if you skipped the import part, I have removed rows without house values).
First let’s check if the number of states makes senesce, by selecting the “State” column from our data-frame and using the unique function we will get a list of all the states in the data, then simply use the python “len” function to check how many states this list contains:
print(‘Number of states id DF: {}’.format(len(grouped_df[‘State’].unique())))
Running the code you will see the number 53, ha?
A quick search with Wikipedia I found that the US has 5 other territories that are not part of the 50 states, and DC which counts as another one.
Let’s check if some of those territories are part of the data set, and if we remove them we should have the right number, or else we have a problem.
I create a list of all those 5 territories and lopped throw the list, each time checking if the data has one of them, if it does I printed how many rows it has with this state and deleted it.
list = [‘AS’, ‘GU’, ‘MP’, ‘PR’, ‘VI’]for state in list:
rows = len(np.where(grouped_df[‘State’] == state)[0])
print(“Delete {} row of state {}”.format(rows,state))
grouped_df = grouped_df[grouped_df[‘State’] != state]
You can skip this printing a looping and just removing the states from the array, for example by selecting all but those territories:
grouped_df = grouped_df[~grouped_df[‘State’].isin(list)]
As for DC, I decided since it’s part of the mainland, I’ll not remove it but will check if it exists. simply calling this:
np.any(grouped_df[‘State’] == ‘DC’)
np.any — returns true if the condition matches and false otherwise.
Remove PR and VI from our list and leaving DC part of it results in a data set witch 51 states, 👌 .
Exploring The Data
Before we are going to retrieve the restaurant data, we should limit the number of locations we are going to work with; of course, you can work with all of them, but ForSquare API has a limit on the number of daily requests and it’ll take you a lot of time.
By exploring the data we can find which states or other criteria we want to use to reduce our list of locations.
I started by looking at the housing price, checking if the data we have makes sense, plotting a distribution of the prices per state to check how it looks.
Seaborn is a great package for visualization in python, it has a distribution plot built-in, all we have to do is to give this function our data set.
import seaborn as sns
sns.distplot(state_df[‘House Value’])
This is all it takes to plot a distribution plot using the seaborn.
But this command will plot one graph with all prices combined, my goal is to plot one for each state. To achive this we will loop over a list of all the states we have, select the data just for this state and plot it.
from matplotlib import pyplot as plt
import matplotlib as mpl
import mathfig = plt.figure(figsize=(30, 40))states_array = grouped_df[‘State’].unique()rows = math.ceil(len(states_array) / 2)
col = 2
# we can explore this by state to see if we find any diffrences
for i, state in enumerate(states_array):
state_df = grouped_df[grouped_df[‘State’] == state]
fig.add_subplot(rows,col,i+1)
ax = sns.distplot(state_df[‘House Value’])
ax.set_title(state)
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(float(x), ‘,’)))
plt.tight_layout()
What is going on:
- grouped_df[‘State’].unique() — returns a unique list of all the state.
- This command just select all the rows for a specific state:
state_df = grouped_df[grouped_df[‘State’] == state] - Plotting this data as a distribution plot, and saving it into a variable ax.
ax = sns.distplot(state_df[‘House Value’])
I’m doing so for two reasons, first to add the state title, ax.set_title(state) and the second reason is to format how prices are displayed, I wanted to add a comma for thousands:
ax.get_xaxis().set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: format(float(x), ‘,’)))
Plotting:
- from matplotlib import pyplot as plt — this is a matlab plotting library for python, very useful plotting library and we use it here to customize the plot size.
- fig = plt.figure(figsize=(30, 40)) — calling plt.figure we can pass the size of our plotting canvas. I passed a height bigger them width since we have a long list of plots.
- plotting two in a row requires use to calculate the numbers of rows, we can simply enter 26 -> 51/2 and rounded up. Or as I did here calculate it in the code: `rows = math.ceil(len(states_array) / 2)`
- for columns simply decided two use 2, more will be harder to read.
- In the loop, we are calling the function “add_subplot” before each new distribution plot, so we can tell the plot where it should be plotted, in which index, fig.add_subplot(rows,col,i+1). The “add_subplot” function gets the following params (rows, columns, index). this way we can create two columns plotting.
- plt.tight_layout() — simply adjusts the padding around the plot so it looks a little bit better.
Next, I’m going to plot a map with all our zip codes as another data validation, our zip code data should cover all US land. Be aware we have 30K points so the map can be laggy, you can try and randomly plot just a few states to check how it works.
Another validation that you can do, google the number of zip codes in a state and sum the data frame by states to calculate how many zip codes we have for each state (I’ll leave this for you to do).
Plotting a map:
There are a few different ways to plot a map in python, I’m going to use a package called ‘Folium’ but there are also some other packages you can use.
You may need to install the package before able to use it, using pip `pip install folium` or if you are working with conda `conda install folium -c conda-forge — yes`
First, we need to create a map object, folium.map(), but this will display a global map without a specific location, passing the center and the zoom params as well, will make it look better:
folium.Map(location=[latitude, longitude], zoom_start=10, prefer_canvas=True)
You can choose whatever lat long coordinates you prefer, I simply took the geolocation of the first item in the list.
The third param, prefer_canvas is a boolean param that changes the base map to use HTML 5 canvas instead of SVG which should make the map more responsive with a lot of markers.
To create a marker on the map simple call folium.createMarker with the params you want, the required one is the location as a tuple [lat, lng] so the map will know where the marker should be.
The create circle marker function can receive more params such as a popup label, fill colors, radius, and more. here is the full code of my marker:
folium.CircleMarker(
[lat, lng],
radius=5.0,
popup=label,
color=color,
fill=True,
fill_color=color,
fill_opacity=0.7,
parse_html=False).add_to(marker_cluster)
I passed color as a param but this can be any hex color (eg. #32a852 — green).
After creating a marker you must append it to a map or into a marker cluster so it’ll be displayed. The most simple thing to do is append it directly onto the map, But I preferred to cluster them.
How to cluster? Start by importing the cluster plugin from the folium library:
from folium.plugins import MarkerCluster
Attach the cluster to a map and then append the markers into the cluster. The reason I chose to create a cluster it’s that by doing so we have a less messy map. This plugin will group the circles together by their proximities and as we zoom into an array it will unpack them into smaller batches until will are close enough to display the markers.
Here is an example of the map from an overview and when zoomed ( The image contains a smaller states dataset):
Box plot — filtering our list of states:
Box plot is a useful statistical tool that describes how our numerical data looks, the plot is a box which displays the following information about each feature, Minimum value, Maximum value, Median, First quartile (this will be the values which 25% of our house values are bellow that number), Third quartile. Any point that will be displayed above or below this box is an outlier; A point which by default is 1.5 standard divisions below or above our median value (1.5 is the default value but can be changed).
I decided to use this plot to find a way for reducing the number of states we’re going to research. Since the goal is to find distinct patterns between the states I’m going to choose the states with the heights number of outliers, there is a better way to find those, for example actually calculating the number of outliers or the percentage but in this case, I just looked on the plot. some may argue that we should actually choose those without outliers or with a small number, I can agree with you, and this can be done in the future, and maybe those states will have better results.
Finally using this information to choose the following states, ‘CA’, ‘FL’, ‘NY’, ‘MD’, ‘CO’, ‘NJ’, ‘MA’, ‘WA’, ‘IN’, ‘SC’, ‘GA’, ‘MS’, ‘WY’, ‘AZ’, ‘MO’.
I later removed WY since it had only 200 zip codes.
Calling the ForSquare api
After we have found out which state we are going to work with we can use the API and get a list of restaurants around. You can use the API to retrieve other kinds of venues as well, be aware they have a daily limit for an account without a credit card, and you can extend it without a charge by attaching a cc to the account.
Step 1:
Create a FS developers account by registering here https://developer.foursquare.com/developer/
Step 2:
Create a new app, this will require you to enter a server web address you can enter any valid url.
Step 3:
Save your credentials, Client Id, Client Secret.
Now you are ready to call the API:
Using the request package we can call a URL and parse the JSON request back into our code. All we have to do is to create a get request for the API with our credentials and parse the result.
I recommend that you first try with one location and check that everything works and parsing before starting to go over a full list of zip codes.
Building the request URL:
Here is the link for the docs: https://developer.foursquare.com/docs/api-reference/venues/search-enterprise/
For a basic request, you must pass the client id, client secret, version, and the geo location (another option is a string with the name of the location). A basic request will return all types of venues around but we can limit it only to restaurants by sending a category.
I also decided to extend the default search radius to 1000 meters.
Here is the complete string format:
CLIENT_ID = '' # your FS ID here
CLIENT_SECRET = '' # your FS Secret here
VERSION = '20180605' # FS API version
LIMIT = 50 # A max FS API limit value
CATEGORYID = '4d4b7105d754a06374d81259'url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}&radius={}&categoryId={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
LIMIT,
1000, # radius of 1000 m
CATEGORYID)
If our call is successful we’ll have a response param in the returned JSON, it will contain a ‘venues’ key with a list of all the venues around this location.
We can simply parse this information into an array which later will be converted into Panda’s data frame:
I just used this script to save the name of the location, the category, and the location, the API returns some more information which you can save as well in case you need it.
# make the GET request
response = requests.get(url).json()["response"]
# in case no venues just continue with a empty array
if('venues' not in response):
response = []
else:
response = response['venues']# formating the result
# since it's a nested object we have to format it manully
arr = [(zipcode,
v['name'],
v['location']['distance'],
v['location']['lat'],
v['location']['lng'],
v['categories'][0]['name'] if len(v['categories']) > 0 else np.nan) for v in response]
Using the above code we can loop throw all our zip codes and collect the venues list, since it takes time and may fall I create another stage. I decided to loop over each state alone, saving the information into a CSV file, after retrieving all the requested information I loaded them back and grouped them into a single data frame.
The complete code to get all the information can be found in this notebook
Cleaning the restaurants’ data
The first thing that I did was to check for duplicates at the same zip codes, what are the chances of the same place to be twice in the same zip code, and if it exists I prefer to assume it to be a duplicate data point.
To check if we have any duplicated data by a few keys we can call the function “duplicated” on our data frame and pass an array of columns by which to check for duplicates. We have to pass the zip code and the restaurant name otherwise it will remove all duplicates by name even when the location doesn’t match.
duplicateRowsDF = aggregated_venues_df[aggregated_venues_df.duplicated(['ZipCode', 'Restaurant Name'])]
print(duplicateRowsDF.shape)
To drop the duplicates use “drop_duplicates” with a few keys, subset — is the array of columns by which we check for duplicates (same as in duplicated). This function will remove all duplicated values unless we specify which of those to keep by adding a param “keep” I set it’s value to first to keep the first duplicate record and remove the rest, you can specify to keep the last or to delete all by setting it to false. set “inplace” to true if you want to drop the values in the existing frame.
Describe:
Call the describe method on our data frame to check how it looks, it’s mostly categorical data so we will get the basic statistics, for example:
- The number of unique restaurants categories
- The number of restaurants in our data set
- The most frequent category / restaurant
Understanding the categories:
Our data set contains 380 different categories, which makes it hard to understand just by reading throw, so let’s start by plotting the distribution of the data in each category, in other words, how many venues we have in each category.
380 it’s still a lot so first I decided to use a simple distribution function to understand how the data looks, and filter the small categories, in the next stage we will look at the remaining.
Small categories:
The method from above shows that we have 380 different categories when we will analyze the data all of those will become features and it’s a huge number. I decided to check for small categories, for example, those with less than 10 places in the category.
Why the 10? I grouped the data by categories and sized it. which gives the number of rows we have for each category (in other words, the number of restaurants in each category) and calling the describe method. The result shows that more than 50% of the categories have less than 10 venues.
So simply select those categories with less than 10 venues and remove them from the data frame.
# Get the list of those categories with less then 10 items, and remove them
categories_to_remove = sized[sized<=10].index.values# remove all rows with those categories and save an aggregated df
aggregated_venues_df = aggregated_venues_df[~aggregated_venues_df['Restaurant Category'].isin(categories_to_remove)]
aggregated_venues_df.describe(include='all')
Eventually, we will have a data set with 158K restaurants across 185 different categories.
Categories distribution:
Sizing the data frame again this time sorting the values from heights to lowest, I generated the top 20 categories by size.
The results show two categories which are very general, “restaurant” and “food”, those don’t help us to segment by food types; In order to try and solve the problem, I decided to check if some of those places exist in our data with a different category. The data is based on Foursquare users uploads so it makes sense that some will be more careful selecting the right category and others will just select the first option.
To solve the issue, I decided to write a function and call it with each category name. The function selects those who match the category we wish to replace and create another DF with all the others.
The next step is to loop over all the rows we want to replace and check if the other data set has a restaurant with the same name, if it exists we can just select the first one and replace the value of the category in the original data frame. the math for each row is done by checking two conditions, the zip code, and the restaurant name. When working with Panda’s to search a data frame with multiple conditions simply add them in the following way (condition 1) & (condition 2) …
def replace_category(df, cat):
print('Number of places with category {} : {}'.format(cat, len(df[df['Restaurant Category'] == cat])))names = df[df['Restaurant Category'] == cat]['Restaurant Name'].unique()
# df of food category items
food_cat_df = df[df['Restaurant Category'] == cat]
# df on items with category different them food
non_food_df = df[df['Restaurant Category'] != cat]
# join them togther so we can have the right category# manually replace food category with other category of same business name:
for index, row in food_cat_df.iterrows():
# replace based on those with food cat and data from non food
search_df = non_food_df[non_food_df['Restaurant Name'] == row['Restaurant Name']].head(1)
if(len(search_df) > 0):
#print(row['ZipCode'], row['Restaurant Name'])
cond = (df.ZipCode == row['ZipCode']) & (df['Restaurant Name'] == row['Restaurant Name'])
df.loc[cond, 'Restaurant Category'] = search_df['Restaurant Category'].values[0]print('Number of places with category Food After Correction {} : {}'.format(cat, len(df[df['Restaurant Category'] == cat])))
return df
Since this is a function and we are changing a DF inside the function, it must be assigned back to the original or to any other DF you wish to proceed working with.
# save the file to csv
aggregated_venues_df.to_csv('aggregated_venues.csv', index=False)
The last step is to save the clean file into a CSV so we have a safe copy and can start to analyze it. I’ll walk you through the analysis process in the next post.