D.S project — The hidden taste of US zip codes — part 3

This is the last series of my 3 posts analyzing US zip codes by types of restaurants. In part 1 I described the work in general and presented the results, which found no way to segment locations by restaurant categories. In part 2 I have shown how the data was collected and cleaned.

In this part, I’ll walk you throw the clustering and regression techniques that were used during the project.

The full code is available in this repo.

Part 1, Part 2

Clustering

Using cluster analysis our goal is to find a number of unique groups that our data can be divided into. For example, in the case of this research, the group can be, Chinese, Italian, or Fast food restaurants. A successful clustering will be such that we can clearly identify the difference between them and predict a new data point into one of the identified clusters.

To try and achieve this we have to convert our categories into dummies, which basically means that each category will become a column in our data frame, and for each row, we will have true(=1) or false(=0) if it belongs to the category or not.

Let’s start by importing our restaurant’s data that we have downloaded in part 2 and apply the “get_dummies” command. When using this command it will keep the rows in the same order but returns a new data frame that has only the categories so we have to add the zip code back.

Dummies head example

Most frequent places

We have 184 different categories so it’s hard to understand how our data really looks like. One thing that we can do (The idea was used during the Coursera course) is to group our data by zip codes and then sort each line by the frequency of the places. In this way, we can find the most frequent places in each location. For example, if a zip code has 3 restaurants, 2 are fast food and one is a coffee shop, by grouping them and taking the mean of the row, fast food will have a score of 0.66 and coffee shop a score of 0.33 which makes the fast-food more frequent.

How to implement:

First group the data by zip codes and take the mean:

Then we have to loop throw the grouped data frame and sort each row by the frequency. I chose to select the 15 top categories.

There are few different ways how to loop throw a DataFrame, some are better than others, here I decided to work with iterrows which should be faster than a regular for loop. This will return a series of the sorted data which then can be converted back into a DataFrame

Note: the new frame will have no column headers, you can simply attach an array of strings as the names of the columns, or dynamically generate a list of columns based on the number of categories you want to select.

Here is an example of dynamically generated columns based on a code from the course:

I displayed here the code for generating the table for common places, I later used this table to analyze the generated clusters. I later use this method each time we generate new clusters to understand how they are spread, this is the reason it’s used as a method.

We can now use this data to see what is the most frequent category in each location or we can plot a chart of the most frequent places.

Here is the code I used to plot the top places in each frequency category, for example, what are the most common places in the first place

*In the code I also plotted this by states to get another point of view.

Clustering

The goal of this research is to find a common pattern between different zip codes and check whether we can or can’t segment them by types of restaurants. There are many different clustering analysis techniques. Here I’ll compare both k-means which is one of the simplest and most known and DBscan which will not require us to set the number of clusters in advance.

In case you want to explore how any one of the algorithms will cluster the data with different coefficients you can simply run one of them with random numbers. But a better way is to run it with a range of params based on each algorithm inputs and score the result. Using the scoring we can then decide which algorithm gives better results and if they actually improve or not.

A good way to score clustering result is the `silhouette score`, it calculates the distance between clusters and tells us how far away they are from each other. The score of this algorithm will range from -1 to 1, where a score of -1 means something is wrong, 0 they are not on top of each other but pretty close and a score as close to 1 means they are far away.
Here is a great post the explains this scoring https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c

Now that we understand what we are going to do, let’s implement it starting with the K-means.

I decided to write each as a function that will get a range of values for iteration and return the best result; The reason for using a function is to keep the rest consistent between each feature set.

Note: The full code also has a plotting of the score and the base algorithm fitting score.

Sklearn has an implementation for K-means which we will be using, this function receives the number of clusters it should cluster the data into, and other optional params like n_init and random_state which I decided to set with constant values, after manually playing with it, but you can actually implement a function to iterate over them as well.

In each iteration of our loop, we will set the number of clusters and fit our data using this algorithm. After the fitting is complete we are calculating the silhouette score. this score can be calculated only if we have at least two clusters; In order to prevent any error, I run it only if the number of unique clusters is higher than two, if not I set the score to -2 which is a value the algorithm will never have.

After the loop finishes running I want to find the number of clusters that correspond to the best score we have calculated. The function prints the score and the number of clusters just for debugging and returns K so we can later use it and check how the data is clustered.

Implementing the same idea for DBScan in this case we don’t need to set the number of clusters but we still have a param to set. this is the maximum distance of data points in each cluster, if the distance is higher then they won’t be set in the same cluster.

This one is very similar to the implementation of K-means the only difference is the I also add a param named steps to the function because the range steps can be changed as well. in the case of K-means the steps are always 1 (or any integer, I assumed any value other than 1 won’t be used) where they can be any positive number.

Setting the min-samples is not required I decided to set it to 9 by manually comparing different results before that. It can also be checked using a loop.

To ease the rest of the implementation and result comparison I have implemented four more functions, the number of function doesn’t matter just split the logic into smaller functions.

So what they actually do?

Run benchmark function:

The first function gets our data for testing and the test name (used to keep track), then it calls each of the previous function to find the best scores for DBscan and K-means. After receiving the results the function call another function to analyze the result for this data. Eventually returning the DataFrame with the corresponding cluster and an array of results, the array will be used later to display a DataFrame with all the different results.

Retrieve result function:

This function gets our dataset and the result of the scoring function, the best number of clusters and the best max distance; it runs each clustering algorithm again with those results, and adds a cluster column for each of the algorithms to our features data set. Using the new columns in the data set we are creating a new data frame with the top places in each cluster, in this way we can identify how the cluster differs.

Print DBscan stats function:

This function can be easily fit into the previous one but I defined it as a separate just for my own ease. it takes our dataset and the result of the dbscan algorithm and then prints and returns the score and the number of clusters the algorithm managed to build.

Cluster Data Function:

This is a function to help understand how the clusters look or actually how they differ. It gets the dataset and the column of the cluster we are comparing (the benchmark function runs both clusters and creates a new column one for each of the clusters algorithms (this is the reason we are specifying here which column to use). It groups the data by clusters and the most common venue (the most common category), using the size function to get the size of each group and then sorting and selecting the top 3, finally returned the DF.

Defining the features set

After we have implemented all our helper functions we can define the different feature sets we are going to use. Our function returns the data frame with all the results. All that is left to do is group them together into a summary DF.

To run the algorithm we need to remove the zip codes from the data and leave our dataset just with the dummies. But, we are interested in the data by each zip code, so before removing it simply group the data by it.

After we have the base set of data we can create a few other sets.

Starting with the PCA algorithm to reduce the data dimensionality. Using this algorithm we can find which features best describe our data. running it over the data we have, it seems like 44 features from the total of 183 can explain more than 95%. Here is the PCA test code:

This also contains a plot of the base algorithm in the first 5 rows. the last row before printing is selecting the params with a variance above 95, those are the 44 PCA features.

This will be the second set of features, but since we are trying to find a hidden pattern, maybe using actually the other features, those that don’t describe all the data will be more helpful in finding a segmentation between the data.

After finding the number of features we want to use, let’s create a DF with all those features, by fitting the original data using the PCA function with 44 features

Using this information we can create another DF with all the features that are not in this set.

Here is an example of how to run one of the sets with our functions. I decided to save the result of the function into 2 params one for the DataFrame and a second for the array of results, in each call I gave the variables different names so I’ll be able to aggregate them at the end.

By printing the result DF we can see the difference between the clusters. It’s ok if we see that the non-PCA selection includes clusters identified by features in the PCA. since the algorithm is selecting the cluster by other features by we use all the features list to identify the clusters.

And this is how the clusters in this case will be identified:

I have run a few more tests with the following sets of features:

  1. A set without the 3 most common places, Fast Food, Pizza, American Restaurant.
  2. A set grouped by states

Eventually summarizing the results into a DataFrame which will display it in more ordered form:

As can be seen in the results summary, there is no big difference between the different data sets, the non-PCA data set have the best result for the DBScan and the Kmeans has the best result using the PCA set.

But when looking at how the clusters are built we can’t really differentiate between them in a way that can tell us anything about the difference in restaurants’ preferences between zip codes.

Regression / Correlation

One of my other ideas was to check if there is any possible connection between the price of a house and the restaurants around it. It can be either the number of restaurants or the type.

To check this I decided to run a few different regression and check how they will score and what coefficient will be given to our features (the features are the restaurant categories).

I defined a few helpers function:

Split x_y:

This function receives our features set and the dependent variables and then splits them into test and train sets using the Sklean “train_test_split”.

We can also fit and transform the x values.

get_cv_score:

A function to calculate a cross-validation score of our model, this scoring technique will choose a different part of our data in each iteration and check its score, we will then take the mean as our cross-validation score.

scoring:

We can also check a basic score just in case, which is mean square error or R², I have implemented them both in this function.

After we have those basic functions we can implement each of the regression and save the scoring results, I’ll also save a data frame with the coefficients so we can compare them.

An example of the linear regression:

If we print the DF and sort by ascending or descending we can see which coefficients may have anything do to with the house price.

Linear Regression coefficients, left — top ten, right — bottom ten
Linear Regression coefficients, left — top ten, right — bottom ten

As can be seen from the print-screen above, places like a coffee shop and juice bar seem to have a positive correlation. (I divided the house prices by 1000 so if you want the real number just multiply back by 1000).

Now that we saw how one of the regression models is working and how the data is collected we can use the same idea to test a few different regression models and compare them.

I decided to test the question with the following regression models:

  1. Polynomial — Very similar to the simple linear regression but in this case, we first increase all the features by an order of the polynomial. I tested with 3rd order. (On selected features — see below)
  2. Ride — add a coefficient to the loss function while trying to minimize it. The coefficient is a squared term, multiplied by a value called “lambda” changing the value will impact how much noise will be captured by the loss function.
  3. Lasso — very similar to the ridge algorithm, but with a different coefficient. This algorithm can be useful for feature selection since it tried to minimize those features found to be not important to zero.
  4. Gradient Boosting — Boosting is a technique used in machine learning and can be used both in clustering and regression. The goal of this technique is to try and find the best features by minimizing a loss function. In this way, even params that seem weak can become significant if it can minimize the loss function.

The implementation of all the models exist in the git repo. Below are a few notes about feature selection and grid search.

Our model is huge with a lot of features, but maybe we don’t need all of them and can reduce the number of features. This can be helpful for the linear regression result, by only choosing the important once. And this will save a lot of running time for the polynomial regression as well. I have performed a feature selection using two different feature selection algorithm, mutual_info_regression, f_regression. Both are from the SKlearn package. When running each separately I saw that most of the features are the same with some small difference so I decided to work with the combined result.

Grid Search — The other three models were trained using grid search, It’s basically a function the receives Sklearn model and a list of params you want to test, each param will have an array of possible values. The function will test the model with all of the params options you have provided and return the best combination based on the score each fitting received.

I hope you managed to keep track and train your models, Now we can combine the result and check how the models have performed.

Gradient boosting had the best result compared to other algorithms but still, R² of just 0.28 is not significant enough, and a cross-validation score of -143274.92 which is just 32,814 better than the Lasso model who ends with the worst result.

Regression model result summery

Another thing that I did is to check the coefficient scores gained by each model. If you remember our scoring function returned the score and a DF of the coefficients scores. Let’s combine them and sort them by the top results. before sorting I took the absolute value so we can check the top-bottom as well.

This is how I sorted them? If you have a better solution I’ll be glad to hear

Create a simple sorting function that performs the following order:

  1. Add a sign column to the DF based on each row
  2. Convert coefficients to their absolute value
  3. Sort the DF by the coefficient in descending order
  4. Multiplay back the sign by the coefficient for each row

And then simply applied this function on all of the coefficients DF from the different models we have tested:

As I have noted in the first post a quick look at the coefficients shows that some of them may actually have an impact on the price of a house, like a coffee shop or a juice bar around. But since the scores of the regressions are low, I’ll not refer to that as something with significant impact.

This table is an excel edited based on the coefficients summary DataFrame:

Coefficients result summary

Does the number of restaurants around have a correlation with the house value?

This was another question I wanted to test, and it’s a simple one since we already have all the data. Just sum up the number of restaurants in each zip code and perform a simple regression between this total number and the value of the houses.

In the notebook, I have this in the beginning but this is the code:

  1. From our dummies data (the restaurant categories data), select all, group by zip code, and sum it — we want the total number of places in each zip code.
  2. Drop the zip code column, state since it’s a numeric value, and sum the DF by rows, sum function with axis equals to 1.
  3. From the zipcodes, data select the zip code, state, and house value
  4. Merge them together and drop any NA values just in case

After having the total number let’s perform a simple linear regression:

  • We are reshaping the data since the linear regression function we wrote passes the date first via the transform function. This function requires a two-dimensional array and this is the workaround.

And here is the result, nothing significant was found:

Thanks for reading and keeping up with the long posts.

This is my first experience of blogging and with a data science project.

If you have any future suggestions or improvements feel free to comment or reach out always happy to learn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store