# D.S project — The hidden taste of US zip codes — part 3

This is the last series of my 3 posts analyzing US zip codes by types of restaurants. In part 1 I described the work in general and presented the results, which found no way to segment locations by restaurant categories. In part 2 I have shown how the data was collected and cleaned.

In this part, I’ll walk you throw the clustering and regression techniques that were used during the project.

The full code is available in this repo.

## Clustering

Using cluster analysis our goal is to find a number of unique groups that our data can be divided into. For example, in the case of this research, the group can be, Chinese, Italian, or Fast food restaurants. A successful clustering will be such that we can clearly identify the difference between them and predict a new data point into one of the identified clusters.

To try and achieve this we have to convert our categories into dummies, which basically means that each category will become a column in our data frame, and for each row, we will have true(=1) or false(=0) if it belongs to the category or not.

Let’s start by importing our restaurant’s data that we have downloaded in part 2 and apply the “get_dummies” command. When using this command it will keep the rows in the same order but returns a new data frame that has only the categories so we have to add the zip code back.

# import CSV

aggregated_venues_df = pd.read_csv('aggregated_venues.csv')# create a new data frame with groupped venues

# one hot encoding

dummies = pd.get_dummies(aggregated_venues_df[['Restaurant Category']], prefix="", prefix_sep="")# add neighborhood column back to dataframe

dummies['ZipCode'] = aggregated_venues_df['ZipCode']# move zipcide column to the first column

fixed_columns = [dummies.columns[-1]] + dummies.columns[:-1].to_list()

dummies = dummies[fixed_columns]print(dummies.shape)

dummies.head()

## Most frequent places

We have 184 different categories so it’s hard to understand how our data really looks like. One thing that we can do (The idea was used during the Coursera course) is to group our data by zip codes and then sort each line by the frequency of the places. In this way, we can find the most frequent places in each location. For example, if a zip code has 3 restaurants, 2 are fast food and one is a coffee shop, by grouping them and taking the mean of the row, fast food will have a score of 0.66 and coffee shop a score of 0.33 which makes the fast-food more frequent.

How to implement:

First group the data by zip codes and take the mean:

`res_grouped = dummies.groupby(‘ZipCode’).mean().reset_index()`

Then we have to loop throw the grouped data frame and sort each row by the frequency. I chose to select the 15 top categories.

There are few different ways how to loop throw a DataFrame, some are better than others, here I decided to work with iterrows which should be faster than a regular for loop. This will return a series of the sorted data which then can be converted back into a DataFrame

res = pd.Series(

[row[0] , *row[1:].sort_values(ascending=False).index.values[0:number_of_top_reseturants]]

for index, row in res_grouped.iterrows()

)# series to data-frame

neighborhoods_venues_sorted = pd.DataFrame.from_records(res)

*Note: the new frame will have no column headers, you can simply attach an array of strings as the names of the columns, or dynamically generate a list of columns based on the number of categories you want to select.*

Here is an example of dynamically generated columns based on a code from the course:

`# create columns according to number of top venues`

columns = ['ZipCode']

indicators = ['st', 'nd', 'rd']

for ind in np.arange(number_of_top_reseturants):

try:

columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))

except:

columns.append('{}th Most Common Venue'.format(ind+1))

I displayed here the code for generating the table for common places, I later used this table to analyze the generated clusters. I later use this method each time we generate new clusters to understand how they are spread, this is the reason it’s used as a method.

We can now use this data to see what is the most frequent category in each location or we can plot a chart of the most frequent places.

Here is the code I used to plot the top places in each frequency category, for example, what are the most common places in the first place

# plot 3 most common places on a plot

def plot_most_freq(cat):

fig = plt.figure(figsize=(16,7))

plt.title(cat)

neighborhoods_venues_sorted[['ZipCode', cat]].groupby(cat).size().sort_values(ascending=False).head(20).plot.bar()

plt.tight_layout()

# save the figure

fig.savefig(cat + '.png', bbox_inches='tight')plot_most_freq('1st Most Common Venue')

plot_most_freq('2nd Most Common Venue')

plot_most_freq('3rd Most Common Venue')

*In the code I also plotted this by states to get another point of view.

## Clustering

The goal of this research is to find a common pattern between different zip codes and check whether we can or can’t segment them by types of restaurants. There are many different clustering analysis techniques. Here I’ll compare both k-means which is one of the simplest and most known and DBscan which will not require us to set the number of clusters in advance.

In case you want to explore how any one of the algorithms will cluster the data with different coefficients you can simply run one of them with random numbers. But a better way is to run it with a range of params based on each algorithm inputs and score the result. Using the scoring we can then decide which algorithm gives better results and if they actually improve or not.

A good way to score clustering result is the `silhouette score`, it calculates the distance between clusters and tells us how far away they are from each other. The score of this algorithm will range from -1 to 1, where a score of -1 means something is wrong, 0 they are not on top of each other but pretty close and a score as close to 1 means they are far away.*Here is a great post the explains this scoring **https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c*

Now that we understand what we are going to do, let’s implement it starting with the K-means.

I decided to write each as a function that will get a range of values for iteration and return the best result; The reason for using a function is to keep the rest consistent between each feature set.

*Note: The full code also has a plotting of the score and the base algorithm fitting score.*

Sklearn has an implementation for K-means which we will be using, this function receives the number of clusters it should cluster the data into, and other optional params like n_init and random_state which I decided to set with constant values, after manually playing with it, but you can actually implement a function to iterate over them as well.

In each iteration of our loop, we will set the number of clusters and fit our data using this algorithm. After the fitting is complete we are calculating the silhouette score. this score can be calculated only if we have at least two clusters; In order to prevent any error, I run it only if the number of unique clusters is higher than two, if not I set the score to -2 which is a value the algorithm will never have.

After the loop finishes running I want to find the number of clusters that correspond to the best score we have calculated. The function prints the score and the number of clusters just for debugging and returns K so we can later use it and check how the data is clustered.

# X - values to run algo on

# s_value - the min value the loop will start from

# e_value - the maximm value to go to

def test_kmeans(df, s_value, e_value):

scores = []

silhouette_coefficients_means = []

x_range = range(s_value, e_value)for k in x_range:

kmeans = KMeans(n_clusters=k, random_state=40, n_init=12)

kmeans.fit(df)

scores.append(kmeans.inertia_)

if(len(np.unique(kmeans.labels_)) > 2 and len(np.unique(kmeans.labels_)) < len(df)):

silhouette_coefficients_means.append(silhouette_score(df, kmeans.labels_))

else:

silhouette_coefficients_means.append(-2)

# find the best number of clusters

loc = np.where(silhouette_coefficients_means == np.max(silhouette_coefficients_means))[0][0]

print('Best Kmeans coeff {} at K= {}'.format(np.max(silhouette_coefficients_means), x_range[loc]))

return x_range[loc]

Implementing the same idea for DBScan in this case we don’t need to set the number of clusters but we still have a param to set. this is the maximum distance of data points in each cluster, if the distance is higher then they won’t be set in the same cluster.

This one is very similar to the implementation of K-means the only difference is the I also add a param named steps to the function because the range steps can be changed as well. in the case of K-means the steps are always 1 (or any integer, I assumed any value other than 1 won’t be used) where they can be any positive number.

Setting the min-samples is not required I decided to set it to 9 by manually comparing different results before that. It can also be checked using a loop.

# X - values to run algo on

# s_value - the min value the loop will start from

# e_value - the maximm value to go to

# steps - optional to set the steps range

def test_dbsacn(X, s_value, e_value, steps = 0.1): x_range = np.arange(s_value,e_value,steps)

silhouette_coefficients = []for e in x_range:

clustering = DBSCAN(eps=e, min_samples=9).fit(X)

if(len(np.unique(clustering.labels_)) > 2 and len(np.unique(clustering.labels_)) < len(X)):

silhouette_coefficients.append(silhouette_score(X, clustering.labels_))

else:

silhouette_coefficients.append(-2)loc = np.where(silhouette_coefficients == np.max(silhouette_coefficients))[0][0]

print('Best DBSCAN coeff {} at eps= {}'.format(np.max(silhouette_coefficients), x_range[loc]))

return x_range[loc]

To ease the rest of the implementation and result comparison I have implemented four more functions, the number of function doesn’t matter just split the logic into smaller functions.

## So what they actually do?

**Run benchmark function:**

The first function gets our data for testing and the test name (used to keep track), then it calls each of the previous function to find the best scores for DBscan and K-means. After receiving the results the function call another function to analyze the result for this data. Eventually returning the DataFrame with the corresponding cluster and an array of results, the array will be used later to display a DataFrame with all the different results.

def run_benchmark(df, test_name):

# test algorithems using pca data

best_eps = test_dbsacn(df, 0.1, 1, steps = 0.1, plot_name='{}_dbsacn'.format(test_name))# run kmeans test

best_k = test_kmeans(df, 2, np.min([20, len(df)]), '{}_kmeans'.format(test_name))

print('test printing: eps={}, k={}'.format(best_eps, best_k))

## retrive result of best estimators

res, d_clusters_, d_noise_, d_score_, k_score_ = retrieve_result(df, best_k, best_eps)

return res, [test_name, best_eps, d_clusters_, d_noise_, d_score_, best_k, k_score_]

**Retrieve result function:**

This function gets our dataset and the result of the scoring function, the best number of clusters and the best max distance; it runs each clustering algorithm again with those results, and adds a cluster column for each of the algorithms to our features data set. Using the new columns in the data set we are creating a new data frame with the top places in each cluster, in this way we can identify how the cluster differs.

def retrieve_result(df, best_k, best_eps):

## cluster the data using both algorithem

neighborhoods_venues_sorted = generate_ven_sorted()

## clustering using Kmeans

# based on our anlysis k gives the best result with 3 clusters

kmeans = KMeans(n_clusters=best_k, random_state=40, n_init=12).fit(df)

k_score_ = -2

# find s_ci scroe for kmeans

if(len(np.unique(kmeans.labels_)) > 2 and len(np.unique(kmeans.labels_)) < len(df)):

k_score_ = silhouette_score(df, kmeans.labels_)## cluster using DBSCAN

# based on our anlysis we found that 0.7 gives the best results

dbscan = DBSCAN(eps=best_eps, min_samples=9).fit(df)

# test the best score we found using pca data fro dbscan

d_clusters_, d_noise_, d_score_ = print_dbscan_state(dbscan, df)# add clustering labels to our dataframe

neighborhoods_venues_sorted.insert(0, 'Kmeasn clusters', kmeans.labels_)

neighborhoods_venues_sorted.insert(0, 'DBSCAN clusters', dbscan.labels_)# merge clustered results with the original data to add zip codes and locations

df_merged_pca = aggregated_venues_df.join(neighborhoods_venues_sorted.set_index('ZipCode'), on='ZipCode')

## aggregate result into one DF

res_df = pd.DataFrame();

## check clusters data for kmeans

temp = clusters_data('Kmeasn clusters', df_merged_pca)

temp.insert(0, 'algo', ['kmeans' for _ in range(len(temp))])

res_df = pd.concat([res_df, temp])

## check clusters data for dbscan

temp = clusters_data('DBSCAN clusters', df_merged_pca)

temp.insert(0, 'algo', ['dbscan' for _ in range(len(temp))])

res_df = pd.concat([res_df, temp])

return res_df, d_clusters_, d_noise_, d_score_, k_score_

**Print DBscan stats function:**

This function can be easily fit into the previous one but I defined it as a separate just for my own ease. it takes our dataset and the result of the dbscan algorithm and then prints and returns the score and the number of clusters the algorithm managed to build.

`def print_dbscan_state(db_cluster, df):`

# Number of clusters in labels, ignoring noise if present.

labels = db_cluster.labels_

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

n_noise_ = list(labels).count(-1)

score_ = -2

if(len(np.unique(labels)) > 2 and len(np.unique(labels)) < len(df)):

score_ = silhouette_score(df, labels)

print('Estimated number of clusters dbs: %d' % n_clusters_)

print('Estimated number of noise points dbs: %d' % n_noise_)

if(score_ > -2):

print("Silhouette Coefficient dbs: %0.3f" % score_)

return n_clusters_, n_noise_, score_

**Cluster Data Function:**

This is a function to help understand how the clusters look or actually how they differ. It gets the dataset and the column of the cluster we are comparing (the benchmark function runs both clusters and creates a new column one for each of the clusters algorithms (this is the reason we are specifying here which column to use). It groups the data by clusters and the most common venue (the most common category), using the size function to get the size of each group and then sorting and selecting the top 3, finally returned the DF.

def clusters_data(cluster_label, df):

temp = df.groupby([cluster_label, '1st Most Common Venue']).size().sort_values(ascending=False).reset_index()

temp_df = pd.DataFrame(temp)summary_df = pd.DataFrame()k = np.unique(df[cluster_label]) # the number of unique clusters for i in range(0, k):

summary_df = pd.concat([summary_df, temp_df[temp_df[cluster_label] == i].head(3)])summary_df.columns = ['Cluster', 'Restaurants', 'Count']

summary_df.groupby(['Cluster', 'Count']).head()

return summary_df

## Defining the features set

After we have implemented all our helper functions we can define the different feature sets we are going to use. Our function returns the data frame with all the results. All that is left to do is group them together into a summary DF.

To run the algorithm we need to remove the zip codes from the data and leave our dataset just with the dummies. But, we are interested in the data by each zip code, so before removing it simply group the data by it.

`X = dummies.groupby('ZipCode').mean().reset_index()`

X_raw = X.drop('ZipCode', 1)

After we have the base set of data we can create a few other sets.

Starting with the PCA algorithm to reduce the data dimensionality. Using this algorithm we can find which features best describe our data. running it over the data we have, it seems like 44 features from the total of 183 can explain more than 95%. Here is the PCA test code:

This also contains a plot of the base algorithm in the first 5 rows. the last row before printing is selecting the params with a variance above 95, those are the 44 PCA features.

from sklearn.decomposition import PCApca = PCA(n_components=len(X_raw))

pca.fit(X_raw)

variance = pca.explained_variance_ratio_

var=np.cumsum(np.round(variance, 3)*100)fig = plt.figure(figsize=(12,6))

plt.ylabel('% Variance Explained')

plt.xlabel('# of Features')

plt.title('PCA Analysis')

plt.ylim(0,100.5)

plt.plot(var)

fig.savefig('pca_plot.png', bbox_inches='tight')

# find the number of components explanins 95% and above

com95 = np.where(var > 95)[0][0]

print('95% of our data can be explanied using {} feaures'.format(com95))

This will be the second set of features, but since we are trying to find a hidden pattern, maybe using actually the other features, those that don’t describe all the data will be more helpful in finding a segmentation between the data.

After finding the number of features we want to use, let’s create a DF with all those features, by fitting the original data using the PCA function with 44 features

`#X_raw = original data set with all 185 features `

pca = PCA(n_components=com95)

pca.fit(X_raw)

pca_scale = pca.transform(X_raw)

pca_df = pd.DataFrame(pca_scale, columns=X.columns[:com95])

pca_df

Using this information we can create another DF with all the features that are not in this set.

Here is an example of how to run one of the sets with our functions. I decided to save the result of the function into 2 params one for the DataFrame and a second for the array of results, in each call I gave the variables different names so I’ll be able to aggregate them at the end.

`npca_res_df, npca_res_results = run_benchmark(non_pca_df, 'no_pca_test')`

By printing the result DF we can see the difference between the clusters. It’s ok if we see that the non-PCA selection includes clusters identified by features in the PCA. since the algorithm is selecting the cluster by other features by we use all the features list to identify the clusters.

And this is how the clusters in this case will be identified:

I have run a few more tests with the following sets of features:

- A set without the 3 most common places, Fast Food, Pizza, American Restaurant.
- A set grouped by states

Eventually summarizing the results into a DataFrame which will display it in more ordered form:

`columns_res = ['test name', 'best_eps', 'dbscan clusters', 'dbscan noise', 'dbs - s_c', 'kmeasn clusters', 'kmeans s_c']`

summary_df = pd.DataFrame([

raw_res_results,

pca_res_results,

npca_res_results,

no_common_res_results,

], columns=columns_res)

summary_df

As can be seen in the results summary, there is no big difference between the different data sets, the non-PCA data set have the best result for the DBScan and the Kmeans has the best result using the PCA set.

But when looking at how the clusters are built we can’t really differentiate between them in a way that can tell us anything about the difference in restaurants’ preferences between zip codes.

## Regression / Correlation

One of my other ideas was to check if there is any possible connection between the price of a house and the restaurants around it. It can be either the number of restaurants or the type.

To check this I decided to run a few different regression and check how they will score and what coefficient will be given to our features (the features are the restaurant categories).

I defined a few helpers function:

**Split x_y:**

This function receives our features set and the dependent variables and then splits them into test and train sets using the Sklean “train_test_split”.

We can also fit and transform the x values.

def split_x_y(X, Y):

from sklearn.preprocessing import StandardScaler

# split to trai and test

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)#standardization scaler

s_scaler = StandardScaler()

X_train = s_scaler.fit_transform(X_train.astype(np.float))

X_test = s_scaler.transform(X_test.astype(np.float))print('Train size ({}, {})'.format(X_train.shape, y_train.shape))

print('Test size ({}, {})'.format(X_test.shape, y_test.shape))

return X_train, X_test, y_train, y_test

**get_cv_score:**

A function to calculate a cross-validation score of our model, this scoring technique will choose a different part of our data in each iteration and check its score, we will then take the mean as our cross-validation score.

`def get_cv_scores(model, Xtrain, ytrain):`

submodule of scikit learn

from sklearn.model_selection import cross_val_score

mse= cross_val_score(model,Xtrain,ytrain, scoring='neg_mean_squared_error',cv=10)

# taking the mean of mse for using as an indicator of how good the model is

return mse.mean()

**scoring:**

We can also check a basic score just in case, which is mean square error or R², I have implemented them both in this function.

After we have those basic functions we can implement each of the regression and save the scoring results, I’ll also save a data frame with the coefficients so we can compare them.

An example of the linear regression:

X_train, X_test, y_train, y_test = split_x_y(X, Y)

# run regression

regressor = LinearRegression()

regressor.fit(X_train, y_train)#predicting the test set result

y_pred = regressor.predict(X_test)#evaluate the model (intercept and slope)

print('intercept: ', regressor.intercept_)

regressor.score(X_test, y_test)

cv_score = get_cv_scores(regressor, X_train, y_train)

msq, r_2 = scoring(y_test, y_pred, cv_score)

#add results into a DataFrame

coeff_df = pd.DataFrame(regressor.coef_, f_col, columns=['Coefficient'])

If we print the DF and sort by ascending or descending we can see which coefficients may have anything do to with the house price.

As can be seen from the print-screen above, places like a coffee shop and juice bar seem to have a positive correlation. (I divided the house prices by 1000 so if you want the real number just multiply back by 1000).

Now that we saw how one of the regression models is working and how the data is collected we can use the same idea to test a few different regression models and compare them.

I decided to test the question with the following regression models:

- Polynomial — Very similar to the simple linear regression but in this case, we first increase all the features by an order of the polynomial. I tested with 3rd order. (On selected features — see below)
- Ride — add a coefficient to the loss function while trying to minimize it. The coefficient is a squared term, multiplied by a value called “lambda” changing the value will impact how much noise will be captured by the loss function.
- Lasso — very similar to the ridge algorithm, but with a different coefficient. This algorithm can be useful for feature selection since it tried to minimize those features found to be not important to zero.
- Gradient Boosting — Boosting is a technique used in machine learning and can be used both in clustering and regression. The goal of this technique is to try and find the best features by minimizing a loss function. In this way, even params that seem weak can become significant if it can minimize the loss function.

The implementation of all the models exist in the git repo. Below are a few notes about feature selection and grid search.

Our model is huge with a lot of features, but maybe we don’t need all of them and can reduce the number of features. This can be helpful for the linear regression result, by only choosing the important once. And this will save a lot of running time for the polynomial regression as well. I have performed a feature selection using two different feature selection algorithm, mutual_info_regression, f_regression. Both are from the SKlearn package. When running each separately I saw that most of the features are the same with some small difference so I decided to work with the combined result.

Grid Search — The other three models were trained using grid search, It’s basically a function the receives Sklearn model and a list of params you want to test, each param will have an array of possible values. The function will test the model with all of the params options you have provided and return the best combination based on the score each fitting received.

I hope you managed to keep track and train your models, Now we can combine the result and check how the models have performed.

Gradient boosting had the best result compared to other algorithms but still, R² of just 0.28 is not significant enough, and a cross-validation score of -143274.92 which is just 32,814 better than the Lasso model who ends with the worst result.

Another thing that I did is to check the coefficient scores gained by each model. If you remember our scoring function returned the score and a DF of the coefficients scores. Let’s combine them and sort them by the top results. before sorting I took the absolute value so we can check the top-bottom as well.

**This is how I sorted them? If you have a better solution I’ll be glad to hear**

Create a simple sorting function that performs the following order:

- Add a sign column to the DF based on each row
- Convert coefficients to their absolute value
- Sort the DF by the coefficient in descending order
- Multiplay back the sign by the coefficient for each row

`def sort_top(df_in):`

df = df_in.copy()

# take sign out

df['sign'] = np.sign(df['Coefficient'])

# abs coeff

df['Coefficient'] = np.abs(df['Coefficient'])

# sort

df = df.sort_values(by='Coefficient', ascending=False)

# apply sign back to get tru position

df['Coefficient'] = df['Coefficient'] * df['sign']

return df

And then simply applied this function on all of the coefficients DF from the different models we have tested:

coef_sum = pd.concat([

sort_top(coeff_df).head(10),

sort_top(coeff_sel_df).head(10),

sort_top(rr_coef_df).head(10),

sort_top(lsm_coef_df).head(10)

])coef_sum = coef_sum.reset_index()

coef_sum.columns = ['Feautre' ,'Model', 'Coefficient', 'Sign']

As I have noted in the first post a quick look at the coefficients shows that some of them may actually have an impact on the price of a house, like a coffee shop or a juice bar around. But since the scores of the regressions are low, I’ll not refer to that as something with significant impact.

This table is an excel edited based on the coefficients summary DataFrame:

## Does the number of restaurants around have a correlation with the house value?

This was another question I wanted to test, and it’s a simple one since we already have all the data. Just sum up the number of restaurants in each zip code and perform a simple regression between this total number and the value of the houses.

In the notebook, I have this in the beginning but this is the code:

- From our dummies data (the restaurant categories data), select all, group by zip code, and sum it — we want the total number of places in each zip code.
- Drop the zip code column, state since it’s a numeric value, and sum the DF by rows, sum function with axis equals to 1.
- From the zipcodes, data select the zip code, state, and house value
- Merge them together and drop any NA values just in case

# group by zip codes

temp = zipcodes_df[['Zip', 'State', 'House Value']]resturants_grouped = dummies.groupby('ZipCode').sum().reset_index()# add sum as number of resturants in each zip code

resturants_grouped["Total Restaurants"] = resturants_grouped.drop(columns=['ZipCode']).sum(axis=1)resturants_grouped = pd.merge(left=resturants_grouped, right=temp, how='left', left_on='ZipCode', right_on='Zip')resturants_grouped.dropna(inplace=True)

# let drop all but resturnat type, zip code and house values

resturants_grouped = resturants_grouped.drop(columns=['Zip'])resturants_grouped

After having the total number let’s perform a simple linear regression:

`x_sum = resturants_grouped["Total Restaurants"].to_numpy().reshape(-1, 1)`

coeff_total_df, result = run_linear_reg(x_sum ,y, ['Total'])

- We are reshaping the data since the linear regression function we wrote passes the date first via the transform function. This function requires a two-dimensional array and this is the workaround.

And here is the result, nothing significant was found:

`Train size ((3688, 1), (3688,))`

Test size ((1817, 1), (1817,))

intercept: 417.87958486984814

mean squared error(sqrt): 361.61671503111216

R**2: 0.10646684634273684

CV score: -173045.15649448492

Thanks for reading and keeping up with the long posts.

This is my first experience of blogging and with a data science project.

If you have any future suggestions or improvements feel free to comment or reach out always happy to learn.