D.S project — The hidden taste of US zip codes — part 3

Clustering

Using cluster analysis our goal is to find a number of unique groups that our data can be divided into. For example, in the case of this research, the group can be, Chinese, Italian, or Fast food restaurants. A successful clustering will be such that we can clearly identify the difference between them and predict a new data point into one of the identified clusters.

# import CSV
aggregated_venues_df = pd.read_csv('aggregated_venues.csv')
# create a new data frame with groupped venues
# one hot encoding
dummies = pd.get_dummies(aggregated_venues_df[['Restaurant Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
dummies['ZipCode'] = aggregated_venues_df['ZipCode']
# move zipcide column to the first column
fixed_columns = [dummies.columns[-1]] + dummies.columns[:-1].to_list()
dummies = dummies[fixed_columns]
print(dummies.shape)
dummies.head()
Dummies head example

Most frequent places

We have 184 different categories so it’s hard to understand how our data really looks like. One thing that we can do (The idea was used during the Coursera course) is to group our data by zip codes and then sort each line by the frequency of the places. In this way, we can find the most frequent places in each location. For example, if a zip code has 3 restaurants, 2 are fast food and one is a coffee shop, by grouping them and taking the mean of the row, fast food will have a score of 0.66 and coffee shop a score of 0.33 which makes the fast-food more frequent.

res_grouped = dummies.groupby(‘ZipCode’).mean().reset_index()
res = pd.Series(
[row[0] , *row[1:].sort_values(ascending=False).index.values[0:number_of_top_reseturants]]
for index, row in res_grouped.iterrows()
)
# series to data-frame
neighborhoods_venues_sorted = pd.DataFrame.from_records(res)
# create columns according to number of top venues
columns = ['ZipCode']
indicators = ['st', 'nd', 'rd']
for ind in np.arange(number_of_top_reseturants):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# plot 3 most common places on a plot
def plot_most_freq(cat):
fig = plt.figure(figsize=(16,7))
plt.title(cat)
neighborhoods_venues_sorted[['ZipCode', cat]].groupby(cat).size().sort_values(ascending=False).head(20).plot.bar()
plt.tight_layout()

# save the figure
fig.savefig(cat + '.png', bbox_inches='tight')
plot_most_freq('1st Most Common Venue')
plot_most_freq('2nd Most Common Venue')
plot_most_freq('3rd Most Common Venue')

Clustering

The goal of this research is to find a common pattern between different zip codes and check whether we can or can’t segment them by types of restaurants. There are many different clustering analysis techniques. Here I’ll compare both k-means which is one of the simplest and most known and DBscan which will not require us to set the number of clusters in advance.

# X - values to run algo on
# s_value - the min value the loop will start from
# e_value - the maximm value to go to
def test_kmeans(df, s_value, e_value):
scores = []
silhouette_coefficients_means = []
x_range = range(s_value, e_value)
for k in x_range:
kmeans = KMeans(n_clusters=k, random_state=40, n_init=12)
kmeans.fit(df)
scores.append(kmeans.inertia_)
if(len(np.unique(kmeans.labels_)) > 2 and len(np.unique(kmeans.labels_)) < len(df)):
silhouette_coefficients_means.append(silhouette_score(df, kmeans.labels_))
else:
silhouette_coefficients_means.append(-2)


# find the best number of clusters
loc = np.where(silhouette_coefficients_means == np.max(silhouette_coefficients_means))[0][0]
print('Best Kmeans coeff {} at K= {}'.format(np.max(silhouette_coefficients_means), x_range[loc]))

return x_range[loc]
# X - values to run algo on
# s_value - the min value the loop will start from
# e_value - the maximm value to go to
# steps - optional to set the steps range
def test_dbsacn(X, s_value, e_value, steps = 0.1):
x_range = np.arange(s_value,e_value,steps)
silhouette_coefficients = []
for e in x_range:
clustering = DBSCAN(eps=e, min_samples=9).fit(X)
if(len(np.unique(clustering.labels_)) > 2 and len(np.unique(clustering.labels_)) < len(X)):
silhouette_coefficients.append(silhouette_score(X, clustering.labels_))
else:
silhouette_coefficients.append(-2)
loc = np.where(silhouette_coefficients == np.max(silhouette_coefficients))[0][0]
print('Best DBSCAN coeff {} at eps= {}'.format(np.max(silhouette_coefficients), x_range[loc]))

return x_range[loc]

So what they actually do?

Run benchmark function:

def run_benchmark(df, test_name):
# test algorithems using pca data
best_eps = test_dbsacn(df, 0.1, 1, steps = 0.1, plot_name='{}_dbsacn'.format(test_name))
# run kmeans test
best_k = test_kmeans(df, 2, np.min([20, len(df)]), '{}_kmeans'.format(test_name))

print('test printing: eps={}, k={}'.format(best_eps, best_k))

## retrive result of best estimators
res, d_clusters_, d_noise_, d_score_, k_score_ = retrieve_result(df, best_k, best_eps)

return res, [test_name, best_eps, d_clusters_, d_noise_, d_score_, best_k, k_score_]
def retrieve_result(df, best_k, best_eps):
## cluster the data using both algorithem
neighborhoods_venues_sorted = generate_ven_sorted()
## clustering using Kmeans
# based on our anlysis k gives the best result with 3 clusters
kmeans = KMeans(n_clusters=best_k, random_state=40, n_init=12).fit(df)
k_score_ = -2
# find s_ci scroe for kmeans
if(len(np.unique(kmeans.labels_)) > 2 and len(np.unique(kmeans.labels_)) < len(df)):
k_score_ = silhouette_score(df, kmeans.labels_)
## cluster using DBSCAN
# based on our anlysis we found that 0.7 gives the best results
dbscan = DBSCAN(eps=best_eps, min_samples=9).fit(df)

# test the best score we found using pca data fro dbscan
d_clusters_, d_noise_, d_score_ = print_dbscan_state(dbscan, df)
# add clustering labels to our dataframe
neighborhoods_venues_sorted.insert(0, 'Kmeasn clusters', kmeans.labels_)
neighborhoods_venues_sorted.insert(0, 'DBSCAN clusters', dbscan.labels_)
# merge clustered results with the original data to add zip codes and locations
df_merged_pca = aggregated_venues_df.join(neighborhoods_venues_sorted.set_index('ZipCode'), on='ZipCode')

## aggregate result into one DF
res_df = pd.DataFrame();

## check clusters data for kmeans
temp = clusters_data('Kmeasn clusters', df_merged_pca)
temp.insert(0, 'algo', ['kmeans' for _ in range(len(temp))])
res_df = pd.concat([res_df, temp])
## check clusters data for dbscan
temp = clusters_data('DBSCAN clusters', df_merged_pca)
temp.insert(0, 'algo', ['dbscan' for _ in range(len(temp))])
res_df = pd.concat([res_df, temp])

return res_df, d_clusters_, d_noise_, d_score_, k_score_
def print_dbscan_state(db_cluster, df):
# Number of clusters in labels, ignoring noise if present.
labels = db_cluster.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
score_ = -2

if(len(np.unique(labels)) > 2 and len(np.unique(labels)) < len(df)):
score_ = silhouette_score(df, labels)

print('Estimated number of clusters dbs: %d' % n_clusters_)
print('Estimated number of noise points dbs: %d' % n_noise_)
if(score_ > -2):
print("Silhouette Coefficient dbs: %0.3f" % score_)

return n_clusters_, n_noise_, score_
def clusters_data(cluster_label, df):
temp = df.groupby([cluster_label, '1st Most Common Venue']).size().sort_values(ascending=False).reset_index()
temp_df = pd.DataFrame(temp)
summary_df = pd.DataFrame()k = np.unique(df[cluster_label]) # the number of unique clusters for i in range(0, k):
summary_df = pd.concat([summary_df, temp_df[temp_df[cluster_label] == i].head(3)])
summary_df.columns = ['Cluster', 'Restaurants', 'Count']
summary_df.groupby(['Cluster', 'Count']).head()

return summary_df

Defining the features set

After we have implemented all our helper functions we can define the different feature sets we are going to use. Our function returns the data frame with all the results. All that is left to do is group them together into a summary DF.

X = dummies.groupby('ZipCode').mean().reset_index()
X_raw = X.drop('ZipCode', 1)
from sklearn.decomposition import PCApca = PCA(n_components=len(X_raw))
pca.fit(X_raw)
variance = pca.explained_variance_ratio_
var=np.cumsum(np.round(variance, 3)*100)
fig = plt.figure(figsize=(12,6))
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(0,100.5)
plt.plot(var)
fig.savefig('pca_plot.png', bbox_inches='tight')
# find the number of components explanins 95% and above
com95 = np.where(var > 95)[0][0]
print('95% of our data can be explanied using {} feaures'.format(com95))
#X_raw = original data set with all 185 features 
pca = PCA(n_components=com95)
pca.fit(X_raw)
pca_scale = pca.transform(X_raw)
pca_df = pd.DataFrame(pca_scale, columns=X.columns[:com95])
pca_df
npca_res_df, npca_res_results = run_benchmark(non_pca_df, 'no_pca_test')
  1. A set without the 3 most common places, Fast Food, Pizza, American Restaurant.
  2. A set grouped by states
columns_res = ['test name', 'best_eps', 'dbscan clusters', 'dbscan noise', 'dbs - s_c', 'kmeasn clusters', 'kmeans s_c']
summary_df = pd.DataFrame([
raw_res_results,
pca_res_results,
npca_res_results,
no_common_res_results,
], columns=columns_res)
summary_df

Regression / Correlation

One of my other ideas was to check if there is any possible connection between the price of a house and the restaurants around it. It can be either the number of restaurants or the type.

def split_x_y(X, Y):
from sklearn.preprocessing import StandardScaler

# split to trai and test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
#standardization scaler
s_scaler = StandardScaler()
X_train = s_scaler.fit_transform(X_train.astype(np.float))
X_test = s_scaler.transform(X_test.astype(np.float))
print('Train size ({}, {})'.format(X_train.shape, y_train.shape))
print('Test size ({}, {})'.format(X_test.shape, y_test.shape))

return X_train, X_test, y_train, y_test
def get_cv_scores(model, Xtrain, ytrain):
submodule of scikit learn
from sklearn.model_selection import cross_val_score
mse= cross_val_score(model,Xtrain,ytrain, scoring='neg_mean_squared_error',cv=10)
# taking the mean of mse for using as an indicator of how good the model is
return mse.mean()
X_train, X_test, y_train, y_test = split_x_y(X, Y)
# run regression

regressor = LinearRegression()
regressor.fit(X_train, y_train)
#predicting the test set result
y_pred = regressor.predict(X_test)
#evaluate the model (intercept and slope)
print('intercept: ', regressor.intercept_)
regressor.score(X_test, y_test)

cv_score = get_cv_scores(regressor, X_train, y_train)
msq, r_2 = scoring(y_test, y_pred, cv_score)

#add results into a DataFrame
coeff_df = pd.DataFrame(regressor.coef_, f_col, columns=['Coefficient'])
Linear Regression coefficients, left — top ten, right — bottom ten
Linear Regression coefficients, left — top ten, right — bottom ten
  1. Polynomial — Very similar to the simple linear regression but in this case, we first increase all the features by an order of the polynomial. I tested with 3rd order. (On selected features — see below)
  2. Ride — add a coefficient to the loss function while trying to minimize it. The coefficient is a squared term, multiplied by a value called “lambda” changing the value will impact how much noise will be captured by the loss function.
  3. Lasso — very similar to the ridge algorithm, but with a different coefficient. This algorithm can be useful for feature selection since it tried to minimize those features found to be not important to zero.
  4. Gradient Boosting — Boosting is a technique used in machine learning and can be used both in clustering and regression. The goal of this technique is to try and find the best features by minimizing a loss function. In this way, even params that seem weak can become significant if it can minimize the loss function.
Regression model result summery
  1. Add a sign column to the DF based on each row
  2. Convert coefficients to their absolute value
  3. Sort the DF by the coefficient in descending order
  4. Multiplay back the sign by the coefficient for each row
def sort_top(df_in):
df = df_in.copy()
# take sign out
df['sign'] = np.sign(df['Coefficient'])
# abs coeff
df['Coefficient'] = np.abs(df['Coefficient'])
# sort
df = df.sort_values(by='Coefficient', ascending=False)
# apply sign back to get tru position
df['Coefficient'] = df['Coefficient'] * df['sign']
return df
coef_sum = pd.concat([
sort_top(coeff_df).head(10),
sort_top(coeff_sel_df).head(10),
sort_top(rr_coef_df).head(10),
sort_top(lsm_coef_df).head(10)
])
coef_sum = coef_sum.reset_index()
coef_sum.columns = ['Feautre' ,'Model', 'Coefficient', 'Sign']
Coefficients result summary

Does the number of restaurants around have a correlation with the house value?

This was another question I wanted to test, and it’s a simple one since we already have all the data. Just sum up the number of restaurants in each zip code and perform a simple regression between this total number and the value of the houses.

  1. From our dummies data (the restaurant categories data), select all, group by zip code, and sum it — we want the total number of places in each zip code.
  2. Drop the zip code column, state since it’s a numeric value, and sum the DF by rows, sum function with axis equals to 1.
  3. From the zipcodes, data select the zip code, state, and house value
  4. Merge them together and drop any NA values just in case
# group by zip codes
temp = zipcodes_df[['Zip', 'State', 'House Value']]
resturants_grouped = dummies.groupby('ZipCode').sum().reset_index()# add sum as number of resturants in each zip code
resturants_grouped["Total Restaurants"] = resturants_grouped.drop(columns=['ZipCode']).sum(axis=1)
resturants_grouped = pd.merge(left=resturants_grouped, right=temp, how='left', left_on='ZipCode', right_on='Zip')resturants_grouped.dropna(inplace=True)
# let drop all but resturnat type, zip code and house values
resturants_grouped = resturants_grouped.drop(columns=['Zip'])
resturants_grouped
x_sum = resturants_grouped["Total Restaurants"].to_numpy().reshape(-1, 1)
coeff_total_df, result = run_linear_reg(x_sum ,y, ['Total'])
  • We are reshaping the data since the linear regression function we wrote passes the date first via the transform function. This function requires a two-dimensional array and this is the workaround.
Train size ((3688, 1), (3688,))
Test size ((1817, 1), (1817,))
intercept: 417.87958486984814
mean squared error(sqrt): 361.61671503111216
R**2: 0.10646684634273684
CV score: -173045.15649448492

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store