K-Nearest Neighbors

2020 April 2nd

In this project we will be implementing the KNN algorithm from scratch. We will implement the algorithm for the K values 1, 3, 5 and 7 and see which one is performing better. Also, we will implement multiple distance measures and compare accurac across all of them.

We begin by importing the necessary libraries

1import math
2
3import matplotlib.pyplot as mp
4import numpy as np
5import pandas as pd
1DATA = "../data/iris.data"
2SPLIT_PERCENT = 0.7
3K_VALUES = [1, 3, 5, 7]
4DIST_METRICS = ["euc", "norm-euc", "cosine-sim"]

We now prepare the dataset for further processing. To do this we first load the data, shuffle it randomly and map the classes to integers for easier processing

1dataset = pd.read_csv(DATA).sample(frac=1).reset_index(drop=True)
2
3# Map class names to integers
4iris_class_map = {v: k + 1 for k, v in enumerate(dataset['class'].unique())}
5dataset['class'] = dataset['class'].map(iris_class_map)
6
7RECORDS_COUNT, ATTR_COUNT = dataset.shape
8ATTRS = dataset.columns.values[0:ATTR_COUNT - 1]
9SPLIT_SIZE = math.floor(RECORDS_COUNT * SPLIT_PERCENT)
10
11# List of columns for every K Value
12K_COL_STRINGS = ["pred_k_{}".format(k) for k in K_VALUES]
13for col in K_COL_STRINGS:
14 dataset[col] = np.nan
15
16# Split dataset for dev and test
17dev_set = dataset[:SPLIT_SIZE].copy(deep=True)
18test_set = dataset[SPLIT_SIZE:].copy(deep=True)
19
20print("Dev data")
21dev_set.head()
1Dev data
sepal_lensepal_widpetal_lenpetal_widclasspred_k_1pred_k_3pred_k_5pred_k_7
05.03.61.40.21NaNNaNNaNNaN
17.03.24.71.42NaNNaNNaNNaN
26.82.84.81.42NaNNaNNaNNaN
35.72.84.11.32NaNNaNNaNNaN
45.52.43.81.12NaNNaNNaNNaN

Now that our data is prepared, we need to find the nearest neighbors for every K value. To do this, we first need to calculate the Euclidian distance for every row in the dev set against every other row. Once we have the distances, we find the K Nearest Neighbors for every K value and use this to predict the class of the selected vector.

Once we make our predictions, we need to calculate the accuracy as well.

1# Find K nearest neighbors for all values of K
2for index, k in enumerate(K_VALUES):
3 calculated_pred = []
4 for i, row in dev_set.iterrows():
5 # Calculate euclidian distance
6 calculated_dist = (dev_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()
7 # Get indices of nearest neighbors
8 nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values
9 # Get nearest neighbors
10 nearest_neighbors = dev_set.loc[nearest_neighbor_indices, :]['class']
11 # Predict class of the vector
12 prediction = nearest_neighbors.mode().values[0]
13 calculated_pred.append(prediction)
14 dev_set[K_COL_STRINGS[index]] = calculated_pred
15
16# Calculating accuracy
17euc_accuracy = []
18
19for col in dev_set[K_COL_STRINGS]:
20 column = dev_set[col]
21 total_rows = dev_set.shape[0]
22 num = dev_set.loc[dev_set['class'] == column].shape[0]
23 acc = round((num/total_rows) * 100, 5)
24 euc_accuracy.append(acc)
25
26print(euc_accuracy)
1[100.0, 99.04762, 99.04762, 98.09524]

Now, follow the same process but instead we use a normalized euclidian distance as a metric. To do this, we normalize the dataset and calculate the euclidian distance again. This allows us to deal with outliers and ensure we are correctly scaling the data

1# Normalize data
2def normalize(dataframe):
3 df = dataframe.copy(deep=True)
4 for col in df[ATTRS]:
5 df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
6 return df
7
8norm_dev_set = normalize(dev_set)
9
10# Reset the prediction columns
11for col in K_COL_STRINGS:
12 norm_dev_set[col] = np.nan
13
14norm_dev_set.head()
sepal_lensepal_widpetal_lenpetal_widclasspred_k_1pred_k_3pred_k_5pred_k_7
00.1944440.8421050.0677970.0434781NaNNaNNaNNaN
10.7500000.6315790.6271190.5652172NaNNaNNaNNaN
20.6944440.4210530.6440680.5652172NaNNaNNaNNaN
30.3888890.4210530.5254240.5217392NaNNaNNaNNaN
40.3333330.2105260.4745760.4347832NaNNaNNaNNaN

Once the dataset is normalized, we follow the same process to calculate the euclidian distance and predict

1# Predict using normalized data for all K values
2for index, k in enumerate(K_VALUES):
3 calculated_pred = []
4 for i, row in norm_dev_set.iterrows():
5 calculated_dist = (norm_dev_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()
6 nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values
7 nearest_neighbors = norm_dev_set.loc[nearest_neighbor_indices, :]['class']
8 prediction = nearest_neighbors.mode().values[0]
9 calculated_pred.append(prediction)
10 norm_dev_set[K_COL_STRINGS[index]] = calculated_pred
11
12norm_euc_accuracy = []
13
14for col in norm_dev_set[K_COL_STRINGS]:
15 column = norm_dev_set[col]
16 total_rows = norm_dev_set.shape[0]
17 num = norm_dev_set.loc[dev_set['class'] == column].shape[0]
18 acc = round((num/total_rows) * 100, 5)
19 norm_euc_accuracy.append(acc)
20
21print(norm_euc_accuracy)
1[100.0, 98.09524, 99.04762, 99.04762]

For the final iteration, we will use Cosine Similarity as a distance metric

1# Dot product
2def dot(A, B):
3 return sum(a * b for a, b in zip(A, B))
4
5# Cosine Similarity
6def cosine_similarity(a, b):
7 return 1 - dot(a, b) / ((np.sqrt(dot(a, a))) * (np.sqrt(dot(b, b))))
8
9cosine_dev_set = dev_set.copy(deep=True)
10
11for index, k in enumerate(K_VALUES):
12 calculated_pred = []
13 for i, row in cosine_dev_set.iterrows():
14 calculated_dist = cosine_dev_set[ATTRS].apply(lambda a: cosine_similarity(np.array(a), np.array(row[ATTRS])), axis=1).sort_values()
15 nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values
16 nearest_neighbors = cosine_dev_set.loc[nearest_neighbor_indices, :]['class']
17 prediction = nearest_neighbors.mode().values[0]
18 calculated_pred.append(prediction)
19 cosine_dev_set[K_COL_STRINGS[index]] = calculated_pred
20
21cosine_accuracy = []
22
23for col in cosine_dev_set[K_COL_STRINGS]:
24 column = cosine_dev_set[col]
25 total_rows = cosine_dev_set.shape[0]
26 num = cosine_dev_set.loc[dev_set['class'] == column].shape[0]
27 acc = round((num/total_rows) * 100, 5)
28 cosine_accuracy.append(acc)
29
30print(cosine_accuracy)
1[100.0, 98.09524, 98.09524, 99.04762]

Now, we need to analyze the accuracy of all these distance metrics across all the K values and pick the one that's doing the best job. To do this

1acc_table = pd.DataFrame(index=K_VALUES)
2acc_table[DIST_METRICS[0]] = euc_accuracy
3acc_table[DIST_METRICS[1]] = norm_euc_accuracy
4acc_table[DIST_METRICS[2]] = cosine_accuracy
5acc_table
eucnorm-euccosine-sim
1100.00000100.00000100.00000
395.2381095.2381098.09524
596.1904895.2381098.09524
798.0952496.1904898.09524
1width = 0.3
2mp.figure(figsize=(15, 10))
3mp.ylim(0, 115)
4e = mp.bar(x=np.add(K_VALUES, width * -1), height=euc_accuracy, width=width, color='#663399')
5n = mp.bar(x=np.add(K_VALUES, width * 0), height=norm_euc_accuracy, width=width, color='#669933')
6c = mp.bar(x=np.add(K_VALUES, width * 1), height=cosine_accuracy, width=width, color='#994d33')
7
8mp.legend(DIST_METRICS, loc="best", fontsize=12)
9mp.xlabel("K Value")
10mp.ylabel("Accuracy (%)")
11mp.show()

png

We see that at K = 1, we get 100% accuracy. This is mainly because overfitting is occuring. From my observations K values 3 and 5 seem to be most stable and the accuracy for K = 7 usually decreases. This is why I will be choosing K = 3 and the Normalized Euclidian distance since it is stable to test the accuracy of the test dataset.

1k_val = 3
2norm_test_set = normalize(test_set)
3
4calculated_pred = []
5for i, row in norm_test_set.iterrows():
6 calculated_dist = (norm_test_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()
7 nearest_neighbor_indices = calculated_dist.iloc[0:k_val].index.values
8 nearest_neighbors = norm_test_set.loc[nearest_neighbor_indices, :]['class']
9 prediction = nearest_neighbors.mode().values[0]
10 calculated_pred.append(prediction)
11norm_test_set[K_COL_STRINGS[1]] = calculated_pred
12
13norm_euc_accuracy = []
14
15column = norm_dev_set[K_COL_STRINGS[1]]
16total_rows = norm_dev_set.shape[0]
17num = norm_dev_set.loc[dev_set['class'] == column].shape[0]
18acc = round((num/total_rows) * 100, 5)
19norm_euc_accuracy.append(acc)
20
21print(norm_euc_accuracy)
1[98.09524]

The final accuracy we get on the test set is around ~98% which is pretty good

The notebook can be downloaded here if you'd like to review it

References:

Develop k-Nearest Neighbors in Python From Scratch - Machine Learning Mastery