# 2020 April 2nd

In this project we will be implementing the KNN algorithm from scratch. We will implement the algorithm for the K values 1, 3, 5 and 7 and see which one is performing better. Also, we will implement multiple distance measures and compare accurac across all of them.

We begin by importing the necessary libraries

```1import math2
3import matplotlib.pyplot as mp4import numpy as np5import pandas as pd```
`1DATA = "../data/iris.data"2SPLIT_PERCENT = 0.73K_VALUES = [1, 3, 5, 7]4DIST_METRICS = ["euc", "norm-euc", "cosine-sim"]`

We now prepare the dataset for further processing. To do this we first load the data, shuffle it randomly and map the classes to integers for easier processing

```1dataset = pd.read_csv(DATA).sample(frac=1).reset_index(drop=True)2
3# Map class names to integers4iris_class_map = {v: k + 1 for k, v in enumerate(dataset['class'].unique())}5dataset['class'] = dataset['class'].map(iris_class_map)6
7RECORDS_COUNT, ATTR_COUNT = dataset.shape8ATTRS = dataset.columns.values[0:ATTR_COUNT - 1]9SPLIT_SIZE = math.floor(RECORDS_COUNT * SPLIT_PERCENT)10
11# List of columns for every K Value12K_COL_STRINGS = ["pred_k_{}".format(k) for k in K_VALUES]13for col in K_COL_STRINGS:14    dataset[col] = np.nan15
16# Split dataset for dev and test17dev_set = dataset[:SPLIT_SIZE].copy(deep=True)18test_set = dataset[SPLIT_SIZE:].copy(deep=True)19
`1Dev data`
sepal_lensepal_widpetal_lenpetal_widclasspred_k_1pred_k_3pred_k_5pred_k_7
05.03.61.40.21NaNNaNNaNNaN
17.03.24.71.42NaNNaNNaNNaN
26.82.84.81.42NaNNaNNaNNaN
35.72.84.11.32NaNNaNNaNNaN
45.52.43.81.12NaNNaNNaNNaN

Now that our data is prepared, we need to find the nearest neighbors for every K value. To do this, we first need to calculate the Euclidian distance for every row in the dev set against every other row. Once we have the distances, we find the K Nearest Neighbors for every K value and use this to predict the class of the selected vector.

Once we make our predictions, we need to calculate the accuracy as well.

```1# Find K nearest neighbors for all values of K2for index, k in enumerate(K_VALUES):3    calculated_pred = []4    for i, row in dev_set.iterrows():5        # Calculate euclidian distance        6        calculated_dist = (dev_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()7        # Get indices of nearest neighbors        8        nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values9        # Get nearest neighbors        10        nearest_neighbors = dev_set.loc[nearest_neighbor_indices, :]['class']11        # Predict class of the vector        12        prediction = nearest_neighbors.mode().values[0]13        calculated_pred.append(prediction)14    dev_set[K_COL_STRINGS[index]] = calculated_pred15
16# Calculating accuracy17euc_accuracy = []18
19for col in dev_set[K_COL_STRINGS]:20    column = dev_set[col]21    total_rows = dev_set.shape[0]22    num = dev_set.loc[dev_set['class'] == column].shape[0]23    acc = round((num/total_rows) * 100, 5)24    euc_accuracy.append(acc)25
26print(euc_accuracy)```
`1[100.0, 99.04762, 99.04762, 98.09524]`

Now, follow the same process but instead we use a normalized euclidian distance as a metric. To do this, we normalize the dataset and calculate the euclidian distance again. This allows us to deal with outliers and ensure we are correctly scaling the data

```1# Normalize data2def normalize(dataframe):3    df = dataframe.copy(deep=True)4    for col in df[ATTRS]:5        df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())6    return df7
8norm_dev_set = normalize(dev_set)9
10# Reset the prediction columns11for col in K_COL_STRINGS:12    norm_dev_set[col] = np.nan13
sepal_lensepal_widpetal_lenpetal_widclasspred_k_1pred_k_3pred_k_5pred_k_7
00.1944440.8421050.0677970.0434781NaNNaNNaNNaN
10.7500000.6315790.6271190.5652172NaNNaNNaNNaN
20.6944440.4210530.6440680.5652172NaNNaNNaNNaN
30.3888890.4210530.5254240.5217392NaNNaNNaNNaN
40.3333330.2105260.4745760.4347832NaNNaNNaNNaN

Once the dataset is normalized, we follow the same process to calculate the euclidian distance and predict

```1# Predict using normalized data for all K values2for index, k in enumerate(K_VALUES):3    calculated_pred = []4    for i, row in norm_dev_set.iterrows():5        calculated_dist = (norm_dev_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()6        nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values7        nearest_neighbors = norm_dev_set.loc[nearest_neighbor_indices, :]['class']8        prediction = nearest_neighbors.mode().values[0]9        calculated_pred.append(prediction)10    norm_dev_set[K_COL_STRINGS[index]] = calculated_pred11
12norm_euc_accuracy = []13
14for col in norm_dev_set[K_COL_STRINGS]:15    column = norm_dev_set[col]16    total_rows = norm_dev_set.shape[0]17    num = norm_dev_set.loc[dev_set['class'] == column].shape[0]18    acc = round((num/total_rows) * 100, 5)19    norm_euc_accuracy.append(acc)20
21print(norm_euc_accuracy)```
`1[100.0, 98.09524, 99.04762, 99.04762]`

For the final iteration, we will use Cosine Similarity as a distance metric

```1# Dot product2def dot(A, B):3    return sum(a * b for a, b in zip(A, B))4
5# Cosine Similarity6def cosine_similarity(a, b):7    return 1 - dot(a, b) / ((np.sqrt(dot(a, a))) * (np.sqrt(dot(b, b))))8
9cosine_dev_set = dev_set.copy(deep=True)10
11for index, k in enumerate(K_VALUES):12    calculated_pred = []13    for i, row in cosine_dev_set.iterrows():14        calculated_dist = cosine_dev_set[ATTRS].apply(lambda a: cosine_similarity(np.array(a), np.array(row[ATTRS])), axis=1).sort_values()15        nearest_neighbor_indices = calculated_dist.iloc[0:k].index.values16        nearest_neighbors = cosine_dev_set.loc[nearest_neighbor_indices, :]['class']17        prediction = nearest_neighbors.mode().values[0]18        calculated_pred.append(prediction)19    cosine_dev_set[K_COL_STRINGS[index]] = calculated_pred20
21cosine_accuracy = []22
23for col in cosine_dev_set[K_COL_STRINGS]:24    column = cosine_dev_set[col]25    total_rows = cosine_dev_set.shape[0]26    num = cosine_dev_set.loc[dev_set['class'] == column].shape[0]27    acc = round((num/total_rows) * 100, 5)28    cosine_accuracy.append(acc)29
30print(cosine_accuracy)```
`1[100.0, 98.09524, 98.09524, 99.04762]`

Now, we need to analyze the accuracy of all these distance metrics across all the K values and pick the one that's doing the best job. To do this

`1acc_table = pd.DataFrame(index=K_VALUES)2acc_table[DIST_METRICS[0]] = euc_accuracy3acc_table[DIST_METRICS[1]] = norm_euc_accuracy4acc_table[DIST_METRICS[2]] = cosine_accuracy5acc_table`
eucnorm-euccosine-sim
1100.00000100.00000100.00000
395.2381095.2381098.09524
596.1904895.2381098.09524
798.0952496.1904898.09524
```1width = 0.32mp.figure(figsize=(15, 10))3mp.ylim(0, 115)4e = mp.bar(x=np.add(K_VALUES, width * -1), height=euc_accuracy, width=width, color='#663399')5n = mp.bar(x=np.add(K_VALUES, width * 0), height=norm_euc_accuracy, width=width, color='#669933')6c = mp.bar(x=np.add(K_VALUES, width * 1), height=cosine_accuracy, width=width, color='#994d33')7
8mp.legend(DIST_METRICS, loc="best", fontsize=12)9mp.xlabel("K Value")10mp.ylabel("Accuracy (%)")11mp.show()```

We see that at K = 1, we get 100% accuracy. This is mainly because overfitting is occuring. From my observations K values 3 and 5 seem to be most stable and the accuracy for K = 7 usually decreases. This is why I will be choosing K = 3 and the Normalized Euclidian distance since it is stable to test the accuracy of the test dataset.

```1k_val = 32norm_test_set = normalize(test_set)3
4calculated_pred = []5for i, row in norm_test_set.iterrows():6    calculated_dist = (norm_test_set[ATTRS].sub(row[ATTRS]).pow(2).sum(1).pow(0.5)).sort_values()7    nearest_neighbor_indices = calculated_dist.iloc[0:k_val].index.values8    nearest_neighbors = norm_test_set.loc[nearest_neighbor_indices, :]['class']9    prediction = nearest_neighbors.mode().values[0]10    calculated_pred.append(prediction)11norm_test_set[K_COL_STRINGS[1]] = calculated_pred12
13norm_euc_accuracy = []14
15column = norm_dev_set[K_COL_STRINGS[1]]16total_rows = norm_dev_set.shape[0]17num = norm_dev_set.loc[dev_set['class'] == column].shape[0]18acc = round((num/total_rows) * 100, 5)19norm_euc_accuracy.append(acc)20
21print(norm_euc_accuracy)```
`1[98.09524]`

The final accuracy we get on the test set is around ~98% which is pretty good

## References:

Develop k-Nearest Neighbors in Python From Scratch - Machine Learning Mastery